How I built an end-to-end AI pipeline in Python to classify legal documents using NLP, transformers, and vector similarity — and the wild lessons I learned along the way.

1. The Problem: Classifying Legal PDFs by Type

I had thousands of legal PDF documents — contracts, agreements, court notices — and my job was to classify each into categories. Manual sorting? Out of the question. Regex and keyword matching? Too brittle.

So I challenged myself: Can I build a true AI classifier that understands what a document is about — and not just what words it contains?

2. Setting Up the Workflow: OCR to Text Pipeline

Most PDFs had scanned images. So, I first needed a pipeline to extract readable text:

import pytesseract
from pdf2image import convert_from_path

def pdf_to_text(pdf_path):
    pages = convert_from_path(pdf_path, dpi=300)
    full_text = ""
    for page in pages:
        text = pytesseract.image_to_string(page)
        full_text += text + "\n"
    return full_text

I combined pdf2image and pytesseract to convert the scanned pages into readable text. This worked well, even on noisy scans.

3. Cleaning Text for NLP Processing

After OCR, the text was messy — broken lines, headers, numbers everywhere. I cleaned it up with simple regex and token filters:

import re

def clean_text(text):
    text = re.sub(r'\n+', ' ', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower()
    return text.strip()

This gave me a smooth, clean paragraph per document — perfect for feeding into an NLP model.

4. Labeling Sample Data for Training

Before training an AI, I needed labeled examples. I hand-labeled 500 PDFs into categories:

  • Employment Contract
  • NDA
  • Vendor Agreement
  • Lease Agreement
  • Court Ruling

I stored the data as CSV: filename, category, cleaned_text.

5. Embedding the Documents with Transformers

I used sentence-transformers (based on BERT) to convert each document into a vector:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = model.encode(cleaned_texts, show_progress_bar=True)

Now, each document had a 384-dimensional vector representing its semantics.

6. Training a KNN Classifier

I used K-Nearest Neighbors to classify new documents based on vector proximity:

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(embeddings, labels)

Now, I could classify new documents by comparing their embeddings to the labeled ones.

7. Building a Predictive Function

I created a function to predict document categories using this trained model:

def classify_document(text):
    clean = clean_text(text)
    embed = model.encode([clean])
    prediction = knn.predict(embed)
    return prediction[0]

This took in raw OCR text and returned a category like "NDA" or "Court Ruling."

8. Adding a Frontend: Upload and Classify

To let others use the tool, I built a simple web interface using Streamlit:

import streamlit as st

st.title("AI Document Classifier")

uploaded_file = st.file_uploader("Upload a PDF", type=["pdf"])

if uploaded_file:
    text = pdf_to_text(uploaded_file)
    category = classify_document(text)
    st.success(f"This document is a: **{category}**")

Now, anyone could upload a PDF and instantly get an AI-generated category.

9. Final Thoughts: Why This AI Was Different

What made this feel like real AI wasn't just automation — it was the semantic understanding:

  • It didn't rely on specific keywords.
  • It could adapt to different formats of the same document type.
  • It learned patterns I couldn't manually define.

And best of all? It outperformed my regex scripts by over 40% in accuracy.

This journey taught me that true AI isn't just about code — it's about building systems that learn from data, adapt, and become smarter over time.

From PDFs to predictions, I built something that made document handling feel… kind of magical.

Thank you for being a part of the community

Before you go: