I Built a Fully Offline AI Agent That Answers Questions From PDF, Images, and Audio — No Cloud…

This wasn't another GPT API wrapper. I trained my own local RAG system with OCR, Whisper, CLIP, and open-source language models — all…

Umar Manzoor

Stackademic

· ~4 min read · July 3, 2025 (Updated: July 3, 2025) · Free: No

This wasn't another GPT API wrapper. I trained my own local RAG system with OCR, Whisper, CLIP, and open-source language models — all running on a mid-tier laptop.

Why Build an Offline AI Agent in 2025?

Cloud LLMs are amazing — but not always ideal.

What if your data is private?
What if you need to work on a plane?
What if you don't want a $3,000 bill for 100,000 tokens?

I needed a local, multimodal assistant that could answer questions from documents, images, and audio — without leaving my machine.

So I built one.

The Stack I Chose: Fully Local and Modular

To build a system that's private and capable, I selected:

LLM: Mistral 7B via Ollama
Embedding: sentence-transformers
OCR: Tesseract for images
Speech-to-Text: OpenAI Whisper
Document Parsing: PyMuPDF
Indexing: FAISS
QA Orchestration: LangChain
Interface: Terminal and Gradio

No cloud APIs. Just raw compute and a dream.

Step 1: Extracting Text From PDFs Using PyMuPDF

Most knowledge lives in PDF documents. First job — make them readable.

import fitz

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        full_text += page.get_text()
    return full_text

I extracted text and metadata from reports, invoices, whitepapers — you name it.

Step 2: OCR Scanned Images With Tesseract

Some PDFs had images with text — useless to LLMs without OCR.

import pytesseract
from PIL import Image

def ocr_image(image_path):
    return pytesseract.image_to_string(Image.open(image_path))

Every image was now a text chunk in my dataset.

Step 3: Transcribing Audio With Whisper Locally

I had 20+ lectures, recorded in noisy conference rooms. Whisper handled them beautifully.

import whisper

model = whisper.load_model("base")

def transcribe_audio(file_path):
    result = model.transcribe(file_path)
    return result['text']

I chunked long files into 10-minute segments and transcribed them in a loop. The quality blew me away.

Step 4: Chunking Documents and Capturing Metadata

I chunked all extracted text into manageable blocks.

def chunk_text(text, chunk_size=1000, overlap=200):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

Each chunk had tags: source type, page number, file name, etc. This metadata was critical for search relevance later.

Step 5: Generating Embeddings Locally

Once chunked, I converted the text into embeddings using Sentence Transformers.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def embed_text_chunks(chunks):
    return model.encode(chunks)

No OpenAI API. Just a fast local model and a few GB of RAM.

Step 6: Storing Embeddings in a FAISS Vector Store

I used FAISS for fast vector similarity search.

import faiss
import numpy as np

def build_faiss_index(embeddings):
    d = embeddings.shape[1]
    index = faiss.IndexFlatL2(d)
    index.add(embeddings)
    return index

Every chunk, whether from audio, image, or PDF, was now searchable.

Step 7: Building the Retrieval QA Chain With LangChain

With data stored, I built the actual RAG system:

from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import Ollama

embedding = HuggingFaceEmbeddings()
vectordb = FAISS.load_local("my_index", embeddings=embedding)
llm = Ollama(model="mistral")

qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectordb.as_retriever())

response = qa_chain.run("What did the CEO say about revenue in Q4?")

And just like that — I had a working question-answering system from raw media.

Step 8: Adding a CLIP-Based Vision Search Engine

I added CLIP to match images semantically:

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def embed_image(image_path):
    image = Image.open(image_path)
    inputs = processor(images=image, return_tensors="pt")
    features = model.get_image_features(**inputs)
    return features.detach().numpy()

This let me search charts and slides with text queries — like "find the revenue pie chart".

Step 9: Creating a Voice-Activated Terminal Chat Interface

I wanted to talk to my AI, so I added voice input with speech_recognition.

import speech_recognition as sr

def get_voice_input():
    r = sr.Recognizer()
    with sr.Microphone() as source:
        audio = r.listen(source)
    return r.recognize_google(audio)

Now I could say: "Summarize the earnings call" — and it would just work.

Step 10: Building a Local Gradio Interface

For a better UX, I built a simple Gradio UI:

import gradio as gr

def ask_bot(question):
    return qa_chain.run(question)

gr.Interface(fn=ask_bot, inputs="text", outputs="text", title="Local RAG AI").launch()

Running locally, this UI was fast, private, and persistent.

Step 11: Packaging the Agent Into a Desktop App

Using PyInstaller:

pyinstaller --onefile my_ai_agent.py

Boom — an executable file that runs everything offline. I shared it with teammates without them installing Python or dependencies.

Step 12: Performance Optimization and Model Swapping

I later tested swapping Mistral with LLaMA-3 8B:

ollama run llama3

It performed better on reasoning tasks but needed more memory. For lightweight devices, Mistral or Phi-2 worked better.

Step 13: Logging and Memory Persistence

I logged every query and response:

import sqlite3

conn = sqlite3.connect("logs.db")
conn.execute("CREATE TABLE IF NOT EXISTS chatlog (query TEXT, response TEXT)")
conn.commit()

This let me replay and audit every session — useful for debugging and training.

Step 14: Using the Agent on Documents From USB Drives

I even added a script to auto-index USB files:

import os

def scan_usb_for_pdfs(path="/mnt/usb"):
    pdfs = []
    for root, _, files in os.walk(path):
        for file in files:
            if file.endswith(".pdf"):
                pdfs.append(os.path.join(root, file))
    return pdfs

This made the agent useful during travel, workshops, and private audits.

Step 15: Final Thoughts — Why You Should Build Your Own

This wasn't just a coding project. It was a capability multiplier.

I learned how every layer of an AI system works
I owned the full pipeline — no vendor lock-in
I could run it anywhere, any time

If you want to truly understand AI — build your own agent. One chunk, one model, one command at a time.

Thank you for being a part of the community

Before you go:

Be sure to clap and follow the writer ️👏️️
Follow us: X | LinkedIn | YouTube | Newsletter | Podcast | Twitch
Start your own free AI-powered blog on Differ 🚀
Join our content creators community on Discord 🧑🏻‍💻
For more content, visit plainenglish.io + stackademic.com

#ai #technology #coding #programming #data-science