This wasn't another GPT API wrapper. I trained my own local RAG system with OCR, Whisper, CLIP, and open-source language models — all running on a mid-tier laptop.
Why Build an Offline AI Agent in 2025?
Cloud LLMs are amazing — but not always ideal.
- What if your data is private?
- What if you need to work on a plane?
- What if you don't want a $3,000 bill for 100,000 tokens?
I needed a local, multimodal assistant that could answer questions from documents, images, and audio — without leaving my machine.
So I built one.
The Stack I Chose: Fully Local and Modular
To build a system that's private and capable, I selected:
- LLM: Mistral 7B via Ollama
- Embedding:
sentence-transformers - OCR:
Tesseractfor images - Speech-to-Text: OpenAI Whisper
- Document Parsing: PyMuPDF
- Indexing: FAISS
- QA Orchestration: LangChain
- Interface: Terminal and Gradio
No cloud APIs. Just raw compute and a dream.
Step 1: Extracting Text From PDFs Using PyMuPDF
Most knowledge lives in PDF documents. First job — make them readable.
import fitz
def extract_text_from_pdf(pdf_path):
doc = fitz.open(pdf_path)
full_text = ""
for page in doc:
full_text += page.get_text()
return full_textI extracted text and metadata from reports, invoices, whitepapers — you name it.
Step 2: OCR Scanned Images With Tesseract
Some PDFs had images with text — useless to LLMs without OCR.
import pytesseract
from PIL import Image
def ocr_image(image_path):
return pytesseract.image_to_string(Image.open(image_path))Every image was now a text chunk in my dataset.
Step 3: Transcribing Audio With Whisper Locally
I had 20+ lectures, recorded in noisy conference rooms. Whisper handled them beautifully.
import whisper
model = whisper.load_model("base")
def transcribe_audio(file_path):
result = model.transcribe(file_path)
return result['text']I chunked long files into 10-minute segments and transcribed them in a loop. The quality blew me away.
Step 4: Chunking Documents and Capturing Metadata
I chunked all extracted text into manageable blocks.
def chunk_text(text, chunk_size=1000, overlap=200):
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunks.append(text[i:i + chunk_size])
return chunksEach chunk had tags: source type, page number, file name, etc. This metadata was critical for search relevance later.
Step 5: Generating Embeddings Locally
Once chunked, I converted the text into embeddings using Sentence Transformers.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def embed_text_chunks(chunks):
return model.encode(chunks)No OpenAI API. Just a fast local model and a few GB of RAM.
Step 6: Storing Embeddings in a FAISS Vector Store
I used FAISS for fast vector similarity search.
import faiss
import numpy as np
def build_faiss_index(embeddings):
d = embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(embeddings)
return indexEvery chunk, whether from audio, image, or PDF, was now searchable.
Step 7: Building the Retrieval QA Chain With LangChain
With data stored, I built the actual RAG system:
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import Ollama
embedding = HuggingFaceEmbeddings()
vectordb = FAISS.load_local("my_index", embeddings=embedding)
llm = Ollama(model="mistral")
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectordb.as_retriever())
response = qa_chain.run("What did the CEO say about revenue in Q4?")And just like that — I had a working question-answering system from raw media.
Step 8: Adding a CLIP-Based Vision Search Engine
I added CLIP to match images semantically:
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def embed_image(image_path):
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt")
features = model.get_image_features(**inputs)
return features.detach().numpy()This let me search charts and slides with text queries — like "find the revenue pie chart".
Step 9: Creating a Voice-Activated Terminal Chat Interface
I wanted to talk to my AI, so I added voice input with speech_recognition.
import speech_recognition as sr
def get_voice_input():
r = sr.Recognizer()
with sr.Microphone() as source:
audio = r.listen(source)
return r.recognize_google(audio)Now I could say: "Summarize the earnings call" — and it would just work.
Step 10: Building a Local Gradio Interface
For a better UX, I built a simple Gradio UI:
import gradio as gr
def ask_bot(question):
return qa_chain.run(question)
gr.Interface(fn=ask_bot, inputs="text", outputs="text", title="Local RAG AI").launch()Running locally, this UI was fast, private, and persistent.
Step 11: Packaging the Agent Into a Desktop App
Using PyInstaller:
pyinstaller --onefile my_ai_agent.pyBoom — an executable file that runs everything offline. I shared it with teammates without them installing Python or dependencies.
Step 12: Performance Optimization and Model Swapping
I later tested swapping Mistral with LLaMA-3 8B:
ollama run llama3It performed better on reasoning tasks but needed more memory. For lightweight devices, Mistral or Phi-2 worked better.
Step 13: Logging and Memory Persistence
I logged every query and response:
import sqlite3
conn = sqlite3.connect("logs.db")
conn.execute("CREATE TABLE IF NOT EXISTS chatlog (query TEXT, response TEXT)")
conn.commit()This let me replay and audit every session — useful for debugging and training.
Step 14: Using the Agent on Documents From USB Drives
I even added a script to auto-index USB files:
import os
def scan_usb_for_pdfs(path="/mnt/usb"):
pdfs = []
for root, _, files in os.walk(path):
for file in files:
if file.endswith(".pdf"):
pdfs.append(os.path.join(root, file))
return pdfsThis made the agent useful during travel, workshops, and private audits.
Step 15: Final Thoughts — Why You Should Build Your Own
This wasn't just a coding project. It was a capability multiplier.
- I learned how every layer of an AI system works
- I owned the full pipeline — no vendor lock-in
- I could run it anywhere, any time
If you want to truly understand AI — build your own agent. One chunk, one model, one command at a time.
Thank you for being a part of the community
Before you go:
- Be sure to clap and follow the writer ️👏️️
- Follow us: X | LinkedIn | YouTube | Newsletter | Podcast | Twitch
- Start your own free AI-powered blog on Differ 🚀
- Join our content creators community on Discord 🧑🏻💻
- For more content, visit plainenglish.io + stackademic.com