Nowadays, nobody will be surprised by running a deep learning model in the cloud. But the situation can be much more complicated in the edge or consumer device world. There are several reasons for that. First, the use of cloud APIs requires devices to always be online. This is not a problem for a web service but can be a dealbreaker for the device that needs to be functional without Internet access. Second, cloud APIs cost money, and customers likely will not be happy to pay yet another subscription fee. Last but not least, after several years, the project may be finished, API endpoints will be shut down, and the expensive hardware will turn into a brick. Which is naturally not friendly for customers, the ecosystem, and the environment. That's why I am convinced that the end-user hardware should be fully functional offline, without extra costs or using the online APIs (well, it can be optional but not mandatory).

In this article, I will show how to run a LLaMA GPT model and automatic speech recognition (ASR) on a Raspberry Pi. That will allow us to ask Raspberry Pi questions and get answers. And as promised, all this will work fully offline.

Let's get into it!

The code presented in this article is intended to work on the Raspberry Pi. But most of the methods (except the "display" part) will also work on a Windows, OSX, or Linux laptop. So, those readers who don't have a Raspberry Pi can easily test the code without any problems.

Hardware

For this project, I will be using a Raspberry Pi 4. It is a single-board computer running Linux; it is small and requires only 5V DC power without fans and active cooling:

None
Raspberry Pi 4, Image source Wikipedia

A newer 2023 model, the Raspberry Pi 5, should be even better; according to benchmarks, it's almost 2x faster. But it is also almost 50% more expensive, and for our test, the model 4 is good enough.

As for the RAM size, we have two options:

  • A Raspberry Pi with 8 GB of RAM allows us to run a 7B LLaMA-2 GPT model, whose memory footprint in a 4-bit quantization mode is about 5 GB.
  • A 2 or 4 GB device allows us to run a smaller model like TinyLlama-1B. As a bonus, this model is also faster, but as we will see later, its answers can be a bit less "smart."

Both models can be downloaded from HuggingFace, and in general, almost no code changes will be required.

The Raspberry Pi is a full-fledged Linux computer, and we can easily see the output in the terminal via SSH. But it is not so fun and not suitable for a mobile device like a robot. For the Raspberry Pi, I will be using a monochrome 128x64 I2C OLED display. This display needs only 4 wires to connect:

None
I2C OLED display connection, Image made by author in Fritzing

A display and wires can be obtained on Amazon for $5โ€“10; no soldering skills are required. An I2C interface must be enabled in the Raspberry Pi settings; there are enough tutorials about that. For simplicity reasons, I will omit the hardware part here and focus only on Python code.

Display

I will start with the display because it's better to see something on the screen during the tests. An Adafruit_CircuitPython_SSD1306 library allows us to display any image on the OLED display. This library has a low-level interface; it can only draw pixels or a monochrome bitmap from a memory buffer. To use scrollable text, I created an array that stores the text buffer and a method _display_update that draws the text:

from PIL import Image, ImageDraw, ImageFont
try:
    import board
    import adafruit_ssd1306
    i2c = board.I2C()
    oled = adafruit_ssd1306.SSD1306_I2C(pixels_size[0], pixels_size[1], i2c)
except ImportError:
    oled = None


char_h = 11
rpi_font_poath = "DejaVuSans.ttf"
font = ImageFont.truetype(rpi_font_poath, char_h)
pixels_size = (128, 64)
max_x, max_y = 22, 5
display_lines = [""]

def _display_update():
    """ Show lines on the screen """
    global oled
    image = Image.new("1", pixels_size)
    draw = ImageDraw.Draw(image)
    for y, line in enumerate(display_lines):
        draw.text((0, y*char_h), line, font=font, fill=255, align="left")

    if oled:
        oled.fill(0)
        oled.image(image)
        oled.show()

Here, a (22, 5) variable contains the number of rows and columns we can display. The oled variable can also be None if the ImportError occurs; for example, if we run this code on a laptop instead of the Raspberry Pi.

To emulate text scrolling, I also created two helper methods:

def add_display_line(text: str):
    """ Add new line with scrolling """
    global display_lines
    # Split line to chunks according to screen width
    text_chunks = [text[i: i+max_x] for i in range(0, len(text), max_x)]
    for text in text_chunks:
        for line in text.split("\n"):
            display_lines.append(line)
            display_lines = display_lines[-max_y:]
    _display_update()

def add_display_tokens(text: str):
    """ Add new tokens with or without extra line break """
    global display_lines
    last_line = display_lines.pop()
    new_line = last_line + text
    add_display_line(new_line)

The first method is adding a new line to the display; if the string is too long, the method automatically splits it into several lines. A second method is adding a text token without "carriage return"; I will use it to display the answer from a GPT model. The add_display_line method allows us to use code like this:

for p in range(20):
    add_display_line(f"{datetime.now().strftime('%H:%M:%S')}: Line-{p}")
    time.sleep(0.2)

If everything was made correctly, the output should look like this:

None
OLED Display, Image by author

There are also special libraries for external displays like Luma-oled, but our solution is good enough for the task.

Automatic Speech Recognition (ASR)

For ASR, I will be using a Transformers library from HuggingFace ๐Ÿค—, which allows us to implement speech recognition in several lines of Python code:

from transformers import pipeline
from transformers.pipelines.audio_utils import ffmpeg_microphone_live


asr_model_id = "openai/whisper-tiny.en"
transcriber = pipeline("automatic-speech-recognition",
                       model=asr_model_id,
                       device="cpu")

Here, I used the Whisper-tiny-en model, which was trained on 680K hours of speech data. This is the smallest Whisper model; its file size is 151MB. When the model is loaded, we can use the ffmpeg_microphone_live method to get data from a microphone:

def transcribe_mic(chunk_length_s: float) -> str:
    """ Transcribe the audio from a microphone """
    global transcriber
    sampling_rate = transcriber.feature_extractor.sampling_rate
    mic = ffmpeg_microphone_live(
            sampling_rate=sampling_rate,
            chunk_length_s=chunk_length_s,
            stream_chunk_s=chunk_length_s,
        )
    
    result = ""
    for item in transcriber(mic):
        result = item["text"]
        if not item["partial"][0]:
            break
    return result.strip()

Practically, 5โ€“10 seconds are enough to say the phrase. A Raspberry Pi does not have a microphone, but any USB microphone will do the job. This code can also be tested on a laptop; in this case, the internal microphone will be used.

Large Language Model (LLM)

Now, let's add the large language model. First, we need to install the needed libraries:

pip3 install llama-cpp-python
pip3 install huggingface-hub sentence-transformers langchain

Before using the LLM, we need to download it. As was discussed before, we have two options. For an 8GB Raspberry Pi, we can use a 7B model. For a 2GB device, a 1B "tiny" model is the only viable option; a larger model just will not fit into the RAM. To download the model, we can use the huggingface-cli tool:

huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
OR
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

As we can see, I use a Llama-2โ€“7b-Chat-GGUF and a TinyLlama-1โ€“1B-Chat-v1-0-GGUF model. A smaller model works faster, but a bigger model can potentially provide better results.

When the model is downloaded, we can use it:

from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser


llm: Optional[LlamaCpp] = None
callback_manager: Any = None

model_file = "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"  # OR "llama-2-7b-chat.Q4_K_M.gguf"
template_tiny = """<|system|>
                   You are a smart mini computer named Raspberry Pi. 
                   Write a short but funny answer.</s>
                   <|user|>
                   {question}</s>
                   <|assistant|>"""
template_llama = """<s>[INST] <<SYS>>
                    You are a smart mini computer named Raspberry Pi.
                    Write a short but funny answer.</SYS>>
                    {question} [/INST]"""
template = template_tiny


def llm_init():
    """ Load large language model """
    global llm, callback_manager

    callback_manager = CallbackManager([StreamingCustomCallbackHandler()])
    llm = LlamaCpp(
        model_path=model_file,
        temperature=0.1,
        n_gpu_layers=0,
        n_batch=256,
        callback_manager=callback_manager,
        verbose=True,
    )


def llm_start(question: str):
    """ Ask LLM a question """
    global llm, template

    prompt = PromptTemplate(template=template, input_variables=["question"])
    chain = prompt | llm | StrOutputParser()
    chain.invoke({"question": question}, config={})

Interestingly, despite the same LLaMA name, the two models were trained using different prompt formats.

Using the model is simple, but here is the tricky part: we need to display the answer token by token on the OLED screen. For this purpose, I will use a custom callback, which will be executed whenever the LLM is generating a new token:

class StreamingCustomCallbackHandler(StreamingStdOutCallbackHandler):
    """ Callback handler for LLM streaming """

    def on_llm_start(
        self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any
    ) -> None:
        """ Run when LLM starts running """
        print("<LLM Started>")

    def on_llm_end(self, response: Any, **kwargs: Any) -> None:
        """ Run when LLM ends running """
        print("<LLM Ended>")

    def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
        """ Run on new LLM token. Only available when streaming is enabled """
        print(f"{token}", end="")
        add_display_tokens(token)

Here, I used the add_display_tokens method that was created before. A print method is also used, so the same code can be executed on a regular PC without a Raspberry Pi.

Testing

Finally, it is time to combine all the parts. The code is straightforward:

if __name__ == "__main__":
    add_display_line("Init automatic speech recogntion...")
    asr_init()

    add_display_line("Init LLaMA GPT...")
    llm_init()

    while True:
        # Q-A loop:
        add_display_line("Start speaking")
        add_display_line("")
        question = transcribe_mic(chunk_length_s=5.0)
        if len(question) > 0:
            add_display_tokens(f"> {question}")
            add_display_line("")

            llm_start(question)

Here, the Raspberry Pi records the audio within 5 seconds, then the speech recognition model converts the audio into text; finally, the recognized text is sent to the LLM. After the end, the process is repeated. This approach can be improved, for example, by using an automatic audio-level threshold, but for the "weekend project" purpose, it is good enough.

Practically, the output looks like this:

None
Tiny LLaMA inference, Image by author

Here, we can see the 1B LLM inference speed on the Raspberry Pi 4. As was mentioned before, the Raspberry Pi 5 should be 30โ€“40% faster.

I did not compare the 1B and 7B models' quality using any "official" benchmark like BLEU or ROUGE. Subjectively, the 7B model provides more correct and informative answers, but it also requires more RAM, more time to load (file sizes are 4.6 and 0.7GB, respectively), and works 3โ€“5x slower. As for the power consumption, the Raspberry Pi 4 requires on average 3โ€“5W with the running model, connected OLED screen, and USB microphone.

Conclusion

In this article, we were able to run an automatic speech recognition and a large language model on a Raspberry Pi, a portable Linux computer that can run fully autonomously. Different models can be used in the cloud, but personally, it's more fun for me to work with a real thing when I can touch it and see how it works.

A prototype like this is also an interesting milestone in using GPT models. Even 1โ€“2 years ago, it was unthinkable to imagine that large language models could run on cheap consumer hardware. We are entering the era of smart devices that will be able to understand human speech, respond to text commands, or perform different actions. Probably, in the future, devices like TVs or microwaves will have no buttons at all, and we will just talk to them. As we can see from the video, the LLM still works a bit slowly. But we all know Moore's law โ€” apparently, 5โ€“10 years later, the same model will easily run on a 1$ chip, the same way as now we can run a full-fledged PDP-11 emulator (the PDP was a $100K computer in the 80s) on the $5 ESP32 board.

In the next part, I will show how to use the push-to-talk button for speech recognition, and we will implement a data processing AI pipeline in the "Rabbit" style. Stay tuned.

If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.

Those who are interested in using language models and natural language processing are also welcome to read other articles:

Thanks for reading.