Nowadays, nobody will be surprised by running a deep learning model in the cloud. But the situation can be much more complicated in the edge or consumer device world. There are several reasons for that. First, the use of cloud APIs requires devices to always be online. This is not a problem for a web service but can be a dealbreaker for the device that needs to be functional without Internet access. Second, cloud APIs cost money, and customers likely will not be happy to pay yet another subscription fee. Last but not least, after several years, the project may be finished, API endpoints will be shut down, and the expensive hardware will turn into a brick. Which is naturally not friendly for customers, the ecosystem, and the environment. That's why I am convinced that the end-user hardware should be fully functional offline, without extra costs or using the online APIs (well, it can be optional but not mandatory).
In this article, I will show how to run a LLaMA GPT model and automatic speech recognition (ASR) on a Raspberry Pi. That will allow us to ask Raspberry Pi questions and get answers. And as promised, all this will work fully offline.
Let's get into it!
The code presented in this article is intended to work on the Raspberry Pi. But most of the methods (except the "display" part) will also work on a Windows, OSX, or Linux laptop. So, those readers who don't have a Raspberry Pi can easily test the code without any problems.
Hardware
For this project, I will be using a Raspberry Pi 4. It is a single-board computer running Linux; it is small and requires only 5V DC power without fans and active cooling:
A newer 2023 model, the Raspberry Pi 5, should be even better; according to benchmarks, it's almost 2x faster. But it is also almost 50% more expensive, and for our test, the model 4 is good enough.
As for the RAM size, we have two options:
- A Raspberry Pi with 8 GB of RAM allows us to run a 7B LLaMA-2 GPT model, whose memory footprint in a 4-bit quantization mode is about 5 GB.
- A 2 or 4 GB device allows us to run a smaller model like TinyLlama-1B. As a bonus, this model is also faster, but as we will see later, its answers can be a bit less "smart."
Both models can be downloaded from HuggingFace, and in general, almost no code changes will be required.
The Raspberry Pi is a full-fledged Linux computer, and we can easily see the output in the terminal via SSH. But it is not so fun and not suitable for a mobile device like a robot. For the Raspberry Pi, I will be using a monochrome 128x64 I2C OLED display. This display needs only 4 wires to connect:
A display and wires can be obtained on Amazon for $5โ10; no soldering skills are required. An I2C interface must be enabled in the Raspberry Pi settings; there are enough tutorials about that. For simplicity reasons, I will omit the hardware part here and focus only on Python code.
Display
I will start with the display because it's better to see something on the screen during the tests. An Adafruit_CircuitPython_SSD1306 library allows us to display any image on the OLED display. This library has a low-level interface; it can only draw pixels or a monochrome bitmap from a memory buffer. To use scrollable text, I created an array that stores the text buffer and a method _display_update
that draws the text:
from PIL import Image, ImageDraw, ImageFont
try:
import board
import adafruit_ssd1306
i2c = board.I2C()
oled = adafruit_ssd1306.SSD1306_I2C(pixels_size[0], pixels_size[1], i2c)
except ImportError:
oled = None
char_h = 11
rpi_font_poath = "DejaVuSans.ttf"
font = ImageFont.truetype(rpi_font_poath, char_h)
pixels_size = (128, 64)
max_x, max_y = 22, 5
display_lines = [""]
def _display_update():
""" Show lines on the screen """
global oled
image = Image.new("1", pixels_size)
draw = ImageDraw.Draw(image)
for y, line in enumerate(display_lines):
draw.text((0, y*char_h), line, font=font, fill=255, align="left")
if oled:
oled.fill(0)
oled.image(image)
oled.show()
Here, a (22, 5)
variable contains the number of rows and columns we can display. The oled
variable can also be None if the ImportError
occurs; for example, if we run this code on a laptop instead of the Raspberry Pi.
To emulate text scrolling, I also created two helper methods:
def add_display_line(text: str):
""" Add new line with scrolling """
global display_lines
# Split line to chunks according to screen width
text_chunks = [text[i: i+max_x] for i in range(0, len(text), max_x)]
for text in text_chunks:
for line in text.split("\n"):
display_lines.append(line)
display_lines = display_lines[-max_y:]
_display_update()
def add_display_tokens(text: str):
""" Add new tokens with or without extra line break """
global display_lines
last_line = display_lines.pop()
new_line = last_line + text
add_display_line(new_line)
The first method is adding a new line to the display; if the string is too long, the method automatically splits it into several lines. A second method is adding a text token without "carriage return"; I will use it to display the answer from a GPT model. The add_display_line
method allows us to use code like this:
for p in range(20):
add_display_line(f"{datetime.now().strftime('%H:%M:%S')}: Line-{p}")
time.sleep(0.2)
If everything was made correctly, the output should look like this:
There are also special libraries for external displays like Luma-oled, but our solution is good enough for the task.
Automatic Speech Recognition (ASR)
For ASR, I will be using a Transformers library from HuggingFace ๐ค, which allows us to implement speech recognition in several lines of Python code:
from transformers import pipeline
from transformers.pipelines.audio_utils import ffmpeg_microphone_live
asr_model_id = "openai/whisper-tiny.en"
transcriber = pipeline("automatic-speech-recognition",
model=asr_model_id,
device="cpu")
Here, I used the Whisper-tiny-en model, which was trained on 680K hours of speech data. This is the smallest Whisper model; its file size is 151MB. When the model is loaded, we can use the ffmpeg_microphone_live
method to get data from a microphone:
def transcribe_mic(chunk_length_s: float) -> str:
""" Transcribe the audio from a microphone """
global transcriber
sampling_rate = transcriber.feature_extractor.sampling_rate
mic = ffmpeg_microphone_live(
sampling_rate=sampling_rate,
chunk_length_s=chunk_length_s,
stream_chunk_s=chunk_length_s,
)
result = ""
for item in transcriber(mic):
result = item["text"]
if not item["partial"][0]:
break
return result.strip()
Practically, 5โ10 seconds are enough to say the phrase. A Raspberry Pi does not have a microphone, but any USB microphone will do the job. This code can also be tested on a laptop; in this case, the internal microphone will be used.
Large Language Model (LLM)
Now, let's add the large language model. First, we need to install the needed libraries:
pip3 install llama-cpp-python
pip3 install huggingface-hub sentence-transformers langchain
Before using the LLM, we need to download it. As was discussed before, we have two options. For an 8GB Raspberry Pi, we can use a 7B model. For a 2GB device, a 1B "tiny" model is the only viable option; a larger model just will not fit into the RAM. To download the model, we can use the huggingface-cli
tool:
huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
OR
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
As we can see, I use a Llama-2โ7b-Chat-GGUF
and a TinyLlama-1โ1B-Chat-v1-0-GGUF
model. A smaller model works faster, but a bigger model can potentially provide better results.
When the model is downloaded, we can use it:
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser
llm: Optional[LlamaCpp] = None
callback_manager: Any = None
model_file = "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf" # OR "llama-2-7b-chat.Q4_K_M.gguf"
template_tiny = """<|system|>
You are a smart mini computer named Raspberry Pi.
Write a short but funny answer.</s>
<|user|>
{question}</s>
<|assistant|>"""
template_llama = """<s>[INST] <<SYS>>
You are a smart mini computer named Raspberry Pi.
Write a short but funny answer.</SYS>>
{question} [/INST]"""
template = template_tiny
def llm_init():
""" Load large language model """
global llm, callback_manager
callback_manager = CallbackManager([StreamingCustomCallbackHandler()])
llm = LlamaCpp(
model_path=model_file,
temperature=0.1,
n_gpu_layers=0,
n_batch=256,
callback_manager=callback_manager,
verbose=True,
)
def llm_start(question: str):
""" Ask LLM a question """
global llm, template
prompt = PromptTemplate(template=template, input_variables=["question"])
chain = prompt | llm | StrOutputParser()
chain.invoke({"question": question}, config={})
Interestingly, despite the same LLaMA
name, the two models were trained using different prompt formats.
Using the model is simple, but here is the tricky part: we need to display the answer token by token on the OLED screen. For this purpose, I will use a custom callback, which will be executed whenever the LLM is generating a new token:
class StreamingCustomCallbackHandler(StreamingStdOutCallbackHandler):
""" Callback handler for LLM streaming """
def on_llm_start(
self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any
) -> None:
""" Run when LLM starts running """
print("<LLM Started>")
def on_llm_end(self, response: Any, **kwargs: Any) -> None:
""" Run when LLM ends running """
print("<LLM Ended>")
def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
""" Run on new LLM token. Only available when streaming is enabled """
print(f"{token}", end="")
add_display_tokens(token)
Here, I used the add_display_tokens
method that was created before. A print
method is also used, so the same code can be executed on a regular PC without a Raspberry Pi.
Testing
Finally, it is time to combine all the parts. The code is straightforward:
if __name__ == "__main__":
add_display_line("Init automatic speech recogntion...")
asr_init()
add_display_line("Init LLaMA GPT...")
llm_init()
while True:
# Q-A loop:
add_display_line("Start speaking")
add_display_line("")
question = transcribe_mic(chunk_length_s=5.0)
if len(question) > 0:
add_display_tokens(f"> {question}")
add_display_line("")
llm_start(question)
Here, the Raspberry Pi records the audio within 5 seconds, then the speech recognition model converts the audio into text; finally, the recognized text is sent to the LLM. After the end, the process is repeated. This approach can be improved, for example, by using an automatic audio-level threshold, but for the "weekend project" purpose, it is good enough.
Practically, the output looks like this:
Here, we can see the 1B LLM inference speed on the Raspberry Pi 4. As was mentioned before, the Raspberry Pi 5 should be 30โ40% faster.
I did not compare the 1B and 7B models' quality using any "official" benchmark like BLEU or ROUGE. Subjectively, the 7B model provides more correct and informative answers, but it also requires more RAM, more time to load (file sizes are 4.6 and 0.7GB, respectively), and works 3โ5x slower. As for the power consumption, the Raspberry Pi 4 requires on average 3โ5W with the running model, connected OLED screen, and USB microphone.
Conclusion
In this article, we were able to run an automatic speech recognition and a large language model on a Raspberry Pi, a portable Linux computer that can run fully autonomously. Different models can be used in the cloud, but personally, it's more fun for me to work with a real thing when I can touch it and see how it works.
A prototype like this is also an interesting milestone in using GPT models. Even 1โ2 years ago, it was unthinkable to imagine that large language models could run on cheap consumer hardware. We are entering the era of smart devices that will be able to understand human speech, respond to text commands, or perform different actions. Probably, in the future, devices like TVs or microwaves will have no buttons at all, and we will just talk to them. As we can see from the video, the LLM still works a bit slowly. But we all know Moore's law โ apparently, 5โ10 years later, the same model will easily run on a 1$ chip, the same way as now we can run a full-fledged PDP-11 emulator (the PDP was a $100K computer in the 80s) on the $5 ESP32 board.
In the next part, I will show how to use the push-to-talk button for speech recognition, and we will implement a data processing AI pipeline in the "Rabbit" style. Stay tuned.
If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.
Those who are interested in using language models and natural language processing are also welcome to read other articles:
- LLMs for Everyone: Running LangChain and a MistralAI 7B Model in Google Colab
- LLMs for Everyone: Running the LLaMA-13B model and LangChain in Google Colab
- LLMs for Everyone: Running the HuggingFace Text Generation Inference in Google Colab
- Natural Language Processing For Absolute Beginners
- 16, 8, and 4-bit Floating Point Formats โ How Does it Work?
Thanks for reading.