One of the main motivations behind our project is to bring large language models to a general audience. Allowing users to interact with the system using voice instead of a keyboard offers many advantages. It is more efficient, safer, and appeals to a broader audience, such as seniors or professionals whose work requires constant use of their hands, like factory workers and drivers.

In this blog post, we'll explore how to build a voice-to-text and text-to-speech system using FastAPI, OpenAI's Whisper, a text-to-speech engine, and AWS S3 for file storage. We'll break down the code structure, outline the key steps, highlight the essential tech stacks for setting up the application, and share the insights gained from this project.

Tech Stack Overview

Before we dive into the code, let's review the tech stack and its roles in the project: 1. FastAPI: A modern, fast web framework for building APIs in Python. 2. OpenAI Whisper & GPT-4: For transcription (voice-to-text) and language model capabilities (text generation). 3. AWS S3: For storing audio files and text files securely in the cloud. 4. Docker: To containerize the application, making it easy to deploy and run on any platform. 5. Python Libraries: — pydantic: For input validation. — requests: For making HTTP requests to the FastAPI server. — sounddevice, numpy, and scipy: For recording and handling audio locally. - boto3: For interacting with AWS services (like S3).

Work Steps

Step 1: FastAPI Setup We begin by setting up the FastAPI application. This involves creating a simple API with a /voicebot endpoint that supports both voice and text inputs. FastAPI Application


 ```python
 from fastapi import FastAPI, HTTPException, Form, UploadFile, File
 from pydantic import BaseModel
 import os
 import openai
 from dotenv import load_dotenv
 import uuid
 import boto3
 from io import BytesIO
 
 load_dotenv()
 app = FastAPI()
 openai.api_key = os.environ['OPENAI_API_KEY']
 
 # Set up CORS middleware
 from fastapi.middleware.cors import CORSMiddleware
 app.add_middleware(
 CORSMiddleware,
 allow_origins=["*"],
 allow_credentials=True,
 allow_methods=["*"],
 allow_headers=["*"],
 )
 
 # Input and Output models
 class QueryInput(BaseModel):
 input_wav_url: str
 
 class QueryResponse(BaseModel):
 output_wav_url: str = None
 return_text: str
 ```
 

Here, we configure FastAPI with CORS middleware to allow requests from any origin. We define input and output models (QueryInput and QueryResponse) to ensure clean API responses. Step 2: Voicebot Endpoint The /api/voicebot/ endpoint processes both text and audio. If the client uploads an audio file, it will first be transcribed using OpenAI's Whisper model. Otherwise, if text is provided, it goes directly to GPT-4 for processing.

 
 ```python
 @app.post("/api/voicebot/", response_model=QueryResponse)
 async def voicebot_endpoint(
 audio: UploadFile = File(None), # Audio file (optional)
 text: str = Form(None) # Text input (optional)
 ):
 # Initialize AWS S3 client
 s3 = boto3.client('s3')
 bucket_name = 'voicebot-text-to-speech'
 
 if audio:
 # Process uploaded audio
 content = await audio.read()
 file_extension = os.path.splitext(audio.filename)[1]
 file_name = f"input_{uuid.uuid4()}{file_extension}"
 
 # Upload audio to S3
 file_buffer = BytesIO(content)
 s3.upload_fileobj(file_buffer, bucket_name, file_name)
 
 # Transcribe audio using Whisper
 transcript = openai.Audio.transcribe(
 model="whisper-1",
 file=(audio.filename, content),
 response_format="text"
 )
 elif text:
 transcript = text
 else:
 raise HTTPException(status_code=400, detail="Audio or text input required")
 
 # Send transcript to GPT-4
 response = openai.Completion.create(
 model="gpt-4–1106-preview",
 messages=[
 {"role": "system", "content": "You are a helpful assistant."},
 {"role": "user", "content": transcript}
 ]
 )
 
 # Convert GPT-4 response to speech
 gpt_response = response.choices[0].message.content
 speech_response = openai.Audio.create_speech(
 model="tts-1",
 voice="nova",
 input=gpt_response
 )
 
 # Upload speech to S3
 speech_file = BytesIO()
 for chunk in speech_response.iter_bytes(chunk_size=4096):
 speech_file.write(chunk)
 speech_file.seek(0)
 speech_file_name = f"speech_{uuid.uuid4()}.mp3"
 s3.upload_fileobj(speech_file, bucket_name, speech_file_name)
 
 # Generate URL to access speech file
 url = s3.generate_presigned_url('get_object', Params={'Bucket': bucket_name, 'Key': speech_file_name}, ExpiresIn=3600)
 
 return QueryResponse(output_wav_url=url, return_text=gpt_response)
 ```
 

Step 3: Dockerizing the Application Containerizing the application with Docker ensures that the API can run anywhere, from your local machine to a cloud platform like AWS or Railway.


 Dockerfile
 
 ```dockerfile
 FROM python:3.9.6
 
 WORKDIR /code
 
 COPY requirements.txt .
 RUN pip install — no-cache-dir — upgrade -r requirements.txt
 
 COPY ./app/ ./app
 
 EXPOSE 8000
 
 RUN useradd -m appuser
 USER appuser
 
 CMD ["uvicorn", "app.main:app", " — host", "0.0.0.0", " — port", "8000"]
 ```

This Dockerfile sets up the Python environment, installs dependencies, and uses uvicorn to serve the FastAPI application. Step 4: Recording and Sending Audio We also implement a Python client to record audio and send it to the API. This allows us to test the voicebot locally. Client-Side Audio Recording


 ```python
 import sounddevice as sd
 from scipy.io.wavfile import write
 import requests
 import numpy as np
 
 # Recording configuration
 fs = 44100 # Sample rate
 duration = 5 # Recording length
 
 print("Recording…")
 myrecording = sd.rec(int(duration * fs), samplerate=fs, channels=1)
 sd.wait()
 
 # Save recording as a WAV file
 audio_file = './data/audio_input.wav'
 write(audio_file, fs, np.int16(myrecording * 32767))
 
 # Send audio to FastAPI endpoint
 def send_audio(file_path):
 with open(file_path, 'rb') as f:
 files = {'audio': (file_path, f, 'audio/wav')}
 response = requests.post("http://localhost:8000/api/voicebot/", files=files)
 return response
 
 response = send_audio(audio_file)
 print(response.json())
 ```
 

Step 5: Deploying to the Cloud Once the application is tested locally, you can deploy it to services like AWS EC2, Railway, or any other container-based cloud provider. AWS ECS: Launch an EC2 instance, SSH into it, install Docker, and run the FastAPI container. Railway: Push your code to Railway, which automatically builds and deploys the app.

Challenges and Lessons learned:

It is easy to develop a proof of concept (POC) using a notebook, but building a running web app is much more challenging. In addition to learning how to write backend APIs and set up databases, we encountered many issues with building the frontend app. One of the main challenges was ensuring seamless communication between the backend and frontend without significant latency. Initially, we tried limiting GPT responses, but eventually, storing audio files in the browser cache and directly triggering the endpoint allowed us to provide a good user experience with minimal lag between voice recording and responses.

Cost: We initially deployed the API on AWS ECS, which made it straightforward to set up clusters and control HTTP inbound and outbound traffic. However, this convenience came with a high cost. With minimal traffic, our cluster generated a $10 bill over a single weekend, which we knew wasn't sustainable. We then switched to Railway.app, signing up for their hobbyist membership. With a $5 monthly fee, we were able to deploy and host the web app successfully without worrying about excessive costs.

CI/CD: During development, we built and tested the API locally, then deployed the code to a Docker container for further testing. Eventually, we deployed the Docker image on Railway.app. Manual code updates quickly became unsustainable, so we connected Railway to our GitHub repository. Now, it automatically updates the FastAPI online whenever we push new code updates, enabling continuous deployment. The process is quite impressive.

Conclusion In this tutorial, we've demonstrated how to build a voice bot that handles voice-to-text and text-to-speech using FastAPI, OpenAI models, and AWS S3 for storage. We also containerized the app with Docker for easy deployment. The next step is to expand the app to address specific customer issues. The app serves as a vehicle to bring our research in Retrieval-Augmented Generation (RAG), model training, and evaluation to the end user. The possibilities are endless!

There is no doubt there will be challenges and road blocks, but if we've come this far, we might as well keep going. Happy coding!

The authors thank OpenAI Research Funding for providing testing tokens for this project.