GPT-4 App Prototyping: How I Built a Voice Assistant App in 60 Minutes

Michal Langmajer
9 min readMar 20, 2023

--

Edit: Read my latest article on how I challenged myself to get app built by GPT-4 to macOS App Store.

As a teenager, I dreamed of creating apps. I had dozens of ideas but struggled with programming.

Later, I've surrounded myself with experts which helped a lot. In a team of people you're able to build big and robust apps by leveraging expertise of all team members. But still, it was a slow and tedious process.

That is until GPT-4 arrived on the scene!

With GPT-4, it’s like having a team of experts on WhatsApp. Now, you can prototype simple apps literally in a matter of hours. Just with very basic programming skills.

In this article, I'd like to walk you through my process of creating a voice assistant app (that I named Jarvis), which is able to:

  • Catch the activation keyword (kinda like Hey Siri!)
  • Understand my question
  • Provide me with an answer

So let's start! (00:00)

To save myself an hour of two of research, I decided to ask AI about the process, so I asked:

These are already quite nice steps to follow. We know what we need to do and the AI suggested some libraries and tools which we could use later, so let's do it!

Keep in mind that if you are unsure about what the AI is referring to, you can always request more details or a guide on the next steps.

Step 1 & 2: Programming Language (04:36)

Python is the only programming language I know at least a little bit and I have already installed Pycharm on my computer. So it's my go-to choice.

If you're a complete beginner, I'd recommend python, too. It's very beginner friendly language with a well written documentation, many tutorials, big community, and it's quite universal.

Step 3: Keyword for Voice Activation (11:52)

Similar to “Hey Siri” or “Okay Google,” I’d like Jarvis to be capable of recognizing its activation phrase.

Originally, I tried to follow AI's guidance and checked the Kitty-AI Snowboy library for this purpose, but I found that it's not maintained anymore.

So I asked AI to give me some alternatives and without extensive research I followed it's first option PicoVoice.ai:

  • It's free for basic use-cases
  • Supports python
  • It's well documented
  • Also includes API for speech-to-text (which we will need later)

I tried to generat the API Key (requires registration) and copy paste the Quick Start code to my JarvisAI python file:

import pvporcupine

porcupine = pvporcupine.create(
access_key='Your access key here',
keywords=['picovoice', 'bumblebee']
)

def get_next_audio_frame():
pass

while True:
audio_frame = get_next_audio_frame()
keyword_index = porcupine.process(audio_frame)
if keyword_index == 0:
print(0)
elif keyword_index == 1:
print(1)

But it didn't work. 😪 Why? The error message says it's missing the voice input. You can see that the get_next_audio_frame() method for obtaining the audio file is empty…

Let's provide the AI with this code and ask to add the logic for recording audio:

The GPT-4 not only written the code for me:

import pvporcupine
import pyaudio
import struct

porcupine = pvporcupine.create(
access_key='your API key',
keywords=['picovoice', 'bumblebee']
)

# Initialize PyAudio
audio = pyaudio.PyAudio()
stream = audio.open(
rate=porcupine.sample_rate,
channels=1,
format=pyaudio.paInt16,
input=True,
frames_per_buffer=porcupine.frame_length,
)

# Main loop
print("Listening for keywords...")
try:
while True:
# Read audio data from the microphone
audio_data = stream.read(porcupine.frame_length)
audio_frame = struct.unpack_from("h" * porcupine.frame_length, audio_data)

# Process audio frame with Porcupine
keyword_index = porcupine.process(audio_frame)

if keyword_index == 0:
print("Keyword 0 detected!")
elif keyword_index == 1:
print("Keyword 1 detected!")
finally:
# Clean up resources
stream.stop_stream()
stream.close()
audio.terminate()
porcupine.delete()

But it also gave me following explanation and instructions for installing the necessary library:

I've tested the code, and it's working! 🎉 The app is correctly detecting keywords in the speech. Let's change this trigger word to “Jarvis” (name of our AI Assistant) and move on!

Step 4: Speech-to-text (27:09)

We’ve got the trigger word detection working, but now we have to also record user’s voice and turn it into text. PicoVoice.ai has got our backs with their API for this very purpose.

First, I tried to incorporate the function to my existing code by myself, but unfortunately I wasn't successful on the first try. For luck, my AI buddy doesn't judge me. 😅

The first attempt didn't work here. The code was throwing some exception regarding recording audio. So I simply copy pasted the message to AI and asked again:

AI politely apologised for making a mistake a provided me with updated code which actually worked! 🎉

Thanks to the comments, we can actually get better idea what's going on and understand the code.

import os
import pvleopard as pvleopard
import pvporcupine
import pyaudio
import struct
import wave

porcupine = pvporcupine.create(
access_key='your API key',
keywords=['jarvis']
)

leopard = pvleopard.create(access_key='your API key')

# Initialize PyAudio
audio = pyaudio.PyAudio()
stream = audio.open(
rate=porcupine.sample_rate,
channels=1,
format=pyaudio.paInt16,
input=True,
frames_per_buffer=porcupine.frame_length * 2,
)

def record_audio(filename, duration):
frames = []

for _ in range(0, int(porcupine.sample_rate / porcupine.frame_length * duration)):
audio_data = stream.read(porcupine.frame_length, exception_on_overflow=False)
audio_frame = struct.unpack_from("h" * porcupine.frame_length, audio_data)
frames.append(audio_data)

with wave.open(filename, 'wb') as wf:
wf.setnchannels(1)
wf.setsampwidth(audio.get_sample_size(pyaudio.paInt16))
wf.setframerate(porcupine.sample_rate)
wf.writeframes(b''.join(frames))

# Main loop
print("Listening for keywords...")
try:
while True:
# Read audio data from the microphone
audio_data = stream.read(porcupine.frame_length, exception_on_overflow=False)
audio_frame = struct.unpack_from("h" * porcupine.frame_length, audio_data)

# Process audio frame with Porcupine
keyword_index = porcupine.process(audio_frame)

if keyword_index == 0:
print("Keyword detected! Recording speech...")

# Record speech for a fixed duration
duration_seconds = 5
audio_file = "recorded_audio.wav"
record_audio(audio_file, duration_seconds)

# Transcribe the recorded speech using Leopard
print("Transcribing speech...")
transcript, words = leopard.process_file(os.path.abspath(audio_file))
print("Transcript:", transcript)

# Remove the audio file if you don't need it
os.remove(audio_file)

finally:
# Clean up resources
stream.stop_stream()
stream.close()
audio.terminate()
porcupine.delete()

Step 5: Let's Make Jarvis Smart (39:13)

It’s time to bring Jarvis to life and make him intelligent. For this purpose, we will use chatGPT and since I have no idea how to proceed, once again I'm approaching AI for the help:

Awesome! Now I have clear steps to follow.

In case I had trouble copy-pasting the code, I could just ask the AI to help me out by saying something like “Here’s my code <<<paste code>>> — can you make the changes you described?” Then, the AI would give me the complete code to use.

Note: After checking the OpenAI docs, I've decided to use the chatGPT model instead of davinci since it’s better and cheaper. If you need more help with this, don’t hesitate to ask the AI for assistance.

Here is the updated code:

import os
import pvleopard as pvleopard
import pvporcupine
import pyaudio
import struct
import wave
import openai

porcupine = pvporcupine.create(
access_key='your API key',
keywords=['jarvis']
)

leopard = pvleopard.create(access_key='your API key')

openai.api_key = 'your API key'

# Initialize PyAudio
audio = pyaudio.PyAudio()
stream = audio.open(
rate=porcupine.sample_rate,
channels=1,
format=pyaudio.paInt16,
input=True,
frames_per_buffer=porcupine.frame_length * 2,
)

def record_audio(filename, duration):
frames = []

for _ in range(0, int(porcupine.sample_rate / porcupine.frame_length * duration)):
audio_data = stream.read(porcupine.frame_length, exception_on_overflow=False)
audio_frame = struct.unpack_from("h" * porcupine.frame_length, audio_data)
frames.append(audio_data)

with wave.open(filename, 'wb') as wf:
wf.setnchannels(1)
wf.setsampwidth(audio.get_sample_size(pyaudio.paInt16))
wf.setframerate(porcupine.sample_rate)
wf.writeframes(b''.join(frames))

# Main loop
print("Listening for keywords...")
try:
while True:
# Read audio data from the microphone
audio_data = stream.read(porcupine.frame_length, exception_on_overflow=False)
audio_frame = struct.unpack_from("h" * porcupine.frame_length, audio_data)

# Process audio frame with Porcupine
keyword_index = porcupine.process(audio_frame)

if keyword_index == 0:
print("Keyword detected! Recording speech...")

# Record speech for a fixed duration
duration_seconds = 5
audio_file = "recorded_audio.wav"
record_audio(audio_file, duration_seconds)

# Transcribe the recorded speech using Leopard
print("Transcribing speech...")
transcript, words = leopard.process_file(os.path.abspath(audio_file))
print("Transcript:", transcript)

# These new lines of code actually takes the transcript
# and send it to openai API.
response = openai.ChatCompletion.create(

model="gpt-3.5-turbo",
messages=[{"role": "assistant",
"content": ("Formulate a short answer for this question:"+transcript)}],

temperature=0.6,
)

# Then I print the text from the response we get.
print(response.choices[0].message.content)

# Remove the audio file if you don't need it
os.remove(audio_file)

finally:
# Clean up resources
stream.stop_stream()
stream.close()
audio.terminate()
porcupine.delete()

You can try to run the code and ask the questions just like you're used to ask chatGPT. It's working!

Let's recap:

  • Jarvis gets activated when you say the trigger word ✅
  • It understands what you say and transcribe it into text ✅
  • It has the ability to reply your questions ✅
  • It's quiet. It can't speak. 😢 Well, let's fix this.

Step 6: Jarvis, Speak! (51:32)

I guess you already know the process… Here is the question I asked:

GPT-4 provided me with the easy-to-use and working library and also gave me exact steps how I have to adjust my code to make the voice working. 🤯

Here is the final working code of Jarvis, personal voice assistant:

import os
import pvleopard as pvleopard
import pvporcupine
import pyaudio
import struct
import wave
import openai
import pyttsx3

porcupine = pvporcupine.create(
access_key='your API key',
keywords=['jarvis']
)

leopard = pvleopard.create(access_key='your API key')

openai.api_key = 'your API key'

# Initialize the voice library
engine = pyttsx3.init()

# Saying some fun welcome message with instructions for the user
engine.say("Hello, I'm Jarvis, your personal assistant. Say my name and ask me anything:")
engine.runAndWait()

# Initialize PyAudio
audio = pyaudio.PyAudio()
stream = audio.open(
rate=porcupine.sample_rate,
channels=1,
format=pyaudio.paInt16,
input=True,
frames_per_buffer=porcupine.frame_length * 2,
)

def record_audio(filename, duration):
frames = []

for _ in range(0, int(porcupine.sample_rate / porcupine.frame_length * duration)):
audio_data = stream.read(porcupine.frame_length, exception_on_overflow=False)
audio_frame = struct.unpack_from("h" * porcupine.frame_length, audio_data)
frames.append(audio_data)

with wave.open(filename, 'wb') as wf:
wf.setnchannels(1)
wf.setsampwidth(audio.get_sample_size(pyaudio.paInt16))
wf.setframerate(porcupine.sample_rate)
wf.writeframes(b''.join(frames))

# Main loop
print("Listening for keywords...")
try:
while True:
# Read audio data from the microphone
audio_data = stream.read(porcupine.frame_length, exception_on_overflow=False)
audio_frame = struct.unpack_from("h" * porcupine.frame_length, audio_data)

# Process audio frame with Porcupine
keyword_index = porcupine.process(audio_frame)

if keyword_index == 0:
print("Keyword detected! Recording speech...")

# Record speech for a fixed duration
duration_seconds = 5
audio_file = "recorded_audio.wav"
record_audio(audio_file, duration_seconds)

# Transcribe the recorded speech using Leopard
print("Transcribing speech...")
transcript, words = leopard.process_file(os.path.abspath(audio_file))
print("Transcript:", transcript)

response = openai.ChatCompletion.create(

model="gpt-3.5-turbo",
messages=[{"role": "assistant",
"content": ("Formulate a very short reply for the question. Here is the question:"+transcript)}],

temperature=0.6,
)

print(response.choices[0].message.content)

# This pretty line of the code reads openAI response
pyttsx3.speak(response.choices[0].message.content)

# Remove the audio file if you don't need it
os.remove(audio_file)

finally:
# Clean up resources
stream.stop_stream()
stream.close()
audio.terminate()
porcupine.delete()

And That’s a Wrap Folks… (57:01)

That's it, we've done it! Check out the demo video:

Instead of spending 2–3 hours by writing documentation of what I need, asking for developer’s allocation and waiting a few days for the result, I got an MVP ready by the end of my lunch break.

Under 1 hour of time, we have a working prototype of AI voice assistant which we could even turn into website or iOS app with just a few more question and start testing with users, collect feedback and continue with the development accordingly.

Looks like we’ve officially entered the age of Artificial Intelligence!

For more content like this, follow me on Twitter: @michallangmajer

--

--

Michal Langmajer

I am a product professional focusing on mobile app growth and revenue optimizations.‍ www.langmajer.cz