Bored of building the same text-based chatbots that just... chat? 🥱
Yeah, same here.
What if you could talk to your AI model and have it control Gmail, Notion, Google Sheets, or any other application you use without touching your keyboard?

If that sounds like something you want to build, stick around till the end. It’s gonna be fun.
Let’s build it all, step by step. It's going to be a bit lengthy, but it will be worth it. ✌️
What’s Covered?
In this tutorial, you will learn:
How to work with Speech Recognition in Next.js
How to power your voice AI agent with multiple SaaS apps like Gmail, Google Docs, etc, using Composio
And most importantly, how to code all of it to complete a web app
If you're impatient, here is the GitHub link for the AI Voice Assistant Chatbot
Want to know how it turns out? Check out this quick demo where I've used Gmail and Google Sheets together! 👇
Project Setup 👷
Initialise a Next.js Application
🙋♂️ In this section, we'll complete all the prerequisites for building the project.
Initialise a new Next.js application with the following command:
ℹ️ You can use any package manager of your choice. For this project, I will use npm.
Next, navigate into the newly created Next.js project:
Install Dependencies
We need some dependencies. Run the following command to install them all:
Here's what they are used for:
composio-core: Integrates tools into the agent
zustand: A simple library for state management
openai: Provides AI-powered responses
framer-motion: Adds smooth animations to the UI
react-speech-recognition: Enables speech recognition
use-debounce: Adds debounce to the voice input
Configure Composio
We'll use Composio to add integrations to our application. You can choose any integration you like, but make sure to authenticate first.
Before moving forward, you need to obtain a Composio API key.
Go ahead and create an account on Composio, get your API key, and paste it in the .env file in the root of the project.

Install and Set Up Shadcn/UI
Shadcn/UI comes with many ready-to-use UI components, so we'll use it for this project. Initialize it with the default settings by running:
We will need a few UI components, but we won't focus heavily on the UI side for the project. We'll keep it simple and concentrate mainly on the logic.
This should add five different files in the components/ui directory called button.tsx, dialog.tsx, input.tsx, label.tsx, and separator.tsx.
Code Implementation
🙋♂️ In this section, we'll cover all the coding needed to create the chat interface, work with Speech Recognition, and connect it with Composio tools.
Add Helper Functions
Before coding the project logic, let's start by writing some helper functions and constants that we will use throughout the project.
Let's begin by setting up some constants. Create a new file called constants.ts in the root of the project and add the following lines of code:
These are just some constants that we will use throughout, and you might be wondering what these alias names are that we are passing in these constants.
Basically, these are alias key names used to hold key-value pairs. For example, when working with Discord, you might have an alias like 'Gaming Channel ID' that holds the ID of your gaming channel.
We use these aliases because it's not practical to say IDs, emails, and those things with voice. So, you can set up these aliases to refer to them easily.
🗣️ Say "Can you summarize the recent chats in my gaming channel?" and it will use the relevant alias to pass to the LLM, which in turn calls the Composio API with the relevant fields.
If you're confused right now, no worries. Follow along, and you'll soon figure out what this is all about.
Now, let's work on setting up the store that will hold all the aliases that we will store in localStorage. Create a new file called alias-store.ts in the lib directory and add the following lines of code:
If you've used Zustand before, this setup should feel familiar. If not, here's a quick breakdown: we have an Alias type that holds a key-value pair and an AliasState interface that represents the full alias state along with functions to add, edit, or remove an alias.
Each alias is grouped under an integration name (such as "Slack" or "Discord"), making it easy to manage them by service. These are stored in an aliases object using the IntegrationAliases type, which maps integration names to arrays of aliases.
We use the persist middleware to persist the aliases so they don't get lost when reloading the page, and this will come in really handy.
The use of createJSONStorage ensures the state is serialised and stored in localStorage under the key "voice-agent-aliases-storage".
ℹ️ For this tutorial, I've kept it simple and stored it in
localStorage, and this should be fine, but if you're interested, you could even setup and store it in a database.
Now, let's add another helper function that will return the correct error message and status code based on the error our application throws.
Create a new file called error-handler.ts in lib directory and add the following lines of code:
We define our own Error class called AppError that extends the built-in Error class. We will not use this in our program just yet, as we don't need to throw an error in any API endpoint.
However, you can use it if you ever need to extend the application's functionality and throw an error.
handleApiError It's pretty simple; it takes in the error and returns a message and the status code based on the error type.
Finally, let's end by setting up the helper functions and writing a Zod validator for validating user input.
Create a new file called message-validator.ts in the lib directory and add the following lines of code:
This Zod schema validates an object with a message string and an aliases record, where each key maps to an array of { name, value } string pairs.
This is going to look something like this:
The idea is that for each message sent, we will send the message along with all the set-up aliases. We will then use an LLM to determine which aliases are needed to handle the user's query.
Now that we're done with the helper functions, let's move on to the main application logic. 🎉
Create Custom Hooks
We will create a few hooks that we will use to work with audio, speech recognition, and all.
Create a new directory called hooks in the root of the project and create a new file called use-speech-recognition.ts and add the following lines of code:
We’re using react-speech-recognition to handle voice input and add debounce so we don’t trigger actions on every tiny change.
Basically, whenever the transcript stops changing for a bit (debounceMsand it's different from the last one we processed, we stop listening, call onTranscriptComplete, and reset the transcript.
startListening clears old data and starts speech recognition in continuous mode. And stopListening... well, stops it. 🥴
That’s it. It's a simple hook to manage speech input with debounce to make sure it does not submit as soon as we stop, and adds a bit of a delay.
Now that we covered handling speech input, let's move on to audio. Create a new file called use-audio.ts and add the following lines of code:
Its job is simple: to play or stop audio using the Web Audio API. We’ll use it to handle audio playback for the speech generated by OpenAI’s TTS.
The playAudio function takes in user input (text), sends it to an API endpoint (/api/tts), gets the audio response, decodes it, and plays it in the browser. It uses AudioContext under the hood and manages state, like whether the audio is currently playing, through isPlaying. We also expose a stopAudio function to stop playback early if needed.
We have not yet implemented the /api/tts route, but we will do it shortly.
Now, let's implement another hook for working with chats. Basically, we will use it to work with all the messages.
This one’s pretty straightforward. We use useChat to manage a simple chat flow — it keeps track of all messages and whether we’re currently waiting for a response.
When sendMessage is called, it adds the user’s input to the chat, hits our /api/chat route with the message and any aliases, and then updates the messages with whatever the assistant replies. If it fails, we just drop in a fallback error message instead. That’s it.
We're pretty much done with the hooks. Now that we have modularized all of these into hooks, why not create another small helper hook for component mount check?
Create a new file called use-mounted.ts and add the following lines of code:
Just a tiny hook to check if the component has mounted on the client. Returns true after the first render, handy for skipping SSR-specific stuff.
Finally, after working on four hooks, we are done with the hooks setup. Let's move on to building the API.
Build the API logic.
Great, now it makes sense to work on the API part and then move on to the UI.
Head to the app directory and create two different routes:
Great, now let's implement the /tts route. Create a new file called route.ts in the api/tts directory and add the following lines of code:
This is our /api/tts route that takes in some text and generates an MP3 audio using OpenAI’s TTS API. We grab the text from the request body, call OpenAI with the model and voice we've set in CONFIG, and get back a streamable MP3 blob.
The important thing is that OpenAI returns an arrayBuffer, so we first convert it to a Node Buffer before sending the response back to the client.
Now, the main logic of our application comes into play, which is to identify if the user is requesting a tool call. If so, we find any relevant alias; otherwise, we generate a generic response.
Create a new file called route.ts in the api/chat directory and add the following lines of code:
First, we parse the request body and validate it using Zod (messageSchema). If it passes, we check whether the message needs tool usage with checkToolUseIntent(). If not, it’s just a regular chat, and we pass the message to the LLM (llm.invoke) and return the response.
If tool use is needed, we pull out the available apps from the user’s saved list aliases, and then try to figure out which apps are actually being referred to in the message using identifyTargetApps().
Once we know which apps are in play, we filter the aliases for only those apps and send them through findRelevantAliases(). This uses the LLM again to guess which ones are relevant based on the message. If we find any, we add them to the message as a context block (--- Relevant Parameters ---) so the LLM knows what it’s working with. From here, the heavy lifting is done by executeToolCallingLogic(). This is where the magic happens. We:
fetch tools from Composio for the selected apps,
start a little conversation history,
call the LLM and check if it wants to use any tools (via
tool_calls),Execute each tool,
and push results back into the convo.
We keep doing this in a loop (max N times), and finally ask the LLM for a clean summary of what just happened.
That’s basically it. Long story short, it's like:
💡 “Do we need tools? No? Chat. Yes? Find apps → match aliases → call tools → summarize.”
That's basically it. This is the heart of our application. If you've understood this, now all that's left is working with the UI.
If you're following along, try building it yourself. If not, let's keep going!
Integrate With the UI
Let's start with the model where the user assigns aliases, as we've discussed. We'll use the aliasStore to access the aliases and all the functions to add, edit, and remove aliases.
It's going to be pretty straightforward to understand, as we've already worked on the logic; this is just attaching all the logic we've done together to the UI.
Create a new file called settings-modal.tsx in the components directory and add the following lines of code:
Great, now that the modal is done, let's implement the component that will be responsible for displaying all the messages in the UI.
Create a new file called chat-messages.tsx in the components directory and add the following lines of code:
This component will receive all the messages and the isLoading prop, and all it does is display them in the UI.
The only interesting part of this code is the messagesEndRef, which we're using to scroll to the bottom of the messages when new ones are added.
Great, so now that displaying the messages is set up, it makes sense to work on the input where the user will send the messages either through voice or with chat.
Create a new file called chat-input.tsx in the components directory and add the following lines of code:
This component is a bit involved with the prop, but it's mostly related to voice inputs.
The only work of this component is to call the handleSubmit function that's passed in the prop.
We are also passing the browserSupportsSpeechRecognition prop to the component because many browsers (including Firefox) still do not support the Web Speech API. In such cases, the user can only interact with the bot through chat.
Since we're writing very reusable code, let's write the header in a separate component as well, because why not?
Create a new file called chat-header.tsx in the components directory and add the following lines of code:
This component is very simple; it's just a header with a title and a settings button.
Cool, so now let's put all of these UI components together in a separate component, which we'll display in the page.tsx, and that concludes the project.
Create a new file called chat-interface.tsx in the components directory and add the following lines of code:
And again, this is pretty straightforward. The first thing we need to do is check if the component is mounted or not because all of this is supposed to run on the client, as you know these are browser-specific APIs. Then we extract all the fields from useSpeechRecognitionWithDebounce, and based on whether the browser supports recognition, we show conditional UI.
Once the transcription is done, we send the message text to the handleProcessMessage function, which in turn calls the sendMessage function, which, as you remember, sends the message to our /api/chat API endpoint.
Finally, update the page.tsx in the root directory to display the ChatInterface component.
And with this, our entire application is done! 🎉
By the way, I have built another similar MCP-powered chat application that can connect with remotely hosted MCP servers and even locally hosted MCP servers (no matter what)! If it sounds interesting, check it out: 👇
Conclusion
Wow, this was a lot of work, but it will be worth it. Imagine being able to control all your apps with just your voice. How cool is that? 😎
And to be honest, it's perfectly ready for you to use in your daily workflow, and I'd suggest you do the same.
This was fun to build. 👀
You can find the entire source code here: AI Voice Assistant