Phone#

Making automated phone calls using Twilio, VoiceStream, and LangChain is very easy. This tutorial walks through making automated outgoing calls, but exactly the same endpoint can be used to handle incoming calls, it’s just a bit different setup within Twilio.

Follow the instructions below from the GitHub example directory.

This app demonstrates a basic integration with Twilio. It can make a call to any number, and then provide the user with a voice interface to GPT-4 during the call.

App Server Setup#

Python Setup#

The recommended way to run this example is to set up a virtual environment and install your dependencies there.

Create your virtualenv python -v venv .venv
Run source .venv/bin/activate
Run pip install -r requirements.txt

Set up Google APIs and Credentials#

You will need to set up Google credentials and APIs to use the server. See the quickstart. If you have already done the quickstart, copy your google_cred.json file to this example directory.

Set up Twilio account and get a phone number#

In order to place calls using Twilio you will need an account with credentials and you will also need a phone number. You have a default phone number when you create a new account. This can only be used to make calls to verified numbers. That will work for this demo, or you can purchase a number.

Use the Twilio console to set up your account.

Create an .env file with your account details.#

Rename the .env.example file to .env
Add your OPENAI_API_KEY to the file
Add your Twilio account information to the file.

All of these variables will be read in as environment variables on startup using load_dotenv

Make your server publicly available#

In order to make calls using Twilio, you must have a publicly available webhook for Twilio to call. You can use ngrok to set one up quickly and for free.

Install ngrok
Run ngrok http 8000
Ngrok will host your server at https://<domain>.ngrok-free.app
Take the domain name from the URL ngrok generated and set that in your .env file
Point a browser to the ngrok URL. Follow the directions there to connect your ngrok account.

Run the server#

Run the app with:

uvicorn main:app --reload

Point your browser to: http: and you should see:

If you enter a phone number and hit call, it will connect.

Twilio Walkthrough#

Initiating the call#

The HTML is fairly straightforward. It simply displays a web page that lets you input a phone number and has a button to make a call to the server to initiate the call.

The handler on the server side just calls the Twilio API to initiate the call. Twilio takes a webhook URL to the TWiML script that describes what to do on the call.

@app.get("/call")
async def outbound_call(phone):
    call_instance = await twilio_client.calls.create_async(
        from_=os.environ["TWILIO_PHONE_NUMBER"],
        to=phone,
        url=f"https://{os.environ['DOMAIN']}/twiml",
    )
    return {"callSid": call_instance.sid}

TWiML#

When Twilio receives the API call and makes the call, it posts to the /twiml endpoint to decide how to handle it.

@app.post("/twiml")
async def twiml_webhook():
    twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://{os.environ["DOMAIN"]}/ws"></Stream>
  </Connect>
  <Pause length="10"/>
</Response>
    """
    return HTMLResponse(twiml)

The TWiML for this call is very simple, we just use a <Connect\> element to send the audio stream to our websocket URL.

Audio URL#

When Twilio receives the API call and makes the call, it posts to the /twiml endpoint to decide how

We will break down the VoiceStream data flow.

First, we set up the websocket to receive JSON messages from Twilio.

@app.websocket("/ws")
async def audio_websocket_endpoint(websocket: WebSocket):
    logger.info("Receiving audio for new call")
    stream = fastapi_websocket_json_source(websocket)

Initial message handling#

Next we filter out the ‘connected’ message since that doesn’t tell us anything. Then we check the sequence numbers as a verification that the stream is valid. Then we use an :func:~voice_stream.extract_value_step to extract the Twilio streamSid from the first message.
We will need this later to format the output audio messages.

    stream = fastapi_websocket_json_source(websocket)
    stream = filter_step(stream, lambda x: x["event"] != "connected")
    stream = twilio_check_sequence_step(stream)
    stream, twilio_sid_f = extract_value_step(stream, value=lambda x: x["streamSid"])
    stream, event_stream = partition_step(stream, lambda x: x["event"] == "media")
    stream = twilio_media_to_audio_bytes_step(stream)

Separating Audio from Call Events#

We then separate the call events, like “call started” and “call ended” from the audio messages using a partition step. We deal with the event stream later. For now, we continue to process the audio stream by converting from twilio media JSON messages to bytes.

    stream = google_speech_v1_step(
        stream,

Running LangChain#

Once we have a stream of audio bytes, we use the same flow from the quickstart to generate the LLM response and convert it to audio. In this case we use a telephony recognizer and use an :class:~voice_stream.audio.AudioFormat of WAV_MULAW_8KHZ. This is what Twilio expects.

        speech_async_client,
        model="telephony",
        audio_format=AudioFormat.WAV_MULAW_8KHZ,
    )
    stream = log_step(stream, "Recognized speech")
    stream = map_step(stream, lambda x: {"query": x})
    stream = langchain_load_memory_step(stream, chain, on_completion="")
    stream = recover_exception_step(
        stream,
        Exception,
        lambda x: "Google blocked the response.  Ending conversation.",
    )
    stream = log_step(stream, "LLM Output")
    stream = google_text_to_speech_step(
        stream, text_to_speech_async_client, audio_format=AudioFormat.WAV_MULAW_8KHZ
    )
    stream = map_step(stream, lambda x: x.audio)
    stream = audio_bytes_to_twilio_media_step(stream, twilio_sid_f)
    done = fastapi_websocket_text_sink(stream, websocket)

Audio Output#

Finally, we take the output audio, convert it into Twilio media messages and send out of the websocket.

    async def close_websocket():

Handling the `stop` event#

We then handle the event stream. With that, all we do is watch for a stop message and use that to close the stream. We then wait on the two streams.

        await websocket.close()

    event_stream = twilio_close_on_stop_step(event_stream, close_func=close_websocket)
    event_done = empty_sink(event_stream)

    logger.info("Streams set up")
    await asyncio.gather(done, event_done)
    logger.info("Call completed")

Full Call Handler#

Here is the full code for the call handler, all together.

@app.websocket("/ws")
async def audio_websocket_endpoint(websocket: WebSocket):
    logger.info("Receiving audio for new call")
    stream = fastapi_websocket_json_source(websocket)
    stream = filter_step(stream, lambda x: x["event"] != "connected")
    stream = twilio_check_sequence_step(stream)
    stream, twilio_sid_f = extract_value_step(stream, value=lambda x: x["streamSid"])
    stream, event_stream = partition_step(stream, lambda x: x["event"] == "media")
    stream = twilio_media_to_audio_bytes_step(stream)
    stream = google_speech_v1_step(
        stream,
        speech_async_client,
        model="telephony",
        audio_format=AudioFormat.WAV_MULAW_8KHZ,
    )
    stream = log_step(stream, "Recognized speech")
    stream = map_step(stream, lambda x: {"query": x})
    stream = langchain_load_memory_step(stream, chain, on_completion="")
    stream = recover_exception_step(
        stream,
        Exception,
        lambda x: "Google blocked the response.  Ending conversation.",
    )
    stream = log_step(stream, "LLM Output")
    stream = google_text_to_speech_step(
        stream, text_to_speech_async_client, audio_format=AudioFormat.WAV_MULAW_8KHZ
    )
    stream = map_step(stream, lambda x: x.audio)
    stream = audio_bytes_to_twilio_media_step(stream, twilio_sid_f)
    done = fastapi_websocket_text_sink(stream, websocket)

    async def close_websocket():
        await websocket.close()

    event_stream = twilio_close_on_stop_step(event_stream, close_func=close_websocket)
    event_done = empty_sink(event_stream)

    logger.info("Streams set up")
    await asyncio.gather(done, event_done)
    logger.info("Call completed")