How to star in Planet Earth

5 minutes


Intro

Last week, I recorded a video of David Attenborough narrating my life.

https://x.com/charliebholtz/status/1724815159590293764?s=20

It unexpectedly went viral. Business Insider and Ars Technica wrote about it, and the video was seen by 3.7 million people.

Here’s how you can make your own.

There’s three parts to what we’re going to build. In the past I’ve liked to describe AI systems as “magic boxes”. We don’t need to deeply understand how they work. We can think of a model as a magical box that takes an input, transforms it, and gives us an output.

Our code is going to need three boxes:

  • An image-to-text model that can “see” through our computer camera and describe what we’re seeing.
  • A language model that writes our script (in my case, in the style of David Attenborough)
  • A text-to-speech model that speaks

In this blog post, I’ll talk mostly about the concepts behind an interactive voice clone. At the end I’ll show you some code you can use to make your own.

A model that sees

see

Our first step is finding a model that can “see.”

We have two inputs for our model that can see: an image and a text prompt. The model then returns a text response.

We have a few options here, and be warned that I am biased.

Llava 13B Llava 13B is open source, cheap, and fast enough. This is what I’d recommend. Llava runs on an A40 GPU and costs $0.000725 per second to run.

Here’s an example Llava 13B prediction:

What is this person doing?

Prediction Link

A Llava prediction takes 1.5-3 seconds to return a response, so each request will cost about $0.0017.

GPT-4-Vision This is the model I used in my demo video. It’s smarter* than Llava, but it’s slightly slower and more expensive to run.

Example prediction:

What is this person doing?

Response (takes 2.5-4 seconds):

The person in the image is holding a red cup to their open mouth, as if they are about to eat or drink from it. They’re looking directly at the camera, and the expression on their face could be perceived as playful or humorous. They are not actually consuming the cup, but the pose suggests a mock action, perhaps for a lighthearted photo or joke.

GPT-4-Vision is a bit complicated. It’s priced by image resolution plus per token. A 250px by 140px image costs $0.00255 and $0.03 per $1K output tokens.

Also, there is a waitlist, and once you’re off it you’re limited to a maximum of 100 calls per day. UPDATE: I think this isn’t true anymore.

Feed the model We also need a way to feed these models images. I’d recommend using your computer’s webcam. Here’s a script I wrote (with GPT-4’s major help) that takes a photo from your webcam every 5 seconds and saves it to a local file. It also downsizes the images — this is important, because it makes it faster (and cheaper) for our image models to read.

We have something for our model to “see”. Let’s wire up a function that can describe the image we’re seeing live.

A model that writes a script

Next, we want a model to write our script. The output of this box is the words that David Attenborough speaks in my video.

Here’s an example of how to do this with Mistral 7B. Our prompt is:

Here is a description of an image. Can you pretend you are {narrator} and narrate it as if it is a nature documentary. Make it short. If I do anything remotely interesting, make a big deal about it!

DESCRIPTION:
The person in the image is holding a red cup up to their mouth, opening their mouth wide, and pretending to take a bite or drink from the cup.

The description here is the output from our vision model in the step before. Our output. Our response looks something like:

In this fascinating observation, we see a remarkable display of human behavior. The person in the image is holding a red cup, poised to take a bite or drink. As they open their mouth wide, we can see the intricate workings of their teeth and jaw, a testament to the incredible adaptability of the human body. It's a sight to behold, a reminder of the complexities that make us who we are.

Prediction ID

I’d recommend limiting the max tokens returned to somewhere around 128. We want as fast a response time as possible. The output above took 2.7s, so right now we’re at 5-6ish second total response time.

Note we can also use GPT-4-Vision to return our response. And we can actually skip a step––we don’t need to ask for a description of an image and then translate that into a David Attenborough style script. We can do both in one prediction. GPT-4-Vision is smart enough to do both at once.

Here’s the system prompt I used:

f"""
    You are {narrator}. Narrate the picture of the human as if it is a nature documentary. Make it snarky and funny. Don't repeat yourself. Make it short. If I do anything remotely interesting, make a big deal about it!
"""

The GPT-4-Vision API returns a response in 3.5-8 seconds (but generally closer to 3).

A model that speaks

Finally, we want our model to speak. And speak with style! We don’t want a robotic voice. We want some gusto.

There’s a bunch of options here. The highest quality output you’re going to get is through ElevenLabs‘s voice cloning feature. A cheaper, open source equivalent for voice cloning is XTTS-v2. Both allow you to upload text for your speaker to say, and audio for your speaker to sound like. Then you get an output that sounds like your speaker audio.

If you’re using Eleven, use their Turbo v2 model —it has 400ms latency. Check out my play_audio() function here.

Conclusion

Our full workflow now looks like this:

And that’s it! Here’s a link to my full codebase. Happy hacking!