Crazy Joe Davola

I used suno-ai/bark and haoheliu/audio-ldm to replace all the audio from a scene in Seinfeld. Of course, I chose a scene from my favorite episode, Crazy Joe Davola. Here are the results and how I made it.

Step 1: Download

First I downloaded the video clip using pytube.

python -m pip install pytube
pytube https://www.youtube.com/watch?v=rrtwWraQDn8
open .

Then I wrote down the audio I'd need to generate:

[italian opera singing]
[seinfeld music]
hi clown, make us laugh clown, make me laugh clown!
[fighting sounds]
[groans]
---
paliacci, paliacci, tragic clown
what did you say?
what are you, a cop?
no, I'm a clown.
You look familiar.
You ever been to the circus?
When I was a kid.
Did you like it?
Uh, well, it was fun, but I was kind of scared of the clowns.
Are you still scared of clowns?
...Yeaaah
[audience laugh effect]

Step 2: Generate and clip together

Then, I pulled the mp4 into iMovie and clipped out the audio. For each sound effect / line, I generated the audio using the bark colab notebook or the Replicate GUI for audio-ldm.

It took some trial and error to overlap the audio over the lip movements. The end result looked like:

Prompts

italian opera singing (suno-ai/bark colab notebook)

text_prompt = """
♪ ridi, Pagliaccio, e ognun applaudirà! ♪
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

jazz brush beat (Replicate)

replicate.run(
  "haoheliu/audio-ldm/b61392adecdd660326fc9cfc5398182437dbe5e97b5decfb36e1a36de68b5b95",
  input={"prompt": "upbeat jazz beat brush", guidance_scale: 10, duration: 2.5}
)

guys in park

text_prompt = """MAN: Hi clown. MAN: Make us laugh clown. MAN: Make me laugh clown!"""
audio_array = generate_audio(text_prompt, history_prompt="en_speaker_2")
Audio(audio_array, rate=SAMPLE_RATE)

fighting sounds

replicate.run(
  "haoheliu/audio-ldm/b61392adecdd660326fc9cfc5398182437dbe5e97b5decfb36e1a36de68b5b95",
  input={"prompt": "a group of men groaning outside", guidance_scale: 10, duration: 2.5}
)

sitcom audience

replicate.run(
  "haoheliu/audio-ldm/b61392adecdd660326fc9cfc5398182437dbe5e97b5decfb36e1a36de68b5b95",
  input={"prompt": "upbeat jazz beat brush", guidance_scale: 2.5, duration: 5}
)

kramer

Note that I split up the Kramer / Joe Davola text. It didn't work well when I asked bark to do the whole conversation together. I got decently consistent results with NARRATOR prepended to each line.

text_prompt = """
NARRATOR: paliacci, paliacci, tragic clown
NARRATOR: what are you, a cop?
NARRATOR:  You look familiar.
NARRATOR: When I was a kid.
NARRATOR: Uh, well, it was fun, but I was kind of scared of the clowns.
NARRATOR: ...Yeaaah
"""
audio_array = generate_audio(text_prompt, history_prompt="en_speaker_1")
Audio(audio_array, rate=SAMPLE_RATE)

crazy joe davola

text_prompt = """
MAN: what did you say?
MAN: no, I'm a clown.
MAN: You ever been to the circus?
MAN: Did you like it?
MAN: Are you still scared of clowns?
"""
audio_array = generate_audio(text_prompt, history_prompt="en_speaker_1")
Audio(audio_array, rate=SAMPLE_RATE)

Take-aways

When bark works, it's incredible. When it doesn't, or something is off, it's terrifying. I would say I get amazing results ~20% of the time. It can take a few tries to get things right.
I'm not sure how, but bark added the eerie music behind Joe Davola's lines by itself. Did it realize he was supposed to be creepy?
Sometimes it helps to have MAN/NARRATOR/WOMAN: appended. Sometimes it doesn't.
You can enter sound effects in bark, like [gasp] and [laughter], and I even got it to work with [answering machine beep]. But that tends to make it less likely that the rest of the audio generation sounds good.
I'm really happy this is open source. This is definitely the most exciting open source text-to-audio model. I'm sure it will get better fast.