Crazy Joe Davola

I used suno-ai/bark and haoheliu/audio-ldm to replace all the audio from a scene in Seinfeld. Of course, I chose a scene from my favorite episode, Crazy Joe Davola. Here are the results and how I made it.

Step 1: Download

First I downloaded the video clip using pytube.

python -m pip install pytube
pytube https://www.youtube.com/watch?v=rrtwWraQDn8
open .

Then I wrote down the audio I'd need to generate:

[italian opera singing]
[seinfeld music]
hi clown, make us laugh clown, make me laugh clown!
[fighting sounds]
[groans]
---
paliacci, paliacci, tragic clown
what did you say?
what are you, a cop?
no, I'm a clown.
You look familiar.
You ever been to the circus?
When I was a kid.
Did you like it?
Uh, well, it was fun, but I was kind of scared of the clowns.
Are you still scared of clowns?
...Yeaaah
[audience laugh effect]

Step 2: Generate and clip together

Then, I pulled the mp4 into iMovie and clipped out the audio. For each sound effect / line, I generated the audio using the bark colab notebook or the Replicate GUI for audio-ldm.

It took some trial and error to overlap the audio over the lip movements. The end result looked like:


Prompts

italian opera singing (suno-ai/bark colab notebook)

text_prompt = """
♪ ridi, Pagliaccio, e ognun applaudirà! ♪
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

jazz brush beat (Replicate)

replicate.run(
  "haoheliu/audio-ldm/b61392adecdd660326fc9cfc5398182437dbe5e97b5decfb36e1a36de68b5b95",
  input={"prompt": "upbeat jazz beat brush", guidance_scale: 10, duration: 2.5}
)

guys in park

text_prompt = """MAN: Hi clown. MAN: Make us laugh clown. MAN: Make me laugh clown!"""
audio_array = generate_audio(text_prompt, history_prompt="en_speaker_2")
Audio(audio_array, rate=SAMPLE_RATE)

fighting sounds

replicate.run(
  "haoheliu/audio-ldm/b61392adecdd660326fc9cfc5398182437dbe5e97b5decfb36e1a36de68b5b95",
  input={"prompt": "a group of men groaning outside", guidance_scale: 10, duration: 2.5}
)

sitcom audience

replicate.run(
  "haoheliu/audio-ldm/b61392adecdd660326fc9cfc5398182437dbe5e97b5decfb36e1a36de68b5b95",
  input={"prompt": "upbeat jazz beat brush", guidance_scale: 2.5, duration: 5}
)

kramer

Note that I split up the Kramer / Joe Davola text. It didn't work well when I asked bark to do the whole conversation together. I got decently consistent results with NARRATOR prepended to each line.

text_prompt = """
NARRATOR: paliacci, paliacci, tragic clown
NARRATOR: what are you, a cop?
NARRATOR:  You look familiar.
NARRATOR: When I was a kid.
NARRATOR: Uh, well, it was fun, but I was kind of scared of the clowns.
NARRATOR: ...Yeaaah
"""
audio_array = generate_audio(text_prompt, history_prompt="en_speaker_1")
Audio(audio_array, rate=SAMPLE_RATE)

crazy joe davola

text_prompt = """
MAN: what did you say?
MAN: no, I'm a clown.
MAN: You ever been to the circus?
MAN: Did you like it?
MAN: Are you still scared of clowns?
"""
audio_array = generate_audio(text_prompt, history_prompt="en_speaker_1")
Audio(audio_array, rate=SAMPLE_RATE)

Take-aways

  • When bark works, it's incredible. When it doesn't, or something is off, it's terrifying. I would say I get amazing results ~20% of the time. It can take a few tries to get things right.
  • I'm not sure how, but bark added the eerie music behind Joe Davola's lines by itself. Did it realize he was supposed to be creepy?
  • Sometimes it helps to have MAN/NARRATOR/WOMAN: appended. Sometimes it doesn't.
  • You can enter sound effects in bark, like [gasp] and [laughter], and I even got it to work with [answering machine beep]. But that tends to make it less likely that the rest of the audio generation sounds good.
  • I'm really happy this is open source. This is definitely the most exciting open source text-to-audio model. I'm sure it will get better fast.