How To Make an AI Voice Model (That’s Actually Good)

· By Will Harken

How To Make an AI Voice Model (That’s Actually Good)

Are you looking to incorporate any voice you want in your content? With AI, you can do that now. I'm going to show you how to create really good models. Which you can use in your next chart topping hit. Or just to make silly memes.

The beauty? You can create AI vocals of someone else or yourself for just $10 by leveraging free tools and subscribing to an AI cloning service. Crazy affordable for unlocking such immense power! 

General Tips

1) Plan to do voice-to-voice. This means you need to have a human sing, then you convert that vocal into your new AI vocal. 

2) Quality matters more than quantity when it comes to creating exceptional AI vocal models. Cut out anything from your target vocals that have instrument bleed or sound garbled.

And...

3) It's usually better to make specialized models tailored for specific purposes rather than trying to create general-purpose models.

For example: It's more effective to have a separate model for breathy falsetto vocals and a separate model for powerful belting takes. You'd want a different model for Freddie Mercury's emotional, quiet sections versus his bold, near-shouting moments.

If you DO create a general purpose model keep in mind the model will rely on the singers ability. So if you have a part of a song that is breathy, the singer will need to be breathy.

On the other hand, if you have a specialized model that is only belty screamy vocals, it will infer belty screamy vocals regardless of how you sing the part.

For general purpose models, aim for vocals that sound similar in terms of mixing (EQ/compression).

Getting Audio Of The Target Voice

Setup & Vocal Isolation

Start by selecting the tracks you'll use. While acapella vocals are ideal for training, they're often not available. In those cases, pick tracks with sparse instrumentals.

If changing the lyrics of one song, using just the solo vocals from that one track can help the AI capture the intended sound accurately. This USUALLY works.

Use a YouTube MP3 downloader to download the MP3s. Just Google "Youtube Mp3 downloader" and find one that works.

Pro tip: Use a Python script to create project folders, subfolders, and automatically download MP3s. Want to learn how to do this? Join my AI Vocal Engineering Series.

I use Ultimate Vocal Remover's UVR and MDX models to isolate vocals, following the tutorial below. Running two different isolations can slightly increase model fidelity (slightly) by providing more training data. 

 

Clean Up & Export

Import the isolations into your DAW and remove any non-vocal sections, group vocals, and garbled parts.

If you need more training data, you can cheat a little bit like I do... You can incorporate vocal isolations of similar-sounding singers to build a hybrid model.

Optional, but helpful: Add Clear by Supertone ($70 at the time of this writing) to strip away lingering reverb and delay for a crisp, clean vocal. I often turn the Ambience and Reverb knobs all the way down. 

I recommend gain staging and/or using a vocal rider to make sure the vocals are all hovering around the same volume. You can look at the volume meter in your DAW to confirm the volume is staying around the same level. You also want the vocal is hovering between -10db and 0 so that it's not too quiet for the training.

Make sure you have a Limiter on your vocals to prevent clipping.

If you plan to use Weights.gg, you will likely need to export multiple audio files because of their file size limit.

I generally export my audio files as both MP3 and WAV, because some tools only support one or the other.

Training Your Model

Use Weights.gg or Jammable (formerly Voicify) for training. With Weights you can download the model for local conversions.

Aim to upload at least 2-3 minutes of data as an MP3. WAV will take longer to upload and isn't actually higher quality because you stripped the vocal from an MP3 anyway.

Typically this amount will be enough for lyric swaps, but can be problematic if you are trying to sing a completely different song with the voice.

Weights.gg must be using some nice cloud GPUs because they are able to train pretty good models within an hour.

From here, you can run inference - creating the new AI vocal - which you can edit into your song with your DAW.

With a little setup, you can run training or inference locally. Which is often much faster and can support batch processing. If you train locally, you can also potentially train for longer to get a higher quality. Want to learn how to do this? Join my AI Vocal Engineering Series.

Related AI Vocal Insights

For deeper dives into the world of AI vocals, explore these posts:

👉 Anyone Can 'Sing' Now: AI Voice Cloning

👉 The AI Vocal Mixing Technique No One's Talking About

👉 How AI is Revolutionizing Music

Level Up Your AI Vocals

0 comments

Leave a comment