Voice cloning, AI twins and deepfakes: threat or useful tool? Or both?

We are all becoming increasingly aware of the challenges around AI deepfakes. With elections looming this year, there are many concerns about how it might impact the outcomes. Every week we are hearing new and disturbing stories. I was in a conversation with some parents who were considering using safe words so their family would know they were really talking to one another on calls and not being scammed.

Governments across the world have been slow to react. At a recent event on AI regulation Professor Dame Wendy Hall pointed out that it would have been relatively simple to make it illegal to make a deepfake someone else without their consent.

In this post I’m going to work the basis that:

It’s better to understand something if we are to discuss it sensibly.
Any bad actors that want to figure out how to make a deepfake already can, so I’m not sharing any secret information
There are very legitimate use of this of technology, and those also pose questions that we need to consider.

I’m going to describe how I created a cloned video of me. I wanted to figure out a process that wasn’t too expensive and wasn’t too complicated. I’m going to describe a two stage process, where I use I use a couple of products, first to clone my voice, and second to sync it to my a new video.

Voicing Cloning

For the voice cloning part, I used ElevenLabs’ tools. An article by them on how they were looking to avoid their voice cloning software from being misused during the forthcoming elections got me intrigued as to how good it actually was.

To clone your own voice you need to sign up for one of their plans – $1 for the first month, which I did.

I then recorded three 30 second files of me talking fairly randomly to train the cloned voice. I could have recorded 25 of these – the more your record, in theory, the more accurate it should get. I found with just one training file it was fairly poor, but with three it was sounding pretty good so didn’t record any more. Once I’d done this I got it to read the intro to one of my blog posts. The result, to my ears, was pretty staggering. To me it sounded like me, but better. No mistakes, perfect pronunciation, but otherwise pretty much like me.

I shared it with my team – at least one had thought I’d uploaded the wrong file and it was actually me.

I played it to my wife – she said it wouldn’t fool her as the lower registers weren’t right. But it was close.

Here’s the result:

As I said, I only recorded 3 short training files – I could have recorded up to 25. I’ll do this when I have enough time and see how much it improves.

Video

I then wanted to explore what it would look like synched up to a video. There are tools that can animate a still image and lip sync video, but the results are a bit weird at the moment. So instead I decided to look at software aimed to improve lip syncing in movies and games. I picked Sync – an early stage company with an impressive promo video. I subscribed to their basic package at $19 a month.

The process was incredibly simple – I recorded a video of me, in the Jisc London office, randomly talking about my cat. I then uploaded the video, alongside the cloned audio of me talking about AI. After a couple of minutes the video is ready to download. Opinions in my team varied as to how well the synching worked. I have to say I thought was it was pretty impressive

Use cases and implications

The potential for misuse has been well discussed, and very real. I’ll come back to that in a moment, but one thing that really struck me was how much time it would save for some tasks. Specifically it had taken me all afternoon to narrate a 30 minute PowerPoint a couple of weeks ago, and I have another one of around a hour that I need to record tomorrow. It takes time because I want to get the result as good as possible, so I tend to redo sections until I’m more or less happy.

Updating the material is equally challenging. If, say we want to update a small section because the name of the technology has changed (hello Gemini/Bard and Copilot/Bing Chat) it often means updating the whole video. How much easier it would be just to edit the script slightly and re-render the video.

Of course, there’s the whole issue of how viewers will react to a clone. I want to test this at some point, but I think it probably very much depends on the context and content. My feeling is that for a straight forward informational video it would be fine. For a movie/video with a decent actor, the cloned voice would be pretty back, but re-synced real audio would probably be OK.

We talk a lot about the impact on the work place, and I think this is real here. One of my friends is a voice over /audio book actor, and he very much sees the threat as real and now.

So should we use it?

As I wrote this blog I realised what a strange direction it was taking. Yes, here was a technology that could change the result of elections, make life awful for young women through terrible deepfakes, be used by criminal gangs to extract the life savings from unsuspecting victims, and take my actor friend’s livelihood. But hey, it could save me a few minutes in creating my narrated PowerPoints, so all is good, right?!

The technology is moving faster than legislation can handle, but some of this we could and should have seen coming. In the meantime, we can at least set the ground rules of the use of these technologies in our institutions.

And in broader society, we must focus on legislation of the threats that are here today, and we must work out the right way to make sure creative people are fairly paid.

I’m still torn – I really want to see how this works out for our training materials, as there is huge potential for time-saving and getting material out and updated quickly. More thought and discussion are needed. Let me know what you think.

Find out more by visiting our Artificial Intelligence page to view publications and resources, join us for events and discover what AI has to offer through our range of interactive online demos.

For regular updates from the team sign up to our mailing list.

Get in touch with the team directly at AI@jisc.ac.uk

Voicing Cloning

Video

Use cases and implications

So should we use it?

Leave a Reply Cancel reply