What is the Text-to-Speech with the best English voice? Comparison of AWS vs Google

22 September, 2021



turn a text into a voice

 

With the use of APIs, it’s easy to integrate a voice generator on a website, or to create the voice-over for a video.

 

In the past the computer voices have sounded robotic and monotonous. But this is not the case anymore. Machine Learning has transformed the way computers speak. The computer voices can now sounds like natural human voices.

 

With Machine Learning a neural network has been trained using a large volume of speech samples. During training, the neural network extracts the structure of the speech, such as which tones follow each other and what a realistic speech looks like.

 

AWS and Google Cloud are offering Text-to-Speech services.

 

To compare this 2 services, we will test the voice quality and the possibilities of voice customization.

 

The AWS Text-to-Speech service is called Amazon Polly. For Google it’s called Cloud Text-to-Speech.

 

 

Voice Quality: AWS vs Google

 

AWS wins! Let's see the details of the test:

 

AWS – Voice Quality

 

You can test all the AWS voices with the demo page. You need to create an AWS account to access this demo.

 

With AWS, you can choose between 2 types of voice quality: neural or standard Text-to-Speech. The neural voice produces the most natural-sounding speech.

 

With the AWS neural options you can choose in a list of voices with different accents:

 

- US accent: 9 voices

 

- British accent: 3 voices

 

- Australian, New Zealand, South African accents: 1 voice for each

 

Each Amazon Polly voice is identified by a first name: Kimberly, Justin, etc.

 

My favorite is the voice called Joanna with a US accent.

 

Google Cloud – Voice Quality

 

Google Text-to-Speech offers a group of premium voices called WaveNet model, the same technology used to produce speech for Google Assistant and Google Translate. The other group is called Standard voices.

 

You don’t need to have a Google Cloud account to access the demo page.

 

In the Google WaveNet group, you can choose in a list of:

 

- US accent: 10 voices

 

- British accent: 5 voices

 

- Australian accent: 5 voices

 

Among the WaveNet voices, my favorite has the id en-US-Wavenet-D, a male voice with a US accent.

 

When I test the Google WaveNet voices with the demo page, I feel that the voices have less life in them than the AWS neural voices. This is subjective and you can make your own test with the 2 demo pages.

 

 

Voice Customization : AWS vs Google

 

The winner depends of which type of customization you are looking for.

 

Both of them allow to control speech output with the Speech Synthesis Markup Language (SSML).

 

The SSML tags enable you for example to control the speed of the speech. You have a lot of other options like adding a pause or changing the pitch of the voice.

 

One drawback of using SSML tags is that you need to manage the reserved characters when you are sending data to the API.

 

See this link for more information about the reserved characters.

 

On AWS, with the SSML tags, you can choose 2 different speaking styles, Newscaster (synthesize speech for TV or Radio newscaster) or Conversational (synthesize speech to simulate the tone of a friendly conversation).

 

There are limitations for the voice customization. For example if you want to slow down the speech, it can sound less natural in some cases.

 

For AWS, you can test the SSML tag by checking the corresponding checkbox on the right side on the demo page.

For Google, the SSML option is under the inbox for the input text.

 

To test a slow voice you can use this SSML tag:

 

<speak>
<prosody rate="slow">Your text</prosody>
</speak>

 

The Google en-US-Wavenet-D slow voice sounds OK for me.

 

The slow voice sounds a little bit distorted when I make a test on AWS with the voice of Joanna (a neural voice with a US accent).

 

In this little test of slowing down the voice, Google is the winner against Amazon Polly.

 

 

Conclusion

 

We have seen how to quickly test the possibilities of the Text-to-Speech services of Google and AWS.

 

You can check which one is the best according to your requirements and your taste.

 

Personally I like the AWS option to easily choose between Newscaster or Conversational (friendly conversation), and I feel that the AWS voices have more life in them.