BASE TTS: Amazon Creates the Largest text-to-speech model

Teacher

Professional
Messages
2,677
Reputation
9
Reaction score
619
Points
113
The technology will not be available for widespread use due to ethical considerations.

A team of artificial intelligence researchers from Amazon AGI has announced the development of the largest text-to-speech model to date. Largest means the model with the largest number of parameters and trained on the largest data set. The researchers published an article on the preprint server arXiv, which described the process of developing and training the model.

Artificial intelligence models like ChatGPT have attracted attention for their ability to intelligently answer questions and create complex texts in human language. But AI continues to permeate other areas of application. In this new study, the researchers tried to improve the capabilities of the text-to-speech app by increasing the number of model parameters and expanding the base for training it.

The new model, called BIG Adaptive Streamable TTS with Emerging abilities (BASE TTS), contains 980 million parameters and was trained on 100,000 hours of speech recordings (found on open resources), most of which are in English. The team also provided the model with examples of pronouncing words and phrases in other languages so that it could correctly pronounce well-known expressions when they were detected, such as" au contraire "or"adios, amigo".

Amazon researchers also tested the model on smaller data sets, hoping to identify cases of so-called emergent properties, when AI suddenly begins to show a higher level of intelligence. They found that for their application, such a jump occurred when using an average - sized dataset of 150 million parameters.

It was also noted that the jump affected many aspects of the language, such as the ability to use complex nouns, express emotions, use foreign words, paralinguistic tools, punctuation marks, and correctly place accents in interrogative sentences.

The team reports that the BASE TTS model will not be released for widespread use due to concerns of unethical use. Instead, they plan to use it as a training program to improve the natural sound of synthesized speech in applications in general.
 
Top