ByteDance proposes a new paradigm called PolyVoice based on LLM to achieve speech translation using a decoder
In recent years, large-scale language models (LLMs) have made many breakthroughs in the field of NLP, especially with the success of ChatGPT, which is leading everyone into a new era of AI. So far, models based on the encoder-decoder framework still dominate in speech processing tasks, while methods based on language models (LM) are still in the early exploration stage. AudioLM and VALL-E, as previous work, have already proven the effectiveness of using discrete semantic units and discrete acoustic units in conjunction with language models for audio generation tasks.
Based on this, researchers at ByteDance have proposed the PolyVoice framework, a speech-to-speech translation (S2ST) system based on discrete speech units. PolyVoice has two outstanding contributions:
(1) Decoder-only: Implementing direct voice translation with a decoder-only framework that can accommodate multiple sources of training data.
(2) We have developed a units-based audio LM for speech translation, which can be used for non-written languages.
Speech-to-speech translation (S2ST) is a challenging task as it requires solving all the problems in automatic speech recognition (ASR), machine translation (MT) and text-to-speech synthesis (TTS) simultaneously. Unlike traditional cascaded approaches, direct modeling approaches have the advantages of low latency and simplified pipelines. Existing direct modeling approaches for S2ST can be further classified based on whether the model predicts continuous mel-spectrogram features or discrete units. Recently, unit-based methods have become increasingly popular for several reasons:
(1)The units-based approach treats the discrete units of speech as a “pseudo-language” and can be applied to existing NLP techniques.
(2) The units-based approach alleviates the difficulty of learning spectrograms.
(3) The method based on units can obtain discrete units in an unsupervised manner and can be used to model non-written languages.
Semantic Units and Acoustic Units are two commonly used speech discrete units. Semantic Units are mainly used to capture the semantic content in speech. Acoustic Units, also known as Codec Units, were originally used to transmit high-quality speech signals in limited bandwidth.
Introduction to PolyVoice
PolyVoice is a language model-based S2ST framework that can handle both written and non-written languages. PolyVoice uses discrete units obtained through self-supervised training as an intermediate representation between source and target speech. PolyVoice is composed of two parts:
The Speech-to-Unit (S2UT) translation module converts discrete units of the source language speech into discrete units of the target language speech.
The Unit-to-Speech (U2S) synthesis module synthesizes target language speech while retaining the speaking style of the source language speaker.
The following diagram shows the overall architecture of PolyVoice:
Speech-to-Unit (S2UT) Translation Module
By using discrete units obtained through self-supervised training, semantic-free information is removed from continuous speech representations. S2UT utilizes a language model to learn cross-lingual generation based on speech discrete units.
1.Semantic Unit Extractor: S2UT processes raw speech using the Semantic Unit Extractor. First, it discretizes the continuous representation of HuBERT output using k-means clustering. Then, it merges consecutive sequences of duplicate units to compress sequence length, reducing computational costs and aiding convergence.
2. Units-based cross-lingual language model (U-XLM): U-XLM translates source language units into target language units. The prompt format for U-XLM can be defined as: Translate [src lang] unit “” to [tgt lang] unit: “”.
3. S2UT Training: In order to address the issue of scarce parallel data for cross-language units in real-world scenarios, as shown in the table below, PolyVoice adapted the prompts and constructed training samples for various types of data sources (such as ASR, MT, etc.), then trained the model through parameter sharing.
U-XLM has several prominent features, including the ability to process both written and non-written languages, multilingual modeling capabilities, and the ability to make zero-shot predictions by utilizing large amounts of unlabeled data. These features make U-XLM a promising framework for advancing research in speech-to-speech translation.
Unit-to-Speech (U2S) synthesis module
1. Unit-to-Speech Language Model (U-SLM): Like VALL-E, U-SLM also includes an autoregressive model and a non-autoregressive model. In PolyVoice, the input consists of Semantic Units in both the source and target languages, as well as Codec Units that contain the speaking style of the source speaker.
2. SoundStream codec: The encoder of SoundStream is used to generate Codec Units that contain the speaking style of the source speaker, while the decoder reconstructs the Acoustic Units predicted by U-SLM into speech waveforms.
3. Duration model: The duration information of discrete units is crucial for the stability of synthesized speech. PolyVoice uses LM to predict duration information. Specifically, as shown in the bottom right corner of the above figure, the merged source Semantic Units, merged target Semantic Units, and source duration value sequence (D) are input as prompts into the Duration LM. The Duration LM predicts the target duration value sequence based on the input prompts and performs the corresponding number of repetitions for each target Semantic Unit.
The author validated the performance of PolyVoice on two S2ST benchmark datasets, EMIME and CVSS.
The ASV score is used to evaluate the ability to preserve the speaker’s voice in the output speech, while the ASR-BLEU is used to measure translation quality. The authors of the experiment have drawn some conclusions:
1.When the actual target translated sequence is available, PolyVoice demonstrates better speech cloning capability.
2. PolyVoice has slightly declined in translation quality, but there is a significant improvement in voice quality. The decrease in translation quality may be due to unsupervised audio discretization introducing information loss. The improvement in naturalness of speech could be due to the data capacity of large-scale language models leading to better generation results.
To verify the effectiveness of PolyVoice in non-written languages, the author evaluated an English to Spanish S2ST system without using any Spanish text for supervision. The results of the ASR-BLEU (18.3) show that the Spanish speech generated by PolyVoice is semantically understandable.
Analysis and Dissolution Experiment
1.Comparison between Decoder-only and Encoder-Decoder frameworks
The decoder-only model brought a significant improvement of 3.9 BLEU, and when using U2S instead of the vocoder for speech synthesis, it narrowed the performance gap, proving the robustness of the U2S backend.
U-XLM has achieved remarkable performance across various tasks including S2ST, ASR, ST, MT, and TTS, validating the universal modeling capability of the decoder-only framework.
3. Optimization of U2S module.
From the experimental results, it was found that removing the duration model from U2S would significantly increase the WER, possibly because the units themselves do not contain as much duration information as phonemes. Therefore, the duration model is essential when using discrete units trained through unsupervised learning. Additionally, the authors trained an additional multilingual HuBERT model (mHuBERT_zh_en) for Chinese and English as a Semantic Unit Extractor, and experimental comparisons showed that larger models may generate better semantic units.
PolyVoice is an S2ST framework based on speech discrete units. Experimental results demonstrate that the units-based S2ST system outperforms existing systems in ASR-BLEU, ASV, and naturalness. Additionally, the authors have demonstrated the ability of PolyVoice in non-written language scenarios without the use of text information supervision. As the performance of PolyVoice is highly correlated with the quality of speech discrete units, future work will continue to investigate how to better perform speech discretization.