Abstract
Speech language models have significantly advanced in generating realistic speech, with neural codec language models standing out. However, the integration of human feedback to align speech outputs to human preferences is often neglected. This paper addresses this gap by first analyzing the distribution gap in codec language models, highlighting how it leads to discrepancies between the training and inference phases, which negatively affects performance. Then we explore leveraging learning from human feedback to bridge the distribution gap. We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences. SpeechAlign involves constructing a preference codec dataset contrasting golden codec tokens against synthetic tokens, followed by preference optimization to improve the codec language model. This cycle of improvement is carried out iteratively to steadily convert weak models to strong ones. Through both subjective and objective evaluations, we show that SpeechAlign can bridge the distribution gap and facilitating continuous self improvement of the speech language model. Moreover, SpeechAlign exhibits robust generalization capabilities and works for smaller models. Code and models will be available at https://github.com/0nutation/SpeechGPT.
语音语言模型在生成逼真语音
方面取得了显著进展,其中神经编解码器语言模型表现尤为突出。然而,通过人类反馈来使语音输出符合人类偏好的方法常被忽视
。本文首先通过分析编解码器语言模型中的分布差距来弥补这一不足,揭示其如何导致训练阶段与推理阶段之间的差异
,从而对模型性能产生负面影响。接着我们探索利用人类反馈学习来弥合这种分布差距。我们提出SpeechAlign——一种迭代式自我改进策略
,旨在使语音语言模型与人类偏好对齐。该方法包含两个核心步骤:首先构建对比黄金编解码器标记与合成标记的偏好编解码数据集
,随后通过偏好优化
改进编解码语言模型。这种改进循环通过迭代执行,能够持续将弱模型提升为强模型。通过主客观评估,我们证明SpeechAlign能够有效弥合分布差距
,推动语音语言模型的持续自我改进。此外,SpeechAlign展现出强大的泛化能力,并适用于小型模型。代码与模型已在https://github.com/0nutation/SpeechGPT发布。
1.Introduction
Large language models (LLMs) have showcased their potent abilities through techniques such as pretraining, supervised fine-tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF) (OpenAI, 2023; Touvron et al., 2023). The field of speech language modeling has seen significant progress (Wang et al., 2023a; Borsos et al., 2022; Zhang et al., 2023a), particularly with the adoption of discrete speech representations (Hsu et al., 2021; Zhang et al., 2023b) like audio codecs (Défossez et al., 2022; Zeghidour et al., 2021; Zhang et al., 2023d). However, current speech language models primarily focus on the SFT stage associated with empowering the LLM’s instruction-following capabilities, neglecting the integration of human feedback to align speech out puts to human preferences regarding quality, naturalness, and expressiveness. Fortunately, learning from human feedback has emerged as a powerful solution for aligning LLM output distribution with human expectation (Stiennon et al., 2022; Bai et al., 2022; Ouyang et al., 2022). The most successful approach, reinforcement learning from human feedback (RLHF), achieves this by integrating re warding modeling and a reinforcement learning phase. Additionally, some computationally efficient alternatives have proven to be effective in aligning LLM behavior without the need for explicit reward modeling (Rafailov et al., 2023; Zhang et al., 2023c; Wang et al., 2024).
大语言模型(Large Language Models, LLMs)通过预训练(pretraining)、监督微调(supervised fine-tuning, SFT)以及基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)等技术展现了强大的能力(OpenAI, 2023; Touvron et al., 2023)。语音语言模型领域也取得了显著进展(Wang et al., 2023a; Borsos et al., 2022; Zhang et al., 2023a),尤其是通过采用离散语音表征(discrete speech representations)(Hsu et al., 2021; Zhang et al., 2023b)和音频编解码器(audio codecs)(Défossez et al., 2022; Zeghidour et al., 2021; Zhang et al., 2023d)等技术。然而,现有语音语言模型主要聚焦于监督微调阶段以增强大语言模型的指令跟随能力,却忽视了整合人类反馈以对齐语音输出在质量、自然度与表现力方面的人类偏好。
值得关注的是,基于人类反馈的学习已成为对齐大语言模型输出分布与人类期望的有效方法(Stiennon et al., 2022; Bai et al., 2022; Ouyang et al., 2022)。其中最成功的方案——基于人类反馈的强化学习(RLHF)通过结合奖励建模(reward modeling)与强化学习阶段实现这一目标。此外,一些计算效率更高的替代方法(如无需显式奖励建模的技术)也被证明能有效对齐大语言模型行为(Rafailov et al., 2023; Zhang et al., 2023c; Wang et al., 2024)。
The key to the success of speech language models (Wang et al., 2023a; Borsos et al., 2022; Zhang et al., 2023a) that build on LLMs is utilizing audio codecs that discretize the speech representations. Neural Codec Language Models, leveraging audio codecs, have demonstrated their effectiveness in speech generation tasks (Yang et al., 2023; Wang et al., 2023a). It primarily utilizes a hierarchical approach that consists of a pipeline of autoregressive (AR) and non-autoregressive (NAR) models, as illustrated in Figure 3 (a). AR model generates semantic tokens (Borsos et al., 2022) or the first layer of codec tokens (Wang et al., 2023a), referred to as AR tokens. These AR tokens serve as input for NAR model to generate acoustic tokens (Borsos et al., 2022) or subsequent layers of codec tokens (Wang et al., 2023a), termed as NAR tokens. However, this pipeline system introduces a discrepancy between the training and inference phases for the codec language model. In training, NARmodel is fed with golden AR tokens derived from real speech. However, the model receives synthetic AR tokens generated by the AR model during inference. As demonstrated in Section 2.3, there is a distribution gap between these two types of AR tokens, which adversely impacts the performance of the NAR model.
以大型语言模型(LLMs)为基础的语音语言模型取得成功的关键在于利用能够离散化语音表示的音频编解码器(Wang 等,2023a;Borsos 等,2022;Zhang 等,2023a)
。神经编解码器语言模型利用音频编解码器,在语音生成任务中展现出了其有效性(Yang 等,2023;Wang 等,2023a)。它主要采用一种由自回归(AR)和非自回归(NAR)模型组成的流水线式的层次化方法,如图3(a)所示。AR模型生成语义标记(Borsos 等,2022)或编解码器标记的第一层(Wang 等,2023a),称为AR标记。这些AR标记作为NAR模型的输入,用于生成声学标记(Borsos 等,2022)或编解码器标记的后续层(Wang 等,2023a),称为NAR标记。然而,这种流水线系统引入了编解码器语言模型在训练阶段和推理阶段之间的差异。在训练
过程中,NAR模型以从真实语音中提取的黄金AR标记为输入
。然而,在推理过程中,模型接收到的是由AR模型生成的合成AR标记
。如第2.3节所示,这两种AR标记之间存在分布差异,这对NAR模型的性能产生了不利影响。
说明
训练阶段:
- 首先,真实语音通过编解码器被离散化为tokens
- 这些tokens被分为两类:
- AR tokens (图中蓝色圆点):语义tokens或第一层编解码tokens
- NAR tokens (图中黄色圆点):声学tokens或后续层编解码tokens
- 训练过程中,
AR LM接收prompt输入,生成AR tokens
NAR LM接收"黄金"(golden) AR tokens作为输入,这些tokens是从真实语音中提取的
- NAR LM学习如何
基于这些真实语音提取的AR tokens生成NAR tokens
推理阶段:
- AR LM接收prompt输入,
生成AR tokens
- 但此时,NAR LM
接收的是AR LM生成的合成AR tokens
,而非来自真实语音的tokens
- NAR LM基于这些合成AR tokens生成NAR tokens
- 最终,Codec Decoder将所有tokens转换为语音输出
差异
- 分布差异 :训练时NAR LM使用的是真实语音提取的AR tokens,而推理时使用的是模型生成的合成AR tokens
- 性能影响 :这种训练和推理阶段的不一致会导致NAR LM性能下降
- 挑战 :合成AR tokens与真实语音提取的AR tokens之间存在分布差距(distribution gap)
例子
训练阶段示例
假设有一段真实语音:“今天天气真不错”
语音编码过程:
- 原始语音波形首先通过编解码器处理
- 编解码器将其转换为离散的token序列
- 例如:AR tokens可能是
[245, 78, 156, 32, 189]
(表示语义信息,q1) - NAR tokens可能是
[67, 23, 198, 45, 112, 89, 201, 34]
(表示声学细节q2-q8)
训练AR LM:
- 输入:文本提示词 “今天天气”
- 目标输出:AR tokens
[245, 78, 156, 32, 189]
- AR LM学习从文本生成语义tokens
训练NAR LM:
- 输入:真实语音提取的AR tokens
[245, 78, 156, 32, 189]
- 目标输出:NAR tokens
[67, 23, 198, 45, 112, 89, 201, 34]
- NAR LM学习从语义tokens生成声学细节tokens
- 输入:真实语音提取的AR tokens
推理阶段示例
现在用户输入文本:“今天天气”,希望生成语音:
AR LM生成过程:
- 输入:文本提示词 “今天天气”
- 输出:合成AR tokens
[245, 76, 158, 30, 190]
- 注意:这些生成的tokens与真实语音的tokens略有不同
NAR LM生成过程:
- 输入:AR LM生成的合成AR tokens
[245, 76, 158, 30, 190]
- 输出:NAR tokens
[65, 25, 195, 48, 110, 92, 200, 36]
- 问题:由于输入的AR tokens与训练时不同,生成的NAR tokens质量可能下降
- 输入:AR LM生成的合成AR tokens
最终解码:
- 编解码器将AR tokens和NAR tokens组合
- 解码为最终语音输出
训练与推理差异的具体影响
分布差异示例:
- 真实AR tokens:
[245, 78, 156, 32, 189]
- 合成AR tokens:
[245, 76, 158, 30, 190]
- 虽然差异看似微小,但这些细微变化会导致NAR LM的输入分布发生偏移
- 真实AR tokens:
性能影响体现:
- NAR LM在训练时从未见过AR LM生成的不完美tokens
- 当接收到略有偏差的AR tokens时,可能会放大错误
- 例如:轻微的语义偏差可能导致明显的发音、音调或韵律问题
这种训练和推理阶段的不一致性是模型面临的实际挑战,解决方案可能包括:在训练时引入噪声、使用生成的AR tokens进行部分训练、或设计特殊的训练策略来减小这种分布差距。
Can wecalibrate the output of codec language models to the authentic codec distribution by learning from human feedback? Collecting a large, high-quality preference dataset for codec language models is challenging. First, codec tokens are often represented in numerical form, which is not directly understandable by humans, making it impossible to collect human preferences for these tokens directly. Furthermore, collecting human preferences on speech to gather feedback on codec tokens poses multiple challenges, including inconsistency across various human annotators and the difficulty of scaling up the dataset size.
我们能否通过学习人类反馈,将编解码器语言模型的输出校准到真实的编解码器分布?为编解码器语言模型收集一个大型、高质量的偏好数据集是很有挑战性的。首先,编解码器标记通常以数字
形式表示,人类无法直接理解
,因此无法直接收集人类对这些标记的偏好。此外,通过语音收集人类偏好以获取编解码器标记的反馈也存在诸多挑战,包括不同人类标注者之间的不一致性,以及扩大数据集规模的难度。
We propose SpeechAlign, an iterative self-improving strategy that aligns speech language models to human preferences. To avoid the need for additional human-annotated data, we construct the pairwise preference codec dataset by considering golden AR tokens as preferred data and synthetic AR tokens as dis-preferred data. Human verification is conducted to ensure its consistency with human preferences. After obtaining the preference dataset, we explore different preference optimization strategies to improve codec language models. Following a complete cycle, we iteratively perform preference dataset collection and preference-aware optimization to convert weak codec language models to stronger ones continually. Experimental results show that SpeechAlign can continually improve the speech generation performance of speech language models.
我们提出了一种名为 SpeechAlign 的迭代自我改进策略
,旨在使语音语言模型与人类偏好对齐。为避免额外收集人工标注数据,我们构建了一个成对偏好编解码数据集
,将黄金 AR 标记视为优选数据,合成 AR 标记视为非优选数据
。通过人工验证
确保其与人类偏好一致。获得偏好数据集后,我们探索了不同的偏好优化策略来改进编解码器语言模型。经过一个完整的周期后,我们迭代地进行偏好数据集收集和基于偏好的优化,以持续地将较弱的编解码器语言模型转化为更强的模型。实验结果表明,SpeechAlign 能够不断提升语音语言模型的语音生成性能。
Our contributions are summarized below:
- We propose SpeechAlign, the first to align speech language models by learning from human feedback.
- We propose an iterative self-improving strategy to convert weak codec language models to stronger ones without additional human-annotated data.
- We analyze the issue of distribution gaps in codec language models and explore various
strategies to bridge the gap.
我们的贡献总结如下:
- 我们提出了SpeechAlign,这是首个通过学习人类反馈来对齐语音语言模型的方法。
- 我们提出了一个迭代自我改进策略,无需额外的人类标注数据,就能将较弱的编解码器语言模型转化为更强的模型。
- 我们分析了编解码器语言模型中的分布差异问题,并探索了多种弥合这一差距的策略。
2.Preliminary Analysis on Distribution Gap
关于分布差异的初步分析
在本节中,我们开展初步实验,旨在分析黄金编解码器标记与合成编解码器标记之间的分布差异,并证明这一差异会降低编解码器语言模型的性能。
2.1Background
常规模型
(核心问题:验证黄金AR标记(golden AR tokens)与合成AR标记(synthetic AR tokens)之间是否存在分布差距(distribution gap)。)
We build a codec language model, referred to as SpeechAlign-sft, serving as the baseline system to analysis the distribution gap. Similar to (Zhang et al., 2024; Budzianowski et al., 2024), we rely on SpeechTokenizer (Zhang et al., 2023d) to extract speech codec tokens. SpeechTokenizer is a Residual Vector Quantization (RVQ)-based speech tokenization method and hierarchically disentangles different aspects of speech information across different RVQ layers. The output of SpeechTokenizer comprises Q = 8 hierarchical RVQ tokens (q1,…,qQ). SpeechAlign-sft consists of a SpeechGPT (Zhang et al., 2023a)-based autoregressive (AR) model and a SoundStorm (Borsos et al., 2023)-based non-autoregressive (NAR) model. The AR model learns the mapping from input golden text to the first layer of codec tokens q1. We continue finetuning the pretrained SpeechGPT model in (Zhan et al., 2024) on LibriSpeech dataset to get the AR model. Details about training process as described in Section 4.1. The NAR model adopts the training and inference procedure of SoundStorm (Borsos et al., 2023) and learns to generate subsequent layers of SpeechTokenizer tokens conditioning on the first layer tokens and prompt speech. We use the pretrained SoundStorm model in (Zhan et al., 2024). At inference time, the AR model converts input text to AR tokens and the NAR model uses these tokens along with prompt speech as conditions to generate NAR tokens. These tokens are then concatenated and converted into speech by the SpeechTokenizer decoder.
我们构建了一个编解码器语言模型,称为 SpeechAlign-sft
(传统方法),作为基线系统
来分析分布差异。与(Zhang et al., 2024; Budzianowski et al., 2024)类似,我们使用 SpeechTokenizer(Zhang et al., 2023d)来提取语音编解码器标记。SpeechTokenizer 是一种基于残差向量量化(RVQ)的语音标记化方法,在不同的 RVQ 层上分层解耦语音信息的不同方面。SpeechTokenizer 的输出包括 Q = 8 个分层的 RVQ 标记(q1,…,qQ)。SpeechAlign-sft 包括一个基于 SpeechGPT(Zhang et al., 2023a)的自回归(AR)模型和一个基于 SoundStorm(Borsos et al., 2023)的非自回归(NAR)模型。AR 模型学习从输入的黄金文本到第一层编解码器标记 q1 的映射。我们在 LibriSpeech 数据集上继续微调(Zhan et al., 2024)中的预训练 SpeechGPT 模型,以获得 AR 模型。训练过程的细节如第 4.1 节所述。NAR 模型采用 SoundStorm(Borsos et al., 2023)的训练和推理过程,并学习在第一层标记和提示语音的条件下生成后续层的 SpeechTokenizer 标记。我们使用(Zhan et al., 2024)中的预训练 SoundStorm 模型。在推理时,AR 模型将输入文本转换为 AR 标记,NAR 模型使用这些标记以及提示语音作为条件来生成 NAR 标记。然后将这些标记连接起来,并通过 SpeechTokenizer 解码器转换为语音。
[原始模型] → 生成语音(可能不自然)
↓
人类反馈:标记“好听/难听”
↓
[SpeechAlign]:自动对比“黄金标记vs合成标记” → 修正模型参数
↓
[升级版模型] → 生成更自然的语音(循环迭代)
基线系统构建
- 模型名称:SpeechAlign-sft(编解码器语言模型基准系统)
- 核心目标:分析编解码语言模型的分布差距(distribution gap) 问题。
语音标记化方法
- 技术基础:基于 Residual Vector Quantization (RVQ) 的SpeechTokenizer(Zhang et al., 2023d)
- 功能特性:
- 分层解耦语音信息,生成 Q=8层 RVQ标记(q₁, …, q₈)
- 实现语音表征的层级化离散编码。
模型架构设计
- 自回归模型(AR):
- 基于 SpeechGPT(Zhang et al., 2023a)
- 任务:将输入文本映射至首层编解码标记 q₁
- 训练:在LibriSpeech数据集上微调预训练模型(Zhan et al., 2024),具体流程见第4.1节。
- 非自回归模型(NAR):
- 基于 SoundStorm(Borsos et al., 2023)
- 任务:根据首层标记 q₁ 和提示语音(prompt speech)生成后续层级标记(q₂~q₈)
- 实现:直接使用预训练模型(Zhan et al., 2024)。
- 自回归模型(AR):
推理流程
[输入文本] → AR模型生成 q₁(AR tokens) → NAR模型结合 q₁ 和提示语音生成 q₂~q₈(NAR tokens) → 拼接所有标记(q₁~q₈) → SpeechTokenizer解码器合成最终语音
Demo
1. 输入阶段
- 用户输入文本:
“请用沉稳的男声朗读:今日天气晴,气温25摄氏度”
- 提示语音(Prompt Speech):可选的一段参考音频(提供的1秒男声片段,用于控制音色、语调)。
2. 标记化与模型分工
SpeechTokenizer 的作用 → 像一台语音拆解机:
- 若将整句话的真实人声输入SpeechTokenizer,它会输出 8层离散标记(q₁~q₈),每层捕捉不同语音特征:
- q₁:语义信息(如“天气”“25℃”的关键词)
- q₂~q₄:音色、音高(男声、沉稳)
- q₅~q₈:细节特征(呼吸声、节奏)
3. 模型生成流程
(1) AR模型(类似“画草图的画家”)
- 输入:纯文本
“请用沉稳的男声朗读:今日天气晴,气温25摄氏度”
- 输出:生成首层标记 q₁(语义轮廓)
- 例如:
q₁ = [101, 205, 309, ...]
(数值编码,表示“天气”“温度”等关键词)
- 例如:
(2) NAR模型(类似“填色的助手”)
- 输入:
- AR模型生成的 q₁(语义轮廓)
- 用户提供的提示语音(提取其音色特征)
- 输出:并行生成 q₂~q₈ 的所有标记
- 例如:
q₂ = [78, 92, ...]
(男声音色)
q₃ = [55, 203, ...]
(沉稳语调)
…
q₈ = [12, 39, ...]
(自然停顿)
- 例如:
4. 语音合成
- 标记拼接:将 q₁~q₈ 按层级拼接 →
[q₁, q₂, ..., q₈]
- SpeechTokenizer解码器:像“乐高组装机”,将离散标记转换为最终语音波形:
输出音频:“请用沉稳的男声朗读:今日天气晴,气温25摄氏度”(带自然男声音色)
2.2 Visualization of Distribution Gap
To analysis the distribution gap, we randomly select 1000 speech-text pairs from LibriSpeech dataset and construct a test corpus composed of triplets D v i s = ( t , y g , y s ) Dvis = {(t,y_g,y_s)} Dvis=(t,yg,ys) following the procedure in Section 3.1. Here t is the input text, yg is the golden AR tokens and ys is the synthetic AR tokens generated by SpeechAlign-sft. The input text t is concatenated with the golden AR tokens yg and fed token in the sequence. By applying mean pooling across the temporal dimension, these hidden states are aggregated to produce a single vector representation Repg for the golden AR tokens. Similarly, we acquire Reps for the synthetic AR tokens using the same procedure. The vectors are visualized in a 2Dspace using t-SNE, as shown in Figure 2 (a). We can observe that the representations of golden ARtokens and synthetic AR tokens are so dissimilar that they naturally form two distinct clusters, indicating that significant distributional gap exists between them.
为分析分布差异,我们从 LibriSpeech 数据集中随机选取 1000 个语音文本对,依据第 3.1 节的步骤构建由三元组 D v i s = ( t , y g , y s ) Dvis = {(t,y_g,y_s)} Dvis=(t,yg,ys) 组成的测试语料库。其中,t 为输入文本, y g y_g yg 是黄金 AR 标记, y s y_s ys 是由 SpeechAlign-sft 生成的合成 AR 标记。将输入文本 t 与黄金 AR 标记 y g y_g yg 级联后,按顺序输入。通过对时间维度进行平均池化,将这些隐藏状态聚合,为黄金 AR 标记生成一个向量表示 Repg
。同理,用相同步骤为合成 AR 标记获得 Reps。利用 t-SNE 将这些向量可视化为二维空间中的点,如图 2(a)所示。可以观察到,黄金 AR 标记和合成 AR 标记的表示差异显著,自然形成两个不同的簇,这表明二者之间存在显著的分布差异。
图 2 通过 t-SNE 可视化展示了不同 AR 标记的表示情况:
图 2(a):展示了原始 AR 标记(Raw AR Tokens,红色方块)和合成 AR 标记(Synthetic AR Tokens,蓝色三角形)在二维空间中的分布。从图中可以看出,原始 AR 标记和合成 AR 标记形成了两个不同的簇,这表明它们的表示之间存在显著差异,即存在明显的分布差异。
图 2(b):展示了原始 AR 标记(Raw AR Tokens,红色方块)和对齐后的合成 AR 标记(Aligned Synthetic AR Tokens,蓝色三角形)在二维空间中的分布。经过 SpeechAlign 的处理后,合成 AR 标记的分布与原始 AR 标记的分布更加接近,说明 SpeechAlign 有效地缩小了二者之间的分布差异,使合成 AR 标记更好地对齐了原始 AR 标记的分布。
2.3 Distribution Gap Degrades Performance
The NARmodel is trained using golden AR tokens as input, but during inference, the input switches to synthetic AR tokens. This results in a discrepancy between the training and inference processes due to the existing distribution gap, potentially affecting performance. To delve into this issue, we conduct a speech reconstruction experiment with NAR model. We construct a dataset composed of triplet data D t e s t = ( z , y g , y s ) D_test = {(z,y_g,y_s)} Dtest=(z,yg,ys), with yg and ys described in Section 2.2 and z represents 3-second prompt speech from the same speaker but distinct from the speech used for yg and ys. The NAR model performs speech reconstruction by taking prompt speech combined with either golden AR tokens or synthetic AR tokens as input, to generate speech for each type of tokens respectively. The quality of the generated speech is evaluated based on the word error rate (WER) and speaker similarity (SIM) metrics, compared against the ground truth. As shown in Tabel 1, speech generated from golden ARtokens exhibits superior WER and Speaker Similarity scores compared to that generated from synthetic AR tokens. This finding proves that the distribution gap adversely affects the NAR model’s performance.
NAR 模型在训练时以黄金 AR 标记为输入,但在推理时输入变为合成 AR 标记。由于存在分布差异,训练和推理过程之间存在不一致
,可能影响性能。为深入探讨此问题,我们使用 NAR 模型进行语音重建实验。构建了一个由三元组数据组成的测试集 D t e s t = { ( z , y g , y s ) } D_{test} = \{(\mathbf{z}, \mathbf{y}_g, \mathbf{y}_s)\} Dtest={(z,yg,ys)},其中 y g \mathbf{y}_g yg 和 y s \mathbf{y}_s ys 在第 2.2 节中有描述, z \mathbf{z} z 表示来自同一说话人的 3 秒提示语音,但与用于 y g \mathbf{y}_g yg 和 y s \mathbf{y}_s ys 的语音不同。NAR 模型通过将提示语音与黄金 AR 标记或合成 AR 标记结合输入,分别生成每种标记类型的语音,从而完成语音重建。生成语音的质量通过词错误率(WER)和说话人相似度(SIM)指标进行评估,并与真实值进行对比。如表 1 所示,与从合成 AR 标记生成的语音相比,从黄金 AR 标记生成的语音在 WER 和说话人相似度得分上更为优秀。这一结果证实了分布差异会对 NAR 模型的性能产生负面影响。
表1展示了非自回归(NAR)模型在不同输入下的语音重建性能。输入类型包括真实值(Groundtruth)、黄金自回归(AR)标记(Golden AR tokens)和合成自回归标记(Synthetic AR tokens)。评价指标为词错误率(WER,越低越好)和相似性(SIM,越高越好)。
- 当输入为真实值时,WER为3.4,SIM未提供。
- 黄金AR标记输入时,WER为5.9,SIM为0.93。
- 合成AR标记输入时,WER为7.2,SIM为0.87。
这表明,黄金AR标记比合成AR标记更能提升NAR模型的语音重建性能,使其更接近真实语音。
3 SpeechAlign
We take SpeechAlign-sft detailed in Section 2.1 as the baseline system, referred to as p θ 0 p_{\theta_0} pθ0. Within this framework, the AR model is represented as p θ 0 a r p_{\theta_0}^{ar} pθ0ar, and the NAR model as p θ 0 n a r p_{\theta_0}^{nar} pθ0nar. As shown in Figure 3 (b), the first step of SpeechAlign is to construct preference dataset that contrasts golden codec tokens with synthetic codec tokens. Utilizing this dataset, we implement various preference optimization strategies to align the baseline model. This process is iteratively executed, enabling the continuous self-improvement of codec language models.
我们把在第 2.1 节中详细描述的 SpeechAlign-sft 作为基线系统,称为 p θ 0 p_{\theta_0} pθ0。在这个框架中,AR 模型表示为 p θ 0 a r p_{\theta_0}^{ar} pθ0ar,NAR 模型表示为 p θ 0 n a r p_{\theta_0}^{nar} pθ0nar。如图 3(b)所示,SpeechAlign 的第一步是构建一个对比数据集,将黄金编解码器标记与合成编解码器标记进行对比
。利用这个数据集,我们实施各种偏好优化策略来对齐基线模型。这个过程是迭代执行的,能够使编解码器语言模型持续自我改进。
3.1 Preference Data Collection
A standard method for collecting human preferences involves prompting the model to produce two distinct responses to a query, after which annotators are asked to select the one they prefer. However, collecting human preferences for codec data is impractical and unscalable. Instead, we construct the preference codec preferences dataset by contrasting the golden codec tokens against synthetic codec tokens. Concretely, we randomly sample N speech-text golden pairs
P = ( s , x ) ) i = 1 N P = {(s, x))_{i=1}^N} P=(s,x))i=1N from LibriSpeech dataset, where s = ( s 1 , . . . , s ∣ s ∣ ) s = (s₁, ..., s_{|s|}) s=(s1,...,s∣s∣) is the speech and
x = ( x 1 , . . . , x ∣ x ∣ ) x = (x₁, ..., x_{|x|}) x=(x1,...,x∣x∣) is the corresponding transcript and N is 50000. For each speech s, we adopt pretrained SpeechTokenizer to extract discrete representations and denote the tokens of first RVQ layer as golden AR tokens y g y_g yg. For the corresponding transcript x, the AR model p θ 0 a r pθ₀^{ar} pθ0artakes it as input to generate synthetic AR tokens y s y_s ys. Following these steps, we can get the preference codec dataset D p f = ( x , y g , y s ) i = 1 N D_{pf} = {(x, y_g, y_s)}_{i=1}^N Dpf=(x,yg,ys)i=1N.
收集人类偏好的标准方法通常需要模型针对同一查询生成两个不同响应,再由标注者选择其更偏好的结果。然而,针对编解码数据直接收集人类偏好既不切实际也难以扩展。为此,我们通过对比黄金编解码器标记与合成编解码器标记构建偏好数据集。具体流程如下:
数据采样:从LibriSpeech数据集中随机抽取 N = 50 , 000 N=50,000 N=50,000个语音-文本对 P = { ( s , x ) } i = 1 N P = \{(s, x)\}_{i=1}^{N} P={(s,x)}i=1N,其中:
- s = ( s 1 , . . . , s ∣ s ∣ ) s = (s_1, ..., s_{|s|}) s=(s1,...,s∣s∣) 为语音信号
- x = ( x 1 , . . . , x ∣ x ∣ ) x = (x_1, ..., x_{|x|}) x=(x1,...,x∣x∣) 为对应文本转录
黄金标记提取:
- 对每条语音 s s s,使用预训练的SpeechTokenizer提取离散表征
- 将残差向量量化(RVQ)
首层
标记定义为黄金AR标记 y g y_g yg
合成标记生成:
- 对每条文本 x x x,通过自回归模型 p θ 0 a r p_{\theta_0}^{ar} pθ0ar 生成合成AR标记 y s y_s ys
数据集构建:
- 最终形成偏好编解码数据集 D p f = { ( x , y g , y s ) } i = 1 N D_{pf} = \{(x, y_g, y_s)\}_{i=1}^{N} Dpf={(x,yg,ys)}i=1N,用于后续优化
HumanVerification To validate the quality of constructed preference codec dataset, we perform human verification by randomly sampling 100 entries from D p f D_{pf} Dpf and employing the same procedure outlined in section 2.3 to convert y g y_g yg and y s y_s ys back into speech. This allows humans to compare them side by side and choose the better speech in terms of both speech quality and voice similarity. From results in Table 2, we can conclude that humans prefer speech reconstructed from golden AR tokens over that from synthetic AR tokens, indicating that the constructed preference codec dataset effectively aligns with human preferences.
人类验证 为了验证构建的偏好编解码器数据集的质量,我们通过从 D p f D_{pf} Dpf 中随机抽取 100 个条目,并采用第 2.3 节中概述的相同程序将 y g y_g yg 和 y s y_s ys转换回语音,从而进行人类验证。这使得人类可以对两者进行比较,并选择在语音质量和语音相似性方面更好的语音。从表 2 的结果来看,我们可以得出结论:人类更喜欢从黄金 AR 标记重建的语音,而不是从合成 AR 标记重建的语音,这表明构建的偏好编解码器数据集有效地符合人类偏好。
- Golden Win(黄金获胜):在71%的情况下,人类更喜欢从黄金AR标记重建的语音。这表明在大多数情况下,黄金AR标记生成的语音在质量和相似性方面优于合成AR标记。
- Tie(平局):在21%的情况下,人类认为两种语音之间没有明显差异。
- Golden Lose(黄金失败):在8%的情况下,人类更喜欢从合成AR标记重建的语音。这可能是因为在某些特定情况下,合成AR标记生成的语音质量更高,或者更符合人类的偏好。
3.2 Preference Optimization
In this section, we introduce how we conduct preference optimization to align codec language models using preference codec dataset, including Chain-of-Hindsight (Liu et al., 2023b), Direct Preference Optimization (Rafailov et al., 2023), RLHF-PPO (Ouyang et al., 2022) and Best-of-N Sampling.
Chain-of-Hindsight (CoH) By converting various forms of feedback into sentences and integrating these with the respective responses, CoH enables models to learn from both positive and negative feedback, allowing the identification and correction of negative attributes or errors. At inference time, the model is guided to generate the desired outputs according to the feedback type in prompt. In our case, we first convert feedback signals into a descriptive template and construct training data by combining responses with corresponding feedback template as follows:
T g = " [ H u m a n ] : R e a d t h i s t e x t a n d g i v e m e a h i g h − q u a l i t y s p e e c h r e s p o n s e : { x } < e o h > [ S p e e c h G P T ] : { y g } < e o q > . " T s = " [ H u m a n ] : R e a d t h i s t e x t a n d g i v e m e a l o w − q u a l i t y s p e e c h r e s p o n s e : { x } < e o h > [ S p e e c h G P T ] : { y s } < e o q > . " \begin{aligned} \mathbf{T}_{g}=&\ "[Human]:\ Read\ this\ text\ and\ give\ me\ a\ high-quality\ speech\ response:\ \{x\}\ <eoh>\ [SpeechGPT]:\ \{y_{g}\}\ <eoq>." \\ \mathbf{T}_{s}=&\ "[Human]:\ Read\ this\ text\ and\ give\ me\ a\ low-quality\ speech\ response:\ \{x\}\ <eoh>\ [SpeechGPT]:\ \{y_{s}\}\ <eoq>." \end{aligned} Tg=Ts= "[Human]: Read this text and give me a high−quality speech response: {x} <eoh> [SpeechGPT]: {yg} <eoq>." "[Human]: Read this text and give me a low−quality speech response: {x} <eoh> [SpeechGPT]: {ys} <eoq>."
The AR model is optimized via the negative log-likelihood loss on preference corpus D p f D_{pf} Dpf as follows:
L C O H = − E ( x , y g , y s ) ∼ D p f [ log p θ 0 a r ( y g ∣ x , T g ) + log p θ 0 a r ( y s ∣ x , T s ) ] L_{COH} = - \mathbb{E}_{(x, y_{g}, y_{s}) \sim D_{pf}} [ \log p_{\theta_{0}}^{ar} (y_{g} | x, T_{g}) + \log p_{\theta_{0}}^{ar} (y_{s} | x, T_{s}) ] LCOH=−E(x,yg,ys)∼Dpf[logpθ0ar(yg∣x,Tg)+logpθ0ar(ys∣x,Ts)]
During the inference phase, we prompt the model with positive feedback in the form of ‘high-quality’ to guide the model in generating favorable outputs.
在本节中,我们介绍如何利用偏好编解码器数据集进行偏好优化
,以对齐编解码器语言模型,包括以下方法:Chain-of-Hindsight (Liu et al., 2023b)、Direct Preference Optimization (Rafailov et al., 2023)、RLHF-PPO (Ouyang et al., 2022) 和 Best-of-N Sampling。
Chain-of-Hindsight (CoH) 通过将各种形式的反馈转化为句子
,并将这些句子与相应的响应结合起来
,CoH 使模型能够从正反馈和负反馈中学习,从而识别和纠正负面属性或错误。在推理阶段,模型会根据提示中的反馈类型生成所需的输出
。在我们的案例中,我们首先将反馈信号转换为描述性模板,然后通过将响应与相应的反馈模板结合来构建训练数据
,具体如下:
T g = " [ H u m a n ] : R e a d t h i s t e x t a n d g i v e m e a h i g h − q u a l i t y s p e e c h r e s p o n s e : { x } < e o h > [ S p e e c h G P T ] : { y g } < e o q > . " T s = " [ H u m a n ] : R e a d t h i s t e x t a n d g i v e m e a l o w − q u a l i t y s p e e c h r e s p o n s e : { x } < e o h > [ S p e e c h G P T ] : { y s } < e o q > . " \begin{aligned} \mathbf{T}_{g}=&\ "[Human]:\ Read\ this\ text\ and\ give\ me\ a\ high-quality\ speech\ response:\ \{x\}\ <eoh>\ [SpeechGPT]:\ \{y_{g}\}\ <eoq>." \\ \mathbf{T}_{s}=&\ "[Human]:\ Read\ this\ text\ and\ give\ me\ a\ low-quality\ speech\ response:\ \{x\}\ <eoh>\ [SpeechGPT]:\ \{y_{s}\}\ <eoq>." \end{aligned} Tg=Ts= "[Human]: Read this text and give me a high−quality speech response: {x} <eoh> [SpeechGPT]: {yg} <eoq>." "[Human]: Read this text and give me a low−quality speech response: {x} <eoh> [SpeechGPT]: {ys} <eoq>."
AR 模型通过在偏好语料库 D p f D_{pf} Dpf 上的负对数似然损失进行优化,具体如下:
L C O H = − E ( x , y g , y s ) ∼ D p f [ log p θ 0 a r ( y g ∣ x , T g ) + log p θ 0 a r ( y s ∣ x , T s ) ] L_{COH} = - \mathbb{E}_{(x, y_{g}, y_{s}) \sim D_{pf}} [ \log p_{\theta_{0}}^{ar} (y_{g} | x, T_{g}) + \log p_{\theta_{0}}^{ar} (y_{s} | x, T_{s}) ] LCOH=−E(x,yg,ys)∼Dpf[logpθ0ar(yg∣x,Tg)+logpθ0ar(ys∣x,Ts)]
在推理阶段,我们通过“高质量”的正反馈提示模型,以引导模型生成有利的输出。
最大化模型生成黄金标记 y g y_g yg的概率(强化高质量输出)。
Direct Preference Optimization (DPO) Without using explicit reward modeling or reinforcement learning, DPO can fine-tune the model to align with human preferences. DPO considers the likelihood of preferred response over dispreferred response and optimizes the LLM model towards that objective. The prompt template for DPO training is as follows:
T = " [ H u m a n ] : R e a d t h i s t e x t a n d g i v e m e a s p e e c h r e s p o n s e : { x } < e o h > [ S p e e c h G P T ] : { y } < e o q > . " \mathbf{T} = "[Human]:\ Read\ this\ text\ and\ give\ me\ a\ speech\ response:\ \{x\}\ <eoh>\ [SpeechGPT]:\ \{y\} <eoq>." T="[Human]: Read this text and give me a speech response: {x} <eoh> [SpeechGPT]: {y}<eoq>."
Algorithm 1 SpeechAlign
In our case, the DPO loss can be formatted as follows:
L D P O = − E ( x , y g , y s ) ∼ D p f [ log σ ( log p θ a r ( y g ∣ x , T ) p r e f a r ( y g ∣ x , T ) − log p θ a r ( y s ∣ x , T ) p r e f a r ( y s ∣ x , T ) ) ] L_{DPO} = - \mathbb{E}_{(x, y_g, y_s) \sim D_{pf}} [ \log \sigma ( \log \frac{p_{\theta}^{ar} (y_g | x, T)}{p_{ref}^{ar} (y_g | x, T)} - \log \frac{p_{\theta}^{ar} (y_s | x, T)}{p_{ref}^{ar} (y_s | x, T)} ) ] LDPO=−E(x,yg,ys)∼Dpf[logσ(logprefar(yg∣x,T)pθar(yg∣x,T)−logprefar(ys∣x,T)pθar(ys∣x,T))]
where p r e f a r p_{ref}^{ar} prefar is the reference model and initialize with p θ a r p_{\theta}^{ar} pθar.
直接偏好优化 (DPO) 不借助显式的奖励建模或强化学习,DPO 能对模型进行微调使其契合人类偏好。DPO 注重于较不受欢迎响应而言更受青睐响应的可能性,并朝着这一目标优化大型语言模型(LLM)。DPO 训练的提示模板如下:
T = " [ H u m a n ] : R e a d t h i s t e x t a n d g i v e m e a s p e e c h r e s p o n s e : { x } < e o h > [ S p e e c h G P T ] : { y } < e o q > . " \mathbf{T} = "[Human]:\ Read\ this\ text\ and\ give\ me\ a\ speech\ response:\ \{x\}\ <eoh>\ [SpeechGPT]:\ \{y\} <eoq>." T="[Human]: Read this text and give me a speech response: {x} <eoh> [SpeechGPT]: {y}<eoq>."
算法 1 SpeechAlign
输入: { ( s i , x i ) } i = 0 N \{(s_i, x_i)\}_{i=0}^N {(si,xi)}i=0N:语音-文本数据集, m ϕ m_\phi mϕ:带有参数 ϕ \phi ϕ 的预训练 SpeechTokenizer 模型, p θ 0 a r p_{\theta_0}^{ar} pθ0ar:带有参数 θ 0 \theta_0 θ0 的自回归(AR)模型, T T T:迭代次数。
在本文中,DPO 损失函数可表示为如下形式:
L D P O = − E ( x , y g , y s ) ∼ D p f [ log σ ( log p θ a r ( y g ∣ x , T ) p r e f a r ( y g ∣ x , T ) − log p θ a r ( y s ∣ x , T ) p r e f a r ( y s ∣ x , T ) ) ] L_{DPO} = - \mathbb{E}_{(x, y_g, y_s) \sim D_{pf}} [ \log \sigma ( \log \frac{p_{\theta}^{ar} (y_g | x, T)}{p_{ref}^{ar} (y_g | x, T)} - \log \frac{p_{\theta}^{ar} (y_s | x, T)}{p_{ref}^{ar} (y_s | x, T)} ) ] LDPO=−E(x,yg,ys)∼Dpf[logσ(logprefar(yg∣x,T)pθar(yg∣x,T)−logprefar(ys∣x,T)pθar(ys∣x,T))]
其中, p r e f a r p_{ref}^{ar} prefar 是参考模型,并用 p θ a r p_{\theta}^{ar} pθar 进行初始化。
算法开始时,我们有初始模型 p θ 0 a r p_{\theta_0}^{ar} pθ0ar,并创建一个完全相同的副本作为参考模型 p r e f a r p_{ref}^{ar} prefar
在第一次迭代中:
- 当前模型 p θ a r = p θ 0 a r p_{\theta}^{ar} = p_{\theta_0}^{ar} pθar=pθ0ar
- 参考模型 p r e f a r = p θ 0 a r p_{ref}^{ar} = p_{\theta_0}^{ar} prefar=pθ0ar
- 此时两者完全相同
经过偏好优化后,当前模型更新为 p θ 1 a r p_{\theta_1}^{ar} pθ1ar,但参考模型保持不变
在后续迭代中:
- 当前模型继续更新: p θ 2 a r p_{\theta_2}^{ar} pθ2ar, p θ 3 a r p_{\theta_3}^{ar} pθ3ar, …
- 参考模型始终保持为 p r e f a r = p θ 0 a r p_{ref}^{ar} = p_{\theta_0}^{ar} prefar=pθ0ar
损失函数中的比值 p θ a r ( y ∣ x , T ) p r e f a r ( y ∣ x , T ) \frac{p_{\theta}^{ar}(y|x,T)}{p_{ref}^{ar}(y|x,T)} prefar(y∣x,T)pθar(y∣x,T) 衡量了当前模型与初始模型在生成特定 tokens 时概率的变化程度
通过这种方式,算法可以有效地引导模型向生成更接近黄金 AR tokens 的方向优化,同时避免训练不稳定或过度偏离初始分布。
分步解析:DPO方法及SpeechAlign算法
1. DPO(直接偏好优化)核心思想
目标:无需复杂奖励模型或强化学习,直接通过偏好数据调整模型,使其生成更符合人类偏好的输出。
核心原理:通过对比偏好响应(黄金AR标记)与非偏好响应(合成AR标记)的生成概率差异,驱动模型优化。
2. DPO训练模板
模板设计:
T = " [ H u m a n ] : 请朗读以下文本并生成语音回复: { x } < e o h > [ S p e e c h G P T ] : { y } < e o q > . " \mathbf{T} = "[Human]:\ 请朗读以下文本并生成语音回复:\{x\}\ <eoh>\ [SpeechGPT]:\ \{y\} <eoq>." T="[Human]: 请朗读以下文本并生成语音回复:{x} <eoh> [SpeechGPT]: {y}<eoq>."
- 作用:
- 统一训练格式,将输入文本 x x x与生成的AR标记 y y y(黄金或合成)绑定。
<eoh>
和<eoq>
为分隔符,用于区分人类指令与模型响应。
示例:
输入文本 ( x = “今日天气晴” ),黄金标记 y g = [ 101 , 205 , 309 ] y_g = [101, 205, 309] yg=[101,205,309],模板填充后:
[Human]: 请朗读以下文本并生成语音回复:今日天气晴 <eoh>
[SpeechGPT]: [101, 205, 309] <eoq>
3. SpeechAlign算法流程
输入:
- 语音-文本数据集 { ( s i , x i ) } i = 0 N \{(s_i, x_i)\}_{i=0}^N {(si,xi)}i=0N ( s i s_i si为语音, x i x_i xi为文本)
- 预训练的SpeechTokenizer模型 m ϕ m_\phi mϕ(参数 ϕ \phi ϕ)
- 初始AR模型 p θ 0 a r p_{\theta_0}^{ar} pθ0ar(参数 θ 0 \theta_0 θ0)
- 迭代次数 T T T
步骤:
- 外层循环(迭代 T T T 次):
- 内层循环(遍历数据集中的每个样本 i i i):
- 生成黄金AR标记:
y r i ∼ m ϕ ( ⋅ ∣ s i ) y_{r_i} \sim m_\phi(\cdot | s_i) yri∼mϕ(⋅∣si)
(使用SpeechTokenizer从真实语音 s i s_i si中提取首层标记) - 生成合成AR标记:
y s i ∼ p θ t a r ( ⋅ ∣ x i ) y_{s_i} \sim p_{\theta_t}^{ar}(\cdot | x_i) ysi∼pθtar(⋅∣xi)
(用当前AR模型根据文本 x i x_i xi生成首层标记)
- 生成黄金AR标记:
- 构建偏好数据集:
D p f = { ( x i , y r i , y s i ) } i = 1 N D_{pf} = \{(x_i, y_{r_i}, y_{s_i})\}_{i=1}^N Dpf={(xi,yri,ysi)}i=1N - 偏好优化更新参数:
使用 D p f D_{pf} Dpf更新AR模型参数 θ t → θ t + 1 \theta_t \rightarrow \theta_{t+1} θt→θt+1
- 内层循环(遍历数据集中的每个样本 i i i):
输出:优化后的AR模型参数 θ T \theta_T θT
4. DPO损失函数解析
公式:
L D P O = − E ( x , y g , y s ) ∼ D p f [ log σ ( log p θ a r ( y g ∣ x , T ) p r e f a r ( y g ∣ x , T ) − log p θ a r ( y s ∣ x , T ) p r e f a r ( y s ∣ x , T ) ) ] L_{DPO} = - \mathbb{E}_{(x, y_g, y_s) \sim D_{pf}} \left[ \log \sigma \left( \log \frac{p_{\theta}^{ar} (y_g \mid x, T)}{p_{ref}^{ar} (y_g \mid x, T)} - \log \frac{p_{\theta}^{ar} (y_s \mid x, T)}{p_{ref}^{ar} (y_s \mid x, T)} \right) \right] LDPO=−E(x,yg,ys)∼Dpf[logσ(logprefar(yg∣x,T)pθar(yg∣x,T)−logprefar(ys∣x,T)pθar(ys∣x,T))]
符号说明:
- p θ a r p_{\theta}^{ar} pθar:待优化的当前AR模型
- p r e f a r p_{ref}^{ar} prefar:参考模型(初始化为 p θ a r p_{\theta}^{ar} pθar)
- σ \sigma σ:Sigmoid函数,将差值映射到(0,1)区间
损失函数含义:
- 核心目标:最大化偏好响应(黄金标记 y g y_g yg)与非偏好响应(合成标记 y s y_s ys)的概率比差异。
- 分子项: log p θ a r ( y g ) p r e f a r ( y g ) \log \frac{p_{\theta}^{ar}(y_g)}{p_{ref}^{ar}(y_g)} logprefar(yg)pθar(yg)
鼓励当前模型 p θ a r p_{\theta}^{ar} pθar 相比参考模型更偏好生成黄金标记。 - 分母项: log p θ a r ( y s ) p r e f a r ( y s ) \log \frac{p_{\theta}^{ar}(y_s)}{p_{ref}^{ar}(y_s)} logprefar(ys)pθar(ys)
抑制当前模型生成合成标记的概率。 - Sigmoid函数:确保优化方向稳定,避免概率比差异过大导致训练不稳定。
5. 技术优势与实例演示
优势:
- 免奖励建模:直接利用偏好数据,无需训练复杂的奖励模型。
- 稳定性:通过参考模型 p r e f a r p_{ref}^{ar} prefar 约束优化方向,防止模型偏离初始能力过远。
实例演示:
训练阶段:
- 输入文本 x = " 今日天气晴 " x = "今日天气晴" x="今日天气晴",黄金标记 y g = [ 101 , 205 , 309 ] y_g = [101, 205, 309] yg=[101,205,309],合成标记 y s = [ 101 , 205 , 310 ] y_s = [101, 205, 310] ys=[101,205,310]。
- 计算损失时:
- 若当前模型生成 y g y_g yg 的概率高于参考模型 → 损失减小
- 若生成 y s y_s ys 的概率高于参考模型 → 损失增大
推理阶段:
- 输入相同文本 x x x,模型更倾向于生成接近 y g y_g yg 的标记(如
[101, 205, 309]
),而非 y s y_s ys。
- 输入相同文本 x x x,模型更倾向于生成接近 y g y_g yg 的标记(如
RLHF-PPO RLHF methods involve training a reward model on a dataset reflecting human preferences. RL algorithms are then applied to adjust a language model’s policy to favor responses that are highly rewarded, while ensuring minimal deviation from the original model’s behavior. With preference dataset (D_{pf}), we can parameterize a reward model (r_\phi(x, y)) and estimate the parameters via maximum likelihood. By treating the task as a binary classification, we utilize the negative log-likelihood loss:
L r m = E ( x , y g , y s ) ∼ D p f [ log σ ( r ϕ ( x , y g ) − r ϕ ( x , y s ) ) ] L_{rm} = \mathbb{E}_{(x, y_g, y_s) \sim D_{pf}} [\log \sigma(r_\phi(x, y_g) - r_\phi(x, y_s))] Lrm=E(x,yg,ys)∼Dpf[logσ(rϕ(x,yg)−rϕ(x,ys))]
where σ \sigma σ is the logistic function. The reward model r ϕ ( x , y ) r_\phi(x, y) rϕ(x,y) is initialized from AR model p θ a r p_\theta^{ar} pθar with a linear layer atop the last Transformer layer to yield a single scalar prediction as the reward value. During the RL stage, we optimize the AR model against the reward model using PPO algorithm. Specially, we refine the AR model p θ 0 a r p_{\theta_0}^{ar} pθ0ar as the following optimization problem:
max p θ 0 a r E x ∼ D p f , y ∼ p θ 0 a r ( y ∣ x ) [ r ϕ ( x , y ) ] − β D k l [ p θ 0 a r ( y ∣ x ) ∥ p r e f a r ( y ∣ x ) ] \max_{p_{\theta_0}^{ar}} \mathbb{E}_{x \sim D_{pf}, y \sim p_{\theta_0}^{ar}(y|x)} [r_\phi(x, y)] - \beta \mathbb{D}_{kl} [ p_{\theta_0}^{ar}(y|x) \| p_{ref}^{ar}(y|x) ] pθ0armaxEx∼Dpf,y∼pθ0ar(y∣x)[rϕ(x,y)]−βDkl[pθ0ar(y∣x)∥prefar(y∣x)]
where β \beta β represents a coefficient regulating the extent of the KL penalty and p r e f a r p_{ref}^{ar} prefaris the reference model and initialize with p θ a r p_\theta^{ar} pθar.
Best-of-N Sampling (BoN) With the reward model trained on the preference data, we implement a Best-of-N approach to enhance the quality of output codec tokens. Concretely, we sample N N N responses using the AR model. These responses are then evaluated by the reward model, and the one receiving the highest reward score is chosen as the final response to serve as input for NAR model.
RLHF-PPO RLHF 方法涉及在反映人类偏好的数据集上训练奖励模型。随后应用强化学习算法来调整语言模型的策略,以偏好那些获得高奖励的响应,同时确保与原始模型的行为偏差最小。借助偏好数据集 (D_{pf}),我们可以参数化奖励模型 r ϕ ( x , y ) r_\phi(x, y) rϕ(x,y),并通过最大似然估计参数。通过将任务视为二元分类,我们采用负对数似然损失:
L r m = E ( x , y g , y s ) ∼ D p f [ log σ ( r ϕ ( x , y g ) − r ϕ ( x , y s ) ) ] L_{rm} = \mathbb{E}_{(x, y_g, y_s) \sim D_{pf}} [\log \sigma(r_\phi(x, y_g) - r_\phi(x, y_s))] Lrm=E(x,yg,ys)∼Dpf[logσ(rϕ(x,yg)−rϕ(x,ys))]
其中 σ \sigma σ 是逻辑函数。奖励模型 r ϕ ( x , y ) r_\phi(x, y) rϕ(x,y) 是从自回归模型 p θ a r p_\theta^{ar} pθar初始化的,在Transformer的最后一层之上添加了一个线性层,以产生一个单一的标量预测作为奖励值。在强化学习阶段,我们使用PPO算法针对奖励模型优化自回归模型。特别地,我们将自回归模型 p θ 0 a r p_{\theta_0}^{ar} pθ0ar 的优化问题表述如下:
max p θ 0 a r E x ∼ D p f , y ∼ p θ 0 a r ( y ∣ x ) [ r ϕ ( x , y ) ] − β D k l [ p θ 0 a r ( y ∣ x ) ∥ p r e f a r ( y ∣ x ) ] \max_{p_{\theta_0}^{ar}} \mathbb{E}_{x \sim D_{pf}, y \sim p_{\theta_0}^{ar}(y|x)} [r_\phi(x, y)] - \beta \mathbb{D}_{kl} [ p_{\theta_0}^{ar}(y|x) \| p_{ref}^{ar}(y|x) ] pθ0armaxEx∼Dpf,y∼pθ0ar(y∣x)[rϕ(x,y)]−βDkl[pθ0ar(y∣x)∥prefar(y∣x)]
其中 β \beta β 是调节KL惩罚项程度的系数,而 p r e f a r p_{ref}^{ar} prefar 是参考模型,初始化为 p θ a r p_\theta^{ar} pθar。
最佳N选样 (BoN) 在偏好数据上训练奖励模型后,我们采用最佳N选样方法来提升输出编解码器标记的质量。具体而言,我们使用自回归模型生成 (N) 个响应。这些响应随后由奖励模型评估,得分最高的响应被选为最终响应,作为非自回归模型的输入。
3.3 Iterative Self-Improvement
Following the aforementioned steps results in an updated AR model, denoted as p θ 1 a r p_{θ1}^{ar} pθ1ar . Using this updated model, we can create a new preference codec dataset, D p f D_{pf} Dpf. This dataset then serves as the basis for further improvement of the AR model through preference optimization. The iterative self-improvement process of the AR model, as detailed in algorithm 1, enables continuous calibration of the output distribution towards the authentic codec token distribution.
按照上述步骤,我们得到了一个更新后的自回归模型,记作 p θ 1 a r p_{\theta_1}^{ar} pθ1ar。利用这一更新模型,我们可以生成一个新的偏好编解码器数据集 D p f D_{pf} Dpf。随后,该数据集将作为进一步通过偏好优化提升自回归模型性能的基础。如算法1所述,自回归模型的迭代自我提升过程能够使其输出分布持续校准至真实的编解码器标记分布。
4 Experiments
4.1 Setups
Data For the continue finetuning stage in Section 2.1, we use the LibriSpeech dataset. To construct the preference codec dataset, we randomly sample 50k speech-text pairs from LibriSpeech training set. During the iterative process of SpeechAlign, we utilize the synthetic data generated in the most recent iteration and combine it with the newly produced synthetic data. As a result, the size of the synthetic dataset increases across iterations: starting at 50k in iteration 0, and expanding to 100k in iterations 1, 2, and 3.
Model For the AR model, we further finetunes the pretrained SpeechGPT model in (Zhan et al., 2024) on LibriSpeech dataset. For the NAR model, we use the pretrained SoundStorm model in (Zhan et al., 2024).
Training For the continue finetuning stage in Section 2.1, the batch size is set to 256, with a learning rate of 1e-5 and train for 3500 steps on 8 A100 80G GPUs. For CoH finetuning, the batch size is set to 32, with a learning rate of 1e-5 and train for 12000 steps on 8 A100 80G GPUs. For DPO f inetuning, the batch size is set to 128, with a learning rate of 5e-7 and train for 2000 steps on 8 A100 80G GPUs. For reward model training, the batch size is set to 32, with a learning rate of 1e-5 and train for 1000 steps on 8 A100 80G GPUs. For PPO training, the batch size is set to 16, with a learning rate of 1e-5 and train for 1000 steps on 8 A100 80G GPUs.
数据 在第2.1节的持续微调阶段,我们使用了LibriSpeech数据集。为了构建偏好编解码器数据集,我们从LibriSpeech训练集中随机抽取了5万对语音-文本对。在SpeechAlign的迭代过程中,我们使用最近一次迭代生成的合成数据,并将其与新生成的合成数据结合。因此,合成数据集的规模在每次迭代中逐渐增加:从第0次迭代的5万对开始,在第1、2、3次迭代中扩展到10万对(这里的意思是:每次迭代都会生成'合成的AR tokens',这些毕竟是生成的,所以不准确,但是可以保留扩充数据集
)。
模型 对于自回归(AR)模型,我们在(Zhan et al., 2024)中使用了预训练的SpeechGPT模型,并在LibriSpeech数据集上进一步微调。对于非自回归(NAR)模型,我们使用了(Zhan et al., 2024)中的预训练SoundStorm模型。
训练 在第2.1节的持续微调阶段,我们将批大小设置为256,学习率为1e-5,在8个A100 80G GPU上训练3500步。对于CoH微调,我们将批大小设置为32,学习率为1e-5,在8个A100 80G GPU上训练12000步。对于DPO微调,我们将批大小设置为128,学习率为5e-7,在8个A100 80G GPU上训练2000步。对于奖励模型训练,我们将批大小设置为32,学习率为1e-5,在8个A100 80G GPU上训练1000步。对于PPO训练,我们将批大小设置为16,学习率为1e-5,在8个A100 80G GPU上训练1000步。
4.2 Evaluation and Metrics
We conduct zero-shot TTS evaluation on LibriSpeech test-clean set and VCTK dataset. For each speaker, we randomly selected a 3s utterance as the prompts while the textual content of a different utterance is used as the input text. To reduce the randomness in the evaluation process, we evaluate each model ten times and then calculate the average to obtain the final result. The metrics we adopt are as follows:
Word Error Rate (WER) is utilized to assess the content accuracy of synthesized speech by calculating the distance between the synthesized speech’s transcription and the input text. We use Whisper medium-en model Radford et al. (2022) to transcribe the synthesized speech.
Speaker Similarity (SIM) evaluates the consistency of timbre between the synthesized and the prompt speech. This is measured by the similarity between the speaker embedding of generated speech and that of the speech prompt. The similarity calculation involves the following steps: 1) employing a speaker embedding extractor 3 to derive the speaker embeddings for both the generated and prompt speech, and 2) computing the cosine similarity between these normalized embeddings.
HumanEvaluation We conduct comparative testing of various models’ outputs against the baseline system’s speech. During the evaluation phase, the evaluators are provided with prompt speech, the baseline system’s speech, and our model’s speech. Human evaluators are tasked with determining which utterance sounded more natural and closer to the prompt speech. Evaluators have the option to choose either of the two utterances or indicate that they perceive them as equally natural. Each evaluation receives 6 ratings from 6 different human evaluators.
零样本语音合成评估
我们在LibriSpeech test-clean数据集和VCTK数据集上进行了零样本语音合成评估。对于每个说话人,我们随机选取了一段3秒的语音作为提示,而使用另一段语音的文本内容作为输入文本。为了减少评估过程中的随机性,我们对每个模型进行了十次评估,然后计算平均值以获得最终结果。我们采用的评估指标如下:
- 词错误率(WER):通过计算合成语音的转录与输入文本之间的距离,来评估合成语音的内容准确性。我们使用Whisper medium-en模型(Radford等人,2022)来转录合成语音。
- 说话人相似度(SIM):评估合成语音与提示语音之间的音色一致性。这是通过测量生成语音的说话人嵌入与提示语音的说话人嵌入之间的相似度来实现的。相似度计算包括以下步骤:1)使用说话人嵌入提取器提取生成语音和提示语音的说话人嵌入;2)计算这些归一化嵌入之间的余弦相似度。
人类评估
我们对不同模型的输出与基线系统的语音进行了对比测试。在评估阶段,评估人员会收到提示语音、基线系统的语音和我们模型的语音。人类评估员的任务是判断哪段语音听起来更自然、更接近提示语音。评估员可以选择其中一段语音,或者表示两者同样自然。每个评估会收到来自6名不同人类评估员的6个评分。
4.3 MainResults
Preference Optimization Boosts Speech Generation Figure 1 reveals that our preference-optimized models, SpeechAlign-BoN, SpeechAlign-RLHF-PPO and SpeechAlign-DPO-Iter1, significantly outperform the baseline model in win rates. As for objective evaluation, Table 3 shows that the WERs of SpeechAlign-RLHF-PPO and SpeechAlign-DPO series models are lower than that of SpeechAlign-sft. This suggests that preference optimization can enhance the accuracy of content modeling. Furthermore, these models also achieved superior performance in Speaker Similarity, indicating that preference optimization can also improve the effectiveness of timbre modeling. These f indings underscore the effectiveness of learning from human feedback in significantly improving the capabilities of codec language models across content, timbre, and audio quality dimensions.
Speech Language Model Can Self-Improve Iteratively The quantitative results in Table 3 show that from Iteration 1, DPO contributes to enhancements in speech generation. Iteration 2 further amplifies these improvements, with a notably significant impact on the WER. And the trend of gradual enhancement is maintained in subsequent iterations. By Iteration 3, there is a reduction in WER by 0.8 compared to the Baseline, and Speaker Similarity has increased to 0.9. Figure 1 shows that SpeechAlign-DPO-Iter3 achieves a higher win rate compared to SpeechAlign-DPO-Iter1, indicating superior performance in qualitative evaluation. This confirms that iterative DPO can consistently enhance the quality of the speech generated by the model. It demonstrates that SpeechAlign is an effective method for the speech language model to undergo continuous and efficient self-improvement.
Genlization to Unseen Speakers We also evaluate whether learning from human feedback would bring better speech generation when encountering unseen speaker in the preference data. We evaluate different models’ performances on VCTK dataset. As shown in Table 3, SpeechAlign-RLHF-PPO, SpeechAlign-BoN, and SpeechAlign-DPO can still improves the generated speech across all metrics. Wealso observe similar improvements in subjective evaluation in Figure 1. And iterative optimization can bring continuous improvement, suggesting that SpeechAlign can be generalized to unseen speakers.
偏好优化提升语音生成性能
图1显示,经过偏好优化的模型,即SpeechAlign-BoN、SpeechAlign-RLHF-PPO和SpeechAlign-DPO-Iter1,在胜率方面显著优于基线模型。在客观评估方面,表3显示SpeechAlign-RLHF-PPO和SpeechAlign-DPO系列模型的词错误率(WER)低于SpeechAlign-sft。这表明偏好优化能够提高内容建模的准确性。此外,这些模型在说话人相似度方面也表现出色,表明偏好优化还能提高音色建模的有效性。这些发现凸显了从人类反馈中学习在提升编解码器语言模型在内容、音色和音频质量方面能力方面的有效性。
语音语言模型可迭代自我改进
表3的定量结果表明,从第1次迭代开始,DPO有助于提升语音生成效果。第2次迭代进一步放大了这些改进,对WER产生了显著影响。随后的迭代中,这种逐步提升的趋势得以保持。到第3次迭代时,与基线相比,WER降低了0.8,说话人相似度提高到0.9。图1显示,SpeechAlign-DPO-Iter3的胜率高于SpeechAlign-DPO-Iter1,表明在定性评估中性能更优。这证实了迭代DPO能够持续提升模型生成语音的质量。它表明SpeechAlign是语音语言模型进行持续高效自我改进的有效方法。
对未见说话人的泛化能力
我们还评估了在偏好数据中遇到未见说话人时,从人类反馈中学习是否能带来更好的语音生成效果。我们在VCTK数据集上评估了不同模型的表现。如表3所示,SpeechAlign-RLHF-PPO、SpeechAlign-BoN和SpeechAlign-DPO在所有指标上均能提升生成的语音质量。从图1的主观评估中,我们也观察到了类似的改进。迭代优化能够带来持续的提升,这表明SpeechAlign可以推广到未见的说话人。
5 Analysis
6 Related Work
7 Conclusion
This paper first analyzes the distribution gap existing in current neural codec language models and propose to solve it by learning from human feedback. To avoid the need for additional human annotated preference data, construct a preference codec dataset contrasting golden codec tokens against synthetic tokens. Then we conduct preference optimization to align codec language models to human preference. Subjective and objective evaluation results prove the effectiveness of SpeechAlign to continuously converting weak codec language models to stronger ones.
本文首先分析了当前神经编解码器语言模型中存在的分布差异,并提出通过学习人类反馈来解决这一问题。为了避免需要额外的人类标注偏好数据,我们构建了一个偏好编解码器数据集,将黄金编解码器标记与合成标记进行对比。接着,我们进行偏好优化,使编解码器语言模型与人类偏好对齐。主观和客观评估结果证明了SpeechAlign的有效性,能够持续地将较弱的编解码器语言模型转化为更强的模型。