Align-SLM论文学习-EW帮帮网

Textless Spoken Language Models with Reinforcement Learning from AI Feedback
通过AI反馈学习的无文本口语模型

Abstract

While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the se mantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the frame work using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.

虽然无文本的语音语言模型（SLMs）在端到端的语音到语音建模方面展现出潜力，但它们在语义连贯性和相关性方面仍不及基于文本的大语言模型（LLMs）。本文介绍了一种名为Align-SLM的框架，该框架利用受强化学习与人工智能反馈（RLAIF）启发的偏好优化方法，以增强SLMs的语义理解能力。我们的方法从给定的提示生成多个语音延续，并使用语义指标为直接偏好优化（DPO）创建偏好数据。我们使用ZeroSpeech 2021基准测试来评估该框架的词汇和句法建模能力，使用故事补全数据集的语音版本来评估语义连贯性，并采用其他语音生成指标（包括GPT4-o评分和人工评估）进行进一步评估。实验结果表明，我们的方法在大多数基准测试中为SLMs实现了最先进的性能，突显了偏好优化对于提升SLMs语义能力的重要性。

1 Introduction

Significant strides have been made in Large Language Models (LLMs) by training decoder-only transformer models on vast amounts of text data. In speech processing, Textless NLP (Lakhotia et al., 2021; Kharitonov et al., 2022b; Nguyen et al., 2023; Lin et al., 2022) employs discrete speech units to train Spoken Language Models (SLMs) through next speech unit prediction. This approach is particularly promising, as SLMs are end-to-end speech-to-speech models that bypass the traditional cascaded pipeline of Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) systems, enabling joint optimization and real-time human computer interaction. Furthermore, SLMs are applicable to all spoken languages, including those without written scripts, as they only require unlabeled speech data, thus promoting inclusivity in speech technology.

在大型语言模型（LLMs）领域，通过在海量文本数据上训练仅含解码器的transformer模型，已取得重大进展。在语音处理领域，无文本自然语言处理（Lakhotia等人，2021；Kharitonov等人，2022b；Nguyen等人，2023；Lin等人，2022）采用离散语音单元，通过预测下一个语音单元来训练语音语言模型（SLMs）。这种方法极具前景，因为SLMs是端到端的语音到语音模型，跳过了传统自动语音识别（ASR）和语音合成（TTS）系统的级联流程，能够实现联合优化和实时的人机交互。此外，SLMs适用于所有口语语言，包括没有书写系统的语言，因为它们仅需未标注的语音数据，从而推动了语音技术的包容性发展。

Despite increasing efforts to develop and improve SLMs—through text model initialization (Hassid et al., 2024; Shih et al., 2024), speech tokenizer design (Lakhotia et al., 2021; Hassid et al., 2024; Baade et al., 2024), text & speech to ken interleaving (Chou et al., 2023; Nguyen et al., 2024), scaling data and model (Hassid et al., 2024; Cuervo and Marxer, 2024)— a substantial gap remains between the understanding capabilities of text-based LLMs and SLMs. Current SLMs, when prompted, often produce speech continuations characterized by repetitive phrases, grammatical inaccuracies, and low relevance. Zhang et al. (2023); Nachmani et al. propose predicting text during intermediate decoding steps in a chain that mimics the ASR, LM, and TTS tasks within a single model. While these intermediate text steps improve the semantics of the generated speech, they still rely on text tokens as conditions to guide speech generation, and the additional decoding steps introduce latency, preventing real-time inter active SLMs. The question of whether textless SLMs can generate semantically relevant speech remains under-explored.

尽管在开发和改进语音语言模型（SLMs）方面做出了越来越多的努力——通过文本模型初始化（Hassid等，2024；Shih等，2024）、语音分词器设计（Lakhotia等，2021；Hassid等，2024；Baade等，2024）、文本与语音标记交错（Chou等，2023；Nguyen等，2024）、扩展数据和模型（Hassid等，2024；Cuervo和Marxer，2024）——但文本基础的大型语言模型（LLMs）与语音语言模型（SLMs）之间的理解能力仍存在显著差距。目前的SLMs在接收到提示时，通常会生成包含重复短语、语法错误和低相关性的语音续集。Zhang等（2023）；Nachmani等提出在一个模拟ASR、LM和TTS任务的单一模型中，在中间解码步骤中预测文本。虽然这些中间文本步骤改善了生成语音的语义，但它们仍依赖于文本标记作为条件来指导语音生成，并且额外的解码步骤引入了延迟，阻碍了实时交互的SLMs。关于无文本SLMs是否能够生成语义相关的语音的问题仍未得到充分探索。

Most research on SLMs has relied exclusively on next-speech-token prediction. Few studies have explored alternative optimization objectives. Compared to text subwords, which on average carry more information, speech tokens are finer-grained and less compact. We argue that the next-speech token prediction task may overlook long-term semantics, as loosely compressed speech units exhibit significant variability along spectral and temporal dimensions. Consequently, SLMs require a better training objective to effectively capture long-range semantics.

大多数关于语音语言模型（SLMs）的研究都仅依赖于下一个语音标记预测(语音标记在频谱特征（如音高、音色）和时间维度（如语速）上高度变化，导致模型难以捕捉长期语义依赖)。很少有研究探索替代的优化目标。与平均携带更多信息的文本子词相比，语音标记更细粒度且不够紧凑(英语单词"cat"可能被分解为音素标记 /k/、/æ/、/t/，而文本子词可能直接编码为"cat"一个标记。)。我们认为，下一个语音标记预测任务可能忽略了长期语义，因为松散压缩的语音单元在频谱和时间维度上表现出显著的变化。因此，SLMs需要更好的训练目标来有效捕捉长距离语义。

Our motivation stems from the observation that SLMsproduce inconsistent results, sometimes generating high-quality speech continuations, while at other times producing suboptimal ones. Can we train SLMs to consistently generate better speech continuations while avoiding failures? Drawing inspiration from Reinforcement Learning with Human Feedback (RLHF) for text LLM alignment (Ouyang et al., 2022; Rafailov et al., 2024), we propose Align-SLM, the first framework that enhances the semantics of SLMs through RL. Starting with a pre-trained SLM (the open-sourced TWIST (Hassid et al., 2024) model), we generate multiple speech continuations from a given speech prompt. The next step is to create preference data (prompt, chosen, rejected) for preference optimization. Since collecting human preferences by listening is costly and time-consuming, following the concept of Reinforcement Learning from AI Feedback (RLAIF), we propose an automatic preference data selection strategy with LLM-guided semantic feedback. After preparing the preference data, Direct Preference Optimization (DPO) (Rafailov et al., 2024) is applied to learn from the feedback. Additionally, we couple the proposed technique with curriculum learning and demonstrate further improvements. The proposed framework is pure speech-to-speech, data efficient, and does not require text injection (Nguyen et al., 2024; Chou et al., 2023) or text-to-speech synthesized speech (Zhang et al., 2023).

我们的动机源于这样一个发现：语音语言模型（SLMs）会产生不一致的结果，有时生成高质量的语音延续，有时则生成次优的语音延续。我们能否训练SLMs，使其始终生成更好的语音延续，同时避免失败呢？从用于文本大语言模型（LLMs）对齐的基于人类反馈的强化学习（RLHF）（Ouyang等人，2022；Rafailov等人，2024）中获得灵感，我们提出了Align-SLM，这是第一个通过强化学习（RL）增强SLMs语义的框架。从一个预训练的SLM（开源的TWIST（Hassid等人，2024）模型）开始，我们从给定的语音提示中生成多个语音延续 <1>。下一步是创建偏好数据（提示、选择、拒绝）用于偏好优化 <2>。由于通过收听收集人类偏好既昂贵又耗时，我们遵循从人工智能反馈进行强化学习（RLAIF）的概念，提出了一个具有LLM引导的语义反馈的自动偏好数据选择策略。在准备好偏好数据后，应用直接偏好优化（DPO）（Rafailov等人，2024）从反馈中学习 <3>。此外，我们将所提技术与课程学习相结合，展示了进一步的改进。所提出的框架是纯语音到语音的，数据高效的，不需要文本注入（Nguyen等人，2024；Chou等人，2023）或文本到语音合成的语音（Zhang等人，2023）。

<1>

步骤1：生成多个语音延续

原理
输入一段语音提示（如“今天天气不错”），让预训练的SLM生成多个可能的延续（如“适合去公园散步”“我想吃冰淇淋冰淇淋”“嗯…下雨了吗？”）。这些候选答案在语义连贯性、语法正确性上存在差异。

实例

输入语音：“如何学习英语？”
生成候选：
- A：“每天背单词，多听播客。”（正确）
- B：“英语学习学习需要坚持坚持。”（重复）
- C：“苹果是红色的。”（无关）

步骤2：自动创建偏好数据

a. 核心挑战

人工标注语音质量成本高，需设计自动筛选机制。

b. 实现方法：LLM语义反馈

语音转文本（仅评估用）：使用ASR将候选语音转为文字（如B→“英语学习学习需要坚持坚持”）。
LLM评分：将文本输入LLM（如GPT-4），要求从相关性、逻辑性、无重复等维度打分。
构建偏好对：选择得分最高的作为“优选”（chosen），随机选一个低分作为“劣选”（rejected）。

实例

候选A文本：“每天背单词，多听播客。” → LLM评分95
候选B文本：“英语学习学习需要坚持坚持。” → LLM评分30
候选C文本：“苹果是红色的。” → LLM评分10
偏好对：
- 提示：“如何学习英语？”
- 优选：A
- 劣选：B

关键点

无需人工参与：ASR和LLM全自动处理
纯语音闭环：ASR仅在评估阶段临时使用，最终模型生成仍不依赖文本

步骤3：直接偏好优化（DPO）

原理
通过偏好数据调整模型，使其对“优选”输出的概率高于“劣选”。数学上优化以下目标：
$\mathcal{L}_{\text{DPO}} = -\log \sigma\left( \beta \left( \log \frac{\pi_\theta(y_{\text{chosen}}|x)}{\pi_{\text{ref}}(y_{\text{chosen}}|x)} - \log \frac{\pi_\theta(y_{\text{rejected}}|x)}{\pi_{\text{ref}}(y_{\text{rejected}}|x)} \right) \right)$
其中 $\pi_\theta$ 为待优化模型， $\pi_{\text{ref}}$ 为原始模型， $\beta$ 为调节系数。

实例解释

原始模型对候选B的概率为30%，对A为50%
训练目标：调整参数θ，使模型对A的概率升至70%，对B降至10%
效果：输入相同提示时，模型更倾向生成A类优质回答

步骤4：课程学习（Curriculum Learning）

原理
分阶段训练，从简单样本逐步过渡到复杂样本，避免模型过早陷入局部最优。

实例训练计划

阶段1：短语音提示（3秒内）+ 明确主题
- 示例输入：“帮我订机票”
- 生成目标：简短确认（如“目的地是哪里？”）
阶段2：中等长度提示（5秒）+ 多话题
- 示例输入：“我想学英语，但工作太忙怎么办？”
- 生成目标：多步骤建议（如“利用通勤时间听材料，每天抽15分钟练习”）
阶段3：长对话历史+隐含意图
- 示例输入：（连续对话）
  User：“推荐一部电影。”
  AI：“您喜欢科幻还是喜剧？”
  User：“科幻，但不要太烧脑。”
- 生成目标：精准推荐（如《火星救援》）

全程纯语音示例

输入输出全流程（不涉及文本中间态）：

用户语音：“明天需要带伞吗？”（语音波形）
SLM生成候选：
- 候选1语音：“今天会下雨，建议带伞。”（波形A）
- 候选2语音：“伞伞带带明天。”（波形B）
自动评估：
- ASR转文本→LLM判断候选1更优
DPO训练：模型学习强化候选1的生成模式
最终效果：用户再次问“需要带伞吗？”，模型直接生成高质量语音回应

与传统方法的对比

维度	传统方法（如Zhang等2023）	Align-SLM（本文）
依赖文本	需中间文本生成（ASR→文本LLM→TTS）	纯语音端到端，无需文本
延迟	高（级联处理）	低（直接生成）
适用语言	依赖文字书写系统	支持无文字语言（如方言、少数民族语）
数据需求	需配对语音-文本数据	仅需无标注语音

We evaluate the SLM’s performance using the sWUGGY and sBLIMP from ZeroSpeech 2021 (Nguyen et al., 2020) for lexical and syntactic modeling, and Spoken-StoryCloze and Topic StoryCloze (Hassid et al., 2024) for textual nuances and continuation coherence. Additionally, we perform generative evaluations for speech continuation using (i) human listening tests and (ii) GPT-4 as a proxy for assessing semantic coherence and relevance. The results show that the proposed method achieves superior performance in semantic understanding and speech generation. The contributions can be summarized as follows:

We propose the first preference optimization framework for textless SLMs, demonstrating that preference optimization is crucial for improving the semantics of SLMs.
We develop an automated preference data selection strategy by designing effective semantic metrics to score preference data pairs.
We couple DPO with curriculum learning by iteratively opting for higher criterion of preference data to further enhance performance.

我们通过ZeroSpeech 2021的sWUGGY和sBLIMP（Nguyen等人，2020）评估SLM的词汇和句法建模性能，通过Spoken-StoryCloze和Topic StoryCloze（Hassid等人，2024）评估文本细微差别和连贯性。此外，我们通过（i）人类听觉测试和（ii）GPT-4（作为语义连贯性和相关性评估的代理）对语音延续进行生成式评估。结果表明，所提方法在语义理解和语音生成方面表现出色。主要贡献如下：

我们提出了首个针对无文本SLMs的偏好优化框架，证明了偏好优化对提升SLMs语义能力至关重要。
我们通过设计有效的语义指标来评分偏好数据对，开发了一种自动化的偏好数据选择策略。
我们通过迭代选择更高标准的偏好数据，将DPO与课程学习相结合，以进一步提升性能。

–

Align-SLM achieves the state-of-the-art performance for end-to-end spoken language models on Zerospeech and StoryCloze bench mark (77.9% on sWUGGY, 61.1% on S-StoryCloze, and 86.8% on T-StoryCloze) and achieves superior Meaningfulness Mean opinion scores with human evaluations.

Align-SLM在ZeroSpeech和StoryCloze基准测试中达到了端到端语音语言模型的最先进性能（sWUGGY得分为77.9%，S-StoryCloze得分为61.1%，T-StoryCloze得分为86.8%），并在人类评估中获得了更高的有意义性平均意见得分。

1. ZeroSpeech 2021 基准测试

测试目标

评测模型在无文本监督下对语音的词汇和句法建模能力，重点关注语言结构的无监督学习。
示例说明

输入语音句子：“The glorp is on the table.”（glorp为虚构词）
- 模型需判断虚构词是否符合英语音系规则（如辅音聚类合理性）
- 77.9%准确率表明Align-SLM能有效区分“glorp”（合理）与“gxrpt”（不合理）

2. StoryCloze 系列测试

测试目标

评估模型对长程语义连贯性和主题一致性的理解能力，模拟真实对话/叙事场景。

子任务对比

测试项	挑战重点	Align-SLM表现	典型示例
S-StoryCloze	基础故事逻辑补全	61.1%	输入故事“小明忘记带钥匙”，需在结局A（找开锁匠）和B（买冰淇淋）中选择合理项
T-StoryCloze	主题约束下的连贯性（如必须包含“环保”关键词）	86.8%	输入主题“太空探索”，生成结局需提及“火箭”“宇航服”等关联概念

性能解读

S-StoryCloze 61.1%：超过此前最佳SLM（54.3%），证明对开放叙事语义的理解提升
T-StoryCloze 86.8%：接近文本LLM水平（如GPT-3在该任务约89%），显示强大的主题聚焦能力

2 Related Work

2.1 Spoken Language Models (SLMs)

Recent advancements in self-supervised representation learning and acoustic unit discovery convert continuous speech signals into discrete speech to kens (Polyak et al., 2021a). SLMs are end-to-end language models with discrete speech tokens, enabling speech continuation given a speech prompt. GSLM (Lakhotia et al., 2021) utilizes speech to kens to train a decoder-only language model and synthesize speech waveforms using a unit-based vocoder. pGSLM (Kharitonov et al., 2022b) in jects prosodic tokens to enhance expressiveness. dGSLM(Nguyen et al., 2023) adopts a dual-tower model for two-channel spoken dialogue modeling.

Although SLMs can generate words and short term phrases, the long-term semantics of the generated speech are often poor. TWIST (Hassid et al., 2024) proposes using a text-based LLM as initialization with large-scale training data to improve semantics. VoxtLM (Maiti et al., 2024) leverages both paired and unpaired speech in addition to text data for joint training of ASR, TTS, SLM, and Text LM. Another research di rection is text token prediction as an intermediate step like SpeechGPT (Zhang et al., 2023) and SPECTRON (Nachmani et al.), which perform a chain of tasks (ASR → LM → TTS) in a single model. SUTLM (Chou et al., 2023) and SPIRIT LM(Nguyen et al., 2024) utilize phrase-level in terleaving for speech and text tokens. Concurrent works like LLaMA-Omni (Fang et al., 2024), Mini Omni (Xie and Wu, 2024), and Moshi (Défos sez et al., 2024) leverage simultaneous text token prediction to guide the speech generation. Sylla bleLM (Baade et al., 2024) recently propose to use syllable-level coarse speech tokens to improve SLMsemantics. Compared to prior works focusing on multi-tasking in a single model or using text tokens to guide speech generation, this is the first work to improve the long-term semantics of speech only SLMs through preference optimization.

最近，在自监督表示学习和声学单元发现方面取得的进展，能够将连续的语音信号转换为离散的语音单元（Polyak等人，2021a）。语音语言模型（SLMs）是端到端的语言模型，使用离散语音单元，在给定语音提示的情况下能够生成语音延续。生成式语音语言模型（GSLM）（Lakhotia等人，2021）利用语音单元训练仅含解码器的语言模型，并通过基于单元的波形合成器合成语音波形。部分生成式语音语言模型（pGSLM）（Kharitonov等人，2022b）引入韵律单元以增强表达效果。对偶生成式语音语言模型（dGSLM）（Nguyen等人，2023）采用双塔模型用于双通道口语对话建模。

尽管SLMs能够生成单词和短语，但生成语音的长期语义往往较差。TWIST（Hassid等人，2024）提出使用基于文本的大语言模型（LLM）进行初始化，并利用大规模训练数据来提升语义性。VoxtLM（Maiti等人，2024）则利用配对和未配对的语音以及文本数据，联合训练自动语音识别（ASR）、语音合成（TTS）、SLM和文本语言模型（Text LM）。另一个研究方向是在中间步骤预测文本标记，例如SpeechGPT（Zhang等人，2023）和SPECTRON（Nachmani等人），它们在一个模型中执行一系列任务（ASR→LM→TTS）。SUTLM（Chou等人，2023）和SPIRIT LM（Nguyen等人，2024）在短语级别上交错使用语音和文本标记。同时进行的工作如LLaMA-Omni（Fang等人，2024）、Mini Omni（Xie和Wu，2024）以及Moshi（Défossez等人，2024）则利用同时的文本标记预测来引导语音生成。音节LM（Baade等人，2024）最近提出使用音节级别的粗粒度语音单元来提升SLMs的语义性。与以往在单个模型中进行多任务处理或使用文本标记引导语音生成的工作不同，这是首次通过偏好优化来提升仅语音SLMs的长期语义性。

2.2 Preference Optimization

Training the language model with next-token pre diction is effective for learning human knowledge, but this objective might be different from the hu man preference. RLHF (Christiano et al., 2017; Ouyang et al., 2022) leverages an external reward model, combined with proximal policy optimization (Schulman et al., 2017), to align LLMs based on human feedback. DPO (Rafailov et al., 2024) proposes that LLMs can learn implicit rewards, allowing them to perform preference optimization independently, without the need for an ex ternal model. RLAIF (Bai et al., 2022; Lee et al., 2023) demonstrates that using AI as an alternative to human feedback can reduce labor costs and is more scalable, offering performance comparable to RLHF.

These methods show effectiveness in aligning an LLM with human preference. However, there are limited studies on leveraging preference optimization in the speech and audio processing field. Recently, Zhang et al. (2024); Chen et al. (2024) adopted preference optimization for the Text-to Speech (TTS) model to align the quality of speech synthesis with human preference, but not for enhancing SLMs’ semantics. Liao et al. (2024); Ma jumder et al. (2024) leverage preference optimization for text-to-audio generation with diffusion model, but the text-to-audio task is very different compared to SLM.

通过下一个标记预测来训练语言模型，对于学习人类知识是有效的，但这一目标可能与人类偏好有所不同。基于人类反馈的强化学习（RLHF）（Christiano等人，2017；Ouyang等人，2022）利用外部奖励模型，结合近端策略优化（Schulman等人，2017），根据人类反馈对大语言模型（LLMs）进行对齐。直接偏好优化（DPO）（Rafailov等人，2024）提出，LLMs可以学习隐式奖励，从而能够独立地进行偏好优化，无需外部模型。从人工智能反馈进行强化学习（RLAIF）（Bai等人，2022；Lee等人，2023）表明，使用人工智能替代人类反馈可以降低劳动成本，更具可扩展性，并且性能可与RLHF相媲美。

这些方法在使LLM与人类偏好对齐方面表现出有效性。然而，在语音和音频处理领域，利用偏好优化的研究有限。最近，Zhang等人（2024）；Chen等人（2024）采用偏好优化，使文本到语音（TTS）模型的语音合成质量与人类偏好对齐，但并未用于提升SLMs的语义性。Liao等人（2024）；Majumder等人（2024）利用偏好优化，结合扩散模型进行文本到音频生成，但文本到音频任务与SLM有很大不同。

3 Align-SLMFramework

The illustration of the proposed framework is shown in Figure 1.
所提框架的示意图如图1所示。
在这里插入图片描述

这张图展示了Align-SLM框架的完整工作流程：

整体框架结构

图片展示了一个黄色背景的大框，代表整个Align-SLM框架，包含以下主要组件：

左侧输入部分：语音提示(speech prompt)
中间处理部分：
- 预训练SLM模型(带LoRA适配器)
- 语音生成与评估流程
右侧输出部分：偏好数据对(preference data pair)

详细工作流程

1. 输入处理

底部左侧显示波形图，标记为"speech prompt"(语音提示)
这个语音提示通过"Speech tokenizer"(语音分词器)被转换为离散的语音标记

2. 多样本生成

预训练的SLM模型(带有LoRA适配器)接收语音标记
模型生成多个不同的语音续集(图中显示了三种不同颜色的标记序列)：
- 黄色标记序列
- 粉色标记序列
- 蓝色标记序列

3. 语音合成与评估

生成的标记序列通过"Unit Vocoder"(单元声码器)转换回语音波形
合成的语音波形通过ASR(自动语音识别)转换为文本
文本被送入"LLM Evaluator"(大语言模型评估器)进行评分

4. 偏好数据选择

LLM评估器根据评分将样本分为三类：

评分≤拒绝阈值：标记为"rejected"(拒绝)，图中用红色X标记
评分介于两个阈值之间：标记为"filtered out"(过滤掉)
评分≥选择阈值：标记为"chosen"(选择)，图中用绿色对勾标记

5. 偏好数据对构建

右下角显示最终的偏好数据对：(prompt, chosen, rejected)
- prompt: 原始语音提示
- chosen: 被选择的高质量语音续集(蓝色)
- rejected: 被拒绝的低质量语音续集(黄色)

6. 模型优化

这些偏好数据对通过"Direct Preference Optimization (DPO)"(直接偏好优化)用于训练SLM模型的LoRA适配器
左上角的"Curriculum Learning"(课程学习)循环箭头表示整个过程可以迭代进行，逐步提高选择标准

关键创新点

全自动评估流程：无需人工标注，通过ASR和LLM自动评估语音质量
纯语音端到端：虽然评估过程使用文本，但最终模型仍是纯语音到语音的
课程学习集成：通过逐步提高标准，迭代改进模型性能
LoRA高效微调：只更新适配器参数，保持基础模型不变

这个框架的核心优势在于它能够在不依赖人工标注的情况下，自动构建高质量的偏好数据，并通过DPO有效地提升语音语言模型的语义理解能力。

3.1 Spoken Language Models

The pre-trained SLM used in this work is TWIST (Hassid et al., 2024), a decoder-only trans former model that is trained on the next speech token prediction task with text model initialization. We utilize the TWIST model in two sizes (1.3B and 7B parameters) from the official re lease1. Specifically, the speech tokenizer con sists of a self-supervised speech model (wen Yang et al., 2021; Lin et al., 2023) and K-means clustering. In this work, HuBERT (Hsu et al., 2021) is used and the cluster number K is set to 500. Notably, when continuous representations are clustered into discrete units, they primarily capture content information, which can be leveraged for modeling and understanding (Polyak et al., 2021b; Lin et al., 2022; Wu et al., 2023). This process first extracts 25Hz frame-level continuous representations from the 11-th layer of the HuBERT model, assigns each frame to its closest cluster index, and then de-duplicates consecutive identical indices to shorten the sequence. The unit-based vocoder is a HifiGAN-based (Kong et al., 2020) model that can convert the discrete units back into a continuous waveform. We use the model checkpoint from the textlesslib (Kharitonov et al., 2022a) library.

本工作中使用的预训练SLM是TWIST (Hassid等，2024)，这是一个仅解码器的Transformer模型，通过文本模型初始化训练，用于预测下一个语音标记任务。我们使用官方发布的两种规模的TWIST模型（1.3B和7B参数）。具体来说，语音分词器由自监督语音模型(wen Yang等，2021; Lin等，2023)和K-means聚类组成。在本工作中，使用了HuBERT (Hsu等，2021)，聚类数量K设置为500。值得注意的是，当连续表示被聚类为离散单元时，它们主要捕获内容信息，这可用于建模和理解(Polyak等，2021b; Lin等，2022; Wu等，2023)。这个过程首先从HuBERT模型的第11层提取25Hz帧级连续表示，将每一帧分配给最接近的聚类索引，然后去除连续的相同索引以缩短序列。基于单元的声码器是一个基于HifiGAN (Kong等，2020)的模型，可以将离散单元转换回连续波形。我们使用来自textlesslib (Kharitonov等，2022a)库的模型检查点。

[原始语音波形]  
    │  
    ▼  
[HuBERT模型] → 提取连续声学特征 (25Hz帧级)  
    │  
    ▼  
[K-means聚类] → 离散语音单元 (500类)  
    │  
    ▼  
[TWIST模型] → 基于上下文的下一单元预测  
    │  
    ▼  
[HifiGAN声码器] → 生成合成语音波形

3.2 Automatic Preference Data Selection

To prepare the preference data pair (prompt, chosen, rejected), given the speech prompt x, the nu cleussampling (Holtzman et al., 2020) is used to generate N different continuations y1,y2,…,yN. Ideally, humans can listen to the samples and select the desirable and semantically correct one as the chosen continuation yc, and the semantically incorrect one as the rejected continuation yr. However, it is costly and time-consuming for human annotators to listen to the samples. Following the idea of RLAIF (Bai et al., 2022; Lee et al., 2023) to simulate human feedback, we propose an automatic preference data selection strategy to create preference data pairs. Since the focus is the semantics of SLMs, which is the content information in the speech, we first use Whisper-large-v2 (Radford et al., 2023) to transcribe the speech into text, then measure the semantics of the transcribed text. In this work, we explore the two types of AI feed back from a text LLM. The text LLM is the open sourced Mistral 7B (instruct-v02)2.

为了准备偏好数据对（提示、选择、拒绝），给定语音提示x，使用核采样（nucleus sampling）(Holtzman等，2020)生成N个不同的续集 $y_1,y_2,...,y_N$ 。理想情况下，人类可以听取样本并选择理想且语义正确的作为选择的续集 $y_c$ ，将语义不正确的作为拒绝的续集 $y_r$ 。然而，让人类标注者听取样本是昂贵且耗时的。遵循RLAIF(Bai等，2022; Lee等，2023)模拟人类反馈的思路，我们提出了一种自动偏好数据选择策略来创建偏好数据对。由于重点是SLMs的语义，即语音中的内容信息，我们首先使用Whisper-large-v2(Radford等，2023)将语音转录为文本，然后测量转录文本的语义。在本工作中，我们探索了来自文本LLM的两种AI反馈类型。文本LLM是开源的Mistral 7B(instruct-v02)。

示例场景

假设我们有一段语音提示：“今天天气怎么样？”

步骤1：生成多个语音续集

系统使用核采样(nucleus sampling)方法从预训练的SLM模型生成5个不同的语音续集：

续集1：“今天北京晴朗，气温在25度左右，非常适合户外活动。”
续集2：“今天下雨了，建议带伞出门，气温较低。”
续集3：“今天天气不错，阳光明媚，微风轻拂。”
续集4：“今天天气天气天气，很好很好很好，外面外面。”（重复、不连贯）
续集5：“今天是星期三，我喜欢吃苹果。”（语义不相关）

步骤2：语音转文本

使用Whisper-large-v2模型将这些语音续集转录为文本：

文本1：“今天北京晴朗，气温在25度左右，非常适合户外活动。”
文本2：“今天下雨了，建议带伞出门，气温较低。”
文本3：“今天天气不错，阳光明媚，微风轻拂。”
文本4：“今天天气天气天气，很好很好很好，外面外面。”
文本5：“今天是星期三，我喜欢吃苹果。”

步骤3：LLM评估文本质量

使用Mistral 7B模型评估这些文本的语义质量：

文本1：评分4.5/5（高质量，相关且信息丰富）
文本2：评分4.2/5（高质量，相关且有用）
文本3：评分4.0/5（高质量，相关）
文本4：评分1.5/5（低质量，重复且不连贯）
文本5：评分2.0/5（低质量，虽然流畅但与问题不相关）

步骤4：创建偏好数据对

根据评分结果，系统可以创建多个偏好数据对：

偏好对1：(提示：“今天天气怎么样？”, 选择：续集1, 拒绝：续集4)
偏好对2：(提示：“今天天气怎么样？”, 选择：续集2, 拒绝：续集5)
偏好对3：(提示：“今天天气怎么样？”, 选择：续集3, 拒绝：续集4)

步骤5：使用偏好数据对训练模型

这些偏好数据对被用于直接偏好优化(DPO)，训练SLM模型生成更符合语义要求的语音续集。

3.2.1 Continuation Likelihood: Perplexity

Perplexity (PPL) is a common metric to measure the likelihood of a sentence given a pre-trained language model. In this work, PPL is calculated on the generated transcribed text conditioned on the ground truth text prompt. PPL is used in previous SLM work to evaluate the generation (Lakhotia et al., 2021; Hassid et al., 2024). However, Lakhotia et al. (2021) found out that SLMs sometimes generate repeated phrases without clear meaning, and the PPL would be extremely low with naively repeated phrases. To measure this, the auto-BLEU score (a) calculates the n-gram counting within the sentence. Given text sentence t and the set of n-gram NG(t), auto-BLEU score of sentence t is:

$a_t = \frac{Σ_{s∈NG(t)} 1[s∈(NG(t)) /s]} {|NG(t)|}$

2-gram is used for auto-BLEU calculation (Lakhotia et al., 2021).

For the PPL of N continuations $PPL_N$ , we first filter out the auto-BLEU $a_i$ higher than $δ$ , then select the lowest PPL sample as $y_c$ . The threshold of the auto-BLEU score is selected by the score distribution between ground truth continuation and the generated result (Please see the Appendix G for details). The y_r is the continuation with the highest PPL. The $y_c, y_r$ is created as below:
$y_i = \begin{cases} y_c, & \text{if } PPL_i = \min(PPL_N) \land a_i \leq \delta \\ y_r, & \text{if } PPL_i = \max(PPL_N) \end{cases}$

困惑度（PPL）是衡量给定预训练语言模型下句子可能性的常用指标。在本研究中，PPL是在生成的转录文本上计算的，该文本基于真实的文本提示。PPL在之前的SLM研究中被用于评估生成结果（Lakhotia等人，2021；Hassid等人，2024）。然而，Lakhotia等人（2021）发现，SLMs有时会生成没有明确意义的重复短语，而如果短语被简单地重复，PPL将会非常低。为了衡量这一点，自动BLEU分数（a）计算句子内的n-gram计数。给定文本句子t和n-gram集合NG（t），句子t的自动BLEU分数为：

$a_t = \frac{Σ_{s∈NG(t)} 1[s∈(NG(t)) /s]} {|NG(t)|}$

使用2-gram进行自动BLEU计算（Lakhotia等人，2021）。

对于N个延续的PPL（ $PPL_N$ ），我们首先过滤掉自动BLEU（ $a_i$ ）高于δ的样本，然后选择PPL最低的样本作为 $y_c$ 。自动BLEU分数的阈值是根据真实延续和生成结果之间的分数分布来选择的（详细信息请参阅附录G）。 $y_r$ 是PPL最高的延续。 $y_c, y_r$ 的创建如下：

$y_i = \begin{cases} y_c, & \text{if } PPL_i = \min(PPL_N) \land a_i \leq \delta \\ y_r, & \text{if } PPL_i = \max(PPL_N) \end{cases} \tag{1}$

3.2.2 LLM Evaluation: Mistral Score

Instruction-tuned LLMs can follow instructions and understand semantics well (Chung et al., 2024). We propose using an LLM to judge the quality of the speech continuation, which evaluates the entire input and predicts the score. The prompt (see the Appendix for more details) is utilized to instruct the model to provide a score between 1 to 5 (1 denoting bad and 5 denoting good) based on the likelihood and meaningfulness of continuation given text prompt. Since we use the Mistral model for LLM evaluation, we call this “Mistral score”, denoted as s. To let the model learn to distinguish the preferred and unpreferred continuations, a certain threshold is set for the Mistral score. $s_c$ is the threshold of the chosen sample, and $s_r$ is the threshold of the rejected sample. $s_c$ should be larger than $s_r$ . auto-BLEU threshold is also used to recognize the naively repeated samples as rejected. We select the ( $y_c$ , $y_r$ ) as below:
$y_i = \begin{cases} y_c, & \text{if } s_i \geq s_c \land a_i \leq \delta \\ y_r, & \text{if } s_i \leq s_r \lor a_i > \delta \end{cases}$

The $s_c$ and $s_r$ values are selected based on a preliminary analysis of the SLM’s score distribution. For more details on the distribution of the SLM’s score, please see Appendix H.

经过指令微调的大语言模型能够很好地遵循指令并理解语义（Chung等人，2024）。我们建议使用大语言模型来判断语音延续的质量，该模型评估整个输入并预测得分。提示（请参阅附录了解更多详细信息）用于指示模型根据给定的文本提示，基于延续的可能性和有意义性，提供1到5之间的得分（1表示差，5表示好）。由于我们使用Mistral模型进行大语言模型评估，我们称之为“Mistral评分”，记作s。为了让模型学会区分优选和非优选的延续，Mistral评分设定了一个阈值。 $s_c$ 是所选样本的阈值， $s_r$ 是被拒绝样本的阈值。 $s_c$ 应该大于 $s_r$ 。自动BLEU阈值也用于识别被拒绝的简单重复样本。我们选择 $y_c，y_r）$ 如下：

$y_i = \begin{cases} y_c, & \text{若 } s_i \geq s_c \text{ 且 } a_i \leq \delta \\ y_r, & \text{若 } s_i \leq s_r \text{ 或 } a_i > \delta \end{cases} \tag{2}$

$s_c$ 和 $s_r$ 的值是根据对SLM评分分布的初步分析来选择的。有关SLM评分分布的更多详细信息，请参阅附录H。

困惑度和自动评估方法详解

上面主要讲解了两种评估语音语言模型(SLM)生成质量的方法：基于困惑度(Perplexity)的评估和基于大语言模型(LLM)的评估。

1. 困惑度(Perplexity)评估

困惑度的基本概念

困惑度(PPL)是衡量语言模型对句子预测能力的指标。PPL值越低，表示模型对该句子的预测越准确，生成质量越高。

问题与解决方案

研究发现SLM有时会生成无意义的重复短语，而这些重复短语往往会得到非常低的PPL值(看起来很好)，但实际质量很差。例如：

例子：

高质量句子：“今天天气很好，阳光明媚。”
低质量重复句子：“今天今天今天天气天气天气很好很好很好。”

虽然第二个句子质量明显较差，但由于重复模式使模型很容易预测下一个词，可能会得到更低的PPL值。

auto-BLEU评分

为了解决这个问题，研究者引入了auto-BLEU评分来检测句子内的重复情况。它计算句子内n-gram的重复程度。

计算方法：

提取句子中所有的n-gram (这里使用2-gram)
计算每个n-gram在句子其他部分出现的比例
得分越高，表示重复程度越高

具体例子：

句子：“今天天气很好”
2-gram：[“今天天气”, “天气很”, “很好”]
没有重复，auto-BLEU = 0
句子：“今天今天天气天气”
2-gram：[“今天今天”, “今天天气”, “天气天气”]
"今天"和"天气"都有重复，auto-BLEU较高

选择最佳续集

在生成N个续集后，系统会：

过滤掉auto-BLEU高于阈值δ的样本(排除重复严重的)
从剩余样本中选择PPL最低的作为优选样本 $y_c$
选择PPL最高的作为拒绝样本 $y_r$

例子：
假设生成了5个续集：

“天气晴朗，适合出行” (PPL=2.1, auto-BLEU=0.1)
“下雨了，记得带伞” (PPL=2.5, auto-BLEU=0.2)
“天气天气天气好好好” (PPL=1.8, auto-BLEU=0.8)
“有点阴，可能会下雨” (PPL=3.2, auto-BLEU=0.1)
“风很大，注意保暖” (PPL=4.5, auto-BLEU=0.2)

如果设定auto-BLEU阈值δ=0.5：

续集3因auto-BLEU=0.8>0.5被过滤掉
在剩余样本中，续集1的PPL最低，被选为 $y_c$
续集5的PPL最高，被选为 $y_r$

2. Mistral评分(LLM评估)

基本思路

使用指令调优的大语言模型(这里是Mistral)来评估生成文本的质量，给出1-5分的评分。

评分标准

1分：质量很差
5分：质量很好
评分基于续集的合理性和有意义程度

选择最佳续集

系统设定两个阈值：

$s_c$ ：选择样本的阈值(高分)
$s_r$ ：拒绝样本的阈值(低分)

选择规则：

如果评分≥ $s_c$ 且auto-BLEU≤δ，选为 $y_c$
如果评分≤ $s_r$ 或auto-BLEU>δ，选为 $y_r$

例子：
假设同样的5个续集，Mistral给出的评分为：

“天气晴朗，适合出行” (评分=4.5, auto-BLEU=0.1)
“下雨了，记得带伞” (评分=4.2, auto-BLEU=0.2)
“天气天气天气好好好” (评分=1.5, auto-BLEU=0.8)
“有点阴，可能会下雨” (评分=3.8, auto-BLEU=0.1)
“风很大，注意保暖” (评分=2.0, auto-BLEU=0.2)

如果设定 $s_c$ =4.0, $s_r$ =2.5, $δ$ =0.5：

续集1和2的评分≥4.0且auto-BLEU≤0.5，可以选为 $y_c$
续集3的auto-BLEU=0.8>0.5，被选为 $y_r$
续集5的评分=2.0≤2.5，也可以选为 $y_r$

最终可能选择续集1作为 $y_c$ ，续集3作为 $y_r$ ，形成偏好数据对。

这两种方法都是为了自动构建高质量的偏好数据对：

困惑度方法：基于语言模型的预测概率，但需要auto-BLEU来过滤重复
Mistral评分：利用大语言模型的理解能力直接评估质量

这些偏好数据对将用于训练SLM模型，使其生成更符合人类偏好的语音内容。

3.3 Direct Preference Optimization for SLMs

In our framework, training with online metrics calculation is computationally infeasible due to the chain of models involved (vocoder, ASR, and LLM evaluator) and the computational complexity as sociated with sampling the SLM multiple times. Instead of calculating the reward online like RLHF, we adopt DPO, a simplified version of RLHF with implicit reward modeling, for preference optimization. The preference data pairs can be prepared offline, making the training more efficient. Additionally, DPO training is stable, simple, and does not require training an external reward model.
DPO training objective is

$\mathcal{L}_{DPO} = - \mathbb{E}_{(x, y_c, y_r) \sim D} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_c | x)}{\pi_{ref}(y_c | x)} - \beta \log \frac{\pi_\theta(y_r | x)}{\pi_{ref}(y_r | x)} \right) \right]$

where $\pi_{ref}$ is the reference model with the pre-trained model’s parameters, which is kept frozen. $\pi_\theta$ is the policy model trained with LoRA (Hu et al., 2022) adapter, while the parameters of the backbone pre-trained model are fixed. $\beta$ controls the deviation from the reference model $\pi_{ref}$ .

在我们的框架中，由于涉及多个模型（波形合成器、自动语音识别系统和大语言模型评估器）以及多次采样语音语言模型所带来的计算复杂度，在线计算评估指标在计算上是不可行的。因此，我们没有采用像RLHF（基于人类反馈的强化学习）那样在线计算奖励的方法，而是选择了直接偏好优化（DPO），这是一种具有隐式奖励建模的简化版RLHF方法，用于偏好优化。偏好数据对可以离线准备，这使得训练更加高效。此外，DPO训练过程稳定、简单，不需要训练外部奖励模型。

DPO训练目标为

其中， $\pi_{ref}$ 是带有预训练模型参数的参考模型，该模型保持冻结。 $\pi_\theta$ 是通过LoRA（Hu等人，2022）适配器训练的策略模型，而骨干预训练模型的参数固定。 $\beta$ 控制相对于参考模型 $\pi_{ref}$ 的偏差。

3.4 CouplingwithCurriculumLearning

Curriculum Learning (CL) is a machine learning approach where models are trained by gradually increasing the complexity of tasks, allowing them to learn simpler concepts first before tackling more difficult ones (Bengio et al., 2009). In this work, we propose to couple DPO with curriculum learning to iteratively improve automated preference data selection. We iteratively raise the difficulty in discerning the preference data by tuning the thresholds $s_c$ and $s_r$ in Equation 2. Specifically for the Mistral score, we raise the $s_c$ from 3 to 4 for chosen samples and $s_r$ from 1 to 2 for rejected samples. With Curriculum learning, we expect the model to iteratively improve, given better feedback data.

课程学习（CL）是一种机器学习方法，通过逐渐增加任务的复杂度来训练模型，使其能够先学习更简单的概念，再应对更困难的任务（Bengio等人，2009）。在本研究中，我们提出将直接偏好优化（DPO）与课程学习相结合，以迭代改进自动偏好数据选择。我们通过调整方程2中的阈值 $s_c$ 和 $s_r$ 来逐步提高辨别偏好数据的难度。具体到Mistral评分，我们将所选样本的 $s_c$ 从3提高到4，将被拒绝样本的 $s_r$ 从1提高到2。借助课程学习，我们期望模型在获得更好的反馈数据的情况下能够迭代改进。

基本原理

课程学习的核心思想：就像人类学习一样，先学简单的，再学难的。比如学习数学时，先学加减法，再学乘除法，最后才学微积分。
在这项研究中的应用：研究者通过调整评判标准的严格程度，逐步提高训练难度。

具体实施方法

研究者通过调整两个关键阈值来实现课程学习：

$s_c$ - 选择样本的阈值：
- 初始设置为3分（满分5分）
- 随着训练进行，提高到4分
- 这意味着随着训练的深入，只有质量更高的样本才会被选为"好样本"
$s_r$ - 拒绝样本的阈值：
- 初始设置为1分
- 随着训练进行，提高到2分
- 这意味着随着训练的深入，更多中等质量的样本也会被归类为"不好的样本"

4 Experiments

4.1 Dataset

We use LibriSpeech (Panayotov et al., 2015) as our primary dataset. We use the official training, dev-clean, and test-clean set. To further expand our dataset size, we leverage the English subset of the Multilingual Librispeech (MLS) (Pratap et al., 2020) as an additional training set, which is around 3 times larger than the Librispeech training set. We use a subset of the MLS data comprising 673K utterances in this work for data scaling, denoted as mls. We apply the following data pre-processing steps to create the final training data:

Speech prompt segment selection using word alignment: Since our task involves speech continuation, the speech prompt should contain a sufficient amount of contextual information. We filter out samples shorter than 6 seconds. Unlike previous works that directly split the first 3 seconds as the prompt (Lakhotia et al., 2021; Hassid et al., 2024), we use forced alignment to select the closest word boundary around 3 seconds. This avoids cutting off spoken words in the middle, which could cause ASR errors in the speech continuations generated by the model. This potentially leads to poor perplexity or LLM evaluation scores.
Filtering out unsuitable chosen/rejected pairs: We apply a second layer of filtering over Mistral score annotations of ASR transcripts by thresholding chosen and rejection scores to ensure separability. Some samples fail to create preference data pairs due to these thresholds. When multiple continuations have the same lowest or highest score, we on-the-fly randomly choose among them. The number of preference data samples for different setups is listed in Table 5 in the Appendix.

我们主要使用LibriSpeech (Panayotov等人，2015)数据集，包括官方的训练集、dev-clean集和test-clean集。为扩大数据集规模，我们还使用了多语言LibriSpeech (MLS) (Pratap等人，2020)的英文子集作为额外训练集，其规模约是LibriSpeech训练集的3倍。在本研究中，我们使用了MLS数据的一个子集，包含673k个语音片段，用于数据扩展，记作mls。我们进行了以下数据预处理步骤来创建最终的训练数据：

基于词对齐的语音提示片段选择：由于我们的任务涉及语音延续，语音提示应包含足够的上下文信息。我们过滤掉了时长少于6秒的样本。与之前直接将前3秒作为提示的工作不同，我们使用强制对齐方法，在3秒附近选择最近的词边界。这可以避免在中间切断正在说的词，这可能会导致模型生成的语音延续出现自动语音识别(ASR)错误，进而可能导致较差的困惑度或大语言模型评估得分。
过滤掉不合适的选中/拒绝对：我们对ASR转录本的Mistral评分注释进行第二层过滤，通过设定选中和拒绝分数的阈值来确保区分度。一些样本因这些阈值而无法创建偏好数据对。当多个延续有相同的最低或最高分数时，我们随机选择其中一个。附录中的表5列出了不同设置下的偏好数据样本数量。

研究者使用了两个主要数据集：

LibriSpeech：一个广泛使用的英语语音数据集
- 使用了官方的训练集、dev-clean集和test-clean集
- 这是基础数据集
多语言LibriSpeech (MLS)：
- 仅使用其中的英语子集
- 规模约为LibriSpeech训练集的3倍
- 实际使用了673K条语音片段
- 在文中标记为"mls"

数据预处理流程

1. 语音提示片段选择（使用词对齐）

目标：创建合适的语音提示，为模型提供足够的上下文信息

具体步骤：

过滤掉短于6秒的样本（太短的样本上下文信息不足）
不同于以往研究直接截取前3秒作为提示，本研究使用了强制对齐技术
在3秒左右找到最近的词边界进行切分
这样避免了在单词中间切断语音，这种切断可能导致：
- ASR（自动语音识别）错误
- 模型生成的语音续集质量下降
- 困惑度或LLM评分不佳

举例说明：

传统方法：直接在3秒处切分 → “Today I went to the scho-”（单词"school"被切断）
本研究方法：找到3秒附近的词边界 → “Today I went to the”（在完整单词后切分）

2. 过滤不合适的选择/拒绝对

目标：确保偏好数据对的质量和可区分性

具体步骤：

对ASR转录文本应用Mistral评分
设置阈值筛选"选择"和"拒绝"样本
- 确保"好"样本和"差"样本之间有明显差距
当多个续集具有相同的最低或最高分数时：
- 随机选择其中一个作为代表
由于这些阈值限制，一些样本无法创建有效的偏好数据对（被丢弃）

数据规模：不同设置下的偏好数据样本数量在附录表5中列出

整体流程图解

[原始语音数据集]
    │
    ▼
[长度过滤] → 移除<6秒样本
    │
    ▼
[词对齐切分] → 在~3秒处找词边界
    │
    ▼
[生成多个续集] → 使用预训练SLM
    │
    ▼
[ASR转录] → 将语音转为文本
    │
    ▼
[Mistral评分] → 对文本质量打分
    │
    ▼
[阈值筛选] → 确定选择/拒绝样本
    │
    ▼
[创建偏好数据对] → (提示, 选择, 拒绝)
    │
    ▼
[DPO训练] → 优化SLM模型

这种精心设计的数据处理流程确保了训练数据的质量，为后续的偏好优化提供了坚实基础，从而提高了最终模型的语义理解和生成能力。

4.2 Objective Evaluation

4.2.1 Zerospeech 2021 Benchmark

sWUGGY and sBLIMP metrics evaluate SLMs’ lexical and syntactic modeling on pure speech input (Nguyen et al., 2020). sWUGGY tests if models prefer real words over phonetically similar non-words. We typically use the “in-vocab” split for reporting results, following the standard practice established by Lakhotia et al. (2021). sBLIMP assesses grammaticality judgments between correct and incorrect sentences. Both metrics compare geometric means of sequence probabilities assigned to paired utterances.

4.2.2 Spoken StoryCloze Benchmark

StoryCloze benchmarks (Mostafazadeh et al., 2017) evaluate the model’s ability to identify the more plausible ending among two scenarios given a short story as a prompt. This requires a degree of high-level semantic understanding and common sense. We utilize the spoken version of the original StoryCloze (S-StoryCloze) as well as the topic-based StoryCloze (T-StoryCloze) created by Hassid et al. (2024). T-StoryCloze uses simpler negative samples that are randomly drawn while S-StoryCloze uses adversarially curated negative samples. The random baseline performance for the above tasks is 50%. We name the spoken version of StoryCloze as “StoryCloze” for simplicity, but note that this is different from the text StoryCloze.

4.2.3 Generative Speech Continuation

GPT4-o score: GPT4-o (OpenAI, 2023) has shown remarkable text understanding performance and can serve as alternative human evaluators (Chiang and Lee, 2023a; Liu et al., 2023; Chiang and Lee, 2023b), showing high correlation with human judgments. We leverage GPT4-o as a proxy for human evaluations. Following llm evaluation (Chiang and Lee, 2023a), the instruction first analyzes the sentence and then provides the score from 1 to 5 to judge the semantic coherence, meaningfulness, and grammatical correctness of the ASR transcribed continuation given a prompt (1 denoting bad and 5 denoting good). The instruction prompt is shown in Appendix F.

MOSNet score: To measure the audio quality, we utilize the MOSNet (Cooper et al., 2022) to predict the Mean Opinion Score (MOS) of audio quality. Specifically, MOSNet is based on self-supervised wav2vec 2.0 (Baevski et al., 2020), fine-tuned on MOS prediction task. The model has shown a high correlation with human MOS scores and good generalization ability for unseen data.

4.2.1 ZeroSpeech 2021 基准

sWUGGY 和 sBLIMP 指标用于评估语音语言模型（SLMs）在纯语音输入下的词汇和句法建模能力（Nguyen 等人，2020）。sWUGGY 测试模型是否更倾向于选择真实单词而非发音相似的非单词。我们通常使用“in-vocab”划分来报告结果，遵循 Lakhotia 等人（2021）确立的标准做法。sBLIMP 评估模型对正确和错误句子的语法判断。这两个指标都比较分配给成对语句的序列概率的几何平均值。

4.2.2 语音 StoryCloze 基准

StoryCloze 基准（Mostafazadeh 等人，2017）评估模型在给定一个简短故事作为提示的情况下，识别两个场景中更合理结尾的能力。这需要一定程度的高层语义理解和常识。我们使用原始 StoryCloze 的语音版本（S-StoryCloze）以及 Hassid 等人（2024）创建的主题 StoryCloze（T-StoryCloze）。T-StoryCloze 使用随机抽取的简单负样本，而 S-StoryCloze 使用对抗性选择的负样本。上述任务的随机基线性能为 50%。为简化起见，我们将 StoryCloze 的语音版本称为“StoryCloze”，但需注意这与文本版 StoryCloze 不同。

4.2.3 生成式语音延续

GPT4-o 评分：GPT4-o（OpenAI，2023）在文本理解方面表现出色，可作为人类评估者的替代方案（Chiang 和 Lee，2023a；Liu 等人，2023；Chiang 和 Lee，2023b），其评分与人类判断高度相关。我们利用 GPT4-o 作为人类评估的代理。按照大语言模型评估（Chiang 和 Lee，2023a）的方法，指令先分析句子，然后根据提示给出的 ASR 转录音频的语义连贯性、意义性和语法正确性，从 1 到 5 进行评分（1 表示差，5 表示好）。指令提示的具体内容在附录 F 中给出。

MOSNet 评分：为衡量音频质量，我们使用 MOSNet（Cooper 等人，2022）来预测音频质量的平均意见得分（MOS）。具体而言，MOSNet 基于自监督的 wav2vec 2.0（Baevski 等人，2020），并在 MOS 预测任务上进行了微调。该模型与人类 MOS 评分高度相关，对未见数据具有良好的泛化能力。

4.3 Subjective Evaluation

We conducted human listening evaluations to assess the meaningfulness of the generated speech. We follow Lakhotia et al. (2021); Kharitonov et al. (2022b) to use the Meaningfulness Mean Opinion Score (MMOS). Specifically, given a speech prompt and generated speech continuations, evaluators listen to audio samples and rate the meaningfulness in terms of relevance, coherence, and grammatical correctness. We randomly sample 100 speech prompts from the Librispeech test-clean set for evaluation. Each sample has 10 evaluators to provide the rating on a scale from 1 to 5 with an increment of 1. The human evaluation template and instruction are shown in Appendix B. CrowdMOS (Ribeiro et al., 2011) package is used for outlier removal (Lakhotia et al., 2021).

我们进行了人类听觉评估，以评估生成语音的意义性。我们遵循Lakhotia等人（2021）和Kharitonov等人（2022b）的方法，使用意义性平均意见得分（MMOS）。具体而言，给定一个语音提示和生成的语音延续，评估者会听取音频样本，并根据相关性、连贯性和语法正确性对意义性进行评分。我们从Librispeech的test-clean集中随机抽取了100个语音提示用于评估。每个样本有10名评估者，他们根据1到5的评分标准（每次递增1分）进行评分。人类评估的模板和说明在附录B中给出。我们使用CrowdMOS（Ribeiro等人，2011）包来去除异常值（Lakhotia等人，2021）。

4.4 Baselines

We compare Align-SLM with other SLMs on the Zerospeech and StoryCloze benchmarks. The most comparable baselines are speech-only SLMs, including GSLM (Lakhotia et al., 2021), AudioLM (Borsos et al., 2023), TWIST (Hassid et al., 2024), and the model from Cuervo and Marxer (2024). Among these, the TWIST model is initialized from a text-based LLM. Additionally, we compare the performance against SLMs leverage text modality (Table 2 with underline), specifically VoxtLM (Maiti et al., 2024), SPIRITLM (Nguyen et al., 2024) with speech-text interleaving, and Moshi (Défossez et al., 2024), which leverages text-guided speech generation.

我们在ZeroSpeech和StoryCloze基准上将Align-SLM与其他语音语言模型（SLMs）进行比较。最相似的基准是纯语音SLMs，包括GSLM (Lakhotia等人，2021)、AudioLM (Borsos等人，2023)、TWIST (Hassid等人，2024)以及Cuervo和Marxer (2024)的模型。其中，TWIST模型是从基于文本的大语言模型（LLM）初始化的。此外，我们还与利用文本模态的SLMs进行性能比较（表2中带下划线部分），特别是VoxtLM (Maiti等人，2024)、具有语音-文本交错的SPIRITLM (Nguyen等人，2024)以及利用文本引导语音生成的Moshi (Défossez等人，2024)。

5 Results

5.1 Mistral Score Provides Better Semantic Feedback Than Perplexity

To determine which preference perplexity data selection strategy is beneficial for SLM’s semantics, we first conduct preliminary experiments on Align-SLM with PPL and Mistral score in Table 1 using the TWIST 1.3B model. Additionally, we continually fine-tune the pre-trained model using the same speech data, which serves as the baseline for the next speech token prediction. “Proxy Metric” refers to the metrics used for preference data selection, while “Zerospeech”, “StoryCloze”, and “Speech Continuation” are zero-shot speech evaluation metrics.

Align-SLM w/PPL successfully improves the proxy metrics for auto-BLEU and PPL, but the performance on speech continuation is slightly worse than the pre-trained model. As for the Zerospeech and StoryCloze benchmark, Align-SLM w/PPL has marginal improvement on most metrics and degrades on T-StoryCloze. Particularly, perplexity feedback shows a much greater improvement on sBLIMP, which measures grammatical correctness. This finding suggests that optimizing toward perplexity might overly focus on grammar rather than semantics and relevance.

On the other hand, Align-SLM w/Mistral score significantly outperforms the pre-trained model across metrics. Specifically, the performance on the Zerospeech and StoryCloze benchmarks improves significantly (+1.6 on S-StoryCloze and +4.5 on T-StoryCloze). The generated continuations also yield a better GPT4-o score, indicating the generated content is more relevant and coherent to the speech prompt. Regarding proxy metrics, the auto-BLEU and Mistral scores are improved, whereas the PPL is similar to the pre-trained model.

This finding suggests that LLM evaluations, such as the Mistral score, provide general semantic feedback and achieve superior performance across benchmarks. Furthermore, Align-SLM w/Mistral score significantly outperforms the fine-tuned baseline, which only marginally improves performance. Therefore, in the following experiments for Align-SLM, we use the Mistral score as the AI feedback.

5.2 Consistently Improves Pre-trained SLMs

Given our findings from Section 5.1, we use the Mistral score as a “proxy metric” for the rest of our experiments. In Table 2, we observe that Align-SLM consistently outperforms the pre-trained model on the Zerospeech, StoryCloze, and speech continuation task for both the 1.3B and 7B models (rows 7 to 8 for 1.3B, row 15 to 16 for 7B). For instance, on T-StoryCloze, training Align-SLM with Librispeech data yields relative improvements of 6.5% and 11.1% for the 1.3B and 7B models, respectively. Additionally, Table 2 demonstrates that Align-SLM 7B performs better in speech continuation, as reflected by the GPT-4 score improving from 2.70 to 3.50. The performance on T-StoryCloze is close to human-level accuracy (90.2). However, for sBLIMP (grammatical correctness), AudioLM and SyllableLM achieve the best performance, possibly due to their superior design of speech tokens. Compared to the cascaded top line (ASR+LLM), end-to-end SLMs still have room for improvement.

5.3 Improvement of Curriculum Learning

After training the pre-trained model with the first iteration of Align-SLM, we consider the resulting model a stronger starting point and generate new preference data with more stringent preference data selection criteria. Results in Table 2 indicate that curriculum learning improves more metrics on the Zerospeech and StoryCloze benchmarks (rows 8 to 9 for 1.3B, rows 16 to 17 for 7B), particularly for T-StoryCloze, which requires fine-grained relevance between the speech prompt and its continuation. Additionally, we observe improvements in speech continuation; for example, the GPT4-o score increases from 2.06 to 2.29 for the 1.3B model. We also experimented with further increasing the number of curriculum learning iterations, which continued to enhance performance (see the discussion in Appendix D).

5.4 Amount of Preference Data

In the previous experiments, we only use the LibriSpeech training data for DPO training. After applying filtering as described in Section 4.1, the number of samples is around 39K and 63K for the 1.3B model and 7B model, respectively. To investigate whether additional data helps the Align-SLM to learn semantics, we scale up the data around three times by including a subset of MLS (mls). It is worth noting that the scale of the MLS subset is still much smaller than the pre-trained data used in LibriSpeech dev-clean and test-clean sets. Table 2 presents the MOS scores from LibriSpeech dev-clean and test-clean sets. Results indicate that the MOSNet scores for Align-SLM are comparable to or slightly higher than those of the pre-trained SLM, suggesting that the training of Align-SLM preserves audio quality. The proposed framework requires the generated speech to pass through the ASR model for speech-to-text conversion, natural and clear speech is essential to avoid speech recognition errors. Consequently, in some cases, the MOSNet score for Align-SLM is even higher than that of the pre-trained SLM.

5.5 Comparison with Baselines

Align-SLM-CL achieves state-of-the-art performance for textless SLMs in T-StoryCloze (86.8), S-StoryCloze (61.1), and swUGGY (77.9), even surpassing text-guided approaches (moshi).

5.6 Human Prefer Align-SLM’s Generation

Table 3 presents the MMOS scores from the subjective evaluation of speech. We compare the re-synthesized speech of the original continuation, the pre-trained TWIST 7B, and the proposed Align-SLM 7B with CL. The results show that human evaluators perceive Align-SLM as generating more meaningful speech continuations than the pre-trained model, and even surpassing the original continuation. This can be attributed to the fact that the original continuation, derived from audiobook content, may rely on the broader context, whereas Align-SLM learns to generate more relevant and meaningful content based on the speech prompt.

5.7 Impact on Audio Quality

Preference optimization can potentially adversarially exploit the ASR and LLM evaluators, leading the SLM to generate nonsensical or noisy speech with artificially high rewards. To address this concern, we evaluate the audio quality of generated speech using MOSNet, which has shown a high correlation with human judgments. Table 2 shows that with more data, the model learns significantly better semantics across model sizes and benchmarks (row 10 to 11 for 1.3B, row 18 to 19 for 7B). Nevertheless, we observe that adding mls data for the 7B SLM yields little improvement in GPT4-o score compared to 1.3B model.

This can be attributed to the amount of preference data for the 7B model from Librispeech already being sufficient (63K for the first iteration and 71K for the second iteration). In contrast, the 1.3B model only has 39K and 20K preference data pairs by Librispeech.

5 结果

5.1 Mistral 评分比困惑度提供更好的语义反馈

为了确定哪种偏好困惑度数据选择策略对SLM的语义有益，我们首先在Align-SLM上进行了初步实验，使用TWIST 1.3B模型，并在表1中展示了PPL和Mistral评分的结果。此外，我们还使用相同的语音数据持续微调预训练模型，作为下一个语音标记预测的基线。“Proxy Metric”指的是用于偏好数据选择的指标，而“ZeroSpeech”、“StoryCloze”和“Speech Continuation”是零样本语音评估指标。

Align-SLM与PPL在auto-BLEU和PPL的代理指标上取得了成功改进，但在语音延续任务上的表现略逊于预训练模型。在ZeroSpeech和StoryCloze基准上，Align-SLM与PPL在大多数指标上仅有边际改进，在T-StoryCloze上表现有所下降。特别是，困惑度反馈在sBLIMP上显示出更大的改进，而sBLIMP衡量的是语法正确性。这一发现表明，朝着困惑度优化可能会过于关注语法，而不是语义和相关性。

另一方面，Align-SLM与Mistral评分在多个指标上显著优于预训练模型。具体来说，在ZeroSpeech和StoryCloze基准上的表现显著提升（S-StoryCloze提升+1.6，T-StoryCloze提升+4.5）。生成的延续还获得了更好的GPT4-o评分，表明生成的内容与语音提示更相关且连贯。在代理指标方面，auto-BLEU和Mistral评分有所提高，而PPL与预训练模型相似。

这一发现表明，像Mistral评分这样的LLM评估提供了通用的语义反馈，并在各个基准上实现了卓越的表现。此外，Align-SLM与Mistral评分显著优于微调的基线，后者仅能边际提高性能。因此，在后续的Align-SLM实验中，我们使用Mistral评分为AI反馈。

在这里插入图片描述

表格结构
- Evaluation：评估任务，包括代理指标（Proxy Metric）、ZeroSpeech、StoryCloze和语音延续（Continuation）。
- Metrics：具体的评估指标，如auto-BLEU、PPL、Mistral评分等。
- Pre-trained：预训练模型的评估结果。
- Fine-tuned (Diff)：微调模型的评估结果及与预训练模型的差异。
- Align-SLM w/PPL (Diff)：使用PPL的Align-SLM的评估结果及与预训练模型的差异。
- Align-SLM w/Mistral score (Diff)：使用Mistral评分的Align-SLM的评估结果及与预训练模型的差异。
评估指标
- auto-BLEU：衡量生成语音与参考语音的n-gram重叠程度，值越低表示生成语音越自然。
- PPL：困惑度，衡量模型对文本序列的预测能力，值越低表示模型的预测能力越强。
- Mistral score：由Mistral模型给出的评分，衡量生成语音的质量，值越高表示生成语音越有意义和连贯。
- sBLIMP：衡量语法正确性，值越高表示语法越正确。
- sWUGGY：衡量词汇建模能力，值越高表示模型越能区分真实单词和非单词。
- S-StoryCloze：基于原始StoryCloze的语音版本，值越高表示模型越能选择合理的结尾。
- T-StoryCloze：基于主题的StoryCloze，值越高表示模型越能选择合理的结尾。
- GPT4-o：由GPT-4模型给出的评分，衡量生成语音的语义连贯性和相关性，值越高表示生成语音越有意义和连贯。
- MOSNet：预测音频质量的平均意见得分，值越高表示音频质量越好。
结果分析
- Pre-trained：预训练模型在各项指标上的表现。
- Fine-tuned：微调模型在大多数指标上略有改进，但在某些指标上表现下降。
- Align-SLM w/PPL：在代理指标（auto-BLEU和PPL）上有所改进，但在语音延续任务上表现略逊于预训练模型。
- Align-SLM w/Mistral score：在大多数指标上显著优于其他模型，特别是在语音延续任务上表现更好，生成的语音更连贯和相关。
颜色标识
- darkgreen：表示相比预训练模型有改进。
- red：表示相比预训练模型性能下降。
总结
- 使用Mistral评分的Align-SLM在大多数评估指标上表现最佳，特别是在语音延续任务上，生成的语音更连贯和有意义。这表明Mistral评分作为语义反馈，能够有效提升模型的语义理解和生成能力。

5.2 持续改进预训练的SLMs

基于5.1节的发现，我们在后续实验中将Mistral评分用作“代理指标”。从表2可以看出，Align-SLM在ZeroSpeech、StoryCloze和语音延续任务上始终优于预训练模型，无论是在1.3B模型还是7B模型上（1.3B模型对应表2的第7至8行，7B模型对应第15至16行）。例如，在T-StoryCloze上，使用Librispeech数据训练Align-SLM，1.3B和7B模型分别获得了6.5%和11.1%的相对改进。此外，表2显示Align-SLM 7B在语音延续任务上表现更佳，GPT-4评分从2.70提升至3.50。T-StoryCloze上的表现接近人类水平（90.2）。然而，在sBLIMP（语法正确性）上，AudioLM和SyllableLM表现最佳，这可能归功于它们在语音标记设计上的优势。与级联顶部（ASR+LLM）相比，端到端SLMs仍有改进空间。

在这里插入图片描述

5.3 课程学习的改进

在使用Align-SLM的第一轮迭代训练预训练模型后，我们认为得到的模型是一个更强的起点，并使用更严格的偏好数据选择标准生成新的偏好数据。表2的结果表明，课程学习在ZeroSpeech和StoryCloze基准上改进了更多指标（1.3B模型对应表2的第8至9行，7B模型对应第16至17行），特别是对于T-StoryCloze，这需要语音提示及其延续之间的细粒度相关性。此外，我们还观察到语音延续任务的改进；例如，1.3B模型的GPT4-o评分从2.06提升至2.29。我们还尝试进一步增加课程学习的迭代次数，这继续提升了性能（详见附录D的讨论）。

5.4 偏好数据量

在之前的实验中，我们仅使用LibriSpeech训练数据进行DPO训练。经过第4.1节所述的过滤后，1.3B模型和7B模型的样本数量分别约为39K和63K。为了探究额外数据是否有助于Align-SLM学习语义，我们将数据量扩大了约三倍，加入了MLS的一个子集（mls）。值得注意的是，MLS子集的规模仍远小于LibriSpeech dev-clean和test-clean集中使用的预训练数据。表2展示了LibriSpeech dev-clean和test-clean集的MOS评分。结果表明，Align-SLM的MOSNet评分与预训练SLM相当，甚至略高，这表明Align-SLM的训练保留了语音质量。所提出的框架要求生成的语音通过ASR模型进行语音到文本的转换，自然且清晰的语音对于避免语音识别错误至关重要。因此，在某些情况下，Align-SLM的MOSNet评分甚至高于预训练SLM。

5.5 与基线模型的比较

Align-SLM-CL在无文本SLMs中实现了最先进性能，在T-StoryCloze上达到86.8，在S-StoryCloze上达到61.1，在swUGGY上达到77.9，甚至超越了基于文本引导的方法（如moshi）。

5.6 人类更偏好Align-SLM的生成结果

表3展示了语音主观评估的MMOS评分。我们比较了原始延续、预训练的TWIST 7B以及带有CL的Align-SLM 7B的重新合成语音。结果显示，人类评估者认为Align-SLM生成的语音延续比预训练模型更有意义，甚至超越了原始延续。这可能是因为原始延续源自有声读物内容，依赖于更广泛的上下文，而Align-SLM基于语音提示学习生成更相关和有意义的内容。
在这里插入图片描述

5.7 对音频质量的影响

偏好优化可能会对抗性地利用ASR和LLM评估器，导致SLM生成没有意义或嘈杂的语音，从而获得人为夸大的高奖励。为解决这一问题，我们使用MOSNet评估生成语音的音频质量，MOSNet与人类判断高度相关。表2显示，随着数据量的增加，模型在不同规模和基准测试中学习到了更显著的语义（1.3B模型对应表2的第10至11行，7B模型对应第18至19行）。然而，我们观察到，对于7B SLM，添加mls数据相较于1.3B模型，在GPT4-o评分上的提升有限。

这可以归因于Librispeech提供的7B模型的偏好数据量已经足够（第一次迭代为63K，第二次迭代为71K）。相比之下，1.3B模型仅有39K和20K的偏好数据对。

6 Conclusion

This work introduces Align-SLM, a novel frame work that significantly enhances the semantics of SLM via preference optimization. By utilizing LLM-guided semantic feedback and direct preference optimization, Align-SLM achieves state-of the-art performance of SLMs across various bench marks and generative tasks, consistently outperforming pre-trained SLMs. The framework demonstrates superior results with the proposed LLM evaluation feedback and curriculum learning. This work highlights the critical role of preference optimization for SLM and paves the way for better end-to-end speech-to-speech models.
本研究提出了一种名为Align-SLM的新颖框架，通过偏好优化显著增强了语音语言模型（SLM）的语义能力。借助大型语言模型（LLM）引导的语义反馈和直接偏好优化，Align-SLM在多项基准测试和生成任务中实现了SLM的最先进性能，持续超越预训练的SLM。该框架通过引入LLM评估反馈和课程学习，展示了卓越的效果。本研究凸显了偏好优化在SLM中的关键作用，为更优秀的端到端语音到语音模型的发展铺平了道路。