25.6.29 | 增加Gummy 实时/一句话语音识别 |
25.6.28 | 增加Qwen TTS本地音频和实时播报 |
背景
准备环境
MacOS M1电脑(其他M系列芯片也可以)
为了方便python的使用环境,使用Miniconda:下载链接:Download Anaconda Distribution | Anaconda
为了方便编辑代码,下载安装最流行的:vscode,最新版本已经有了Github Copilot免费用,记得要打开。
为了解决使用了Miniconda的python环境,导致vscode自带的运行环境找不到dashscope出错的问题;
- 按下
Cmd+Shift+P
,输入并选择Python: Select Interpreter
。 - 选择你用 miniconda 安装 dashscope 的那个环境(比如
miniconda3/envs/xxx
)。 - 右下角状态栏会显示当前环境。
- 验证:which python
Gummy-ASR
实时识别
一句话识别
官方demo,未修改;
原理:程序启动时,会开始录音;录音结束判停:经过查看代码,和日志查看,由云端判定的),一分钟音频是上限。
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html
# 一句话识别能够对一分钟内的语音数据流(无论是从外部设备如麦克风获取的音频流,还是从本地文件读取的音频流)进行识别并流式返回结果。
import pyaudio
import dashscope
from dashscope.audio.asr import *
# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"
mic = None
stream = None
class Callback(TranslationRecognizerCallback):
def on_open(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback open.")
mic = pyaudio.PyAudio()
stream = mic.open(
format=pyaudio.paInt16, channels=1, rate=16000, input=True
)
def on_close(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback close.")
stream.stop_stream()
stream.close()
mic.terminate()
stream = None
mic = None
def on_event(
self,
request_id,
transcription_result: TranscriptionResult,
translation_result: TranslationResult,
usage,
) -> None:
print("request id: ", request_id)
print("usage: ", usage)
if translation_result is not None:
print(
"translation_languages: ",
translation_result.get_language_list(),
)
english_translation = translation_result.get_translation("en")
print("sentence id: ", english_translation.sentence_id)
print("translate to english: ", english_translation.text)
if english_translation.vad_pre_end:
print("vad pre end {}, {}, {}".format(transcription_result.pre_end_start_time, transcription_result.pre_end_end_time, transcription_result.pre_end_timemillis))
if transcription_result is not None:
print("sentence id: ", transcription_result.sentence_id)
print("transcription: ", transcription_result.text)
callback = Callback()
translator = TranslationRecognizerChat(
model="gummy-chat-v1",
format="pcm",
sample_rate=16000,
transcription_enabled=True,
translation_enabled=True,
translation_target_languages=["en"],
callback=callback,
)
translator.start()
print("请您通过麦克风讲话体验一句话语音识别和翻译功能")
while True:
if stream:
data = stream.read(3200, exception_on_overflow=False)
if not translator.send_audio_frame(data):
print("sentence end, stop sending")
break
else:
break
translator.stop()
Qwen-TTS
非实时TTS
修复了tts请求时,response出错的情况(比如API_KEY不对)
import os
import requests
import dashscope
text = "那我来给大家推荐一款T恤,这款呢真的是超级好看,这个颜色呢很显气质,而且呢也是搭配的绝佳单品,大家可以闭眼入,真的是非常好看,对身材的包容性也很好,不管啥身材的宝宝呢,穿上去都是很好看的。推荐宝宝们下单哦。"
response = dashscope.audio.qwen_tts.SpeechSynthesizer.call(
model="qwen-tts",
api_key=os.getenv("DASHSCOPE_API_KEY"),
text=text,
voice="Cherry",
)
# ====== 开始检查 response 是否有效 ======
print(response)
if not hasattr(response, 'output') or response.output is None:
print("响应中没有 output 字段,请检查权限或模型是否开通")
exit()
if not hasattr(response.output, 'audio') or response.output.audio is None:
print("响应中没有 audio 数据,请检查返回内容")
exit()
if not hasattr(response.output.audio, 'url'):
print("响应中 audio 没有 url 字段,请检查返回结构")
exit()
# ====== 结束检查 response 是否有效 ======
audio_url = response.output.audio["url"]
save_path = "downloaded_audio.wav" # 自定义保存路径
try:
response = requests.get(audio_url)
response.raise_for_status() # 检查请求是否成功
with open(save_path, 'wb') as f:
f.write(response.content)
print(f"音频文件已保存至:{save_path}")
except Exception as e:
print(f"下载失败:{str(e)}")
问题1:pip install pyaudio失败
解决方案:
(1)先安装brew:/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)",重启
(2)再安装brew install portaudio
(3)再安装pip install pyaudio
生成的TTS音频文件,放到了本地。适合音频生成离线播报场景,比如PPT。
不适合实时的语音交互,我们需要实时TTS。
实时TTS
按照官方demo跑就好。
我们把实时TTS封装成函数api,并提供了api测试demo;
函数封装代码:qwen_play_tts.py
# coding=utf-8
import os
import dashscope
import pyaudio
import time
import base64
import numpy as np
def qwen_play_tts(text, voice="Ethan", api_key=None):
"""
使用通义千问 TTS 进行流式语音合成并播放
:param text: 合成文本
:param voice: 发音人
:param api_key: Dashscope API Key(可选,默认读取环境变量)
"""
api_key = api_key or os.getenv("DASHSCOPE_API_KEY")
if not api_key:
raise ValueError("DASHSCOPE_API_KEY is not set.")
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16,
channels=1,
rate=24000,
output=True)
responses = dashscope.audio.qwen_tts.SpeechSynthesizer.call(
model="qwen-tts",
api_key=api_key,
text=text,
voice=voice,
stream=True
)
for chunk in responses:
audio_string = chunk["output"]["audio"]["data"]
wav_bytes = base64.b64decode(audio_string)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
stream.write(audio_np.tobytes())
time.sleep(0.8)
stream.stop_stream()
stream.close()
p.terminate()
# 示例调用
if __name__ == "__main__":
sample_text = "你好,这是一段测试语音。"
qwen_play_tts(sample_text)
api测试代码:qwen_api_test.py
from qwen_play_tts import qwen_play_tts
qwen_play_tts("这是一个函数调用测试。")