一、前言
在当今信息呈现方式越来越多样化的背景下,如何将文字、图片甚至视频高效转化为可听的音频体验,已经成为内容创作者、教育者和研究者们共同关注的重要话题。Podcastfy是一款基于Python的开源工具,它专注于将多种形式的内容智能转换成音频,正在引领一场“可听化”的创作新风潮。
通过结合生成式人工智能(GenAI)和先进的文本转语音(TTS)技术,Podcastfy能够将网页、PDF文件、图片甚至YouTube视频等多种输入,转变为自然流畅的多语言音频对话。
与传统的单一内容转化工具不同,Podcastfy支持从短小的2分钟精华片段到长达30分钟的深度播客生成,还允许用户在音频风格、语言结构和语音模型上进行高度自定义。并且,Podcastfy以其开源特性和程序化接口,为各种场景下的内容创作提供了灵活且专业的解决方案。这一工具的推出,不仅为信息的可及性带来了重要突破,还重新定义了“声音经济”时代的内容表达方式。
二、术语介绍
2.1.Podcastfy
是一款基于 Python 开发的开源多模态内容转换工具,其核心作用是通过生成式人工智能(GenAI)技术,将文本、图像、网页、PDF、YouTube 视频等多种形式的内容,智能转化为多语言音频对话,从而革新内容创作与传播方式。
技术定位与核心功能
1. 多模态输入兼容性
- Podcastfy 支持从网页、PDF、图像、YouTube 视频甚至用户输入的主题中提取内容,并自动生成对话式文本脚本。
2.多语言与音频定制化
- 工具内置多语言支持(包括中文、英语等),可生成不同语言版本的音频,并允许调整播客的风格、声音、时长(如 2-5 分钟短片或 30 分钟以上的长篇内容),甚至模拟自然对话的互动感。
3.技术架构与开源特性
- 生成式 AI 驱动:集成 100+ 主流语言模型(如 OpenAI、Anthropic、Google 等),支持本地运行 HuggingFace 上的 156+ 模型,兼顾生成质量与隐私控制。
- 高级 TTS 引擎:与 ElevenLabs、Microsoft Edge 等文本转语音平台无缝整合,生成拟人化语音效果。
- 开源可扩展:用户可自由修改代码,定制播客生成逻辑或集成私有模型,突破闭源工具(如 Google NotebookLM)的功能限制。
2.2.Gradio
是一个开源的 Python 库,专注于快速构建交互式 Web 应用程序,尤其适用于机器学习模型、API 或任意 Python 函数的可视化展示和用户交互。通过简单代码即可生成功能丰富的界面,无需前端开发经验。
2.3.nohup
命令
是类 Unix 系统中使用的一个工具,用于在后台运行程序并使其忽略挂起信号。在使用命令行运行程序时,通常如果你关闭终端或注销用户,正在运行的程序也会被终止。使用 nohup
可以避免这种情况,让程序在后台持续运行。
三、前置条件
3.1.基础环境及前置条件
1. 操作系统:无限制
3.2.安装依赖
conda create --name podcastfy-app python=3.12
conda activate podcastfy-app
pip install gradio-client==1.4.2 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install gradio==5.4.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install podcastfy==0.4.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install python-dotenv==1.0.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
四、技术实现
4.1.Gradio代码
# -*- coding:utf-8 -*-
import gradio as gr
import os
import tempfile
import logging
from podcastfy.client import generate_podcast
from dotenv import load_dotenv
# Configure logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
# Load environment variables
load_dotenv()
os.environ["GEMINI_API_KEY"] = 'xxxxxxxxxxxxxx-xxxxxxxx-xx'
os.environ["OPENAI_API_KEY"] = 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
def get_api_key(key_name, ui_value):
return ui_value if ui_value else os.getenv(key_name)
def process_inputs(
text_input,
urls_input,
pdf_files,
image_files,
gemini_key,
openai_key,
elevenlabs_key,
word_count,
conversation_style,
roles_person1,
roles_person2,
dialogue_structure,
podcast_name,
podcast_tagline,
tts_model,
creativity_level,
user_instructions,
longform
):
try:
logger.info("Starting podcast generation process")
# API key handling
logger.debug("Setting API keys")
os.environ["GEMINI_API_KEY"] = get_api_key("GEMINI_API_KEY", gemini_key)
if tts_model == "openai":
logger.debug("Setting OpenAI API key")
if not openai_key and not os.getenv("OPENAI_API_KEY"):
raise ValueError("OpenAI API key is required when using OpenAI TTS model")
os.environ["OPENAI_API_KEY"] = get_api_key("OPENAI_API_KEY", openai_key)
if tts_model == "elevenlabs":
logger.debug("Setting ElevenLabs API key")
if not elevenlabs_key and not os.getenv("ELEVENLABS_API_KEY"):
raise ValueError("ElevenLabs API key is required when using ElevenLabs TTS model")
os.environ["ELEVENLABS_API_KEY"] = get_api_key("ELEVENLABS_API_KEY", elevenlabs_key)
print(f'GEMINI_API_KEY: {os.environ["GEMINI_API_KEY"]},OPENAI_API_KEY: {os.environ["OPENAI_API_KEY"]}')
# Process URLs
urls = [url.strip() for url in urls_input.split('\n') if url.strip()]
logger.debug(f"Processed URLs: {urls}")
temp_files = []
temp_dirs = []
# Handle PDF files
if pdf_files is not None and len(pdf_files) > 0:
logger.info(f"Processing {len(pdf_files)} PDF files")
pdf_temp_dir = tempfile.mkdtemp()
temp_dirs.append(pdf_temp_dir)
for i, pdf_file in enumerate(pdf_files):
pdf_path = os.path.join(pdf_temp_dir, f"input_pdf_{i}.pdf")
temp_files.append(pdf_path)
with open(pdf_path, 'wb') as f:
f.write(pdf_file)
urls.append(pdf_path)
logger.debug(f"Saved PDF {i} to {pdf_path}")
# Handle image files
image_paths = []
if image_files is not None and len(image_files) > 0:
logger.info(f"Processing {len(image_files)} image files")
img_temp_dir = tempfile.mkdtemp()
temp_dirs.append(img_temp_dir)
for i, img_file in enumerate(image_files):
# Get file extension from the original name in the file tuple
original_name = img_file.orig_name if hasattr(img_file, 'orig_name') else f"image_{i}.jpg"
extension = original_name.split('.')[-1]
logger.debug(f"Processing image file {i}: {original_name}")
img_path = os.path.join(img_temp_dir, f"input_image_{i}.{extension}")
temp_files.append(img_path)
try:
# Write the bytes directly to the file
with open(img_path, 'wb') as f:
if isinstance(img_file, (tuple, list)):
f.write(img_file[1]) # Write the bytes content
else:
f.write(img_file) # Write the bytes directly
image_paths.append(img_path)
logger.debug(f"Saved image {i} to {img_path}")
except Exception as e:
logger.error(f"Error saving image {i}: {str(e)}")
raise
# Prepare conversation config
logger.debug("Preparing conversation config")
conversation_config = {
"word_count": word_count,
"conversation_style": conversation_style.split(','),
"roles_person1": roles_person1,
"roles_person2": roles_person2,
"dialogue_structure": dialogue_structure.split(','),
"podcast_name": podcast_name,
"podcast_tagline": podcast_tagline,
"creativity": creativity_level,
"user_instructions": user_instructions
}
# Generate podcast
logger.info("Calling generate_podcast function")
logger.debug(f"URLs: {urls}")
logger.debug(f"Image paths: {image_paths}")
logger.debug(f"Text input present: {'Yes' if text_input else 'No'}")
audio_file = generate_podcast(
urls=urls if urls else None,
text=text_input if text_input else None,
image_paths=image_paths if image_paths else None,
tts_model=tts_model,
conversation_config=conversation_config,
longform = eval(longform)
)
logger.info("Podcast generation completed")
# Cleanup
logger.debug("Cleaning up temporary files")
for file_path in temp_files:
if os.path.exists(file_path):
os.unlink(file_path)
logger.debug(f"Removed temp file: {file_path}")
for dir_path in temp_dirs:
if os.path.exists(dir_path):
os.rmdir(dir_path)
logger.debug(f"Removed temp directory: {dir_path}")
return audio_file
except Exception as e:
logger.error(f"Error in process_inputs: {str(e)}", exc_info=True)
# Cleanup on error
for file_path in temp_files:
if os.path.exists(file_path):
os.unlink(file_path)
for dir_path in temp_dirs:
if os.path.exists(dir_path):
os.rmdir(dir_path)
return str(e)
# Create Gradio interface with updated theme
with gr.Blocks(
title="Podcastfy.ai",
theme=gr.themes.Base(
primary_hue="blue",
secondary_hue="slate",
neutral_hue="slate"
),
css="""
/* Move toggle arrow to left side */
.gr-accordion {
--accordion-arrow-size: 1.5em;
}
.gr-accordion > .label-wrap {
flex-direction: row !important;
justify-content: flex-start !important;
gap: 1em;
}
.gr-accordion > .label-wrap > .icon {
order: -1;
}
"""
) as demo:
with gr.Tab("Content"):
# API Keys Section
gr.Markdown(
"""
<h2 style='color: #2196F3; margin-bottom: 10px; padding: 10px 0;'>
🔑 API Keys
</h2>
""",
elem_classes=["section-header"]
)
with gr.Accordion("Configure API Keys", open=False):
gemini_key = gr.Textbox(
label="Gemini API Key",
type="password",
value=os.getenv("GEMINI_API_KEY", ""),
info="Required"
)
openai_key = gr.Textbox(
label="OpenAI API Key",
type="password",
value=os.getenv("OPENAI_API_KEY", ""),
info="Required only if using OpenAI TTS model"
)
elevenlabs_key = gr.Textbox(
label="ElevenLabs API Key",
type="password",
value=os.getenv("ELEVENLABS_API_KEY", ""),
info="Required only if using ElevenLabs TTS model [recommended]"
)
# Content Input Section
gr.Markdown(
"""
<h2 style='color: #2196F3; margin-bottom: 10px; padding: 10px 0;'>
📝 Input Content
</h2>
""",
elem_classes=["section-header"]
)
with gr.Accordion("Configure Input Content", open=False):
with gr.Group():
text_input = gr.Textbox(
label="Text Input",
placeholder="Enter or paste text here...",
lines=3
)
urls_input = gr.Textbox(
label="URLs",
placeholder="Enter URLs (one per line) - supports websites and YouTube videos.",
lines=3
)
# Place PDF and Image uploads side by side
with gr.Row():
with gr.Column():
pdf_files = gr.Files( # Changed from gr.File to gr.Files
label="Upload PDFs", # Updated label
file_types=[".pdf"],
type="binary"
)
gr.Markdown("*Upload one or more PDF files to generate podcast from*",
elem_classes=["file-info"])
with gr.Column():
image_files = gr.Files(
label="Upload Images",
file_types=["image"],
type="binary"
)
gr.Markdown("*Upload one or more images to generate podcast from*", elem_classes=["file-info"])
# Customization Section
gr.Markdown(
"""
<h2 style='color: #2196F3; margin-bottom: 10px; padding: 10px 0;'>
⚙️ Customization Options
</h2>
""",
elem_classes=["section-header"]
)
with gr.Accordion("Configure Podcast Settings", open=False):
# Basic Settings
gr.Markdown(
"""
<h3 style='color: #1976D2; margin: 15px 0 10px 0;'>
📊 Basic Settings
</h3>
""",
)
word_count = gr.Slider(
minimum=500,
maximum=5000,
value=2000,
step=100,
label="Word Count",
info="Target word count for the generated content"
)
conversation_style = gr.Textbox(
label="Conversation Style",
value="engaging,fast-paced,enthusiastic",
info="Comma-separated list of styles to apply to the conversation"
)
# Roles and Structure
gr.Markdown(
"""
<h3 style='color: #1976D2; margin: 15px 0 10px 0;'>
👥 Roles and Structure
</h3>
""",
)
roles_person1 = gr.Textbox(
label="Role of First Speaker",
value="main summarizer",
info="Role of the first speaker in the conversation"
)
roles_person2 = gr.Textbox(
label="Role of Second Speaker",
value="questioner/clarifier",
info="Role of the second speaker in the conversation"
)
dialogue_structure = gr.Textbox(
label="Dialogue Structure",
value="Introduction,Main Content Summary,Conclusion",
info="Comma-separated list of dialogue sections"
)
# Podcast Identity
gr.Markdown(
"""
<h3 style='color: #1976D2; margin: 15px 0 10px 0;'>
🎙️ Podcast Identity
</h3>
""",
)
podcast_name = gr.Textbox(
label="Podcast Name",
value="PODCASTFY",
info="Name of the podcast"
)
podcast_tagline = gr.Textbox(
label="Podcast Tagline",
value="YOUR PERSONAL GenAI PODCAST",
info="Tagline or subtitle for the podcast"
)
# Voice Settings
gr.Markdown(
"""
<h3 style='color: #1976D2; margin: 15px 0 10px 0;'>
🗣️ Voice Settings
</h3>
""",
)
tts_model = gr.Radio(
choices=["openai", "elevenlabs", "edge", "gemini", "geminimulti"],
value="openai",
label="Text-to-Speech Model",
info="Choose the voice generation model (edge is free but of low quality, others are superior but require API keys)"
)
# Advanced Settings
gr.Markdown(
"""
<h3 style='color: #1976D2; margin: 15px 0 10px 0;'>
🔧 Advanced Settings
</h3>
""",
)
creativity_level = gr.Slider(
minimum=0,
maximum=1,
value=0.7,
step=0.1,
label="Creativity Level",
info="Controls the creativity of the generated conversation (0 for focused/factual, 1 for more creative)"
)
user_instructions = gr.Textbox(
label="Custom Instructions",
value="",
lines=2,
placeholder="Add any specific instructions to guide the conversation...",
info="Optional instructions to guide the conversation focus and topics"
)
longform = gr.Radio(
choices=["True", "False"],
value="False",
label="Podcasts Generation Way",
info="Choose the podcasts generation Content Length"
)
# Output Section
gr.Markdown(
"""
<h2 style='color: #2196F3; margin-bottom: 10px; padding: 10px 0;'>
🎵 Generated Output
</h2>
""",
elem_classes=["section-header"]
)
with gr.Group():
generate_btn = gr.Button("🎙️ Generate Podcast", variant="primary")
audio_output = gr.Audio(
type="filepath",
label="Generated Podcast"
)
# Handle generation
generate_btn.click(
process_inputs,
inputs=[
text_input, urls_input, pdf_files, image_files,
gemini_key, openai_key, elevenlabs_key,
word_count, conversation_style,
roles_person1, roles_person2,
dialogue_structure, podcast_name,
podcast_tagline, tts_model,
creativity_level, user_instructions,longform
],
outputs=audio_output
)
DEFAULT_SERVER_NAME = '0.0.0.0'
DEFAULT_PORT = 8000
DEFAULT_USER = "zhangshan"
DEFAULT_PASSWORD = '123456'
if __name__ == "__main__":
demo.queue().launch(debug=False,
share=False,
inbrowser=False,
server_port=DEFAULT_PORT,
server_name=DEFAULT_SERVER_NAME,
auth=(DEFAULT_USER, DEFAULT_PASSWORD) )
4.2.测试
4.2.1.启动Gradio服务
nohup python /podcastfy-app/gradio-server.py > /logs/podcastfy-app.log 2>&1 &
浏览器访问:http://IP:8000
输入账号:zhangshan/123456
4.2.2.测试文本输入
注意:需要具备科学上网的能力
PS:服务端输出的日志:
DEBUG:openai._base_client:HTTP Response: POST https://api.openai.com/v1/audio/speech "200 OK" Headers({'date': 'Wed, 16 Apr 2025 07:20:51 GMT', 'content-type': 'audio/mpeg', 'transfer-encoding': 'chunked', 'connection': 'keep-alive', 'access-control-expose-headers': 'X-Request-ID', 'openai-organization': 'everblessed-technology-inc', 'openai-processing-ms': '1334', 'openai-version': '2020-10-01', 'strict-transport-security': 'max-age=31536000; includeSubDomains; preload', 'via': 'envoy-router-84dd794555-brjjp', 'x-envoy-upstream-service-time': '1313', 'x-ratelimit-limit-requests': '10000', 'x-ratelimit-remaining-requests': '9999', 'x-ratelimit-reset-requests': '6ms', 'x-request-id': 'req_cc00076d234569e896d01ee281a07938', 'cf-cache-status': 'DYNAMIC', 'x-content-type-options': 'nosniff', 'server': 'cloudflare', 'cf-ray': '9311ec21cd46fb30-SJC', 'alt-svc': 'h3=":443"; ma=86400'})
DEBUG:openai._base_client:request_id: req_cc00076d234569e896d01ee281a07938
DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/audio/speech', 'headers': {'Accept': 'application/octet-stream'}, 'files': None, 'json_data': {'input': "Exactly! It's not just about the present moment. It's about envisioning a future, a forever, with this person. And that forever is clear, sharply defined.", 'model': 'tts-1-hd', 'voice': 'shimmer'}}
DEBUG:openai._base_client:Sending HTTP Request: POST https://api.openai.com/v1/audio/speech
DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_headers.complete
DEBUG:httpcore.http11:send_request_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_body.complete
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/1_question.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/1_answer.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/2_question.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/2_answer.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/3_question.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/3_answer.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/4_question.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/4_answer.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/5_question.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/5_answer.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/6_question.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/6_answer.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/7_question.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'mp3', '-i', '/opt/anaconda3/envs/podcastfy-app/lib/python3.12/site-packages/podcastfy/data/audio/tmp/tmpxsbm6y6y/7_answer.mp3', '-acodec', 'pcm_s16le', '-vn', '-f', 'wav', '-'])
DEBUG:pydub.converter:subprocess.call(['ffmpeg', '-y', '-f', 'wav', '-i', '/tmp/tmpytxxw8ea', '-f', 'mp3', '/tmp/tmptv8lgkb9'])
4.2.3.测试文件输入
注意:需要具备科学上网的能力
测试的PDF文件共25页,大小1.3M。