Pocket-TTS:CPU 友好型文本转语音与语音克隆 - Openclaw Skills
作者:互联网
2026-03-30
什么是 Pocket-TTS?
Pocket-TTS 是一款专为在本地硬件上高效运行而设计的高性能音频合成工具。作为 Openclaw Skills 的重要组成部分,它利用 100M 参数模型提供高质量的英语语音生成。与许多需要大量 GPU 资源的现代 AI 模型不同,此技能专门针对 CPU 效率进行了调整,仅需两个核心即可在 M4 等现代处理器上实现比实时快 6 倍的速度。
此技能非常适合需要在维护数据隐私的同时将低延迟语音响应集成到应用程序中的开发人员。通过支持首个分片延迟约为 200ms 的流式生成,它提供了无缝的交互体验。无论您是在构建自动化助手还是本地媒体工具,Openclaw Skills 库中的这一项都能确保在不依赖外部 API 的情况下获得专业结果。
下载入口:https://github.com/openclaw/skills/tree/main/skills/leonaaardob/lb-pocket-tts-skill
安装与下载
1. ClawHub CLI
从源直接安装技能的最快方式。
npx clawhub@latest install lb-pocket-tts-skill
2. 手动安装
将技能文件夹复制到以下位置之一
全局模式~/.openclaw/skills/
工作区
/skills/
优先级:工作区 > 本地 > 内置
3. 提示词安装
将此提示词复制到 OpenClaw 即可自动安装。
请帮我使用 Clawhub 安装 lb-pocket-tts-skill。如果尚未安装 Clawhub,请先安装(npm i -g clawhub)。
Pocket-TTS 应用场景
- 为注重隐私的 AI 应用从文本生成本地语音。
- 从 3-10 秒的短音频样本创建自定义语音克隆。
- 为实时对话代理实现低延迟流式传输。
- 将文本文件批量处理为 24kHz 单声道 WAV 音频用于内容创作。
- 运行带有内置 Web 界面的本地自托管 TTS 服务器。
- 用户安装软件包并将 100M 参数的轻量级模型加载到内存中。
- 通过选择默认语音或提供一段音频样本(3-10秒)来建立语音状态以进行即时克隆。
- 引擎使用特定的质量调节参数(如温度和 LSD 解码步数),通过 CPU 优化模型处理输入文本。
- 音频被合成为高质量的 24kHz 单声道 WAV 格式。
- 输出结果可以直接保存为文件、分块流式传输以立即播放,或通过 FastAPI Web 服务器提供服务。
Pocket-TTS 配置指南
要开始使用 Openclaw Skills 系列中的这项技能,请通过您偏好的 Python 包管理器进行安装:
# 通过 pip 进行标准安装
pip install pocket-tts
# 或者使用 uv 进行更快的管理
uv add pocket-tts
安装完成后,您可以直接从命令行生成您的第一个音频文件:
pocket-tts generate --text "Welcome to the Openclaw Skills ecosystem."
Pocket-TTS 数据架构与分类体系
Pocket-TTS 组织其输出和资产时专注于速度和兼容性。以下是元数据和文件结构用法:
| 数据类型 | 描述 | 格式 |
|---|---|---|
| 音频输出 | 高质量单声道语音 | .wav (24kHz, 16-bit PCM) |
| 语音状态 | 预处理的语音嵌入,用于快速加载 | .safetensors |
| 输入提示 | 用于语音克隆的源音频 | .wav, .mp3, 或 URL |
| 模型配置 | 温度和解码步数的参数 | CLI 参数 / JSON |
id: pocket-tts
name: Pocket-TTS
version: 1.0.0
author: Leonardo Balland
description: Generate speech from text using Kyutai Pocket TTS - lightweight, CPU-friendly, streaming TTS with voice cloning. English only. ~6x real-time on M4 MacBook Air.
categories:
- text-to-speech
- documentation
tags:
- kyutai
- text-to-speech
- tts
- cpu
- streaming
- voice-cloning
homepage: https://kyutai.org/blog/2026-01-13-pocket-tts
repository: https://github.com/kyutai-labs/pocket-tts
documentation: https://github.com/kyutai-labs/pocket-tts/tree/main/docs
Pocket TTS
Lightweight CPU-friendly text-to-speech with voice cloning. No GPU required.
When to Use
- Generating speech from text on CPU without GPU
- Voice cloning from audio samples
- Streaming audio generation (low latency)
- Local TTS without API dependencies
- Real-time speech synthesis (~6x faster than real-time)
Key Features
- 100M parameters - Small, efficient model
- CPU-optimized - No GPU needed, uses only 2 cores
- ~6x real-time - Fast generation on modern CPUs
- ~200ms latency - To first audio chunk (streaming)
- Voice cloning - From 3-10s audio samples
- 24kHz mono WAV - High-quality output
- English only - More languages planned
Installation
pip install pocket-tts
# or
uv add pocket-tts
CLI Commands
Generate Speech
# Basic generation (default voice)
pocket-tts generate --text "Hello world"
# Custom voice (local file, URL, or safetensors)
pocket-tts generate --voice ./my_voice.wav
pocket-tts generate --voice "hf://kyutai/tts-voices/alba-mackenna/casual.wav"
pocket-tts generate --voice ./voice.safetensors
# Quality tuning
pocket-tts generate --temperature 0.7 --lsd-decode-steps 3
See docs/generate.md for full CLI reference.
Start Web Server
# Start FastAPI server with web UI
pocket-tts serve
# Custom host/port
pocket-tts serve --host localhost --port 8080
See docs/serve.md for server options.
Export Voice Embeddings
Convert audio files to .safetensors for faster loading:
# Single file
pocket-tts export-voice voice.mp3 voice.safetensors
# Batch conversion
pocket-tts export-voice voices/ embeddings/ --truncate
See docs/export_voice.md for export options.
Python API
Basic Usage
from pocket_tts import TTSModel
import scipy.io.wavfile
# Load model
model = TTSModel.load_model()
# Get voice state
voice = model.get_state_for_audio_prompt(
"hf://kyutai/tts-voices/alba-mackenna/casual.wav"
)
# Generate audio
audio = model.generate_audio(voice, "Hello world!")
# Save
scipy.io.wavfile.write("output.wav", model.sample_rate, audio.numpy())
Load Model
model = TTSModel.load_model(
config="b6369a24", # Model variant
temp=0.7, # Temperature (0.5-1.0)
lsd_decode_steps=1, # Generation steps (1-5)
eos_threshold=-4.0 # End-of-sequence threshold
)
Voice State
# From audio file/URL
voice = model.get_state_for_audio_prompt("./voice.wav")
voice = model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav")
# From safetensors (fast loading)
voice = model.get_state_for_audio_prompt("./voice.safetensors")
Streaming Generation
# Stream audio chunks
for chunk in model.generate_audio_stream(voice, "Long text..."):
# Process/save/play each chunk as generated
print(f"Chunk: {chunk.shape[0]} samples")
Multi-Voice Management
# Preload multiple voices
voices = {
"casual": model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav"),
"announcer": model.get_state_for_audio_prompt("./announcer.safetensors"),
}
# Use different voices
audio1 = model.generate_audio(voices["casual"], "Hey there!")
audio2 = model.generate_audio(voices["announcer"], "Breaking news!")
See docs/python-api.md for complete API reference.
Available Voices
Pre-made voices from hf://kyutai/tts-voices/:
alba-mackenna/casual.wav(default, female)jessica-jian/casual.wav(female)voice-donations/Selfie.wav(male, marius)voice-donations/Butter.wav(male, javert)ears/p010/freeform_speech_01.wav(male, jean)vctk/p244_023.wav(female, fantine)vctk/p262_023.wav(female, eponine)vctk/p303_023.wav(female, azelma)
Or clone any voice from your own audio samples.
Voice Cloning Tips
- Clean audio - Remove background noise (use Adobe Podcast Enhance)
- Length - 3-10 seconds of speech is ideal
- Quality - Input quality affects output quality
- Format - WAV, MP3, or any common audio format supported
Performance Tips
- CPU-only - GPU provides no speedup (model too small, batch size 1)
- 2 cores - Uses only 2 CPU cores efficiently
- Streaming - Low latency (<200ms to first chunk)
- Safetensors - Pre-process voices to
.safetensorsfor instant loading
Output Format
All commands output WAV files:
- Sample rate: 24 kHz
- Channels: Mono
- Bit depth: 16-bit PCM
Links
- GitHub
- Tech Report
- Paper (arXiv)
- HuggingFace Model
- Voice Repository
- Live Demo
相关推荐
专题
+ 收藏
+ 收藏
+ 收藏
+ 收藏
+ 收藏
最新数据
相关文章
韩国发票:自动化估价单与税务发票 - Openclaw Skills
小红书文案教练:爆款笔记生成器 - Openclaw Skills
慕尼黑 MVG & S-Bahn 实时追踪命令行工具 - Openclaw Skills
Reddit 研究技能:自动化社群洞察 - Openclaw Skills
豆包聊天:带有联网搜索功能的免费 AI 对话 - Openclaw Skills
NightPatch:自动化工作流优化 - Openclaw 技能
国产 AI 视频生成器:Wan2.6 与可灵集成 - Openclaw Skills
Sonos Announce:智能音频状态恢复 - Openclaw Skills
Hypha Payment:P2P 代理协作与 USDT 结算 - Openclaw Skills
Cashu Emoji:隐藏代币编解码 - Openclaw Skills
AI精选
