YouTube 视频分析器：多模态 AI 洞察 - Openclaw Skills-脚本在线

AI智能体脚本智能办公脚本自动化游戏脚本浏览器自动化脚本服务器脚本

YouTube 视频分析器：多模态 AI 洞察 - Openclaw Skills

作者：互联网

2026-03-27

AI教程

什么是 YouTube 视频分析器（多模态）？

YouTube 视频分析器是一款专为技术内容分析设计的专业工具，超越了简单的字幕提取。通过利用 Openclaw Skills，该工具弥合了视频音频和视觉频道之间的差距。它允许 AI 代理不仅理解正在说话的内容，还能理解屏幕上显示的准确内容，使其在分析软件教程、编程演示和复杂图表时不可或缺。

该技能采用强大的工作流程来处理速率限制和字幕可用性等常见问题。通过在策略性间隔提取关键帧并将其与带有时间戳的文本对齐，它能够生成高保真指南和技术摘要，其中包括纯文本工具经常遗漏的视觉上下文。

下载入口:https://github.com/openclaw/skills/tree/main/skills/sdrabent/you@tube-video-analyzer

安装与下载

1. ClawHub CLI

从源直接安装技能的最快方式。

npx clawhub@latest install you@tube-video-analyzer

2. 手动安装

将技能文件夹复制到以下位置之一

全局模式 ~/.openclaw/skills/ 工作区 /skills/

优先级：工作区 > 本地 > 内置

3. 提示词安装

将此提示词复制到 OpenClaw 即可自动安装。

请帮我使用 Clawhub 安装 you@tube-video-analyzer。如果尚未安装 Clawhub，请先安装（npm i -g clawhub）。

YouTube 视频分析器（多模态）应用场景

从软件 UI 演示中创建分步技术文档。
总结视觉图表对解释至关重要的教育内容。
直接从编程视频教程中提取代码块和终端命令。
审核视频教程的视觉准确性，并将口头指令与屏幕动作同步。
生成全面的技术报告，突出显示音频轨道中未提及的纯视觉信息。

YouTube 视频分析器（多模态）工作原理

使用 yt-dlp 获取基本的视频元数据，包括标题、时长和 ID。
通过多步骤流程提取字幕，该流程通过定为json3 字幕 URL 来绕过常见的速率限制。
以 720p 分辨率下载视频，以在视觉清晰度和处理速度之间取得平衡。
使用 ffmpeg 根据固定间隔或针对动态 UI 内容的场景变化检测来提取关键帧。
根据时间戳将每个提取的帧与字幕的相应片段同步。
执行多模态合成，将视觉数据（UI 元素、文本、动作）与音频数据进行比较，生成结构化输出。

YouTube 视频分析器（多模态）配置指南

要开始使用此技能，请确保您的系统上安装了 ffmpeg、python3 和 curl。您可以通过 uv 安装主要依赖项：

uv tool install yt-dlp

该技能在临时目录中运行，以管理视频和图像资产：

VIDEO_URL=""
WORK_DIR=$(mktemp -d /tmp/yt-analysis-XXXXXX)
mkdir -p "$WORK_DIR/frames"

YouTube 视频分析器（多模态）数据架构与分类体系

该技能通过多个文件管理数据，以保持频道之间的整洁同步：

数据组件	格式	描述
字幕元数据	JSON	包含精确毫秒时间戳的原始字幕数据。
格式化字幕	TXT	人类可读且 AI 友好的带有时间戳的段落。
视觉帧	JPEG	在特定间隔（例如每 10-20 秒）提取的关键帧。
视频资产	MP4	用于本地帧处理的源视频文件。
同步映射	逻辑	帧文件名到字幕段落的映射。

name: you@tube-video-analyzer
description: >
  Multimodal YouTube video analysis through both audio (transcript) and visual (frame extraction + image analysis) channels.
  Especially powerful for HowTo videos, tutorials, demos, and explainer videos where what is SHOWN (screenshots, UI demos,
  diagrams, code, physical actions) is just as important as what is SAID. Use this skill whenever a user wants to analyze,
  summarize, or create step-by-step guides from YouTube videos, or when they share a YouTube URL and want to understand
  what happens in the video. Triggers on requests like "Analyze this YouTube video", "Create a step-by-step guide from
  this video", "What does this video show?", "Summarize this tutorial", or any YouTube URL shared with analysis intent.
version: 1.0.0
metadata:
  openclaw:
    requires:
      bins:
        - ffmpeg
        - python3
        - curl
    emoji: "??"
    os:
      - linux
      - macos
    install:
      - kind: uv
        package: yt-dlp
        bins: [yt-dlp]

YouTube Video Analyzer — Multimodal

This skill performs deep analysis of YouTube videos through both information channels:

Audio channel: Transcript with timestamps (what is SAID)
Visual channel: Frame extraction + image analysis (what is SHOWN)

Most YouTube skills only extract transcripts. This skill closes the gap by synchronizing visual frames with spoken content, enabling accurate step-by-step guides where "click the blue button" is matched with the actual screenshot showing which button.

Workflow Overview

YouTube URL
    |
    +---> 1. Get metadata (title, duration, video ID)
    |
    +---> 2. Extract transcript (yt-dlp --dump-json + curl)
    |         -> Timestamped segments
    |
    +---> 3. Extract frames (yt-dlp + ffmpeg)
    |         -> Keyframes at strategic intervals
    |
    +---> 4. Synchronize frames <-> transcript
    |         -> Match frames to spoken content by timestamp
    |
    +---> 5. Multimodal analysis
              -> Read each frame image, combine with transcript
              -> Generate structured output

Step 1: Setup Working Directory

VIDEO_URL=""
WORK_DIR=$(mktemp -d /tmp/yt-analysis-XXXXXX)
mkdir -p "$WORK_DIR/frames"

Step 2: Get Video Metadata

yt-dlp --print title --print duration --print id "$VIDEO_URL" 2>/dev/null

This returns three lines: title, duration in seconds, video ID. Store these for later use.

Step 3: Extract Transcript

IMPORTANT: Direct subtitle download via --write-sub frequently hits YouTube rate limits (HTTP 429). Use the reliable two-step method below instead.

Step 3a: Get subtitle URL from video JSON

yt-dlp --dump-json "$VIDEO_URL" 2>/dev/null | python3 -c "
import json, sys
data = json.load(sys.stdin)
auto = data.get('automatic_captions', {})
subs = data.get('subtitles', {})

# Priority: manual subs > auto subs. Prefer user's language, fallback chain.
for source in [subs, auto]:
    for lang in ['en', 'de', 'en-orig', 'fr', 'es']:
        if lang in source:
            for fmt in source[lang]:
                if fmt.get('ext') == 'json3':
                    print(fmt['url'])
                    sys.exit(0)

# Fallback: take first available auto-caption, get json3 URL
for lang in sorted(auto.keys()):
    for fmt in auto[lang]:
        if fmt.get('ext') == 'json3':
            url = fmt['url']
            # Remove translation param to get original language
            import re
            url = re.sub(r'&tlang=[^&]+', '', url)
            print(url)
            sys.exit(0)

print('NO_SUBS', file=sys.stderr)
sys.exit(1)
" > "$WORK_DIR/sub_url.txt"

Step 3b: Download and parse transcript

curl -s "$(cat "$WORK_DIR/sub_url.txt")" -o "$WORK_DIR/transcript.json3"

Verify it is valid JSON (not an HTML error page):

head -c 20 "$WORK_DIR/transcript.json3"
# Should start with { — if it starts with


Step 3c: Parse json3 into readable timestamped segments
python3 -c "
import json

with open('$WORK_DIR/transcript.json3') as f:
    data = json.load(f)

for event in data.get('events', []):
    segs = event.get('segs', [])
    if not segs:
        continue
    start_ms = event.get('tStartMs', 0)
    duration_ms = event.get('dDurationMs', 0)
    text = ''.join(s.get('utf8', '') for s in segs).strip()
    if not text or text == '
':
        continue
    s = start_ms / 1000
    e = (start_ms + duration_ms) / 1000
    print(f'[{int(s//60):02d}:{int(s%60):02d} - {int(e//60):02d}:{int(e%60):02d}] {text}')
" > "$WORK_DIR/transcript.txt"

Read $WORK_DIR/transcript.txt to get the full transcript with timestamps.
Fallback: No transcript available
If no subtitles exist at all, inform the user and proceed with visual-only analysis.
Step 4: Download Video and Extract Frames
Step 4a: Download video (720p is sufficient for frame analysis)
yt-dlp -f "bestvideo[height<=720]+bestaudio/best[height<=720]" r
       -o "$WORK_DIR/video.mp4" "$VIDEO_URL"

Step 4b: Get exact duration
DURATION=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 "$WORK_DIR/video.mp4")

Step 4c: Extract frames using adaptive interval strategy
Choose interval based on video length:




Duration
Interval
Approx. Frames
Rationale


< 5 min
10s
20-30
Dense enough for detailed analysis

5-20 min
20s
15-60
Good balance of coverage vs. volume

20-60 min
30-45s
30-120
Focus on key moments

> 60 min
60s
60-120+
Ask user if they want to focus on specific sections
# Example for a 5-20 minute video (interval=20):
ffmpeg -i "$WORK_DIR/video.mp4" -vf "fps=1/20" -q:v 3 "$WORK_DIR/frames/frame_%04d.jpg" 2>&1

For scene-change-detection (software HowTos, UI demos):
ffmpeg -i "$WORK_DIR/video.mp4" r
       -vf "select='gt(scene,0.3)',showinfo" r
       -vsync vfr -q:v 3 "$WORK_DIR/frames/scene_%04d.jpg" 2>&1

Step 4d: Calculate timestamps for each frame
For fixed-interval extraction: frame N has timestamp (N-1) * interval seconds.
frame_0001.jpg -> 0:00
frame_0002.jpg -> 0:20
frame_0003.jpg -> 0:40
...

Step 5: Synchronize Frames with Transcript
For each extracted frame:

Calculate the frame's timestamp in seconds
Find the transcript segment(s) covering that timestamp
Create a synchronized pair: {timestamp, transcript_text, frame_path}
This is done mentally or via a simple lookup — no external script needed.
Step 6: Multimodal Analysis
Step 6a: Read and analyze each frame
Use the Read tool (or view tool) to look at each frame image. For each frame, consider:

UI elements: Buttons, menus, dialogs, settings panels visible
Text on screen: Code, labels, error messages, URLs, terminal output
Diagrams/graphics: Charts, flow diagrams, architecture drawings
Physical actions: Hand positions, tool usage (for physical HowTos)
Changes: What changed compared to the previous frame?
Step 6b: Synthesize both channels
For each key moment, combine audio and visual:
Segment [TIMESTAMP]:
  SAID: "Click the blue button in the top right"
  SHOWN: Settings page screenshot, blue "Save" button highlighted
         in top-right corner, cursor pointing at it
  SYNTHESIS: -> On the Settings page, click the blue "Save" button
               in the top-right corner

Step 6c: Identify visual-only information
Flag moments where the visual channel provides information NOT present in audio:

Specific button names, menu paths, exact UI locations
Code that is shown but not read aloud
Error messages visible on screen
Before/after comparisons
Output Formats
Generate the appropriate format based on the user's request:
Format A: Step-by-Step Guide (most common)
# [Video Title] — Guide

## Step 1: [Action] (00:15)
[Description based on transcript + frame analysis]
> Visual: [What the screen/image shows at this point]

## Step 2: [Action] (00:42)
[...]

Format B: Comprehensive Summary with Visual Anchors
# [Video Title] — Summary

## Overview
[2-3 sentence summary of the entire video]

## Key Sections

### [Section Name] (00:00 - 02:30)
[Summary of this section]
- Key visual: [Description of what's shown]
- Key quote: "[Important spoken content]"

### [Section Name] (02:30 - 05:00)
[...]

## Key Takeaways
- [Takeaway 1]
- [Takeaway 2]

Format C: Technical Detail Analysis
Separate analysis of both channels plus discrepancy detection:
# [Video Title] — Technical Analysis

## Audio Channel Analysis
[What was said, key points, structure]

## Visual Channel Analysis
[What was shown, UI flows, code, diagrams]

## Channel Synchronization
[Where audio and visual complement each other]

## Visual-Only Information
[Important details only visible in frames, not mentioned in speech]

Error Handling & Edge Cases




Problem
Solution


HTTP 429 on subtitle download
Use --dump-json method (Step 3a). If curl also gets blocked, wait 10-15 seconds and retry with different User-Agent

No subtitles available at all
Proceed with visual-only analysis, inform user

Original audio language not in auto-captions list
The original language is the source — auto-captions are translations. Remove &tlang=XX from any auto-caption URL to get the original

transcript.json3 contains HTML instead of JSON
YouTube returned an error page. Wait 10s, retry with: curl -s --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" "$URL"

Video > 60 min
Ask user if they want to focus on specific time ranges or chapters

Poor video quality / blurry frames
Extract more frames at tighter intervals to compensate

Video is age-restricted or private
Inform user that the video cannot be accessed. Suggest using --cookies-from-browser if they have access

yt-dlp download fails
Try alternative format: -f "best[height<=720]" without separate audio+video streams
Cleanup
After analysis is complete, remove temporary files:
rm -rf "$WORK_DIR"

Tips for Best Results

Software HowTos: Use scene-change detection — UI transitions create clear visual breaks
Physical HowTos: Use tighter frame intervals (10-15s) — movements are subtler
Read the transcript first: Identify "interesting timestamps" before extracting frames. Look for phrases like "as you can see here", "let me show you", "on the screen" — these signal important visual moments
Context-aware frame analysis: When analyzing a frame, always provide the transcript context. The speaker often explains what's about to be shown
Batch frame reading: Read frames in batches of 8-10 to maintain context across sequential frames and detect visual changes
Always extract both channels in parallel: Start the video download while processing the transcript to save time

Duration	Interval	Approx. Frames	Rationale
< 5 min	10s	20-30	Dense enough for detailed analysis
5-20 min	20s	15-60	Good balance of coverage vs. volume
20-60 min	30-45s	30-120	Focus on key moments
> 60 min	60s	60-120+	Ask user if they want to focus on specific sections

Problem	Solution
HTTP 429 on subtitle download	Use `--dump-json` method (Step 3a). If curl also gets blocked, wait 10-15 seconds and retry with different User-Agent
No subtitles available at all	Proceed with visual-only analysis, inform user
Original audio language not in auto-captions list	The original language is the source — auto-captions are translations. Remove `&tlang=XX` from any auto-caption URL to get the original
`transcript.json3` contains HTML instead of JSON	YouTube returned an error page. Wait 10s, retry with: `curl -s --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" "$URL"`
Video > 60 min	Ask user if they want to focus on specific time ranges or chapters
Poor video quality / blurry frames	Extract more frames at tighter intervals to compensate
Video is age-restricted or private	Inform user that the video cannot be accessed. Suggest using `--cookies-from-browser` if they have access
yt-dlp download fails	Try alternative format: `-f "best[height<=720]"` without separate audio+video streams


														
                            
							 							 							  上一篇：发票生成器：专业计费与 Markdown 发票 - Openclaw Skills
							 							 
                             							                                 下一篇：DevOps Bridge：自动化 GitHub、Slack 和 CI/CD - Openclaw Skills


                        
                            
                                相关推荐
                                
                            
                            
														                                
                                    
									
                                        
                                            
                                                                                        
                                                                                        
                                            
                                                技能收益追踪器：监控 Openclaw 技能并实现变现
                                                
                                                                                                                                     什么是 技能收益追踪器？
技能收益追踪器是一款专业级实用工具，旨在弥合 AI 开发与经济可持续性之间的鸿沟。随着开发者开始将作品变现，该工具提供了必要的基础设施，用于监控 ClawHub、EvoMap 和 
                                                                                                                
                                            
                                        
									
                                        
                                            
                                                2026-03-30
                                                
                                            
                                            立即查看
                                        
                                    
                                
								                                
                                    
									
                                        
                                            
                                                                                        
                                                                                        
                                            
                                                信号管道：自动化营销情报工具 - Openclaw Skills
                                                
                                                                                                                                     什么是 信号管道？
信号管道是一个复杂的数据采集和内容合成工具，旨在将碎片化的数字噪音转化为结构化的营销情报。作为 Openclaw Skills 的多功能组件，该系统坚控高价值来源，包括 RSS 订阅、X
                                                                                                                
                                            
                                        
									
                                        
                                            
                                                2026-03-30
                                                
                                            
                                            立即查看
                                        
                                    
                                
								                                
                                    
									
                                        
                                            
                                                                                        
                                                                                        
                                            
                                                AI 合规准备就绪度：评估与治理工具 - Openclaw Skills
                                                
                                                                                                                                     什么是 AI 合规准备就绪度？
此技能为组织提供了一个全面的框架，用于从八个关键维度评估其 AI 合规态势。它通过分析风险分类、偏差缓解和数据来源，弥合了技术 AI 部署与复杂监管要求之间的差距。利用这些 O
                                                                                                                
                                            
                                        
									
                                        
                                            
                                                2026-03-30
                                                
                                            
                                            立即查看
                                        
                                    
                                
								                                
                                    
									
                                        
                                            
                                                                                        
                                                                                        
                                            
                                                FOSMVVM ServerRequest 测试生成器：自动化 API 测试 - Openclaw Skills
                                                
                                                                                                                                     什么是 FOSMVVM ServerRequest 测试生成器？
FOSMVVM ServerRequest 测试生成器是 Openclaw Skills 生态系统中的专用工具，旨在简化服务端 Swift 单元
                                                                                                                
                                            
                                        
									
                                        
                                            
                                                2026-03-30
                                                
                                            
                                            立即查看




                    
                        
                            
                                专题
                                
                            
                            
							                                
                                    
                                        
                                            #Grok
                                            
                                        
                                        Grok脚本资源网站，提供G
                                    
                                    + 收藏
                                
							                                
                                    
                                        
                                            #Sora2
                                            
                                        
                                        Sora2脚本资源网站，提供S
                                    
                                    + 收藏
                                
							                                
                                    
                                        
                                            #通义万相
                                            
                                        
                                        通义万相脚本资源网站，提供通
                                    
                                    + 收藏
                                
							                                
                                    
                                        
                                            #海螺AI
                                            
                                        
                                        海螺AI脚本资源网站，提供海
                                    
                                    + 收藏
                                
							                                
                                    
                                        
                                            #可灵AI
                                            
                                        
                                        可灵AI脚本资源网站，提供可
                                    
                                    + 收藏
                                
							                                
                                    
                                        
                                            #Kling3.0
                                            
                                        
                                        Kling3.0脚本资源网站，提
                                    
                                    + 收藏
                                
								
                                


                            
                        


                        
                            
                                最新数据
                               
                            


                            
																					                                
                                        
                                                                                     
                                                                                    
                                    
                                        DeFi 收益扫描器：通过 Openclaw Skills 发现高年化加密机会
                                        
                                                                                            什么是 DeFi 收益扫描器？
                                                                                        
                                        
                                    
                                
								
															
															
															
															
															
															
															
															
															
								
                                
                                										
																											
											
											企业风险管理引擎：基于 AI 的 ERM 框架 - Openclaw Skills
										
										
																											
											
											Google Fonts 优化：性能与搭配 - Openclaw Skills
										
										
																											
											
											神经网络记事本：高级关联记忆与知识映射 - Openclaw Skills
										
										
																											
											
											Atlassian CLI (acli)：Jira 与管理员自动化 - Openclaw Skills
										
										
																											
											
											樱花动漫app最新版本下载入口-樱花动漫app官方正版安装包下载
										
										
																											
											
											PocketBase：构建实时后端 - Openclaw Skills
										
										
																											
											
											生存33天加点宠物外骨骼调号指南
										
										
																											
											
											Tavily 搜索：面向 AI 智能体的实时 Web 情报 - Openclaw Skills
										
										
																											
											
											有趣的沙滩排球游戏有哪些 好玩的排球游戏推荐2026
										
										
										
                                    
    
                                
    
                            
                           
                        

                       
                        
                            
                                相关文章
                                
                            
                            
							                                信号管道：自动化营销情报工具 - Openclaw Skills
03/30
							                                技能收益追踪器：监控 Openclaw 技能并实现变现
03/30
							                                AI 合规准备就绪度：评估与治理工具 - Openclaw Skills
03/30
							                                FOSMVVM ServerRequest 测试生成器：自动化 API 测试 - Openclaw Skills
03/30
							                                酒店搜索器：AI 赋能的住宿与位置情报 - Openclaw Skills
03/30
							                                Dub 链接 API：程序化链接管理 - Openclaw Skills
03/30
							                                IntercomSwap：P2P BTC 与 USDT 跨链兑换 - Openclaw Skills
03/30
							                                spotplay：macOS 原生 Spotify 播放控制 - Openclaw Skills
03/30
							                                DeepSeek OCR：AI驱动的图像文本识别 - Openclaw Skills
03/30
							                                Web Navigator：自动化网页研究与浏览 - Openclaw Skills
03/30
								
                                
                            

                        

                        
                        
                            AI精选 
							更多
                        
						                        
                            
																                                
                                                                        
                                                                        MCP 协议深度解析：构建 A
																								                                
                                                                        
                                                                        OpenClaw 真正的效率开
																																																																																																																																																                               
                            
                            
																																																						
										精选
										Anthropic 的 Harness 启示：当 AI Agent 开始「长跑」，架构才是真正的天花板
									
																																
										精选
										AI Agent 智能体 - Multi-Agent 架构入门
									
																																
										精选
										RAG 不一定非得靠向量库：一套更偏工程落地的“结构化推理检索”方案
									
																																
										精选
										一文搞懂深度学习中的池化！
									
																																
										精选
										一文搞懂卷积神经网络经典架构-LeNet
									
																																
										精选
										告别 Vibe Coding：用 SDD 让 AI 编程提效 50%，三工具实战对比
									
																																
										精选
										Agent 语音交互如何更稳、更快？一次高并发消息链路优化实践
									
																																
										精选
										# AI 终于能"干活"了——Function Calling 完全指南
									
															                               
                            
                        

                    
    
						
							
								脚本推荐
								
							
							
															SeeDance 2.0 Video Creator专区
															OpenClaw AI专区
															cowork专区
															claude code skills专区

YouTube 视频分析器：多模态 AI 洞察 - Openclaw Skills

什么是 YouTube 视频分析器（多模态）？

安装与下载

1. ClawHub CLI

2. 手动安装

3. 提示词安装

YouTube 视频分析器（多模态） 应用场景

YouTube 视频分析器（多模态） 配置指南

YouTube 视频分析器（多模态） 数据架构与分类体系

YouTube Video Analyzer — Multimodal

Workflow Overview

Step 1: Setup Working Directory

Step 2: Get Video Metadata

Step 3: Extract Transcript

Step 3a: Get subtitle URL from video JSON

Step 3b: Download and parse transcript

Step 3c: Parse json3 into readable timestamped segments

Fallback: No transcript available

Step 4: Download Video and Extract Frames

Step 4a: Download video (720p is sufficient for frame analysis)

Step 4b: Get exact duration

Step 4c: Extract frames using adaptive interval strategy

Step 4d: Calculate timestamps for each frame

Step 5: Synchronize Frames with Transcript

Step 6: Multimodal Analysis

Step 6a: Read and analyze each frame

Step 6b: Synthesize both channels

Step 6c: Identify visual-only information

Output Formats

Format A: Step-by-Step Guide (most common)

Format B: Comprehensive Summary with Visual Anchors

Format C: Technical Detail Analysis

Error Handling & Edge Cases

Cleanup

Tips for Best Results

YouTube 视频分析器（多模态）应用场景

YouTube 视频分析器（多模态）配置指南

YouTube 视频分析器（多模态）数据架构与分类体系