Gemini 计算机操作：使用 Playwright 构建自动化浏览器代理

AI智能体脚本智能办公脚本自动化游戏脚本浏览器自动化脚本服务器脚本

Gemini 计算机操作：使用 Playwright 构建自动化浏览器代理 - Openclaw Skills

作者：互联网

2026-03-21

AI教程

什么是 Gemini 计算机操作？

Gemini 计算机操作是一个复杂的框架，旨在创建能够像人类一样控制浏览器的自主代理。通过将 Gemini 2.5 模型与 Playwright 相结合，该技能实现了持续的反馈循环：代理观察屏幕，决定动作，并实时执行。Openclaw Skills 生态系统中的这一集成，为开发人员提供了处理视觉优先的网页导航和复杂 UI 交互的工具。

该技能在传统的基于 API 的自动化无法实现的场景中特别有效。它使用结构化的代理循环，将视觉截图转化为可执行的函数调用，确保即使在最动态的 Web 环境中也能准确、安全地进行导航。凭借对不同浏览器渠道和安全确认的内置支持，它为现代 Web 自动化需求提供了强大的解决方案。

下载入口:https://github.com/openclaw/skills/tree/main/skills/am-will/gemini-computer-use

安装与下载

1. ClawHub CLI

从源直接安装技能的最快方式。

npx clawhub@latest install gemini-computer-use

2. 手动安装

将技能文件夹复制到以下位置之一

全局模式 ~/.openclaw/skills/ 工作区 /skills/

优先级：工作区 > 本地 > 内置

3. 提示词安装

将此提示词复制到 OpenClaw 即可自动安装。

请帮我使用 Clawhub 安装 gemini-computer-use。如果尚未安装 Clawhub，请先安装（npm i -g clawhub）。

Gemini 计算机操作应用场景

自动化需要导航多个网站并捕获视觉数据的研究任务。
在无需编写手动脚本的情况下对 Web 应用程序执行端到端回归测试。
与不提供公共 API 的旧版 Web 平台或 SaaS 工具进行交互。
构建需要复杂 UI 导航的自主购物或预订助手。
创建能够绕过传统 HTML 解析固有挑战的视觉数据爬虫。

Gemini 计算机操作工作原理

视觉捕捉：代理捕捉当前浏览器状态的高分辨率截图，作为模型的视觉输入。
目标处理：将用户的提示和当前截图发送给 Gemini 2.5 模型，以确定下一个逻辑步骤。
动作生成：模型返回代表特定浏览器操作（如点击、输入或滚动）的函数调用。
安全验证：如果某个操作被标记为有风险，系统将触发安全决策循环，在执行前请求用户确认。
Playwright 执行：经过验证的操作通过 Playwright 在浏览器环境中执行。
响应迭代：系统将结果（包括新 URL 和更新后的截图）发送回模型，继续循环直到任务完成。

Gemini 计算机操作配置指南

要开始使用此技能，请先通过复制示例文件并添加您的 API 密钥来配置环境：

cp env.example env.sh
# 使用您的 API 密钥编辑 env.sh
source env.sh

接下来，设置 Python 虚拟环境并安装必要的 Openclaw Skills 依赖项：

python -m venv .venv
source .venv/bin/activate
pip install google-genai playwright
playwright install chromium

通过提供提示和起始 URL 来运行代理：

python scripts/computer_use_agent.py r
  --prompt "搜索 Openclaw Skills 文档" r
  --start-url "https://google.com" r
  --turn-limit 10

Gemini 计算机操作数据架构与分类体系

该技能通过结构化的截图和函数响应序列管理其操作数据：

数据类型	描述
截图	用于模型感知的 1440x900 PNG 文件。
函数调用	定义浏览器操作（如 `click` 或 `type`）的 JSON 对象。
环境配置	用于浏览器渠道（Chrome, Edge, Brave）的基于 Shell 的变量。
会话上下文	截图 -> 动作 -> 函数响应循环的迭代日志。
安全标志	排除特定风险 UI 操作的配置参数。

name: gemini-computer-use
description: Build and run Gemini 2.5 Computer Use browser-control agents with Playwright. Use when a user wants to automate web browser tasks via the Gemini Computer Use model, needs an agent loop (screenshot → function_call → action → function_response), or asks to integrate safety confirmation for risky UI actions.

Gemini Computer Use

Quick start

Source the env file and set your API key:

cp env.example env.sh
$EDITOR env.sh
source env.sh

Create a virtual environment and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install google-genai playwright
playwright install chromium

Run the agent script with a prompt:

python scripts/computer_use_agent.py r
  --prompt "Find the latest blog post title on example.com" r
  --start-url "https://example.com" r
  --turn-limit 6

Browser selection

Default: Playwright's bundled Chromium (no env vars required).
Choose a channel (Chrome/Edge) with COMPUTER_USE_BROWSER_CHANNEL.
Use a custom Chromium-based executable (e.g., Brave) with COMPUTER_USE_BROWSER_EXECUTABLE.

If both are set, COMPUTER_USE_BROWSER_EXECUTABLE takes precedence.

Core workflow (agent loop)

Capture a screenshot and send the user goal + screenshot to the model.
Parse function_call actions in the response.
Execute each action in Playwright.
If a safety_decision is require_confirmation, prompt the user before executing.
Send function_response objects containing the latest URL + screenshot.
Repeat until the model returns only text (no actions) or you hit the turn limit.