AutomationLive

Browser Automation Agent

A Playwright-powered browser agent that takes natural language instructions and autonomously navigates websites to complete tasks. Uses vision models to understand page layouts and action models to interact with elements. Handles multi-step workflows like "Go to Amazon, find the cheapest USB-C hub with 4+ stars, and add it to cart."

GPT-4o VisionClaude Sonnet

78% task completionMulti-step navigationVisual DOM understanding

Source Code

Status

Live

Tech Stack

PlaywrightGPT-4o VisionPythonLangGraphRedis

Models

GPT-4o VisionClaude Sonnet

Overview

Tell it what you want in plain English — "Find the cheapest flight from SFO to JFK next Friday" — and watch it navigate websites, fill forms, click buttons, and extract results autonomously. This experiment combines Playwright browser control with GPT-4o Vision for page understanding and LangGraph for multi-step planning, creating an agent that can complete real web tasks without any site-specific code.

Methodology

I built a 50-task benchmark spanning 5 categories: e-commerce (search, filter, add to cart), travel (flight search, hotel booking), information retrieval (Wikipedia lookups, documentation search), form filling (contact forms, applications), and multi-site workflows (price comparison across 3 sites). Each task was tested 3 times to measure consistency. The agent uses GPT-4o Vision to understand page layouts, identifies interactive elements via accessibility tree parsing, and plans multi-step actions through LangGraph.

Tech Stack

Playwright controls the browser with full CDP (Chrome DevTools Protocol) access. GPT-4o Vision understands page layouts and identifies interactive elements from screenshots. LangGraph manages the multi-step task planning and execution graph. Redis caches page states for error recovery and rollback.

Key Findings

The most important insights from this experiment.

78% task completion across 50 diverse tasks

The agent successfully completed 78% of benchmark tasks end-to-end. Success rate was highest on e-commerce (88%) and lowest on multi-site workflows (62%), where maintaining context across site transitions was challenging.

Accessibility tree > screenshot-only approaches

Combining vision (screenshot understanding) with accessibility tree parsing (structured element identification) improved task completion by 22% over vision-only approaches. The accessibility tree provides reliable element selectors that screenshots alone cannot.

Error recovery is the key differentiator

Implementing a "retry with different strategy" mechanism (e.g., switching from clicking a button to using keyboard navigation) recovered 15% of initially failed actions, boosting overall task completion from 65% to 78%.

CAPTCHAs and anti-bot measures are the ceiling

60% of failures were caused by CAPTCHAs, bot detection, or dynamic loading that changed the page after initial rendering. These are fundamental limitations of browser automation agents.

Architecture

The user provides a natural language goal. LangGraph decomposes it into a sequence of web actions (navigate, click, type, scroll, extract). For each action, Playwright captures a screenshot and accessibility tree, GPT-4o Vision identifies the target element and action type, and Playwright executes the interaction. After each step, the agent verifies the expected outcome by analyzing the resulting page. If verification fails, it triggers an error recovery loop that tries alternative interaction strategies.

Results

78% task completion rate across 50 benchmark tasks. Average task completion time: 34 seconds for single-site tasks, 92 seconds for multi-site. Error recovery mechanism saved 15% of failing tasks. The agent handled dynamic content, pop-ups, and cookie banners without site-specific code.

Challenges

Key technical challenges encountered during this experiment.

Challenge 1

Dynamic content and lazy loading

Elements that load after initial page render were invisible to the first screenshot. Implemented a "wait and re-capture" strategy that scrolls the page, waits for network idle, and takes multiple screenshots before declaring an element missing.

Challenge 2

Action ambiguity in complex UIs

Pages with multiple similar buttons (e.g., several "Add to Cart" buttons) confused the vision model. Added a two-step process: first identify all candidate elements, then use surrounding context (product name, price) to select the correct one.