Test Management Blog – Test Management

AI Document Consistency and Reducing Conflicts

Bill Echlin — Thu, 06 Nov 2025 23:21:55 +0000

Document Consistency and Building the System That Prevents AI Conflict

Why Your AI Agent Keeps Changing Its Mind

One of the quickest ways to send your AI agent off track is to give it conflicting information. Conflicting information in files, memory, or context almost guarantees unreliable results. One run it’ll do X, the next run it’ll go off and do Y.

Now this might seem, on the surface, like quite an easy thing to avoid. However, when you’re getting AI agents to generate information in the first place, you can (most likely will) end up with a lot of data and files. And let’s face it — we don’t always read and review all of the content that’s generated.

As you create all these documents you’re not just creating documents. You’re building an information system that shapes your AI’s behavior. That behaviour becomes more unpredictable the more conflict it’s exposed to. Conversely the more consistency you provide the more predictability you’ll get.

The content in your system MUST be consistent if you want to build a good system.

Two Fast Exercises to See the Problem

I’ve designed two exercises so you can experience this immediately. You just need a browser and a claude.ai or ChatGPT account. Each exercise takes under 10 minutes.

Exercise 1: Simple personal fitness project (blindingly obvious conflicts)
Exercise 2: Test case development system (realistic professional scenario)

Each exercise focuses on creating WHAT, HOW, and WHO/WHEN/WHERE documents, then analyzing them for conflicts. More on this later.

To see what I’m talking about, try the following exercise.

Exercise 1 : Simple Personal Fitness Plan

This exercise gets you to build a personal fitness plan. It’s based on creating 3 documents:

WHAT document for my fitness goals
HOW document for my fitness training approach
WHO/WHEN/WHERE context document for my fitness situation

This is a simple exercise with blindingly obvious discrepancies. The idea is to force home the concept and technique … not actually help you run a 10k race!

You can do this entire exercise in a single browser session (with 4 tabs open). It’s designed to give you the “aha moment” in under 10 minutes.

The Setup : Open a Browser with 4 Tabs

Complete these actions:

open a browser
open 4 tabs
navigate to Claude.ai or ChatGPT in all 4 tabs.

Step 1: Create Your WHAT Document

Open claude.ai or ChatGPT in a browser (Tab 1) and use this exact prompt:

"Please create a WHAT document for my fitness goals. I want to:
- Lose 15 pounds over 6 months
- Complete a 10K race 
- Improve overall strength and muscle tone

Define what success looks like for each goal, what the finished state looks like, and what standards I need to meet. Format this as a clear and concise document I can reference."

Save this response as a document called what.md

Step 2: Create Your HOW Document

Open a second browser tab with claude.ai or ChatGPT (Tab 2) and use this exact prompt:

"Please create a HOW document for my fitness training approach. My methods are:
- Running 4 times per week building up distance gradually
- Strength training 3 times per week focusing on compound movements
- Calorie deficit of 500 calories per day through portion control

Describe how I'll approach training, how I'll structure my week, and how I'll execute this plan. Format this as a clear and concise document."

Save this response as a document called how.md

Step 3: Create Your WHO/WHEN/WHERE Document

Open a third browser tab with claude.ai or ChatGPT (Tab 3) and use this exact prompt:

"Please create a WHO/WHEN/WHERE context document for my fitness situation. My reality:
- I work standard 45-hour weeks with occasional evening meetings
- I have a 45-minute commute each way
- I enjoy exercise but struggle with early mornings
- I have a basic home gym setup (dumbbells, resistance bands, yoga mat)
- I typically go to bed around 11pm and wake at 7am
- I have family dinner commitments 4 evenings per week at 6:30pm

Describe my constraints, my current situation, and the resources and limitations I'm working with. Format this as a clear and concise document."

Save this response as a document called who-when-where.md

Step 4: Run Your Conflict Analysis

Open a fourth browser tab with claude.ai/chatGPT (Tab 4) :

Upload the following documents or paste them into your AI tool of choice.

what.md how.md who-when-where.md

Then enter the following prompt:

"I have three documents about my fitness plan. Analyze them for conflicts, contradictions, and inconsistencies."

Identify:
1. Goal conflicts (goals that work against each other)
2. Method mismatches (approaches that don't support the goals)
3. Reality gaps (constraints that make goals/methods impossible)
4. Resource conflicts (time/access issues)

Be specific about which parts of which documents conflict with each other."

From here you can work with your AI agent to create 3 updated documents that are consistent and work together.

This conflict analysis isn’t just debugging documentation for AI systems. In many ways it’s debugging your own decision-making process. The clearer your thought process is the clearer you can make it for an AI system.

Maybe I could even go as far as saying humans and AI agents share very similar fundamental requirements. Consistent information produces consistent behavior. Contradictory information — whether in documents, plans, or your own thinking, produces conflict.

Build consistency into your system, and you build reliability into your results.

From Document Management to Behavior Design

Most people think about documentation as storage. However, if you start thinking about this from a higher-level you start to understand that your documentation is the behavioral programming for AI system.

Lower-Level Thinking: “I need to organize my project documents better”

Higher-Level Thinking: “I need coherent and consistent documentation so that conflicting signals are minimised”

This isn’t about being tidy. It’s about understanding that every document is a behavior modifier for your AI system. Conflicts create behavioral instability. Consistency creates predictable and powerful AI behavior.

Separating WHAT from HOW

After months of working with AI agents, and debugging when things are going wrong, I discovered this technique. It’s deceptively simple, but it’s the difference between AI that works and AI that works reliably.

The Architecture

You’ll see in that first exercise that I’ve structured my documents into distinct areas:

Domain A: The WHAT Documents (Templates/Examples/Specifications)

What the output should look like
What standards to follow
What patterns to match
What quality criteria to meet

Domain B: The HOW Documents (Process/Instructions/Methods)

How to approach the task
How to make decisions
How to handle edge cases
How to iterate and improve

Domain C: The WHO/WHEN/WHERE Context (Persistent Memory)

Who is involved (roles, expertise)
When things happen (schedules, triggers)
Where resources exist (tools, systems)
Why certain choices were made (decisions, history)

In many ways this approach helps ME as much as it helps the AI engine.

That’s because I can break down what I’m trying to do into easier to manage and understand components. When you separate WHAT from HOW, you’re creating clear lanes where different aspects of your information can be created with clarity and consistency.

One of the big reasons I do this is because it makes it easier to pick up on conflicts and build consistency.

Now let’s try this with something more realistic.

Exercise 2: The Test Case Creation System

In this exercise we create the document set that builds a system to create test cases for a software testing project. We’ll create the following documents:

WHAT : a template that shows what a test case should look like
HOW : a set of principles for creating good test cases
WHO/WHEN/WHERE : other constraints and useful context

The Setup : Open a Browser with 4 Tabs

Complete these actions:

open a browser
open 4 tabs
navigate to Claude.ai or ChatGPT in all 4 tabs.

Step 1: Create Your WHAT Document

Open claude.ai or ChatGPT in a browser (Tab 1) and enter the following prompt:

Create a test case template that shows the standard structure and format for writing software test cases. Include:
- All essential fields (ID, title, preconditions, steps, expected results, etc.)
- 2-3 concrete examples showing the template in use
- Clear formatting that makes it easy to replicate

Keep it practical and scannable.

Save this response to a file called test-case-format.md.

Step 2: Create Your HOW Document

Open claude.ai or ChatGPT in a browser (Tab 2) and use this exact prompt:

Define 5-7 core principles for writing effective test cases. For each principle:
- State the principle clearly (e.g., "Test one thing at a time")
- Explain why it matters in 1-2 sentences
- Give a brief example or counter-example

Focus on principles that lead to maintainable, clear, and valuable test cases.

Save this response to a file named test-principles.md.

Step 3: Create Your Context

For this one you can just create the markdown document with the following content.

You don’t need to get AI to generate anything here – just use this as it is

# Project Context

## Team
- 4 developers with varying TDD experience
- QA lead reviews all test PRs
- Product owner reads test names for validation

## Environment
- JavaScript/Node.js project
- Jest test framework
- CI/CD runs tests on every commit
- Test coverage target: 80%

## History
- Previous tests were inconsistent
- Team agreed to define structure in last retro

Save this content to a file called project-context.md.

Step 4: Run the Consistency Analysis

Open claude.ai or ChatGPT in a browser (Tab 4).

Then copy and paste the contents of these three documents into the chat (one after another), or if your AI tool supports it, upload them as files:

test-case-format.md
test-principles.md
project-context.md

Then use this exact prompt:

Analyze these three test documentation files for conflicts or inconsistencies:
1. test-case-format.md (WHAT we build)
2. test-principles.md (HOW we build)
3. project-context.md (WHO/WHEN/WHERE constraints)

Specifically identify:
- BLOCKING conflicts: Where HOW contradicts WHAT
- TENSION points: Where following one might compromise another
- GAPS: What's undefined that could cause inconsistency

Then create a test case following all three documents and show me where you had to make judgment calls.

You’ll probably find significant conflicts here – and that’s precisely the point!

You’ve just experienced what your AI experiences every time it reads your project documents.

Those contradictions between template structure and guiding principles? That tension between comprehensive documentation and incomplete examples? Your AI navigates these conflicts constantly, making different choices each run.

Build consistency into your process, not after it. Create documents sequentially, each informing the next.

You’re not organizing information – you’re architecting behavior.

You’re Building a Behavioral Operating System

This isn’t really about documents. It’s about understanding that you’re designing a system to control AI behavior.

Every document is a behavioral instruction. Every conflict is a defect in that system. Every consistency check is behavioral debugging.

When you think this way, you stop being a prompt writer and become a behavior architect.

The Challenge I Leave You With

Tomorrow, before you create your next AI prompt, ask yourself:

“What behavior am I programming? Then check your documents for conflicts before your AI does.”

The post AI Document Consistency and Reducing Conflicts appeared first on Test Management.

AI Experiment #5: Can Test Automation AI Learn From Its Own Failures?

Bill Echlin — Mon, 03 Nov 2025 22:08:32 +0000

Can this prompt-driven test automation system scale with complex applications using lessons learnt loops?

I wondered if AI could fail at automating a complex test case, learn from that failure, and succeed on the second try. Here’s my attempt at building a process with a feedback loop that acheives that.

The Question

Can this prompt-driven test automation system scale with complex applications using lessons learnt loops? I know from experience that AG Grid scenarios are really difficult to automate – drag-and-drop, row grouping, complex UI interactions. They’re automation nightmares. So what happens when AI comes up against these sorts of challenges?

What I’m looking at is scenarios where you encounter a complex application that defeats standard automation approaches. Where you can get Claude Code to complete some deep research and work out solutions. Solutions you can then build into a learning loop. And then, maybe creating a growing knowledge base of lessons that gets better over time, making each subsequent test easier to automate.

Let’s see if we can get this to work.

What I’m Using

Chrome DevTools MCP for browser automation
Claude Code with ultra-thinking capability for deep research
AG Grid example application (demanding complex automation interactions)
Test case markdown documents from video recordings
The deterministic prompt-driven automation framework from Experiment #4

I picked AG Grid specifically because I know it’s going to fail with the standard, simple, approaches used with test automation. The drag-and-drop implementation in AGgrid is non-standard, the components are complex, and it’s exactly the kind of application that makes automation engineers question their career choices.

The Setup

Here’s how the Lessons Learnt Loop works:

The Core Philosophy:

Test Case (Markdown) → Discovery (AI Execution) → Failure Analysis → Lessons Document → Enhanced Discovery
           ↑                                                                              ↓
           └──────────────────── Success with Learned Patterns ─────────────────────────┘

Traditional test automation often fails when encountering complex UI patterns – and then you’re stuck. You either spend hours debugging selector strategies, or you give up and mark it as “manual only.” The Lessons Learnt Loop treats this as a learning problem instead. Your first execution attempt gathers data about what didn’t work. An ultra-thinking phase analyzes why it failed and documents solutions. The second attempt uses these lessons to succeed where the first attempt failed.

This shift in thinking has interesting implications. Just like a student learns from mistakes, our automation system can build knowledge about specific testing challenges. The AG Grid drag-and-drop that defeats standard automation becomes a documented, solved problem. The lessons learnt document becomes organizational knowledge that can be shared across teams. The result: automation that gets smarter with each challenge it encounters.

The Three Key Components:

Failure Detection and Analysis
- Identify exactly what failed and why
- Capture error patterns and behaviors
- Document the gap between expected and actual
Not all failures are created equal. An “element not found” error is different from “drag started but drop didn’t register.” This approach captures the nuances of complex failures – particularly important with frameworks like AG Grid that implement custom event handling. The failure analysis isn’t just logging errors; it’s understanding the underlying cause. This deep understanding is what enables the ultra-thinking phase to generate meaningful solutions.
Ultra-Thinking Research Phase
- Deep dive into framework documentation
- Analyze alternative approaches
- Generate multiple solution strategies
- Document findings in structured format
The ultra-thinking capability is like having a senior automation engineer research the problem for you. It doesn’t just try random alternatives – it systematically investigates the framework’s architecture, understands the implementation choices, and proposes solutions based on that understanding. For AG Grid, this meant discovering the custom event system and proposing API-based alternatives. This isn’t trial and error; it’s informed problem-solving.
Knowledge Integration
- Feed lessons back into discovery
- Apply learned patterns in execution
- Build reusable knowledge base
The lessons learnt document isn’t just a one-time fix – it becomes part of the automation’s knowledge base. Future test cases can reference these lessons, avoiding the same failures. Over time, you build a comprehensive understanding of your application’s quirks and complexities. This is organizational learning embedded in your automation system.

The Four Commands Enhanced:

Building on the Test Automation Compiler from Experiment #4, I’m using the same four commands but with lessons learnt integration:

/discover – Discovery execution (now accepts lessons learnt documents)
/learn – Extract automation patterns (incorporates lessons)
/generate – Create YAML from learnings (uses lesson-based strategies)
/validate – Validation execution (confirms lesson effectiveness)

The Experiment

I took an AG Grid test case through the complete learning loop process – expecting failure, researching solutions, and applying lessons.

Step 1: The Expected Failure

What I asked:

/discover TestManagement\test-cases\TC-002-aggrid-column-management-and-row-grouping.md

What happened:

Claude Code attempted to execute the test case using standard Chrome DevTools MCP approaches. The test involved dragging column headers into the row grouping panel – a seemingly simple interaction that’s actually one of AG Grid’s most complex features.

The discovery phase captured:

Successful navigation to the AG Grid example
Correct identification of column headers
Successful initiation of drag events
Complete failure on the drop action

“It’s not behaving in the way our AI agent was expecting. It’s a non-standard implementation of the drag and drop components… It can drag, but it can’t actually drop.”

This wasn’t a surprise – I specifically chose AG Grid because I knew it would fail. AG Grid uses a custom drag-and-drop implementation that doesn’t respond to standard browser events. The drag appears to work visually, but the drop zone doesn’t accept the element. This is exactly the kind of complex scenario that makes automation engineers either write custom JavaScript handlers or give up entirely.

The key insight here is that the failure was informative. We learned exactly what doesn’t work and why – setting up the next phase perfectly.

Step 2: Ultra-Thinking Research

What I asked:

please can you examine why this drag-and-drop for the row grouping didn't work,
ultrathink and come up with solutions. Please write these solutions to the
file aggrid-lessons-learnt.md

What happened:

Claude Code spent five minutes in deep research mode. This wasn’t just error analysis – it was comprehensive investigation into AG Grid’s architecture and implementation choices.

The ultra-thinking phase discovered:

Root Cause: AG Grid doesn’t use native HTML5 drag-and-drop APIs
Implementation: Custom event handling with synthetic events
Complexity: Multiple ambiguous drop zones with custom hit detection
Event Sequence: Specific order of mouse events required that standard automation misses

The research went deeper, analyzing AG Grid’s documentation and finding:

The grid exposes a comprehensive API for programmatic control
Column state can be managed without UI interaction
Row grouping can be applied through applyColumnState() method
The API approach is actually more reliable than UI automation

The output was a comprehensive lessons learnt document with:

Detailed explanation of why standard approaches fail
Multiple solution strategies ranked by reliability
Code examples for API-based approaches
Fallback strategies if API isn’t available
Timing considerations for animations and state changes

This research phase is transformative. Instead of blindly trying different selectors or timing strategies (the traditional debugging approach), we’re building understanding of the underlying system. The lessons learnt document isn’t just a workaround – it’s a knowledge artifact that captures expert-level understanding of AG Grid automation.

Step 3: The Informed Second Attempt

What I asked:

/discover TestManagement\test-cases\TC-002-aggrid-column-management-and-row-grouping.md
          aggrid-lessons-learnt.md

What happened:

I completely cleared the context – this was a fresh start with no memory of the previous failure. The only difference was the inclusion of the lessons learnt document.

This second discovery run was radically different:

Identified AG Grid’s API access pattern through React Fiber
Used gridApi.applyColumnState() instead of drag-and-drop
Applied row grouping programmatically in milliseconds
Verified the UI updated correctly
Expanded grouped rows using node.setExpanded(true)

“It’s been through all of the test steps. It says that it matches exactly what was expected in the test case.”

The transformation was impressive. What failed completely in the first attempt now executing flawlessly. The drag-and-drop interaction that would have taken complex custom code was replaced with a simple API call. The visual result was identical – the Country column appeared in the row groups panel, the data reorganized into groups, and the Belgium group expanded to show Isabella Kingston’s data.

But here’s the crucial point: the test case didn’t change. The markdown still described dragging and dropping. The AI learned to interpret “drag Country to row groups” as “apply row grouping by Country” and implement it the most reliable way possible.

That’s pretty impressive!

Step 4: Complete the 4-Phase Workflow

What I asked:

/learn TestManagement\discoveries\TC-002-discovery-log.json aggrid-lessons-learnt.md

Then generate the YAML specification:

/generate TestManagement\learnings\TC-002-learnings.json
          TestManagement\test-cases\TC-002-aggrid-column-management-and-row-grouping.md
          aggrid-lessons-learnt.md

Finally, validate the compiled automation:

/validate TestManagement\specs\TC-002-test-spec.yaml

What happened:

The complete 4-phase workflow (Discover → Learn → Generate → Validate) now incorporated the lessons learnt at every stage:

Learning Extraction Enhanced:
The /learn command didn’t just extract patterns from the discovery log – it cross-referenced them with the lessons learnt document. It understood that certain UI interactions should be replaced with API calls. It learned that timing for AG Grid animations needs special handling. It identified which selectors were reliable and which were fragile. The output wasn’t just a pattern list – it was an intelligent synthesis of discovered behavior and documented solutions.

YAML Generation Transformed:
The /generate command produced a radically different YAML specification than it would have without the lessons. Instead of drag-and-drop instructions, it contained API calls. Instead of complex event sequences, it had simple state changes. The YAML included:

API access patterns through React Fiber
Programmatic column state management
Direct node manipulation for row expansion
Optimized timing based on measured animations
Fallback strategies for unreliable elements

Validation Results:
The validation phase confirmed that our learning loop worked perfectly:

✅ Validation Complete - TC-002 PASSED

Outcome: ✅ PASS (13/13 steps successful)

📊 Key Metrics
| Metric                  | Result                        |
|-------------------------|-------------------------------|
| Total Steps             | 13 (Setup: 3, Test Steps: 10) |
| Successful Steps        | 13                            |
| Failed Steps            | 0                             |
| Assertions Passed       | 4/4                           |
| Deterministic Execution | ✅ YES                         |
| Spec Followed Exactly   | ✅ YES                         |
| Fallbacks Used          | ❌ NONE                        |
| Improvisation           | ❌ NONE                        |

The validation was twice as fast as discovery because it executed deterministically without exploration. Every step passed on the first attempt. No fallbacks were needed. The API approach was not just a workaround – it was demonstrably superior to UI automation.

The Run Twice Pattern Validated:
Both the discovery run (with lessons) and the validation run produced identical functional outcomes:

Belgium group expanded
Isabella Kingston data visible and correct
20 country groups created
Row grouping by Country active

This proves the YAML specification successfully captured ALL automation knowledge, including the lessons learnt. The compilation from markdown to YAML was complete and correct.

Patterns I Noticed

After watching this 4-phase workflow (Discover → Learn → Generate → Validate) complete, some clear patterns emerged:

Works well for:

Complex UI components that have API alternatives
Applications where standard automation approaches fail initially
Building reusable knowledge bases for specific frameworks (AG Grid, Kendo UI, etc.)
Scenarios where you need to “teach” the automation about application quirks
Catching up on technical debt where automation has been deferred

Gets messy with:

Applications without good API access (purely visual interfaces)
Scenarios where the UI approach is the only way to test user experience

Surprises:

The ultra-thinking capability produced genuinely useful, actionable research
The lessons learnt document was comprehensive enough to use immediately
Validation was 2x faster than discovery (530ms vs 1,127ms)
The API approach was actually MORE reliable than UI automation
The YAML specification captured ALL automation knowledge perfectly

The Honest Take

Quick Verdict:
GREEN: “I’m going to start using this in production” – for catching up on technical debt

The Good:

Self-healing automation: Tests that learn from failure and fix themselves
Knowledge preservation: Lessons learnt documents capture expert knowledge permanently
Speed improvement: 2x faster execution after learning (530ms vs 1,127ms)
Reliability boost: API approach more stable than UI automation
Reusable solutions: Lessons apply to all similar test cases
Systematic process: Clear fail → research → learn → succeed workflow
Production-ready: 13/13 steps passing with deterministic execution

The Concerns:

API vs UI testing: The API approach doesn’t test exactly what the end user uses
Research time: Ultra-thinking takes 5+ minutes per complex problem (that’s quicker than me doing it!)
Framework-specific: Lessons are tied to specific frameworks (AG Grid, Kendo, etc.)
Not universal: Only works when alternative approaches exist (API, keyboard nav, etc.)

Would I use this?
Absolutely. I’m already planning to scale this up for production use. The ability to fail, learn, and succeed systematically is transformative for complex test automation. This isn’t just a clever experiment – it’s a practical solution to real problems I face daily.

Here’s where I’d use it immediately:

Legacy application automation where standard approaches have failed
Complex UI frameworks (AG Grid, Kendo UI, DevExpress) that defeat simple automation
Technical debt catch-up – finally automate those “too hard” test cases
Knowledge base building – create a library of lessons for the entire team
Onboarding acceleration – new team members inherit accumulated automation knowledge

When would I NOT use it?
There are clear boundaries to this approach:

Pure UI validation – when you must test exact user interactions, not API equivalents
Simple applications – overhead isn’t justified for basic forms and buttons
Visual regression testing – pixel-perfect validation needs different tools
One-off tests – the learning investment doesn’t pay off for single-use cases

Still Curious About

What I’m still curious about and want to test further…

Shared knowledge libraries: Can we build a centralized repository of lessons learnt that multiple teams can contribute to and benefit from? Imagine a GitHub repo of automation lessons for every major framework.
Other complex frameworks: Would this pattern work with Kendo UI, DevExpress, Telerik, or other notoriously difficult frameworks? Each has its own quirks that might benefit from documented lessons.
Lessons evolution: How many test cases need to fail and be fixed before the lessons document becomes truly comprehensive? Is there a point of diminishing returns?
Playwright MCP integration: Could we swap in Playwright MCP instead of Chrome DevTools for cross-browser testing? That would open up Firefox and Safari automation with the same learning approach.
Lessons versioning: How do we handle framework updates that invalidate lessons? Can we version lessons alongside application versions?

The Main Lesson

The Lessons Learnt Loop isn’t just error recovery – it’s systematic knowledge building for test automation.

Traditional automation fails and stays failed. This approach fails, learns why, documents solutions, and succeeds. The difference is profound:

Failure becomes valuable: Each failure generates knowledge that prevents future failures
Knowledge persists: Solutions are documented, not trapped in one engineer’s head
Complexity becomes manageable: Even AG Grid’s notorious drag-and-drop can be automated reliably
Teams scale better: Junior engineers inherit senior engineers’ solutions
Maintenance simplifies: When something breaks, check if there’s already a lesson for it

The pattern demonstrated here – fail → research → document → succeed – mirrors how human experts develop. We’re essentially teaching our automation system to become an expert through experience. The ultra-thinking phase acts like a senior engineer researching a problem. The lessons learnt document captures that expertise permanently. The second attempt applies that expertise successfully.

This approach transforms test automation from a brittle, high-maintenance burden into a learning system that gets smarter over time. Every challenge makes it stronger. Every failure makes it wiser.

Conclusion

This experiment proves that test automation with AI can literally learn from its own failures. The Lessons Learnt Loop isn’t just a workaround for complex scenarios – it’s a fundamental shift in how we approach automation challenges.

What started as an expected failure with AG Grid’s drag-and-drop became a demonstration of systematic problem-solving. The automation failed, researched why, documented solutions, and succeeded on the second attempt. The real success was turning an “impossible to automate” scenario into a solved problem.

The implications extend beyond this single test case. Every organization has applications with quirky behaviors that defeat standard automation. Every team has that list of “manual only” test cases they’ve given up on. The Lessons Learnt Loop offers a path forward: systematic learning that turns automation failures into documented solutions.

What makes this approach particularly powerful is its alignment with how organizations actually work. Senior engineers naturally build mental models of application quirks. The Lessons Learnt Loop captures that knowledge explicitly, making it shareable, reusable, and permanent. When that senior engineer leaves, their automation knowledge stays.

The four-phase workflow (Discover → Learn → Generate → Validate) now has an enhancement: front load the earlier phases with lessons learnt. This creates a positive feedback loop where automation gets progressively smarter. Each failure contributes to future success. Each lesson learned benefits all subsequent tests.

I’m moving this from experiment to production. The approach is mature enough, the results are compelling enough, and the need is definitely there. Those AG Grid tests that have been “manual only” for years? They’re about to become automated.

The Prompts I Actually Used

If you’re interested in trying this, these are the exact prompts I used:

# Initial discovery (expected to fail)
/discover TestManagement\test-cases\TC-002-aggrid-column-management-and-row-grouping.md

# Ultra-thinking research after failure
please can you examine why this drag-and-drop for the row grouping didn't work,
ultrathink and come up with solutions. Please write these solutions to the
file aggrid-lessons-learnt.md

# Second discovery with lessons learnt
/discover TestManagement\test-cases\TC-002-aggrid-column-management-and-row-grouping.md
          aggrid-lessons-learnt.md

# Learning phase with lessons
/learn TestManagement\discoveries\TC-002-discovery-log.json aggrid-lessons-learnt.md

# Generation phase with lessons
/generate TestManagement\learnings\TC-002-learnings.json
          TestManagement\test-cases\TC-002-aggrid-column-management-and-row-grouping.md
          aggrid-lessons-learnt.md

# Validation to confirm it works
/validate TestManagement\specs\TC-002-test-spec.yaml

Resources

AG Grid Examples: https://www.ag-grid.com/example/
Chrome DevTools MCP: GitHub Repository
Previous Experiments: Building on the framework from Experiment #4

Want to try this yourself?

Really was simple to get setup once you understand the pattern: Fail → Research → Learn → Succeed. I’ll get this packaged up and released on GitHub very soon. Then you can try it and let me know what happens when you try it – I’m especially curious about what frameworks defeat your first attempt and whether ultra-thinking helps you too.

The post AI Experiment #5: Can Test Automation AI Learn From Its Own Failures? appeared first on Test Management.

AI Experiment #4: The Test Automation Compiler

Bill Echlin — Mon, 27 Oct 2025 22:17:55 +0000

Can you treat test cases like source code that compiles into automation?

I wondered if this might work… so I tried it.

The Question

How far can I take building a fully “prompt driven” test automation system?

In previous experiments, I’ve been exploring different aspects of AI-driven test automation:

Experiment #1: Creating test cases from screen recordings with FFMPEG
Experiment #2: Running test cases directly with Chrome DevTools MCP
Experiment #3: The “run twice” pattern – agentic discovery then deterministic YAML

This experiment brings it all together into a systematic approach I’m calling The Test Automation Compiler.

The core idea: Treat markdown test cases as source code that gets compiled into executable automation through a defined process – just like a programming language compiler turns source code into machine code.

I don’t honestly know if this is a good approach yet. Intuition just tells me it’s worth trying!

What I’m Using

Chrome DevTools MCP (configured from Experiment #2)
Claude Code
Financial Dashboard demo application
Markdown test case documents
Test Automation Compiler strategy document (version 2.0)

That last one is crucial – I took everything learned from Experiments #2 and #3, and a document I developed on determinitic AI test automation. I then refined this with Claude Code’s help, and built a comprehensive strategy document that defines the entire compilation philosophy and process.

The Setup

Here’s how the Test Automation Compiler works:

The Core Philosophy:

Human Intent (Markdown) → Compiler (AI Discovery) → Machine Code (YAML) → Execution (Deterministic)
           ↑                                                                        ↓
           └─────────────────── Feedback Loop (Continuous Learning) ────────────────┘

Traditional test automation requires translating human test cases into code – a manual, error-prone process that creates two artifacts. Two artifacts that need maintaining in parrallel. The Test Automation Compiler approach treats this as a compilation problem instead. Your markdown test case is the source code. An AI discovery run acts as the compiler, learning implementation details through intelligent exploration. The generated YAML is the compiled output – optimized, deterministic and ready to execute. Not ready to execute in the traditional sense. Ready to execute with an AI coding engine and an MCP connection to a browser.

This shift in approach could have profound implications. Just like a programming language compiler converts high-level code into machine instructions, this system converts human-readable test cases into a script that can be run. Kind of similar to modern compilers where they include optimization passes, our new learning phase extracts the most efficient patterns from discovery. The feedback loop acts like profiling tools, identifying where the compiled code needs refinement. The result: a single source of truth (your markdown test case) that stays in sync with automation through systematic recompilation.

The Three Pillars:

Automation-Aware Test Case Creation
- Markdown test cases written with automation in mind
- Structured natural language (GIVEN/WHEN/THEN)
- Clear element identification
- Specific assertions
Not all test cases are created equal. A test case written as “Click the blue button” is human-readable but nightmarish to automate. A test case written as “CLICK button labeled ‘Submit'” is both human-readable AND automatable. This approach ensures test cases are written in a structured format that humans can read naturally while AI can parse reliably. Kind of similar to BDD but not as prescriptive. It’s an automation-aware format that is “compiler-friendly” test documentation.
Run Twice Pattern (from Experiment #3)
- Run 1: Discovery – AI explores and learns
- Run 2: Validation – YAML executes deterministically
The first run is exploratory – the AI coder with MCP DevTools connection tries multiple approaches, measures actual timing, discovers the most reliable selectors, and logs everything. Maybe a bit like a compiler’s analysis phase, understanding the structure before generating code. The second run validates that what was learned actually works deterministically. This two-phase approach separates the intelligent discovery (which can be non-deterministic) from the production execution (which must be deterministic). You get the benefits of AI exploration without the unreliability of agentic execution in production.
Intelligent Feedback Loop
- Learn from execution results
- Update YAML for implementation changes
- Update markdown for business logic changes
- Continuous improvement
Over time, the system learns from failures and successes. When a selector breaks, the feedback loop updates the YAML without touching the markdown test case. When business logic changes, the markdown is updated and the YAML recompiled. This feedback loop identifies and undertakes maintenance for your automated tests. The system gets smarter with every execution, learning which patterns work and which need adjustment.

That’s the idea … I still have a bit to work out and finish off here. Watch out for Experiment #5.

The Four Commands:

I built four Claude Code slash commands that implement the compilation pipeline:

/discover – Discovery execution (Run 1)
/learn – Extract automation patterns
/generate – Create YAML from learnings
/validate – Validation execution (Run 2)

The Experiment

I took a markdown test case through this complete 4-step compilation process.

Step 1: Discovery Execution

What I asked:

/discover TC-001-add-investment-account.md

What happened:

Claude Code took the markdown test case and executed it using Chrome DevTools MCP. But this wasn’t just a simple run – it was a discovery session designed to learn everything about automating this test:

Discovery captured:

Multiple selector strategies attempted
Successful selectors with confidence scores
Timing requirements measured in milliseconds
Screenshots at each step
Patterns in the application behavior
Modal animations (600ms discovered)
Form validation delays (1000ms required)
Success indicators and their duration

The output: discovery-log.json – a comprehensive record of everything learned during the first run.

This discovery run is analyzing the application’s structure, parsing the UI patterns, measuring the real-world behavior. It’s not just blindly executing steps; it’s building an internal model of how this application works so it can generate optimal automation instructions. The discovery log is a structured representation of everything needed for the yaml script generation.

The key insight here is that this first run is intelligent exploration, not just execution.

Step 2: Learning Extraction

What I asked:

/learn discovery-log.json

What happened:

Claude Code analyzed the discovery log and extracted automation patterns specific to this application:

Learnings extracted:

Most reliable selectors identified
Optimal wait strategies (immediate vs timed)
Element interaction patterns
Data transformation rules
Application-specific quirks
Best practices for this UI

The output: learnings.json – distilled intelligence ready for YAML generation.

This learning extraction step does is looking to optimise our test automation process specifically for our test case and our application. It looks at all the attempted selectors and picks the most reliable ones. It analyzes timing patterns and calculates optimal wait strategies. It identifies application-specific behaviors (like a 600ms modal animation) that need special handling. This is more than just data aggregation – it’s using AI intelligent pattern recognition that extracts reusable automation knowledge from raw execution data. The learnings become the “optimisation rules” that guide YAML generation (the next step).

Step 3: YAML Generation

What I asked:

/generate learnings.json TC-001-add-investment-account.md

What happened:

Claude Code combined:

The learnings (implementation details)
The original markdown test case (requirements)

To generate a deterministic YAML specification.

The YAML included:

Structured steps using discovered selectors
Optimized waits based on measured timing
Proper sequencing learned from discovery
Fallback strategies for flaky elements
Confidence scores for each selector
Execution notes documenting quirks

The output: test-spec.yaml – the “compiled” version of the markdown test case.

This is the yaml script generation phase – where high-level requirements (markdown) and implementation intelligence (learnings) combine to produce a script that can be followed by AI coding tools. The YAML specification includes everything needed for deterministic execution: precise selectors with confidence scores, optimized wait times based on measured behavior, proper sequencing learned from discovery, and even fallback strategies for unreliable elements. It’s structured, readable, and maintainable – just like well-written code. But unlike hand-written automation this is like a “compiled” output based on actual observed behavior.

The YAML reads like hand-crafted automation code, but it was generated entirely from the discovery process. It’s deterministic, optimized, and includes all the learned implementation details.

Step 4: Validation Execution

What I asked:

/validate test-spec.yaml

What happened:

Claude Code executed the generated YAML specification to validate that the compilation was successful.

Validation performed:

Executed all steps deterministically
Compared results to discovery run
Verified repeatability
Scored reliability across multiple dimensions
Identified maintenance points
Confirmed CI/CD readiness

Validation Results:

Dimension	Score	Notes
Reliability	10/10	All steps execute consistently
Timing	10/10	Optimal waits discovered
Data Handling	10/10	Correct transformations
Determinism	10/10	No randomness in execution
Maintainability	10/10	Well-structured, documented

Overall Assessment: Production-ready, fully deployable to CI/CD

Maintenance Notes: Selectors need monitoring for UI changes (which is true for any automation).

The validation step completes the compilation cycle by proving the generated YAML actually works. It’s comparing results against the discovery run, checking for deterministic behavior, confirming there’s no randomness in execution. The scoring system provides objective metrics across multiple dimensions, giving you confidence the automation is production-ready. This isn’t just a binary pass/fail – it’s a comprehensive quality assessment that identifies potential maintenance points before they become problems. The 10/10 scores across all dimensions mean the compilation was successful: the YAML faithfully represents the markdown test case with optimal implementation.

Patterns I Noticed

After completing the 4-step compilation workflow:

Works well for:

Converting test documentation directly into automation without coding
Maintaining single source of truth (markdown test case)
Creating deterministic execution from agentic discovery
Systematic approach with clear, defined stages
Production-ready output with confidence scores
Auditable compilation process (all artifacts saved)

Might not work quite so well for:

Currently Chrome-only (using Chrome DevTools MCP limitation)
Needs extensive testing with more complex scenarios
Unknown performance in production GitHub workflows
No feedback loop implemented yet (planned next)

Surprises:

The 4-command workflow felt natural and systematic
10/10 reliability score on first attempt with no tweaking
Generated YAML was readable and maintainable by humans
The “compiler” concept works as both metaphor and reality
Validation report proactively identified future maintenance points
Discovery log captured quirks I wouldn’t have thought to document

The Honest Take

Quick Verdict:
AMBER: “Interesting . . . an approach that’s taking shape! Need to work in the feedback and learning layer!”

The Good:

Single source of truth: markdown test case is THE documentation
No coding required: entire workflow is prompt-driven
Systematic process: 4 clear steps with defined inputs/outputs
Production-ready: 10/10 reliability score, validated for CI/CD
Intelligent compilation: learns optimal selectors and timing automatically
Self-validating: system knows its own reliability and limitations
Auditable: every stage produces artifacts for review

The Concerns:

Still needs extensive testing with complex scenarios (multi-step flows, error cases)
Chrome-only currently (Chrome DevTools MCP limitation)
Unknown reliability in production CI/CD workflows
Feedback loop not yet implemented (continuous learning needed)
Need to test with UI changes to verify maintenance approach

Would I use this?
Maybe – the approach is solid and the 4-step workflow makes intuitive sense. I need to:

Validate with more complex tests
Run in actual CI/CD pipeline
Add the feedback loop
Test Playwright MCP for multi-browser

I think the foundation we have here is strong enough to convince me to invest more time in this approach.

The Main Lesson

The “Test Automation Compiler” concept isn’t just a metaphor – it’s a working proof of concept.

Treating test cases as source code that compiles into automation provides several benefits:

Clear mental model: Everyone understands compilers
Defined stages: Each step has specific inputs/outputs
Separation of concerns: Requirements (markdown) vs implementation (YAML)
Optimization opportunity: Learning phase extracts best practices
Validation built-in: Compiler verifies its own output

The three pillars work together:

Automation-Aware Creation ensures good input (well-structured test cases)
Run Twice Pattern provides intelligent translation (discovery → YAML)
Feedback Loop enables continuous improvement (not yet implemented but designed)

I think the concept and foundation is solid. From markdown to production-ready automation with a systematic compilation process and no code written.

Conclusion

This experiment demonstrates that the Test Automation Compiler approach isn’t just an interesting idea – it’s a practical, working proof of concept. The compiler metaphor proved to be more than just a convenient analogy; it’s a useful description of what’s happening. We’re taking human-readable test documentation (source code) and systematically transforming it through analysis, optimization, and code generation phases into deterministic, production-ready automation (executable machine code). When I say “executable machine code” I really mean a script that’s reliably execuatable by an AI coding engine with an MCP conncetion to a browser.

The four-command workflow provides a clear, repeatable process. A process that separates concerns: discovery for learning, extraction for optimization, generation for compilation, and validation for quality assurance.

What makes this approach fundamentally different from traditional test automation is the elimination of parallel artifacts. There’s no separate test documentation that falls out of sync with test code. There’s no manual translation step where implementation details get lost or misinterpreted. Although it could be argued that losing this translation step means you’re missing a human review and analysis step that traditionally would find issues – both in the test case and the application under test.

The markdown test case becomes THE documentation, and the YAML specification is automatically compiled from observed behavior rather than assumed implementation. When the application changes, you update the markdown and recompile – just like updating source code and rebuilding. When implementation details change (selectors, timing), the feedback loop updates the YAML without touching your documentation (at least that’s what I’m hoping once this stage is implemented).

This single source of truth approach, combined with intelligent compilation and continuous learning, represents a genuinely interesting way to think about test automation maintenance and sustainability.

The post AI Experiment #4: The Test Automation Compiler appeared first on Test Management.

AI Experiment #3: When Claude Code Builds a Framework You Didn’t Ask For

Bill Echlin — Fri, 24 Oct 2025 07:27:07 +0000

Can you transform unreliable agentic tests into deterministic, repeatable tests using a “run twice” pattern?

I wondered if this works… so I tried it.

The Question

Why did Claude Code suddenly start building a test automation framework I didn’t ask for?

Towards the end of Experiment #2, things went sideways. I asked Claude Code to create a slash command for test automation. Instead, it started generating commands for converting tests to YAML, creating action libraries, building test runners… way beyond what I’d asked for.

Then I realized: I’d forgotten about a markdown document in my project folder. A document about “deterministic test automation” that I’d been exploring with Claude Code in a previous session. When I said “read all the files in this project,” it read that forgotten document.

And that context guided everything that followed.

What I’m Using

Chrome DevTools MCP (already configured from Experiment #2)
Claude Code
Financial Dashboard app (v0 app for testing)
Test case markdown documents
Hidden context: a forgotten “deterministic testing” document

That last one – the forgotten document – is the key to understanding this entire experiment.

The Setup

Here’s how I got this working (or rather, how it got itself working):

Installation:
Chrome DevTools MCP already configured from Experiment #2. Nothing new needed.

Configuration:
Discovered Claude Code had read a “deterministic testing” markdown file I’d forgotten about. That file explained concepts like:

Moving from agentic (AI-driven, somewhat unpredictable) to deterministic (scripted, reliable) testing
Creating action libraries
Building YAML specifications
Test execution frameworks

Starting Point:
Picking up where Experiment #2 unexpectedly pivoted – with three slash commands Claude Code had generated:

/run-test – Execute markdown test case using Chrome DevTools MCP
/convert-test-to-yaml – Create deterministic YAML specification
/create-action – Build reusable action library

The Experiment

I followed the three-step workflow Claude Code had created to see what would actually happen.

Try #1: Run the agentic test

What I asked:

/run-test test-cases/test-management/TC-001-add-investment-account.md

What happened:
Claude Code executed the markdown test case using Chrome DevTools MCP. It:

Navigated to the accounts dashboard
Clicked the “Add Account” button
Selected “Investment” account type
Filled in account name: “Main Investment”
Filled in description: “My Investment Account”
Submitted the form
Verified the account was created
Hovered over the account to show highlight effect
Completed all 13 test steps

But it didn’t just run the test. It also:

Created a test-evidence folder
Captured screenshots at key points
Generated a comprehensive test execution report with:
- Summary
- Execution details
- Step-by-step verification results
- Evidence captured
- Observations
- Issues (none found)
- Test status: PASSED

Wait, it created a complete test execution framework just from the markdown?

That’s what I wasn’t expecting. It built evidence capture, reporting, and verification tracking automatically.

Try #2: Convert to YAML specification

What I asked:

/convert-test-to-yaml test-cases/test-management/TC-001-add-investment-account.md

What happened:
Claude Code created a structured YAML specification. It:

Created a test-specs folder
Generated a YAML file with:
- Metadata (test ID, functional area, timestamp)
- Preconditions (verified text on page)
- Steps using Chrome DevTools MCP commands as primitives:
  - navigate – Go to URL
  - click-button – Click elements
  - select-dropdown – Choose options
  - fill-text – Enter data
  - verify-text – Assert expected content
- Assertions for each step
- Evidence collection points (screenshots)
- Custom actions needed
Generated a conversion notes document explaining the structure

The YAML spec captured everything from the agentic run – navigation steps, form interactions, verifications – but structured it as commands that could execute deterministically.

This is a “run twice” pattern – agentic first to explore, then deterministic for reliability.

The first run uses the LLM to figure out how to interact with the application. The second run captures that as a deterministic specification.

Try #3: Create reusable action library

What I asked:

/create-action "add a new account through the dashboard modal"

What happened:
Claude Code extracted a reusable action from the test case. It:

Created a reusable-actions folder
Generated an “add account” action with:
- Parameters (account type, name, description, initial value)
- Implementation steps
- Selectors for UI elements
- Error handling
- Success criteria

Here’s what’s interesting: the original test case was specific – “add an investment account with these exact values.” But the action it extracted was generic – “add any type of account with any values.”

That’s pretty impressive!

It’s building a test automation framework without being explicitly asked – extracting patterns and creating reusable components that could work for multiple test scenarios.

Patterns I Noticed

After running through the three-command workflow, some clear patterns emerged:

Works well for:

Converting agentic (unreliable) runs into deterministic (reliable) specifications
Extracting reusable actions from specific test cases
Capturing test evidence automatically (screenshots, execution reports)
Parameterizing test data for reuse
Building action libraries incrementally from executed tests

Gets messy with:

Need to understand what Claude Code actually built (lots of files generated across multiple folders)
Framework emerged from accidental context (the forgotten document) – not planned

Surprises:

Claude Code built an entire framework I wasn’t planning
“Run twice” pattern emerged: agentic → deterministic
Context from a forgotten document guided the entire implementation
Three-command workflow was created automatically
Framework suggested building a YAML test runner as the next step
It generalized specific test cases into reusable actions

The Honest Take

Quick Verdict:
AMBER: “Really interesting, no idea where this goes”

The Good:

Solves a real problem: converting unreliable agentic tests to reliable deterministic tests
Three-step workflow makes sense: run → convert → extract actions
Automatically captures test evidence and execution reports
Creates parameterizable, reusable test specifications
Could enable a test case library that runs consistently
Extracts reusable patterns from specific implementations

The Concerns:

Completely unplanned – emerged from forgotten context
Haven’t validated the YAML specs actually work (no runner built yet)
No YAML test runner built yet
Lots of generated files to understand and validate
Unknown if this scales to multiple test cases
Need to understand what Claude Code built before trusting it

Would I use this?
Maybe – need to complete the framework and validate it works.

The “run twice” pattern feels right: agentic for exploration, deterministic for reliability. If the YAML specs actually execute reliably, this could be valuable.

For what?

When you need reliable test execution but want the speed of agentic test creation
Building test automation frameworks from exploratory testing
Creating action libraries from executed test patterns
Capturing test specifications from manual testing sessions

When would I NOT use it?

When I need deterministic tests immediately (framework not complete yet)
For simple tests where plain Playwright would be faster
Until I understand what Claude Code actually built and validate it works

Still Curious About

What I’m still curious about and want to test further…

Can the YAML specs actually execute reliably?
How would the YAML test runner work? (Claude Code suggested this as next step)
Does this scale to dozens or hundreds of test cases?
Can I build an action library that covers most test scenarios?
Is the “run twice” pattern a general principle for AI test automation?
What happens if the application UI changes – can the YAML specs adapt?
Could this work for API testing or just UI testing?

The Main Lesson

Context matters enormously.

A forgotten document about deterministic testing guided Claude Code to build an entire framework I wasn’t expecting.

I was just exploring ideas with Claude Code – writing down thoughts about moving from agentic to deterministic testing, discussing action libraries, thinking about YAML specifications. That document sat in my project folder. I moved it to a temp location and forgot about it.

When I started Experiment #2 and said “read all the files in this project,” Claude Code read that document. And when things went off the rails, that context guided it to build exactly what I’d been theorizing about.

This suggests a pattern: have Claude Code help you explore ideas and concepts, keep those documents in context, and let that guide future implementation.

I’m working this out as I go along, but the “run twice” pattern that emerged – agentic for exploration, then deterministic for reliability – feels like a real insight.

The Prompts I Actually Used

If you’re interested in trying this, these are the exact prompts I used:

/run-test test-cases/test-management/TC-001-add-investment-account.md

/convert-test-to-yaml test-cases/test-management/TC-001-add-investment-account.md

/create-action "add a new account through the dashboard modal"

Important Note: These slash commands were generated by Claude Code at the end of Experiment #2. They were guided by a forgotten “deterministic testing” markdown document that was sitting in my project context.

If you want to try this approach, you’d need to:

Set up Chrome DevTools MCP (see Experiment #2)
Create markdown test case documents
Give Claude Code context about deterministic testing concepts
Let it build the framework (or create the slash commands yourself)

Resources

Chrome DevTools MCP: https://github.com/ChromeDevTools/chrome-devtools-mcp
Financial Dashboard (demo app): v0 Demo app
Experiment #2 (where this started): https://www.testmanagement.com/blog/2025/10/test-cases-automated-in-minutes-with-chrome-devtools-mcp/

Want to try this yourself?

Watch this space – this needs follow-up to see if it actually delivers on the promise. I need to:

Validate the YAML specs actually work
Build (or have Claude Code build) the YAML test runner
Test with multiple test cases
See if the action library approach scales

Let me know if you’ve experimented with agentic vs deterministic testing approaches – I’m especially curious about whether the “run twice” pattern resonates with your experience.

The post AI Experiment #3: When Claude Code Builds a Framework You Didn’t Ask For appeared first on Test Management.

AI Experiment #2: Test Case to Automated Execution with Chrome DevTools MCP

Bill Echlin — Wed, 15 Oct 2025 20:26:48 +0000

Can you actually go straight from test case doc to automated execution?

Thought I’d give this a go and see what kind of results I came up with.

The Question

“Can I just give Claude Code my test case document, connect the Chrome DevTools MCP, and have it run the tests? Like, actually run them in a real browser, without writing any test code or setting up a framework?”

Potential scenarios for this include…

when you have documented test cases but no automation yet – could this give you instant automated execution?
Maybe you could use this to figure out how to interact with difficult applications before building proper test frameworks.

Anyway, no point speculating about how to use this if it doesn’t work. We just need to know if it does work!

What I’m Using

Chrome DevTools MCP – Model Context Protocol for Chrome browser automation
Claude Code – AI coding assistant with MCP support
Demo financial dashboard – Test application
Test case from Experiment #1 – Markdown-formatted test case I’d previously created

I was curious whether the Chrome DevTools MCP could bridge the gap between human-readable test documentation and actual automated execution.

The Setup

Here’s how I got this working:

Installation:
First you need Node and NPM installed. Then install the Chrome DevTools MCP:

npm install -g chrome-devtools-mcp@latest

I had ffmpeg already installed (it comes with a Playwright install), but you’ll need it for screenshots.

Configuration:
Add the Chrome DevTools MCP to Claude Code’s configuration. You can do this via command line:

claude mcp add chrome-devtools npx chrome-devtools-mcp@latest

This updates your Claude Code configuration file to include the MCP integration.

Verification:
Start Claude Code and check that Claude can see the MCP using:

/mcp

You should see chrome-devtools listed and connected. When I ran this, Claude actually fired up an initial Chrome instance just to verify it had access.

Test Case:
I used the test case created in Experiment #1 – a markdown file with structured test steps for adding an investment account. I added the application URL to the preconditions so Claude would know where to navigate.

Starting Point:
So my starting point was a fresh Claude Code session with the test case file and Chrome DevTools MCP configured.

The Experiment

Try #1: Initial Prompt

I primed Claude Code:

“You are an expert test automation engineer. Our goal is to take a test case documented in markdown and run it directly as an automated test using the Chrome DevTools MCP.”

Then I gave it the path to my test case file.

What happened:
Claude Code reviewed the test case, created a todo list to track execution, and… opened Chrome automatically. It navigated straight to the dashboard page.

Then it clicked the “Add Account” button.

That’s pretty amazing. We’ve gone from a standard human-readable test case directly to automating it in a browser with no framework, no code – just directly via the Chrome DevTools MCP.

I’m genuinely surprised. It followed every step in my test documentation – clicked buttons, filled forms, validated results. All from my markdown test case.

It actually caught specific details – the account name I specified, the exact description text, even the modal form behavior and validation messages.

That’s pretty impressive!

Try #2: Verify Consistency

I ran the test again to see if it would work consistently.

What happened:
Second run was actually quicker. Claude had context from the first run and executed more efficiently.

What I learned:
The approach is repeatable. Once Claude understands your test case structure, subsequent runs are faster and smoother.

Try #3: Token Usage Check

I wanted to understand the cost implications. Using a tool called CC Usage to monitor token consumption:

Result:
Each test run consumed about 1-2% of my session token allocation. That means roughly 50-100 test runs per session maximum.

Observation:
Not free, but not prohibitive either. For exploratory testing or figuring out automation approaches, this token cost is worth it. For CI/CD pipeline runs with hundreds of tests daily, you’d want a more deterministic coded approach.

Patterns I Noticed

After testing this approach, some patterns emerged:

Works well for:

Initial automation of documented test cases
Exploratory test automation – figuring out how to interact with an application
Applications where you’re unsure of the automation approach
Generating automation insights before building test frameworks
Quick verification of test cases
Learning how to interact with complex UI components

Gets messy with:

Long exploratory sessions (with no clear objective)
Tests requiring exact timing control
High-volume test execution (token consumption)
Production CI/CD pipelines (need deterministic frameworks)
Tests with two-factor authentication or complex login flows

Surprises:

It actually followed test steps reliably
Chrome DevTools MCP has comprehensive browser control (list console messages, network requests, screenshots, drag, hover, fill forms, click, navigate)
The test execution logs were detailed and useful
You could potentially use all the telemetry from Chrome DevTools to generate Playwright scripts for more deterministic automation
Claude suggested creating a slash command to make this repeatable
It even started suggesting ways to make the approach more deterministic (YAML test case format)

The Honest Take

Quick Verdict:
I’d use this tomorrow

Would I use this?
Absolutely. Not for everything, but definitely for specific scenarios.

For what?

Working out how to interact with difficult applications before building test automation
Initial automation of existing test cases to understand feasibility
Exploratory automation where you need quick feedback
Generating insights for building proper test frameworks
Applications with complex UI where you’re unsure of the automation approach

When would I NOT?

Production CI/CD pipelines (needs deterministic frameworks)
High-volume test execution (token consumption)

Token usage will add up with frequent runs, but you could test with lower-spec models to reduce costs. The real value is using this approach to figure out automation strategies, then converting to deterministic Playwright or similar frameworks for production use.

Still Curious About

What I’m still curious about and want to test further…

How consistent will this be over repeated runs? (It’s totally agentic – different models might give different results)
Will lower-spec models work as reliably with lower token costs?
Can I use all the Chrome DevTools telemetry to automatically generate Playwright scripts for deterministic automation?
Will this work with something like AGgrid or other complex component libraries? (That’ll be interesting!)
How would it handle applications with two-factor authentication or complex security?
Then I’m thinking ….. might this just work in a CI/CD pipeline?

The Prompts I Actually Used

If you’re interested in trying this, these are the exact prompts I used:

Please review the files in this project.

Please check that you have access to the chrome devTools mcp

Please tell me which commands you can use with the chrome devTools mcp

You are an expert test automation engineer

Our goal is to take a test case documented in markdown and run it directly as an automated test using the chrome DevTools mcp

[path to test case markdown file]

please guide me on how to take the lessons learnt from this session and create a claude code slash command

Resources

Chrome DevTools MCP: https://github.com/chrome-devtools/chrome-devtools-mcp
Demo Application: Vercel Demo App
Test Case from Experiment #1: Created using the video-to-test-case approach
Claude Code: https://claude.com/claude-code

Want to try this yourself?

Really was simple to get setup once you have the Chrome DevTools MCP installed and configured. Let me know what happens when you try it – I’m especially curious about how it handles complex component libraries like AGgrid or applications with difficult authentication flows.

The post AI Experiment #2: Test Case to Automated Execution with Chrome DevTools MCP appeared first on Test Management.

AI Experiment #1: Recording Videos and Converting Them to Test Cases

Bill Echlin — Wed, 15 Oct 2025 20:13:11 +0000

Can you actually record yourself testing and get test case documentation back automatically?

I wondered if this works… so I tried it.

The Question

I hate writing test case documentation, especially after I’ve already done the testing manually. So I wondered… what if I just record a video of myself testing and let Claude Code write the test cases? Could this actually work?

I’m kind of thinking of two potential scenarios if this works. First, when you have something to test but nobody has bothered to write a decent spec – you could record yourself exploring the application and generate test cases from that. Second, maybe you have a wireframe of the application (before it’s built) and you can record yourself stepping through the wireframe to create your test cases before development even starts.

Anyway, no point speculating about how to use this if it doesn’t work. So will it work?

What I’m Using

Screen recording – Windows Snipping Tool for capturing test execution
ffmpeg – To extract frames and process video
Claude Code – AI coding assistant with vision capabilities
Financial dashboard demo app – Test application for the experiment

I had ffmpeg already installed (it comes with a Playwright install). The approach would be to use ffmpeg to convert video into a sequence of screenshots that Claude Code could then process and convert into a structured test case in markdown format.

The Setup

Here’s how I got this working:

Installation:
ffmpeg was already on my system from Playwright. If you need it, you can get it from ffmpeg.org.

Configuration:
Created a test case template and some principles for writing good test cases (both in markdown). I used Claude.ai to generate these with these prompts:

Please create me a Test case template with standard fields (ID, title, preconditions, test steps, expected results)
Please create a Test case creation principles document with rules for writing good test cases

These documents provide context for Claude Code so it knows what structure to follow.

Test Case Template:
Standard template with placeholders for ID, title, description, preconditions, test steps, expected results, and actual results.

Starting Point:
Fresh Claude Code project initialized with /init, the template files, and a 43-second screen recording of me adding an investment account to the financial dashboard application.

The Experiment

Try #1: Can Claude Process Video Directly?

I asked Claude Code: “You are an expert in software testing. My goal is to create a process where I can record a video of me using an application and have you turn it into a structured test case in markdown format. Are you able to process and understand video?”

What happened:
Claude said yes, it could process video. So I asked it to take my test-add-account.mp4 file and create a test case.

It came back with an error reading the file.

Reaction:
As expected – Claude can’t directly process video files. But this was the learning moment that led to the actual solution.

Try #2: Using ffmpeg to Extract Frames

I prompted Claude to use ffmpeg to convert the video into screenshots that it could then analyze.

What happened:
Claude immediately understood the approach and started generating the ffmpeg command. It suggested using 1 frame per second, but I thought that wouldn’t be enough detail.

Adjustment:
I asked it to use 2 frames per second instead: “Please continue with this approach but use fps of 2.”

Result:
It created a temporary frames directory and extracted 84 PNG images from my 43-second video (roughly 2 per second as requested).

Try #3: Generate Test Case from Screenshots

Claude then read and analyzed all 84 screenshots, extracting test steps and expected results from the sequence of images.

What happened:
It created a complete test case file: TC-001-add-investment-account.md

The test case included:

Test case ID and title
Full description of what’s being tested
The actual values I’d used in the test (account type: “Investment”, name: “Main Investment”, and the exact description text I typed)
All the navigation steps
Expected results for each step
Screenshots attached as evidence

It actually caught specific details I’d entered – “Main Investment” as the account name, the exact description text I typed, even the modal form behavior and validation messages.

That’s pretty impressive!

Patterns I Noticed

After running this experiment, some patterns emerged:

Works well for:

Short, focused recordings (under 2 minutes)
Clear, deliberate actions (not clicking too fast)
Happy path flows where you’re showing the intended behavior
When you narrate what you’re doing (helps provide context)
Recording exploratory testing sessions you’re doing anyway
Creating test cases for applications with poor or missing specifications

Gets messy with:

Long exploratory sessions (with no clear objective)
Multiple test scenarios in one video (boundaries get confused)
Fast clicking and navigation (harder to capture what’s happening)
Expected vs actual results – it sees what happened, not necessarily what SHOULD happen

Surprises:

It caught UI elements I didn’t explicitly mention or focus on
Better at documenting steps than creating assertions/validations
Quality of test case depends heavily on recording quality and pace
Narration in video helps A LOT with context and understanding
The 2 frames per second rate worked well – enough detail without overwhelming
For production use, you’d create the frames, generate the test case, then delete the temporary screenshots

The Honest Take

Quick Verdict:
Yes, but not for everything

Would I use this?
Absolutely. Not as a replacement for all test documentation, but definitely for specific scenarios.

For what?

Quick documentation of manual tests I’m already doing anyway
Bug reproductions where I want both video evidence and structured test steps
Onboarding new team members – record once, get sharable test cases
When I’m too lazy to write it up properly (let’s be honest)
Applications where specs are missing or poor
Exploratory testing sessions where I want to capture what I discovered

When would I NOT?

Complex business logic verification where precise assertions matter
Anything needing precise validation points

Still needs refinement for longer test runs. A lot of potential once perfected. You’d want to add some validation loops to ensure the test cases being generated are accurate and complete.

Still Curious About

What I’m still curious about and want to test further…

Can it spot differences between expected and actual results if I point them out in the video?
Would audio narration improve the accuracy significantly?
Can it generate Given-When-Then format specifically for BDD test cases?
Will it work with something like AGgrid or other complex component libraries? (That’ll be interesting!)
What about testing mobile apps using screen recordings from devices?
How would it handle wireframe recordings to generate test cases before the app is built?

The Prompts I Actually Used

If you’re interested in trying this, these are the exact prompts I used:

In Claude.ai (for setup):

Please can you create me a test case template for software testing in markdown

Please can you create me a list of principles and rules that should be followed when creating good test cases for software testing

In Claude Code:

please review files in this project

You are an expert in software testing. My goal is to create a process where I can record a video of me using an application and have you turn it into a structured test case in Markdown format. Are you able to process and understand video?

please take the video test-add-account.mp4 and create a test case

[Request interrupted by user for tool use] please continue with this approach but use fps of 2

please could you take this process and the steps followed and create a claude code slash command named create-test-from-video

Resources

ffmpeg: https://www.ffmpeg.org/
Demo Application: Vercel Demo Application
Claude Code: https://claude.com/claude-code
Claude Code Cookbook (Vision best practices): https://github.com/anthropics/claude-cookbooks/blob/main/multimodal/best_practices_for_vision.ipynb

Want to try this yourself?

Really was simple to get setup once you have ffmpeg installed. Let me know what happens when you try it – I’m especially curious about whether audio narration improves the results and how it handles more complex applications.

The post AI Experiment #1: Recording Videos and Converting Them to Test Cases appeared first on Test Management.

An Intro to Agentic Testing : Overview and the PRP System

Bill Echlin — Sat, 30 Aug 2025 16:27:00 +0000

Contents : Module 1 Lesson 1

This Module

Module 1 – Building an Automation Framework with the PRP System

Lesson 1 : Overview and the PRP System (this document)
Lesson 2 : Pre-requisites and Setup
Lesson 3 : Step-by-step Build and Execute
Lesson 4 : Documentation and Debugging
Bonus Lesson : Getting Started with Claude Code

New Modules Comming Soon

Module 2: Developing your PRP Process to Add Tests
Module 3: Using MCPs to build a better Automation Framework
Module 4: PTP System for Test Case Creation
Module 5: Executing and Running Tests with PTP
Module 6: Agentic Test Maintenance

Introduction

The high-level architecture we’re working with is based around three core components: PRD, PRP, and the PTP.

PRD : Product Requirements Document — it defines WHAT to build. In our case this will be our “Playwright Test Automation Framework”.

PRP : Product Requirement Prompts – These are prompts that expand on the PRD to provide detailed instructions on HOW to build the application that your team are developing.

PTP : Product Testing Prompts – These are the HOW to test the product (we cover this in a later module)

These three components come together to provide us with this agentic product development and testing framework that works so well together.

For now though our initial focus for this particular lesson is based around just the PRD and the PRP because what we’re going to do is take that concept, extract it, and use it to build our automated test framework.

So think of the testing framework as the product we’re building. Later we’ll combine our Framework as part of the PTP in a complete systems for building and testing a web application.

This PRP system is an approach to software development that’s gaining ground fast. There’s various variations on this theme but it’s focus is on building comprehensive prompts that deliver well built working software.

What is the PRP Methodology

The PRP methodology, or system, is basically a library of assets and templates used for agenetic engineering or developing products and code with AI tools like Core Claude Code.

Provided below is a link to the original repository that Rasmus Wielding has built. This enables you to take a product requirement document, create product requirement prompts, and then execute those prompts with a coding tool like Claude Code to build a product.

If you want to go deeper then I’d recommend exploring his GitHub repo here..

https://github.com/Wirasm/PRPs-agentic-eng

I’ve starting using this methodology successfully to build test automation harnesses and then built on a PTP (Product Testing Prompts) methodology and embeds that into the whole system to create this whole agentic development and testing loop. But first, we need to understand the PRP methodology, and the best way to do that is to use it to build test automation framework and we’re going to do that with a framework built around Playwright.

What we’re going to do is take the core most important parts from this system and we’re going to use those to build our test automation framework. We have Playwright-orientated prompts and examples that we plug into this framework that enable us to easily and quickly build out a Playwright project with no coding.

Course GitHub Repository

We get into the details of how to use this in the next lesson.

For now though i want to describe the high level components and process. Show you how to modify and apply this system to building a framework with Playwright.

In later modules in this course I’m going to introduce you to the PTP (Product Testing Prompts) methodology that slots into this PRP (Product Requirements Prompt) methodlogy.

PRD – Product REquirements Document
PRP – Product Requirement Prompts
PTP – Product Testing Prompts

For now though … one step at a time let’s focus on one aspect, the PRD and PRP steps to build a Playwright automation framework.

How Does the PRP System Work?

How does this system work? How do the different parts connect? What do you need to do to use it?

The Process

The overall workflow is like this.

You’d start with an idea (possibly working with an AI chatbot like Claude Code or ChatGPT to develop and work through your idea).

Then we create a product requirements document. This is the high-level “what to build” document. What I’ve done for this course is already gone through the idea and the AI chat to create the PRD on the main branch. You’ll find the PRD document in the repository. It’s been honed to help provide the basis for a well-structured, well-designed Playwright test automation framework in TypeScript.

Now that PRD, along with the PRP README.md which outlines the process for Claude Code, the PRP template which gives an example of what to create, the Claude.md (which contains these high-level instructions that guide Claude Code when it’s working). Then with the Claude Code slash command, the PRP create command, all of those components come together to create your specific PRP document. You just run the create PRP slash command in Claude Code. The output from this is the PRP document

You now have a VERY detailed how-to-build-the-playwright-framework document.

There’s an example PRP document that I’ve created on the main branch of the code repo. I’ve used this to build my own Playwright frameworks.

Once you have the PRP document, we want to execute on it to build our Playwright framework. The execute PRP slash command takes in the PRP document, the PRP README, which provides the overview of the whole process to your code instance, and of course the Claude.md file, which Claude code uses as high-level instructions and a guide to any coding project it’s working on. We run the PRP execute command with all those inputs, and from that, Claude code will go away and build you your Playwright framework in TypeScript.

Some key bits to understand here

the PRP ReadMe, the Claude.md, and the create PRP commands are specific to the overall PRP system.
the PRD, the PRP templates and our AUT (Applicaiton Under Test) that we feed in, are what guide this system to build our “Specific” Playwright framework PRP document. It’s then executing on that PRP document, which is our very detailed “How to build” Playwright frameworks, that results in us getting our working TypeScript playwright framework for OUR application.

How do all of these components fit together then?

The Components

So there are a number of parts to this system.

It consists of some templates, PRP templates that Claude Code can use as an example to create your specific PRP. Secondly, there are a number of key slash commands that Claude Code will understand to help you generate your PRP and to help you execute on your PRP to create your actual coded framework.

The three core commands you’ll use are: PRP Planning, PRP base and PRP execute.

Now I have modified these commands specifically to enhance them to create test automation frameworks. I will take you through these commands so that you understand how to use them and how to update them when you need to. These commands are designed to leverage the PRP template that we use to create the PRP specifically for YOUR implmentation of a Playwright framework.

There is also a PRP README.md file that provides an overview of the whole process. That’s useful for you too, but it’s also useful for the AI coding agent so that it understands the context and the whole end-to-end process.

Thirdly, there’s the global rules directory or claude.md file that acts as a high-level directive or memory for Claude code when it’s working on your project. Again, I’ve customized this for your project for the building of test automation frameworks.

Another key input to this whole system and methodology are tehe example templates. AI loves examples to help it see what it’s output is expected to look like. I’ve created a PRD template and a PRP template specifically around a Playwright framework. This gives Claude Code loads of context when it comes to creating our speicifc version of the PRD and PRP documents.

Finally we’ll key Claude Code into the actual application we want to design our Playwright automation framework to test. In this instance we’ll provide the source code to our application in our repo. You’ll find the directory APP-UNDER-TEST in the course github repo. This could also be a direct pointer to a running instance of the application you’re testing. We’ll stick to teh soruce code for now. This will help Claude Code create a Playwright Framework specific to our application that we wan’t to test.

What We’re Going To Do Next

If you want to go deep into the PRP system, then I would go to this Git repo, clone it and work through a lot of the examples here.

PRP Agentic Engineering Repo

This framework/repo from Rasmus will enable you to build TypeScript, Python applications from scratch using this system. And within this repo are templates for building many different types of applications.

If you want to keep things simple to start with use our course repo here…

Course GitHub Repository

What I’ve done in our repo for this course, is take the core parts of this PRP system and built templates and commands specifically for building test automation frameworks in TypeScript for Playwright.

Over the duration of this six-module course, you will find examples, templates, and architecture documents to help you build both UI and API test automation frameworks.

Next, in Lesson 2, I’m going to walk you through setting up your system and the tools you need.

Then, in Lesson 3, I want to take you through, step by step, this process to build your first framework with the PRP system.

I’m going to walk you through a 5 step process

1. Define your idea
2. Turn your idea in to a PRD
3. Develop your idea into a comprehensive PRP
4. Feed the PRP into Claude Code to build your Playwright framework
5. Test your finished framework

Remember – learn this and you’ll have a system at your finger tips for building all sorts of test tools and testing frameworks!!!

The post An Intro to Agentic Testing : Overview and the PRP System appeared first on Test Management.

Playwright Tutorial : Executing Tests in Playwright

Bill Echlin — Mon, 03 Mar 2025 22:43:40 +0000

This is the final lesson in this introductory tutorial for Playwright where we look at executing tests. You have a number of options and ways to execute tests in Playwright.

Your three main options for executing tests include:

Execute from within Visual Studio Code
Execute with Trace Viewer
Execute from the Command Line

We’re going to explore each of these options in the next few sections as well as look at how to debug and select browsers.

Executing Playwright Tests in VS Code

We’re in our Playwright project, in the Explorer view. You’ll see your individual tests can be executed by pressing the run tests button on the left hand side.

However, if you created a new test or you’ve edited a test, you won’t see that button. You’ll need to come into the test explorer view and you’ll need to refresh tests.

For example if we rename a test, and Playwright hasn’t yet picked this up this change so we can’t execute the test. After we’ve saved this specification file we can come into the “Test Explorer” view and refresh the tests list. Now we’ll see that Playwright will pick this up and we now have the option to run this test. What you’ll notice too is that you can run from the test Explorer view too now.

You can run all of the tests in all of the test specs in this directory. You can also come in and run all of the tests in a particular specification file, or you can run individual tests within a specification file.

If you drill down to a specific test you want to edit then you’ll also find that you’ve got this option to turn on “continuous run”. Select this option and whenever you edit a test, and then save it, it will run it immediately.

When you run the tests, you’ll see that the test result panel will open up. You’ll get the results of that particular test run displayed along with the history presented in the right hand panel.

Trace Viewer

When it comes to debugging these tests, you can either show the test in a browser when you execute them so you can view the test run itself (“headed” mode). Or you can run in the background (“headless” mode with no browser displayed) and use the “trace viewer” tool to see how the test ran. You can select which one of these you need in the Playwright panel.

When you run a test with the trace viewer this shows you the recording of the test execution (especially useful if you’re running in headless mode). So you can view all of the steps executed within the test.

You can now look at the before hooks and after hooks (which we’ll talk about later). You can look at the steps and investigate the actual code that was used to run that step. You can even see the locators that have been used.

You can even experiment with those locators because all of those locators are recorded as part of that Trace Viewer recording. You can use the Pick locator feature, inspect the graphical view of the page, and find a new locator… all from the recording!

There’s lots of useful tools within the Trace Viewer, which we’ll talk about in a dedicated lesson later on. For now though just be aware that you can step through all of the steps in your script and visually see where any errors might have cropped up.

Selecting Browsers for Execution

When we installed Playwright for this particular project, remember that the command palette installer gave us the option to select the browsers we needed.

Those browsers that were installed as part of this project are now represented in the left hand Playwright panel.

So if we want to run a test in multiple browsers, we just select the browsers that we need. Using these check boxes I can run this single test against one or more browsers. After the execution completes you can then see the test results broken down by browser.

Executing Tests from the Command Line

The last thing to mention then is that tests can be run from the command line. You can either do this from within VS Code (in the “Terminal” panel) or in a separate command prompt window. We’ll step through how to run in a separate command prompt.

First find the path to your Playwright project. Then copy the path.

Then I can open a PowerShell or a command prompt, and change the directory with the path I just copied. Then run the command to execute my tests.

You’ll see from the above example that the command we ran was:

npx playwright test example.spec.ts

This can be broken down into 4 parts.

npx - Node package execute (the executor)
playwright - run the playwright package
test - get the playwright package to run the tests
example.spec.ts - run the tests defined in the spec.ts file

In our example we’re running all the tests in example.spec.ts file against all the browsers we installed when we setup the project. If you want to run against specific browsers use the –project switch. And if you want to see the results displayed in the Trace Viewer use the –ui switch. For example:

npx playwright test example.spec.ts --ui --project=chromium

In this instance then, the Chromium test run results will be shown in the trace viewer view. You can come in and look at each of the tests and execute them from the trace viewer and see them executing within there.

Another option you have at your disposal, after the tests have run, is to generate a report. As you’ll see in the command prompt:

As you can see you can run this command:

npx playwright show-report

This will generate a full html repoprt and open it in your default browser.

From this report you can drill down and look at specific steps or explore any errors that were generated.

Conclusion

That brings our final module in this series to a conclusion. Over 6 modules We’ve taken you from the initial install of Playwright through to executing your tests and generating a test report.

In future modules we’re going to cover lots more on inspectors, writing tests and debugging tests. More to follow soon.

The post Playwright Tutorial : Executing Tests in Playwright appeared first on Test Management.

Playwright Tutorial for Beginners : Lesson 5

Bill Echlin — Sun, 02 Mar 2025 22:21:08 +0000

We've already seen how to write your first test in Playwright and how to record tests using the “Test Generator" tool. We've seen how to use the pick locator option on our web page to identify objects that we want to interact with too.

However, there is another stand alone tool called the “Playwright Inspector". This very useful tool can be used to help you construct your tests and identify locators.

This “Inspector" tool is not directly initialized from within Visual Studio Code. However, you can start it from a terminal using npx.

Starting Playwright Inspector

First open a terminal from within VS Codeview terminal.

Then you need to execute this npx command:

npx playwright codegen https://www.google.com

Using the url of the web page you want to navigate directly too as the final parameter in the command. Like this:

So you can run this command directly from within Visual Studio Code or if you've got a command prompt open, you can run it directly from a command prompt as well.

Using Playwright Inspector

Once you've run this what you'll get is this “Playwright inspector" tool that command allows you to inspect objects and record tests

As you interact with your application, you'll see that test steps are added to the test panel:

We can also use the locator tool to identify objects that we might want to interact with.

Then we can see and copy the locator code that Playwright will generate. We can jump back into record mode again if we need to, if we want to continue to record in our script, and we can use the assert visibility, assert text, and assert value options in the toolbar as well to add expect statements to our script that we're recording.

Once we've completed the recording of that script, we can, of course, copy the code, and add to one of our test specifications (or we can create a completely new test test specification if required).

The Playwright Inspector tools is a useful little tool that you'll find helps you experiment a bit with some of the tests and work out locators before you add them to your test specifications within Visual Code Studio.

So in the next lesson, we're going to look at executing tests within the Visual Studio environment and how to execute them from the command line too.

The post Playwright Tutorial for Beginners : Lesson 5 appeared first on Test Management.

Playwright Tutorial 2025 : Recording Tests

Bill Echlin — Tue, 25 Feb 2025 15:29:57 +0000

In this lesson we want to have a look at recording tests with the Playwright test generator.

The Playwright “Test Generator” is a tool for recording tests. And I agree with you … recording tests is not a good approach to build automated tests. However, it is a good approach for learning Playwright and working out how Playwright works. It can also help you build chunks of code that you can copy into your projects.

Starting the Test Generator

You can start the “Test Generator” from the Playwright menu in the test explorer view.

From here you should see a browser open with the recording bar displayed.

In the background, in VS Code, you should see the creation of an initial test specification file, test one spec, along with the initial test that we're going to be recording.

Recording The Test

Now I can enter the URL in the browser, and you'll see the steps that you complete created as a TypeScript script in your specification file.

Now you'll see in the recording bar displayed in the browser and that we're ready to start recording.

that we've got five options. One is that we're in record mode. Two is the locator selection tool. Then we've got the assert functionality. So we can assert on three things, visibility on text and on a value.

The Test Generator Tool Bar

You see on the tool bar that we have five options.

One is that we're in record mode. Two is the locator selection tool. Then we've got the assert functionality. So we can assert on three things, visibility on text and on a value.

For example if we use the visibility assert feature, we can check that the Google logo is displayed. If I select the Google logo we can see that an expect statement is generated.

The other option in the record bar is the pick locator. You can use the select pick locator and then you can look at the locators that Playwright would use to identify the objects or elements in your web application.

At first it looks like it's a bit difficult to actually copy that locator you need. However, if you actually click on the element, you will see the locator is entered in the VS Code bar at the top of the page. From there, you can copy the locator and you can add this into your scripts manually.

Another tip for you – once you've used that locator the test generator does flip out of record mode. The key to continuing your recording is to put the cursor back in your script where you want to continue recording, and then click on the record button to continue where you left off.

In the next lesson then we're going to look at these inspectors and locators in a little bit more detail. We'll see how Playwright identifies elements in your application and how it creates those locator statements.

The post Playwright Tutorial 2025 : Recording Tests appeared first on Test Management.