Browser agents are easy to demo and hard to audit. The question I wanted to answer was narrow: can I record one real browser-use session at the browser boundary and replay it later with matching strict observables?
The setup used two processes. The Python process ran browser-use against a Chromium instance. A Node process attached to the same browser over CDP and recorded the session with DBAR.
browser-use agent
|
v
Chromium over CDP
|
v
DBAR capture -> capsule -> replay DBAR recorded virtual time, network requests and responses, DOM snapshots, accessibility snapshots, screenshots, and hashes at step boundaries. Twenty-four hours later, I replayed the capsule on a fresh browser instance. For this session, the strict observables matched: DOM hash, accessibility-tree hash, and network digest.
That is the important phrasing. I am not claiming every browser session can be replayed perfectly. Fonts, GPU differences, cross-origin behavior, service workers, browser versions, and unsupported APIs can all introduce drift. I am saying this captured Chromium session replayed cleanly on the observables DBAR treats as strict.
The more useful case is divergence. If a replay fails, DBAR reports the step and observable:
step: 3
observable: dom
expected: a3f2...
actual: b7c1... That changes the debugging loop. Instead of rerunning and hoping the failure appears again, you inspect the recorded step where the page stopped matching the capsule.
For agent systems, this is the kind of evidence I want: not just "the agent clicked submit," but the browser state before and after that click, plus the network data that got it there. Logs tell you what the harness thought happened. Replay evidence tells you what the browser exposed at the time.
DBAR is public here: github.com/pyyush/dbar.