The Macro: Everyone Hates E2E Testing and the Tools Deserve It
End-to-end testing is one of those problems where every engineering team agrees it matters and almost every engineering team does it badly. The numbers are grim. By most estimates, somewhere between 30% and 50% of e2e test suites are flaky, meaning they pass sometimes and fail sometimes for reasons unrelated to actual bugs. Engineers spend hours debugging test failures that turn out to be timing issues, selector changes, or race conditions in the test itself. The result is predictable: teams stop trusting their tests, start ignoring failures, and ship bugs to production anyway.
The tooling landscape reflects this frustration. Selenium has been around since 2004 and remains widely used despite being universally disliked. Cypress modernized the developer experience but introduced its own limitations around cross-origin testing and parallel execution. Playwright from Microsoft’s team is technically excellent but still requires you to write and maintain JavaScript or TypeScript test code. Each of these tools has the same fundamental architecture: find elements using CSS selectors or XPath, interact with them programmatically, assert on the results.
That architecture is the root cause of flakiness. CSS selectors break when designers change class names. XPath expressions break when the DOM structure changes. Element IDs break when component libraries get updated. A button that was #submit-btn yesterday might be .form-actions > button:first-child today because someone refactored a component. The test was correct. The app was correct. The selector was stale. That is a test maintenance problem, not a test quality problem, and it consumes enormous engineering effort.
The AI testing wave has started. QA Wolf offers human-plus-AI test creation and maintenance. Mabl uses machine learning for visual testing. Testim (acquired by Tricentis) applies AI to test stability. But most of these solutions layer AI on top of the same DOM-based approach. They make selectors smarter rather than eliminating selectors entirely.
The Micro: Stripe and Citadel Alumni Watching Pixels Instead of Parsing HTML
Docket was founded by Nishant Hooda and Boris Skurikhin. Nishant was an engineer at Stripe and Brex. Boris was a quant developer at Citadel and a software engineer at Patreon. They are part of Y Combinator’s Spring 2025 batch, based in San Francisco.
The core technical insight is simple and radical: stop parsing the DOM. Instead of finding elements through CSS selectors, Docket uses multimodal AI agents that look at the rendered page and interact with it using pixel coordinates. The agent sees a button labeled “Submit” and clicks on it at the (X, Y) position where that button appears on screen. No selector. No XPath. No DOM dependency at all.
This approach solves the flakiness problem at the root. If a designer changes the CSS class of a button but the button still says “Submit” and still appears in roughly the same location, the test keeps working. If the button moves 50 pixels to the right because of a layout change, Docket’s self-healing mechanism adjusts the click coordinates automatically. The test breaks only when the actual user experience breaks, which is exactly when you want it to break.
You write tests in plain English. “Log in with test credentials, navigate to the dashboard, click the export button, verify the CSV downloads.” Docket’s AI translates that into a sequence of visual interactions. It handles canvases, iframes, shadow DOM elements, and other structures that make traditional selector-based testing painful or impossible.
The product integrates with CI/CD pipelines, includes a dedicated mailbox for email testing, supports two-factor authentication flows, offers scheduled test runs, and sends results to Slack. The demo reel shows tests running against Perplexity, Amazon, Airbnb, Character AI, Mercury, and Patreon. Customers include Centerpoint and Paradigm.
The Verdict
I think Docket is making the right architectural bet. The DOM-based testing paradigm has had twenty years to get good and it is still producing flaky tests. Coordinate-based visual testing is how humans actually interact with software, and the AI models are now good enough to make that approach reliable at scale.
The competitive question is whether this becomes the new standard or remains a niche approach. Playwright is free, backed by a massive team, and deeply integrated into the JavaScript ecosystem. QA Wolf has raised significant funding and offers a managed service that takes testing off your plate entirely. Docket needs to be dramatically better at staying green to justify switching costs.
The Stripe and Citadel backgrounds are relevant here. Both founders come from environments where reliability is not optional and where testing infrastructure is taken seriously. That is a different mindset from founders who see testing as a chore to automate away.
Thirty days, I want to see what percentage of tests stay green after a week without maintenance. That is the metric that matters. Sixty days, whether the plain-English test writing is actually faster than Playwright for experienced engineers, or whether the value is primarily in maintenance reduction. Ninety days, the pricing question: can Docket charge enough per test to build a real business, or does the coordinate-based approach need to be packaged as a platform with broader QA capabilities? The technical approach is sound. The team has the right DNA. The market is desperate for something that works.