The Macro: Everyone Agrees E2E Testing Is Important and Nobody Wants to Do It
I have never met a developer who enjoys writing end-to-end tests. I have met many developers who say they believe in testing, who advocate for high test coverage in code reviews, and who then quietly skip writing the E2E suite because it takes forever and breaks constantly.
The tools are part of the problem. Cypress is powerful but verbose. A simple “log in, navigate to the dashboard, verify the chart loads” flow requires dozens of lines of code, explicit waits, CSS selectors that break when the frontend changes, and retry logic for flaky network requests. Playwright improved on Cypress in meaningful ways but the fundamental authoring experience is still writing imperative test scripts in JavaScript or Python.
The maintenance burden is worse than the authoring cost. E2E tests are brittle by nature. A designer moves a button. A backend team renames an API endpoint. A CSS class gets refactored. The tests break. Someone has to go update the selectors, adjust the waits, and rerun the suite. In my experience, most teams eventually reach a point where the E2E suite is more trouble than it is worth, and they either abandon it or assign one unlucky engineer to maintain it full-time.
This is a known problem and the solutions so far have been incremental. Playwright’s auto-wait is better than Cypress’s explicit waits. Testing Library’s semantic selectors are more resilient than CSS selectors. But nobody has fundamentally rethought the authoring model. You are still writing procedural scripts that describe clicks, inputs, and assertions at the implementation level.
The AI testing space is getting crowded. QA Wolf offers a managed service. Testim uses AI for element selection. Mabl provides low-code test creation. Each takes a different approach to reducing the pain, but most still require some form of scripted test definition or a visual recorder that generates fragile code underneath.
The Micro: Two Stripe Engineers Who Got Tired of Test Maintenance
Vijit Dhingra and Jack Brown founded Lark after spending time at Stripe, where they built billing infrastructure and developer tooling. If you have worked at a company with Stripe’s engineering standards, you know the testing culture is intense. You also know that even at Stripe, maintaining E2E test suites is a grind. They saw the problem up close and decided the authoring model itself needed to change.
They came through Y Combinator’s Summer 2025 batch and are based in San Francisco.
The product replaces scripted tests with plain English descriptions. Instead of writing code that says “find the element with id login-button, click it, wait for the dashboard to load, find the element with class chart-container, assert it is visible,” you write something like “log in, go to the dashboard, verify the revenue chart is showing data.” Lark’s AI figures out the selectors, handles the waits, and generates the assertions.
The value proposition is not just easier authoring. It is resilience. When a designer renames a CSS class or moves a button to a different part of the page, a Cypress test breaks. A Lark test that says “click the login button” should still work because the AI re-identifies the element based on its semantic meaning rather than its implementation details. That is a meaningful difference if it holds up in practice.
Lark supports testing across dashboards, APIs, and SDKs. You can run tests on feature branches, in CI/CD pipelines, or against production. The continuous testing angle means tests run automatically as your code changes, not just when someone remembers to trigger the suite.
The pitch is clean. Write tests in English. Run them everywhere. Stop maintaining selector-based scripts. I have seen similar promises from other tools, but the Stripe pedigree gives me some confidence that these founders understand the engineering rigor required to make AI-driven testing actually reliable and not just a demo that works on a todo app.
The Verdict
I think Lark is attacking the right layer of the testing problem. The issue was never that developers did not have powerful enough testing frameworks. Cypress and Playwright are both excellent. The issue is that the authoring and maintenance model is fundamentally at odds with how fast modern frontends change. Natural language test definitions that are resilient to UI changes would be a genuine step function improvement.
The risk is reliability. If a plain English test fails, debugging it is harder than debugging a Cypress test because you cannot see exactly what the AI tried to do. False positives and false negatives are the enemies of any testing tool, and AI-driven element identification introduces new ways for both to occur. The Cypress and Playwright teams have spent years making their tools deterministic. Lark needs to match that reliability bar while using a fundamentally less deterministic approach.
In 30 days I want to see how it handles a real application with a complex frontend. Not a demo app. A production dashboard with tables, charts, modals, and nested navigation. Sixty days, the question is flakiness rate compared to equivalent Cypress tests. If Lark tests are more flaky, nobody will adopt it no matter how easy they are to write. Ninety days, I want to see CI/CD integration stories. Testing tools live or die by how smoothly they fit into existing development workflows. If adding Lark to a GitHub Actions pipeline takes more than fifteen minutes, the adoption curve gets steep fast.