The Macro: Testing AI Agents Is the Hardest Unsolved Problem in Dev Tools
Traditional software testing is well-understood. You write unit tests, integration tests, end-to-end tests. You know what the expected output should be. You run the tests, compare results, and ship with confidence.
AI agents break this model completely. The outputs are non-deterministic. The same input can produce different results depending on conversation history, tool state, and model behavior. User journeys through an agent are unpredictable. And the failure modes are subtle. An agent might give a technically correct but misleading answer. It might use the wrong tool at the right time. It might hallucinate a fact buried in an otherwise accurate response.
The result is that most teams ship AI agents with minimal testing. They try a few prompts manually, eyeball the results, and hope for the best. This is how you get agents that work in demos but fail with real users who ask questions you never anticipated.
The testing and evaluation space for AI is growing. Braintrust, LangSmith, and Humanloop all offer evaluation frameworks. But most of these focus on single-turn evaluations. Test a prompt, check the response. What they do not handle well is multi-step user journeys that span multiple tool calls, involve multiple modalities, and reveal failure modes that only appear after several interactions.
The Micro: Synthetic User Journeys That Find What You Missed
Shreyas Kaps and Rohan Kulkarni founded Ashr. Shreyas built AI agents in finance and devops for the past two years. Rohan cofounded Ask Geri, which was acquired, and has a Berkeley EECS degree. They are a two-person team from San Francisco, part of YC Winter 2026 with Harshita Arora.
Ashr generates large volumes of synthetic user journeys through your agent’s tool calls, results, and questions. Not one-off test cases. Realistic, multi-step user stories that exercise your product the way real users would. The platform picks up on errors, inconsistencies, and edge cases that manual testing misses.
The features that differentiate Ashr are the dataset management with full status and trace visibility, complete test timelines showing speaker interactions and tool calls, side-by-side comparison of expected versus actual outputs, and prompt versioning with inline diffs and per-version pass rates. That last one is particularly useful. When you change a prompt, you want to know exactly how it affects every test case, not just the one you are looking at.
The platform runs evals for UC Berkeley, Stanford, and several startups. There is a free tier available, and booking a demo is straightforward through their site.
The Verdict
Ashr is building for a problem that every AI team will have as agents get more complex. Testing multi-step, multi-tool agent behavior is genuinely hard, and the current alternatives are either too simple or too manual. The synthetic user journey approach is the right idea because it generates diversity at scale.
The risk is that the evaluation platform market consolidates around one of the better-funded players. Braintrust has significant traction. LangSmith has the LangChain distribution advantage. But neither handles multi-modal, multi-step journey testing as well as Ashr claims to.
In 30 days, I want to see the number of production agents being tested on the platform. In 60 days, the question is whether Ashr catches bugs that users would have found. Real examples of production issues caught by synthetic testing are the most compelling proof point. In 90 days, I want to know about CI/CD integration. If Ashr runs automatically on every deploy, it becomes essential infrastructure rather than an optional tool.