The Macro: Reinforcement Learning Is the Secret Sauce Nobody Can Cook
Here is something that everyone building with LLMs has figured out by now: the base model is not enough. Fine-tuning helps. Prompt engineering helps. But the biggest performance gains in the last two years have come from reinforcement learning, specifically RLHF (reinforcement learning from human feedback) and its variants. It is how ChatGPT went from a clever demo to a product people actually use daily. It is how Claude got good at following instructions. It is the difference between a model that sounds smart and a model that actually does what you want.
The problem is that reinforcement learning is genuinely hard to implement. It requires specialized infrastructure, careful reward function design, significant compute, and a team that understands both RL theory and the practical engineering of training pipelines. Most companies building AI products do not have that team. They have application developers who are good at prompt engineering and fine-tuning but have never set up an RL training loop.
The existing options are limited. You can hire a team of ML researchers, which is expensive and slow. You can use open-source frameworks like TRL or DeepSpeed, which require substantial engineering effort to operationalize. Or you can pay OpenAI or Anthropic for access to their already-trained models and hope the general-purpose alignment is close enough to what you need. None of these options serve the growing middle market of companies that want custom RL-trained models without building an ML research team.
This is a familiar pattern in infrastructure. Managed databases replaced self-hosted Postgres for most companies. Managed Kubernetes replaced DIY container orchestration. The question is whether reinforcement learning is ready for the same abstraction layer.
The Micro: Give Them a Model, a Prompt, and a Reward
RunRL was founded by Andrew Gritsevskiy and Derik, and is based in San Francisco. They are a three-person team that came through Y Combinator’s Spring 2025 batch, working with YC partner Diana Hu.
The value proposition is clean. You give RunRL a model, a set of prompts, and a reward function that defines what good output looks like. They give you back a better model. That is the entire pitch, and the simplicity is the point. The hard part, the infrastructure for distributed RL training, the hyperparameter tuning, the training stability issues, all of that sits behind their API.
The website is live but minimal. It has dark mode detection and basic structure, but it is not a marketing-heavy site. There is no public pricing, no customer logos, no detailed documentation visible from the landing page. This is early. They are clearly in the phase of working closely with initial customers rather than scaling a self-serve product.
What I find interesting about the timing is that the demand for custom RL training is growing faster than the supply of people who know how to do it. Every company that has fine-tuned a model and hit a performance ceiling is a potential RunRL customer. Every team that has tried to implement RLHF from research papers and gotten lost in the engineering complexity is a potential RunRL customer. That is a large and growing addressable market.
The three-person team is lean for an infrastructure play, but RL infrastructure is one of those areas where a small team of people who deeply understand the problem can outpace a larger team that is figuring it out as they go. The key question is whether their platform is general enough to handle diverse reward functions and model architectures, or whether each customer requires custom engineering.
The Verdict
I think RunRL is attacking a real gap in the market. The distance between “we want RL-trained models” and “we can actually do RL training” is wide, and most companies are stuck on the wrong side of it. A managed service that handles the infrastructure complexity is the obvious solution, and I am surprised more companies are not building this.
The risk is that the big cloud providers could offer this as a feature. AWS SageMaker, Google Vertex AI, and Azure ML are all inching toward managed RL training. If Amazon announces “RLHF as a Service” at re:Invent, the independent market contracts quickly. The counter-argument is that cloud provider ML tools tend to be general-purpose and mediocre, and a focused startup can stay ahead on quality and developer experience.
Thirty days, I want to see customer case studies with measurable improvements. Sixty days, whether the platform can handle different model architectures without custom engineering for each customer. Ninety days, the question is pricing and unit economics. RL training is compute-intensive, and the margin between what customers will pay and what the training actually costs is where this business either works or does not.