Wafer Optimizes Your GPU Code So You Don't Have to Learn CUDA

The Macro: GPU Performance Is Everyone’s Problem and Almost Nobody Can Fix It

I want to talk about one of the dirtiest secrets in AI infrastructure. Most companies running GPU workloads are leaving enormous performance on the table. Not because they don’t care, but because the people who know how to fix it are absurdly scarce.

Writing optimized GPU kernels is one of the hardest skills in software engineering. You need to understand memory hierarchies, warp scheduling, tensor core utilization, and about fifteen other concepts that most engineers have never encountered. The people who are genuinely good at this can name their salary. And there are maybe a few thousand of them on the planet.

This creates a brutal dynamic. Companies are spending millions on GPU compute, running workloads at a fraction of theoretical throughput, and they cannot fix it because the talent pool is too small. NVIDIA’s tooling (NSight Compute, CUTLASS) is powerful but requires deep expertise to use effectively. AMD’s ecosystem is even worse. The result is that most AI inference workloads run slow and expensive, and the teams running them just accept the cost.

The market for GPU optimization tooling is not new. TensorRT has been around for years. Triton made kernel writing more accessible. But there is a meaningful gap between “more accessible” and “actually automated.” Most existing tools still require an engineer who understands what they are looking at. The question is whether AI can close that gap, using models to do the profiling, diagnosis, and code generation that currently requires a specialist.

The Micro: Two UChicago Grads Who Worked at the Right Places

Wafer profiles your PyTorch code, identifies bottlenecks, generates optimized kernels, and verifies that the fixes actually work. The workflow is end-to-end: trace to speedup, with compiler-level verification at each step. You do not need to know what PTX is. You do not need to read a flame graph. The system handles it.

Steven Arellano and Emilio Andere founded the company. Steven worked at Two Sigma and Sei Labs after studying CS and economics at UChicago. Emilio came from Argonne National Lab and Elicit, with a math background from the same school. They went through Y Combinator’s Summer 2025 batch and are running a four-person team in San Francisco.

The product integrates with NSight Compute for NVIDIA hardware counters and ROCProfiler for AMD GPUs. That hardware portability piece matters. Most optimization tooling is NVIDIA-only, which locks companies into a single vendor. Wafer claims to work across GPU types, which would be genuinely valuable as AMD and Intel push harder into the AI accelerator market.

The customer list is interesting. Intel, LinkedIn, Pinterest, Datadog, Naver, and MIT are all listed on the homepage. The investor backing includes Jeff Dean and Woj Zaremba as individual investors alongside Fifty Years and Liquid 2. Getting Jeff Dean to invest in your GPU optimization startup is not nothing. The man essentially invented the modern approach to large-scale distributed systems at his day job.

On-demand GPU sandbox environments are part of the offering. You can test optimized kernels on actual B200 and MI300X hardware without provisioning your own instances. That is a nice touch for teams that want to validate performance before deploying to production.

Pricing is enterprise-oriented. Team and Enterprise tiers, no self-serve free tier visible. That makes sense for a product where the value proposition scales with compute spend.

The Verdict

I think Wafer is attacking a real and expensive problem. GPU utilization rates at most companies are embarrassingly low, and the talent to fix it does not exist at scale. If Wafer can reliably deliver 1.5 to 5x speedups through automated profiling and kernel generation, the ROI math is straightforward. Any company spending six figures a month on GPU compute should at least evaluate this.

The risk is trust. Letting an automated system rewrite your inference kernels is a big ask. The verification step is critical. If Wafer’s compiler-level checks are robust, adoption should follow. If teams find edge cases where the optimized code produces different outputs, trust collapses fast and is hard to rebuild.

The competitive landscape includes NVIDIA’s own tooling, Modular (which is building a new AI compiler stack), and various open-source efforts around Triton and MLIR. But none of those are fully automated end-to-end. That is Wafer’s wedge. In thirty days, I want to see case studies with specific numbers. In sixty days, I want to know whether the AMD support is production-ready or aspirational. In ninety days, the question is whether this becomes standard infrastructure for AI teams or remains a nice-to-have that loses to “just rent more GPUs.”

Also featured on HUGE: Luminal Built an ML Compiler That Makes vLLM Look Slow · Fulcrum Builds the Debugger Nobody Knew AI Agents Needed · A1Base Gives AI Agents a Phone Number, and That Changes More Than You Think

Wafer Optimizes Your GPU Code So You Don't Have to Learn CUDA

The Macro: GPU Performance Is Everyone’s Problem and Almost Nobody Can Fix It

The Micro: Two UChicago Grads Who Worked at the Right Places

The Verdict

More on this

The HUGE Brief