The Macro: Inference Cost Is the Real AI Bottleneck Now
I keep seeing the same conversation play out. A team picks a model, gets it running, and then discovers that serving it to actual users costs more than they budgeted. The model works. The inference bill does not.
This is the stage the AI industry is in. Training costs get the headlines. Inference costs determine whether a product is viable. When you are running a 70-billion-parameter model and serving thousands of concurrent requests, the difference between 26,000 tokens per second and 36,000 tokens per second is not academic. It is the difference between needing eight H100s and needing six. At current GPU rental prices, that is tens of thousands of dollars per month.
The inference stack right now is dominated by a few options, none of which are ideal. vLLM is the open-source standard, and it is good but not fast. TensorRT-LLM from NVIDIA is faster but locks you into NVIDIA hardware and is notoriously painful to set up. PyTorch native inference is slow enough that nobody uses it in production if they can avoid it. Then there are managed services from Fireworks, Together, and Anyscale, where you trade control for convenience.
The underlying problem is architectural. Most inference frameworks are interpreters. They take a model graph and execute it operation by operation at runtime, making scheduling decisions on the fly. This is flexible but inherently slower than compiling the model into optimized native code ahead of time. It is the same tradeoff that separates interpreted languages like Python from compiled languages like C. Interpreters are easy to work with. Compilers are fast.
Almost nobody in the inference space is taking the compiler approach seriously. That is the gap Luminal is targeting.
The Micro: Three People With a Compiler and $5.3 Million
Joe Fioti (CEO), Matthew Gunton, and Jake Stevens founded Luminal and went through Y Combinator Summer 2025. The team is three people in San Francisco. They raised $5.3 million, announced in November 2025 via TechCrunch. Their YC partner is Jared Friedman, which is notable because Friedman tends to work with deeply technical infrastructure companies.
The product is an ahead-of-time ML compiler. You give it a model in PyTorch or Hugging Face format, and Luminal compiles it into optimized native GPU code. No runtime interpretation. No dynamic graph execution. The compiler analyzes the entire computation graph, applies hardware-aware optimizations like operator fusion, memory tiling, and kernel scheduling, and emits code that runs directly on the GPU with zero overhead.
The benchmark numbers are striking. On a GPT-class 120-billion-parameter model running on 8xH100 SXM GPUs, Luminal reports 36,000 tokens per second. vLLM gets 26,000. TensorRT-LLM gets 28,000. PyTorch gets 3,000. If those numbers hold up in production workloads and not just cherry-picked benchmarks, this is a meaningful improvement.
The compilation pipeline works in three stages. First, the model is converted into a graph-level intermediate representation, a dataflow graph that captures the full computation. Second, hardware-aware optimizations are applied: fusion of adjacent operations, tiling strategies that match GPU memory hierarchies, and memory planning that minimizes data movement. Third, the compiler generates native kernel code that runs directly on the hardware.
Beyond the compiler itself, Luminal is building what they call an Inference OS. This is a dynamic scheduling and load-balancing layer that manages workloads across heterogeneous compute clusters. It supports CPUs, GPUs, and ASICs, and can redistribute workloads in real time as demand fluctuates. The vision is that you do not just compile one model for one GPU. You compile your entire inference fleet and let the OS manage everything.
They offer two deployment modes. Luminal Cloud gives you serverless endpoints with auto-scaling and pay-per-use pricing. On-prem gives you a licensed deployment with dedicated support and custom kernel optimization. The dual approach covers both startups that want to get running fast and enterprises that need to keep models inside their own infrastructure.
The GitHub repository is public, which suggests they are open-sourcing at least part of the stack. That is the right move for an infrastructure tool competing against vLLM’s open-source momentum. Developers are not going to adopt a closed-source inference compiler when vLLM is free and good enough for many workloads.
The Verdict
Luminal is making a bet that the inference stack will eventually look like the traditional software compilation stack: ahead-of-time compilation, static optimization, and hardware-specific code generation. If that bet is right, this is one of the most important infrastructure companies in the current AI wave.
The benchmarks are the strongest argument. A 3.2x improvement over vLLM on the same hardware is not incremental. That is the kind of performance gap that changes deployment economics. Companies spending millions on GPU clusters will pay attention to anything that lets them serve the same traffic on fewer machines.
My concern is the compilation tradeoff itself. Ahead-of-time compilation is fast at runtime but inflexible. If your model changes, you recompile. If your batch sizes are highly variable, a statically optimized kernel might not be optimal for every workload shape. The interpreter-based frameworks like vLLM are slower but more adaptive, and that adaptability matters in production where traffic patterns are unpredictable.
The team size is also worth noting. Three people building a compiler, a runtime, a cloud service, and an inference OS is ambitious to the point of being concerning. Compilers are some of the hardest software to build correctly. The bugs are subtle, the edge cases are infinite, and the testing surface is enormous. $5.3 million buys runway, but this is a problem that could use ten engineers tomorrow.
At 30 days, I want to see independent benchmark validation from someone outside the company. At 60 days, the question is compilation time. How long does it take to compile a 70B model, and how does that fit into deployment workflows? At 90 days, I want to know whether any production API provider has switched from vLLM or TensorRT to Luminal and seen the promised improvements at scale.
If the numbers are real, this team is sitting on something that every AI company will eventually need. Inference is the cost center that matters most, and a 3x improvement on cost efficiency is worth a lot of money.