← May 17, 2027 edition

cumulus-labs

Performant serverless GPU inference

Cumulus Labs Promises 12-Second GPU Cold Starts and That Changes the Math on Serverless Inference

Cloud ComputingInfrastructureAIGPU

The Macro: GPU Inference Is Too Expensive and Too Slow

Running AI models in production is a surprisingly painful experience. You have two bad options. Self-host on your own GPUs, which means provisioning servers, managing CUDA drivers, handling load balancing, and paying for idle compute when traffic is low. Or use a managed provider, which means paying premium prices and often waiting through slow cold starts when your model needs to scale up from zero.

The cold start problem is particularly brutal for serverless GPU platforms. When a model has not been called recently, the infrastructure needs to spin up a GPU, load the model weights, and initialize the runtime. On platforms like Modal, this can take 60 seconds or more. For applications that need fast response times, like real-time image generation or voice transcription, a 60-second wait is unacceptable.

The cost problem is equally real. GPU compute is expensive. If you keep instances warm to avoid cold starts, you pay for idle time. If you scale to zero to save money, you hit cold starts. Teams are constantly trading off between cost and latency, and neither option is great.

Cumulus Labs, backed by Y Combinator, is attacking both problems simultaneously. They claim 12.5-second GPU cold starts, which they say is 4x faster than Modal. And they offer scale-to-zero pricing, so you only pay for the GPU time your models actually consume.

The Micro: One Function, Zero Infrastructure

The developer experience is minimalist. You deploy a model with a single Python function. Cumulus handles GPU selection, autoscaling, failover, and load balancing. No infrastructure configuration required.

The 12.5-second cold start claim is the headline number, and if it holds up in production, it is a meaningful improvement. Most serverless GPU platforms take 30-90 seconds for cold starts depending on model size. Cutting that to 12.5 seconds makes serverless GPU inference viable for a much broader range of applications.

The founding team brings relevant infrastructure experience. Suryaa Rajinikanth was Lead Engineer at TensorDock, building distributed GPU marketplaces. Veer Shah led a Space Force SBIR contract for military satellite communications and contributed to NASA programs. Both are Georgia Tech and UW-Madison CS graduates.

Cumulus supports a range of workloads: LLMs, image generation, speech-to-text, and computer vision models. They are also part of the NVIDIA Inception Program, which provides access to NVIDIA’s technical resources and early hardware.

The competitive field is intense. Modal is the most direct competitor, offering serverless GPU compute with a strong developer experience. RunPod provides on-demand GPU instances at competitive prices. Replicate focuses on running open-source models. Together AI and Fireworks target inference specifically. Each has different strengths, but none has solved the cold start problem as aggressively as Cumulus claims to have.

The risk is that GPU infrastructure is a scale game. The biggest providers have the most hardware, the best pricing from GPU vendors, and the deepest pockets to absorb losses while building market share. Cumulus needs to demonstrate that their technical advantages are durable and not just a function of operating at small scale.

The Verdict

GPU inference infrastructure is one of the most competitive spaces in tech right now, and for good reason. Every AI application needs it. The market is enormous and growing.

At 30 days: do the 12.5-second cold starts hold up under real production workloads with large models? The benchmark number needs to be validated across different model sizes and architectures.

At 60 days: what is the cost comparison at scale versus Modal, RunPod, and Replicate? The pricing advantage needs to be real and significant, not just marginal.

At 90 days: how many production applications are running on Cumulus, and what is the reliability track record? Uptime and consistency matter as much as raw performance for production workloads.

I think Cumulus has identified the right bottleneck. Cold starts are the single biggest barrier to serverless GPU adoption. If they have genuinely solved that problem at the infrastructure level, they have a real competitive advantage. But GPU infrastructure is a knife fight, and they will need to scale fast to stay ahead.