Key Features & Capabilities
Preemptive Inference Scaling with Cache Hierarchy
What It Is
Modern inference workloads are bursty. Traffic spikes suddenly, then drops. Traditional approaches wait for the spike to hit, then scramble to allocate resources—resulting in slow cold starts.
Cumulus takes a different approach: we predict when spikes will occur and preemptively move your model up the cache hierarchy before the spike hits.
The Cache Hierarchy
Your model's CUDA checkpoint moves through multiple tiers based on predicted demand:
S3 (Cloud Storage)
↓ [Predictive signal: spike incoming]
Disk (Local SSD)
↓ [Spike imminent]
Host RAM
↓ [Spike now]
VRAM (GPU Memory) ← Ready to serve instantly
How Predictive Scheduling Works
Our system continuously monitors:
- Historical traffic patterns for your models
- Current request rates across regions
- Seasonal/temporal trends
When we detect signs of an incoming spike, we move the checkpoint up the hierarchy proactively. By the time requests arrive, your model is already in VRAM, ready for instant serving.
Why This Matters
- Zero Cold Starts: No waiting for models to load from cloud storage
- Burst Ready: Handle traffic spikes without latency degradation
- Automatic: Happens behind the scenes; you don't configure anything
Cold Start Elimination via CUDA Checkpoints
What It Is
A CUDA checkpoint captures the complete live execution state of a running model: weights in VRAM, intermediate tensors, metadata, everything. It's like taking a snapshot of the GPU's memory.
How Cumulus Uses This
Rather than loading model weights fresh each time (slow), we:
- Capture the checkpoint of your running model
- Replicate it across our global network of providers
- Position replicas in strategic locations (close to your users)
- When a request comes in, we restore from the nearest checkpoint
This is orders of magnitude faster than downloading weights from cloud storage.
Benefits
- Instant Serving: Restore execution state in milliseconds, not seconds
- Distributed Ready: Replicas are pre-warmed across regions
- No Warm-Up: Models are immediately at peak performance
Multi-Tenant Fractional GPU Optimization
What It Is
Most developers rent full GPUs even if they only need 25% of the capacity. Cumulus automatically detects small models and packs them together on shared GPUs.
How It Works
When you deploy a model, our orchestrator analyzes:
- Model size (parameter count)
- Memory footprint
- Compute requirements
If it's small (< 20B parameters), we automatically bin it with other small models on the same GPU. You pay only for the capacity you use.
Example
- Model A needs 4GB VRAM
- Model B needs 3GB VRAM
- Model C needs 2GB VRAM
- Total: 9GB used on a 40GB GPU
| Scenario | Cost |
|---|---|
| Without Cumulus | 3 × $X/hour = 3X cost |
| With Cumulus | 1 × $X/hour (fractional) ≈ 0.3X cost |
Benefits
- Cost Efficiency: Pay for fractional GPU capacity
- Automatic Binning: No manual configuration required
- No Performance Loss: Each model gets isolated compute and memory limits
endpoint = cumulus.deploy(
model="detector_model",
workload_type="inference"
)
result = endpoint("input_data_here")
In this example, if detector_model is small, Cumulus automatically packs it with other small models on the same GPU. You don't need to do anything—the SDK handles it.
Global Geographic Proximity & Ultra-Fast Scaling
What It Is
We maintain aggregated GPU infrastructure in every major region and country worldwide. This means two things:
- Your inference runs geographically close to your users (low latency)
- You can scale from 1 request to billions instantly because replicas are pre-positioned globally
Geographic Distribution
We have providers in:
- Every major cloud region (US, EU, Asia, etc.)
- Individual countries (Switzerland, Germany, India, etc.)
- Strategic data centers for regulatory compliance
Pre-Positioned Replicas
Rather than spinning up new instances when traffic spikes, we maintain warm replicas across our network. When a request arrives from a user in Switzerland, it routes to the closest replica—already running, already cached.
Ultra-Fast Scaling
Because replicas are distributed:
- Low latency: Routes to nearest replica, served immediately
- Distributes load: Requests spread across existing replicas
- Scales intelligently: Additional replicas spin up in high-demand regions
No cold starts. No orchestration delays. Just instant scaling.
Example: EU Customer
You have users in Switzerland, Germany, and France. Cumulus automatically:
- Maintains replicas in each country
- Routes Swiss requests to Switzerland replica
- Routes German requests to Germany replica
- If one region overloaded, scales to others within EU
All transparent. You don't manage it.