Skip to main content

Key Features & Capabilities

Preemptive Inference Scaling with Cache Hierarchy

What It Is

Modern inference workloads are bursty. Traffic spikes suddenly, then drops. Traditional approaches wait for the spike to hit, then scramble to allocate resources—resulting in slow cold starts.

Cumulus takes a different approach: we predict when spikes will occur and preemptively move your model up the cache hierarchy before the spike hits.

The Cache Hierarchy

Your model's CUDA checkpoint moves through multiple tiers based on predicted demand:

S3 (Cloud Storage)
↓ [Predictive signal: spike incoming]
Disk (Local SSD)
↓ [Spike imminent]
Host RAM
↓ [Spike now]
VRAM (GPU Memory) ← Ready to serve instantly

How Predictive Scheduling Works

Our system continuously monitors:

  • Historical traffic patterns for your models
  • Current request rates across regions
  • Seasonal/temporal trends

When we detect signs of an incoming spike, we move the checkpoint up the hierarchy proactively. By the time requests arrive, your model is already in VRAM, ready for instant serving.

Why This Matters

  • Zero Cold Starts: No waiting for models to load from cloud storage
  • Burst Ready: Handle traffic spikes without latency degradation
  • Automatic: Happens behind the scenes; you don't configure anything

Cold Start Elimination via CUDA Checkpoints

What It Is

A CUDA checkpoint captures the complete live execution state of a running model: weights in VRAM, intermediate tensors, metadata, everything. It's like taking a snapshot of the GPU's memory.

How Cumulus Uses This

Rather than loading model weights fresh each time (slow), we:

  1. Capture the checkpoint of your running model
  2. Replicate it across our global network of providers
  3. Position replicas in strategic locations (close to your users)
  4. When a request comes in, we restore from the nearest checkpoint

This is orders of magnitude faster than downloading weights from cloud storage.

Benefits

  • Instant Serving: Restore execution state in milliseconds, not seconds
  • Distributed Ready: Replicas are pre-warmed across regions
  • No Warm-Up: Models are immediately at peak performance

Multi-Tenant Fractional GPU Optimization

What It Is

Most developers rent full GPUs even if they only need 25% of the capacity. Cumulus automatically detects small models and packs them together on shared GPUs.

How It Works

When you deploy a model, our orchestrator analyzes:

  • Model size (parameter count)
  • Memory footprint
  • Compute requirements

If it's small (< 20B parameters), we automatically bin it with other small models on the same GPU. You pay only for the capacity you use.

Example

  • Model A needs 4GB VRAM
  • Model B needs 3GB VRAM
  • Model C needs 2GB VRAM
  • Total: 9GB used on a 40GB GPU
ScenarioCost
Without Cumulus3 × $X/hour = 3X cost
With Cumulus1 × $X/hour (fractional) ≈ 0.3X cost

Benefits

  • Cost Efficiency: Pay for fractional GPU capacity
  • Automatic Binning: No manual configuration required
  • No Performance Loss: Each model gets isolated compute and memory limits
endpoint = cumulus.deploy(
model="detector_model",
workload_type="inference"
)

result = endpoint("input_data_here")

In this example, if detector_model is small, Cumulus automatically packs it with other small models on the same GPU. You don't need to do anything—the SDK handles it.


Global Geographic Proximity & Ultra-Fast Scaling

What It Is

We maintain aggregated GPU infrastructure in every major region and country worldwide. This means two things:

  1. Your inference runs geographically close to your users (low latency)
  2. You can scale from 1 request to billions instantly because replicas are pre-positioned globally

Geographic Distribution

We have providers in:

  • Every major cloud region (US, EU, Asia, etc.)
  • Individual countries (Switzerland, Germany, India, etc.)
  • Strategic data centers for regulatory compliance

Pre-Positioned Replicas

Rather than spinning up new instances when traffic spikes, we maintain warm replicas across our network. When a request arrives from a user in Switzerland, it routes to the closest replica—already running, already cached.

Ultra-Fast Scaling

Because replicas are distributed:

  • Low latency: Routes to nearest replica, served immediately
  • Distributes load: Requests spread across existing replicas
  • Scales intelligently: Additional replicas spin up in high-demand regions

No cold starts. No orchestration delays. Just instant scaling.

Example: EU Customer

You have users in Switzerland, Germany, and France. Cumulus automatically:

  • Maintains replicas in each country
  • Routes Swiss requests to Switzerland replica
  • Routes German requests to Germany replica
  • If one region overloaded, scales to others within EU

All transparent. You don't manage it.