Use Cases & Scenarios
Burst Inference for Detection Models
The Challenge
Security detection models receive bursty traffic: quiet periods followed by sudden spikes when threats are detected.
How Cumulus Helps
endpoint = client.deploy(
model="threat_detector",
workload_type="inference"
)
# Whether 1 detection or 1M, same latency
for threat in threat_stream:
result = endpoint(threat)
Our predictive scheduler anticipates spikes and preemptively moves your model to VRAM. When the spike hits, it's already ready. No cold starts, consistent latency.
Multi-Region Compliance
The Challenge
You have customers in regulated regions (Switzerland, Saudi Arabia). Data must stay in-region.
How Cumulus Helps
# Deploy in Switzerland only
endpoint_ch = client.deploy(
model="classifier",
workload_type="inference",
region="eu-ch"
)
# Deploy in Saudi Arabia only
endpoint_sa = client.deploy(
model="classifier",
workload_type="inference",
region="me-sa"
)
Cumulus maintains providers in every region. Your model runs only where required. No data leaves the specified region.
Multiple Specialized Models
The Challenge
Your security system uses 5 different small models (detector, classifier, ranker, etc.). Renting 5 separate GPUs is wasteful.
How Cumulus Helps
detector = client.deploy(model="threat_detector", workload_type="inference")
classifier = client.deploy(model="threat_classifier", workload_type="inference")
ranker = client.deploy(model="threat_ranker", workload_type="inference")
Cumulus automatically:
- Recognizes all three are small models
- Bins them on the same GPU (fractional allocation)
- You pay for ~1/3 the cost of separate GPUs
Performance-Critical Inference
The Challenge
Your application demands sub-100ms latency for every inference, even during traffic spikes.
How Cumulus Helps
endpoint = client.deploy(
model="critical_model",
workload_type="inference",
region="eu-ch", # Deploy near users
cache_config={"model_cache": "aggressive"}, # Keep in VRAM
min_replicas=2 # Always have backups ready
)
With geographic proximity and predictive scaling:
- Model is always in VRAM in the closest region
- Replicas are pre-warmed globally
- Spikes are anticipated and handled preemptively
Result: Consistent sub-100ms latency across all inference requests.
Platform Characteristics
High-Speed Inference
- Predictive scheduling moves models up the cache hierarchy before traffic spikes
- CUDA checkpoints eliminate warm-up delays
- Geographic proximity minimizes network latency
Multi-Cloud Aggregation
- Access GPUs from major clouds and individual providers
- Automatic failover and load balancing across vendors
- Unmatched flexibility in GPU selection and pricing
Performance-Focused Optimization
- Latency is the priority, not just cost
- Cold start elimination via checkpoint replication
- Intelligent request routing based on geography and capacity
Geographic Flexibility & Regulatory Compliance
- Providers in every major region and country
- Deploy in specific geographies for data residency
- No cross-border data movement unless explicit
Intelligent Resource Management
- Automatic fractional GPU allocation
- Multi-tenant workload packing
- Continuous monitoring and optimization
Ultra-Fast Scaling
- Scale from 1 to billions of requests with pre-positioned replicas
- No orchestration delays or cold starts
- Consistent latency at any scale