Skip to main content

Use Cases & Scenarios

Burst Inference for Detection Models

The Challenge

Security detection models receive bursty traffic: quiet periods followed by sudden spikes when threats are detected.

How Cumulus Helps

endpoint = client.deploy(
model="threat_detector",
workload_type="inference"
)

# Whether 1 detection or 1M, same latency
for threat in threat_stream:
result = endpoint(threat)

Our predictive scheduler anticipates spikes and preemptively moves your model to VRAM. When the spike hits, it's already ready. No cold starts, consistent latency.


Multi-Region Compliance

The Challenge

You have customers in regulated regions (Switzerland, Saudi Arabia). Data must stay in-region.

How Cumulus Helps

# Deploy in Switzerland only
endpoint_ch = client.deploy(
model="classifier",
workload_type="inference",
region="eu-ch"
)

# Deploy in Saudi Arabia only
endpoint_sa = client.deploy(
model="classifier",
workload_type="inference",
region="me-sa"
)

Cumulus maintains providers in every region. Your model runs only where required. No data leaves the specified region.


Multiple Specialized Models

The Challenge

Your security system uses 5 different small models (detector, classifier, ranker, etc.). Renting 5 separate GPUs is wasteful.

How Cumulus Helps

detector = client.deploy(model="threat_detector", workload_type="inference")
classifier = client.deploy(model="threat_classifier", workload_type="inference")
ranker = client.deploy(model="threat_ranker", workload_type="inference")

Cumulus automatically:

  • Recognizes all three are small models
  • Bins them on the same GPU (fractional allocation)
  • You pay for ~1/3 the cost of separate GPUs

Performance-Critical Inference

The Challenge

Your application demands sub-100ms latency for every inference, even during traffic spikes.

How Cumulus Helps

endpoint = client.deploy(
model="critical_model",
workload_type="inference",
region="eu-ch", # Deploy near users
cache_config={"model_cache": "aggressive"}, # Keep in VRAM
min_replicas=2 # Always have backups ready
)

With geographic proximity and predictive scaling:

  • Model is always in VRAM in the closest region
  • Replicas are pre-warmed globally
  • Spikes are anticipated and handled preemptively

Result: Consistent sub-100ms latency across all inference requests.


Platform Characteristics

High-Speed Inference

  • Predictive scheduling moves models up the cache hierarchy before traffic spikes
  • CUDA checkpoints eliminate warm-up delays
  • Geographic proximity minimizes network latency

Multi-Cloud Aggregation

  • Access GPUs from major clouds and individual providers
  • Automatic failover and load balancing across vendors
  • Unmatched flexibility in GPU selection and pricing

Performance-Focused Optimization

  • Latency is the priority, not just cost
  • Cold start elimination via checkpoint replication
  • Intelligent request routing based on geography and capacity

Geographic Flexibility & Regulatory Compliance

  • Providers in every major region and country
  • Deploy in specific geographies for data residency
  • No cross-border data movement unless explicit

Intelligent Resource Management

  • Automatic fractional GPU allocation
  • Multi-tenant workload packing
  • Continuous monitoring and optimization

Ultra-Fast Scaling

  • Scale from 1 to billions of requests with pre-positioned replicas
  • No orchestration delays or cold starts
  • Consistent latency at any scale