Skip to main content

Advanced Configuration

For power users who need control over region, caching, replicas, and more.

Region Selection

Lock your deployment to a specific geographic region:

endpoint = client.deploy(
model="detector_model",
workload_type="inference",
region="eu-ch" # Lock to Switzerland
)

result = endpoint("input_data_here")

Supported Regions

Region CodeLocationUse Case
us-eastUS East CoastDefault US deployments
us-westUS West CoastWest Coast users
eu-chSwitzerlandRegulated EU data
eu-deGermanyEU data residency
eu-frFranceEU data residency
eu-ukUnited KingdomEU data residency
eu-nlNetherlandsEU data residency
eu-seSwedenEU data residency
me-saSaudi ArabiaMiddle East deployments
me-aeUnited Arab EmiratesGCC region
asia-inIndiaAsia-Pacific region
asia-sgSingaporeSoutheast Asia
asia-jpJapanEast Asia
autoGlobalAutomatic regional selection (default)

Replica Management & Auto-Scaling

By default, Cumulus automatically manages replicas across regions. Inference is stateless—each replica loads from a single CUDA checkpoint, so replicas can spin up/down concurrently without data loss or re-initialization.

# Default: Auto-managed replicas (recommended)
endpoint = client.deploy(
model="detector_model",
workload_type="inference",
region="eu-ch"
)

# Advanced: Set minimum replicas to maintain
endpoint = client.deploy(
model="detector_model",
workload_type="inference",
region="eu-ch",
min_replicas=2 # Keep at least 2 replicas running
)

How Auto-Scaling Works

  • Cumulus monitors request load in real-time
  • When a replica reaches capacity, Cumulus automatically spins up additional replicas
  • Requests distribute evenly across all replicas in the region
  • Idle replicas scale down to zero when traffic drops
  • All replicas share the same CUDA checkpoint—no state sync needed

Parameters

ParameterDescriptionDefault
min_replicasMinimum number of replicas to keep running. Prevents cold starts.1

Caching & Persistence

Models are cached across multiple tiers for faster serving and intelligent memory management. Cumulus handles movement automatically.

endpoint = client.deploy(
model="detector_model",
workload_type="inference",
cache_config={
"model_cache": "aggressive"
}
)

Cache Strategies

StrategyBehaviorBest For
aggressiveKeep model in VRAM as much as possible, spill to disk when memory pressure highLatency-critical apps
autoAutomatically move between VRAM and disk based on request patternsMost use cases (default)
conservativePrefer disk, only load to VRAM on requestCost optimization

How Cumulus Uses Caching

  1. Models start in S3 (cloud storage)
  2. Move to disk (local SSD) when region gets traffic
  3. Move to VRAM (GPU memory) when spike is predicted
  4. Spill back to disk intelligently when memory pressure increases
  5. Persist checkpoints across replica restarts

You just specify the strategy—Cumulus handles the movement automatically.

# Aggressive caching for latency-critical apps
endpoint = client.deploy(
model="critical_model",
workload_type="inference",
cache_config={"model_cache": "aggressive"}
)

# Conservative caching for cost optimization
endpoint = client.deploy(
model="large_model",
workload_type="inference",
cache_config={"model_cache": "conservative"}
)

Concurrency & Request Handling

Control how many concurrent requests each replica can handle and set timeouts for long-running inference.

endpoint = client.deploy(
model="detector_model",
workload_type="inference",
max_concurrent_requests=32, # Requests per replica
request_timeout=300 # 5 minutes
)

Parameters

ParameterDescriptionDefault
max_concurrent_requestsHow many requests one replica processes simultaneously32
request_timeoutMaximum seconds to wait for inference completion300

How It Works

  • Each replica queues incoming requests up to max_concurrent_requests
  • Requests exceeding this limit trigger auto-scaling (new replica spins up)
  • Any request exceeding request_timeout is terminated to free resources

Example

endpoint = client.deploy(
model="detector_model",
workload_type="inference",
max_concurrent_requests=64, # High concurrency for bursty workloads
request_timeout=120 # Fail fast on stuck requests
)

# Send request
result = endpoint("input_data_here") # Returns in < 120 seconds or times out

Monitoring & Observability

Query real-time replica status and distribution across regions:

# Get replica status
status = client.get_replica_status(endpoint.id)
print(status)

# Output:
# {
# "eu-ch": {
# "active_replicas": 3,
# "servers": ["server-1a", "server-2b", "server-3c"],
# "load": "85%"
# },
# "eu-de": {
# "active_replicas": 1,
# "servers": ["server-5f"],
# "load": "32%"
# }
# }

Finding Your Endpoint ID

# From deployment response
endpoint = client.deploy(model="detector_model", workload_type="inference")
endpoint_id = endpoint.id # Returns: "ep_1a2b3c4d5e6f7g8h"

# Or retrieve from your account
endpoints = client.list_endpoints()
for ep in endpoints:
print(f"{ep.name}: {ep.id}")

Available Metrics

MetricDescription
active_replicasNumber of running replicas per region
serversServer IDs where replicas are running
loadAverage CPU/GPU utilization per region
requests_queuedPending requests waiting for available replica
avg_latency_msAverage inference latency per region

Workload Types

TypeBehaviorBest For
inferenceHigh priority, optimized for latencyReal-time model serving, detection, classification
trainingLower priority, optimized for throughputModel fine-tuning, batch processing