Advanced Configuration
For power users who need control over region, caching, replicas, and more.
Region Selection
Lock your deployment to a specific geographic region:
endpoint = client.deploy(
model="detector_model",
workload_type="inference",
region="eu-ch" # Lock to Switzerland
)
result = endpoint("input_data_here")
Supported Regions
| Region Code | Location | Use Case |
|---|---|---|
us-east | US East Coast | Default US deployments |
us-west | US West Coast | West Coast users |
eu-ch | Switzerland | Regulated EU data |
eu-de | Germany | EU data residency |
eu-fr | France | EU data residency |
eu-uk | United Kingdom | EU data residency |
eu-nl | Netherlands | EU data residency |
eu-se | Sweden | EU data residency |
me-sa | Saudi Arabia | Middle East deployments |
me-ae | United Arab Emirates | GCC region |
asia-in | India | Asia-Pacific region |
asia-sg | Singapore | Southeast Asia |
asia-jp | Japan | East Asia |
auto | Global | Automatic regional selection (default) |
Replica Management & Auto-Scaling
By default, Cumulus automatically manages replicas across regions. Inference is stateless—each replica loads from a single CUDA checkpoint, so replicas can spin up/down concurrently without data loss or re-initialization.
# Default: Auto-managed replicas (recommended)
endpoint = client.deploy(
model="detector_model",
workload_type="inference",
region="eu-ch"
)
# Advanced: Set minimum replicas to maintain
endpoint = client.deploy(
model="detector_model",
workload_type="inference",
region="eu-ch",
min_replicas=2 # Keep at least 2 replicas running
)
How Auto-Scaling Works
- Cumulus monitors request load in real-time
- When a replica reaches capacity, Cumulus automatically spins up additional replicas
- Requests distribute evenly across all replicas in the region
- Idle replicas scale down to zero when traffic drops
- All replicas share the same CUDA checkpoint—no state sync needed
Parameters
| Parameter | Description | Default |
|---|---|---|
min_replicas | Minimum number of replicas to keep running. Prevents cold starts. | 1 |
Caching & Persistence
Models are cached across multiple tiers for faster serving and intelligent memory management. Cumulus handles movement automatically.
endpoint = client.deploy(
model="detector_model",
workload_type="inference",
cache_config={
"model_cache": "aggressive"
}
)
Cache Strategies
| Strategy | Behavior | Best For |
|---|---|---|
aggressive | Keep model in VRAM as much as possible, spill to disk when memory pressure high | Latency-critical apps |
auto | Automatically move between VRAM and disk based on request patterns | Most use cases (default) |
conservative | Prefer disk, only load to VRAM on request | Cost optimization |
How Cumulus Uses Caching
- Models start in S3 (cloud storage)
- Move to disk (local SSD) when region gets traffic
- Move to VRAM (GPU memory) when spike is predicted
- Spill back to disk intelligently when memory pressure increases
- Persist checkpoints across replica restarts
You just specify the strategy—Cumulus handles the movement automatically.
# Aggressive caching for latency-critical apps
endpoint = client.deploy(
model="critical_model",
workload_type="inference",
cache_config={"model_cache": "aggressive"}
)
# Conservative caching for cost optimization
endpoint = client.deploy(
model="large_model",
workload_type="inference",
cache_config={"model_cache": "conservative"}
)
Concurrency & Request Handling
Control how many concurrent requests each replica can handle and set timeouts for long-running inference.
endpoint = client.deploy(
model="detector_model",
workload_type="inference",
max_concurrent_requests=32, # Requests per replica
request_timeout=300 # 5 minutes
)
Parameters
| Parameter | Description | Default |
|---|---|---|
max_concurrent_requests | How many requests one replica processes simultaneously | 32 |
request_timeout | Maximum seconds to wait for inference completion | 300 |
How It Works
- Each replica queues incoming requests up to
max_concurrent_requests - Requests exceeding this limit trigger auto-scaling (new replica spins up)
- Any request exceeding
request_timeoutis terminated to free resources
Example
endpoint = client.deploy(
model="detector_model",
workload_type="inference",
max_concurrent_requests=64, # High concurrency for bursty workloads
request_timeout=120 # Fail fast on stuck requests
)
# Send request
result = endpoint("input_data_here") # Returns in < 120 seconds or times out
Monitoring & Observability
Query real-time replica status and distribution across regions:
# Get replica status
status = client.get_replica_status(endpoint.id)
print(status)
# Output:
# {
# "eu-ch": {
# "active_replicas": 3,
# "servers": ["server-1a", "server-2b", "server-3c"],
# "load": "85%"
# },
# "eu-de": {
# "active_replicas": 1,
# "servers": ["server-5f"],
# "load": "32%"
# }
# }
Finding Your Endpoint ID
# From deployment response
endpoint = client.deploy(model="detector_model", workload_type="inference")
endpoint_id = endpoint.id # Returns: "ep_1a2b3c4d5e6f7g8h"
# Or retrieve from your account
endpoints = client.list_endpoints()
for ep in endpoints:
print(f"{ep.name}: {ep.id}")
Available Metrics
| Metric | Description |
|---|---|
active_replicas | Number of running replicas per region |
servers | Server IDs where replicas are running |
load | Average CPU/GPU utilization per region |
requests_queued | Pending requests waiting for available replica |
avg_latency_ms | Average inference latency per region |
Workload Types
| Type | Behavior | Best For |
|---|---|---|
inference | High priority, optimized for latency | Real-time model serving, detection, classification |
training | Lower priority, optimized for throughput | Model fine-tuning, batch processing |