Configuration Reference
Complete reference for all Cumulus job configuration options.
Quick Reference
from cumulus import CumulusClient
client = CumulusClient()
job = client.submit(
# Required
script="train.py",
# File handling
include_patterns=["*.yaml", "data/*.csv"],
exclude_patterns=["*.pyc", "__pycache__/*"],
requirements=["torch", "transformers"],
requirements_file="requirements.txt",
# Resources
gpu_count=1,
memory_request="16Gi",
memory_limit="32Gi",
worker_image="pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime",
# Job settings
job_id="my-training-001",
workload_type="training",
priority=7,
# Environment
env={"WANDB_API_KEY": "...", "HF_TOKEN": "..."},
# Optimization hints
model_architecture={...},
training_config={...},
auto_detect=True
)
File Handling
script (required)
Path to the main Python script to execute.
job = client.submit(script="train.py")
job = client.submit(script="src/main.py")
include_patterns
Glob patterns for files to include with your job.
job = client.submit(
script="train.py",
include_patterns=[
"*.yaml", # All YAML files in root
"data/*.csv", # CSV files in data folder
"models/**/*.pt", # All .pt files in models (recursive)
"configs/*.json" # JSON files in configs folder
]
)
Pattern syntax:
| Pattern | Matches |
|---|---|
*.yaml | All YAML files in current directory |
**/*.yaml | All YAML files recursively |
data/*.csv | CSV files in data/ directory |
models/**/*.pt | All .pt files under models/ |
By default, the SDK automatically detects files referenced in your code. Use include_patterns when you need to include files that aren't directly referenced, or when auto-detection misses something.
exclude_patterns
Glob patterns for files to exclude from your job.
job = client.submit(
script="train.py",
exclude_patterns=[
"*.pyc", # Compiled Python
"__pycache__/*", # Cache directories
"*.log", # Log files
".git/*", # Git directory
"*.egg-info/*", # Package metadata
"tests/*" # Test files
]
)
requirements
List of pip packages to install.
job = client.submit(
script="train.py",
requirements=["torch", "transformers", "wandb"]
)
# With version constraints
job = client.submit(
script="train.py",
requirements=[
"torch>=2.0.0",
"transformers==4.35.0",
"wandb~=0.16.0"
]
)
additional_files
Explicitly include files that aren't auto-detected.
job = client.submit(
script="train.py",
additional_files=["model.py", "utils.py", "../data/processed"]
)
When including files outside your script's directory (paths with ../), the .. components are stripped and files are placed alongside your script. For example, ../data/processed/ becomes data/processed/ relative to your script.
Use relative paths in your script to access these files.
requirements_file
Path to a requirements.txt file.
job = client.submit(
script="train.py",
requirements_file="requirements.txt"
)
You can combine both:
job = client.submit(
script="train.py",
requirements_file="requirements.txt",
requirements=["wandb"] # Additional packages
)
Resource Configuration
gpu_count
Number of GPUs to request.
job = client.submit(
script="train.py",
gpu_count=1 # Default: 1
)
# Multi-GPU training
job = client.submit(
script="distributed_train.py",
gpu_count=4
)
memory_request
Minimum memory guaranteed for your job.
job = client.submit(
script="train.py",
memory_request="16Gi" # Default: "8Gi"
)
Format: Use Kubernetes notation: "8Gi", "16Gi", "32Gi", etc.
memory_limit
Maximum memory your job can use.
job = client.submit(
script="train.py",
memory_limit="32Gi" # Default: "16Gi"
)
Guidelines:
| Model Size | memory_request | memory_limit |
|---|---|---|
| Small (< 1B params) | 8Gi | 16Gi |
| Medium (1B - 7B params) | 16Gi | 32Gi |
| Large (7B+ params) | 32Gi | 64Gi |
worker_image
Docker image to run your job.
job = client.submit(
script="train.py",
worker_image="pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime"
)
Available images:
| Image | Description |
|---|---|
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime | Default PyTorch 2.5 with CUDA 12.4 |
nvcr.io/nvidia/pytorch:24.01-py3 | NVIDIA optimized PyTorch |
nvcr.io/nvidia/tensorflow:24.01-tf2-py3 | NVIDIA TensorFlow 2 |
nvcr.io/nvidia/tritonserver:24.01-py3 | NVIDIA Triton for serving |
Job Settings
job_id
Custom identifier for your job.
job = client.submit(
script="train.py",
job_id="bert-finetune-exp-001"
)
Requirements:
- Alphanumeric characters and hyphens only
- Maximum 63 characters
- Must be unique
If not provided, a unique ID is auto-generated: job-20240103-143052-a1b2c3
workload_type
Type of workload, affects scheduling priority.
job = client.submit(
script="train.py",
workload_type="training" # Default
)
| Type | Use Case | Eviction Priority |
|---|---|---|
training | Model training | Highest (rarely evicted) |
finetuning | Fine-tuning pre-trained models | Medium |
inference | Batch inference, evaluation | Lowest |
priority
Job priority (1-10), affects scheduling order and eviction.
job = client.submit(
script="train.py",
priority=8 # Default: 5
)
| Priority | Behavior |
|---|---|
| 1-3 | Low priority, evicted first for higher priority jobs |
| 4-6 | Normal priority |
| 7-10 | High priority, rarely evicted |
Use high priority (7-10) for important production jobs. Use low priority (1-3) for experimental or non-urgent work.
Environment Variables
env
Dictionary of environment variables for your script.
job = client.submit(
script="train.py",
env={
"WANDB_API_KEY": "your-wandb-key",
"HF_TOKEN": "your-huggingface-token",
"DEBUG": "true",
"LEARNING_RATE": "0.001",
"BATCH_SIZE": "32"
}
)
Access in your script:
import os
wandb_key = os.environ.get("WANDB_API_KEY")
lr = float(os.environ.get("LEARNING_RATE", "0.001"))
batch_size = int(os.environ.get("BATCH_SIZE", "32"))
debug = os.environ.get("DEBUG", "false").lower() == "true"
System-provided environment variables:
These are automatically set inside your job:
| Variable | Description | Example |
|---|---|---|
JOB_ID | Current job identifier | job-20240103-143052-abc123 |
ORIGINAL_JOB_ID | Original job ID (for requeued jobs) | job-20240103-143052-abc123 |
S3_BUCKET | Storage bucket | cumulus-jobs-... |
AWS_REGION | Cloud region | us-east-2 |
RESUME_FROM_CHECKPOINT | Whether to resume | true or false |
CHECKPOINT_PATH | Path to checkpoint (if resuming) | s3://bucket/job/checkpoints/ |
REQUEUE_COUNT | Times job has been requeued | 0, 1, 2, ... |
CUDA_VISIBLE_DEVICES | Assigned GPU index | 0 |
Optimization Hints
model_architecture
Model architecture details for VRAM estimation.
job = client.submit(
script="train.py",
model_architecture={
"architecture_type": "transformer",
"num_layers": 12,
"hidden_dim": 768,
"num_heads": 12,
"total_params": 110_000_000,
"base_model": "llama-7b", # For known model baselines
"lora_rank": 32 # For LoRA fine-tuning
}
)
| Field | Description |
|---|---|
architecture_type | Model type: transformer, diffusion, cnn, unet, etc. |
num_layers | Number of layers |
hidden_dim | Hidden dimension size |
num_heads | Number of attention heads (transformers) |
total_params | Total parameter count |
base_model | Base model name for known baselines (e.g., sdxl, llama-7b, flux) |
lora_rank | LoRA rank for accurate trainable parameter calculation |
For diffusion models like SDXL or Flux, set architecture_type: "diffusion" and include base_model for the most accurate VRAM estimates.
training_config
Training configuration for optimization.
job = client.submit(
script="train.py",
training_config={
"batch_size": 32,
"precision": "fp16",
"optimizer": "adamw",
"sequence_length": 512,
"peft": True, # LoRA/PEFT fine-tuning
"image_size": 1024, # For diffusion models
"gradient_checkpointing": True,
"cfg_scale": 7.5 # Classifier-free guidance
}
)
| Field | Description | Values |
|---|---|---|
batch_size | Training batch size | Integer |
precision | Numerical precision | fp32, fp16, bf16 |
optimizer | Optimizer type | adam, adamw, sgd |
sequence_length | Sequence length (for NLP) | Integer |
peft | Using PEFT/LoRA (reduces VRAM estimate) | True, False |
image_size | Image resolution for vision/diffusion models | Integer (e.g., 512, 1024) |
gradient_checkpointing | Enables gradient checkpointing (reduces VRAM ~30%) | True, False |
cfg_scale | Classifier-free guidance scale (diffusion); >1 doubles effective batch | Float |
For SDXL LoRA fine-tuning, a typical configuration is:
training_config={
"batch_size": 1,
"precision": "fp16",
"peft": True,
"image_size": 1024,
"gradient_checkpointing": True,
"cfg_scale": 7.5
}
This accurately estimates ~12-16GB VRAM depending on checkpointing.
auto_detect
Enable automatic detection of model configuration from your script.
job = client.submit(
script="train.py",
auto_detect=True # Default: True
)
When enabled, the SDK parses your script to detect:
- Model architecture (PyTorch, HuggingFace, torchvision)
- Batch sizes
- Precision settings
- Local file imports
Disable if auto-detection causes issues:
job = client.submit(
script="train.py",
auto_detect=False
)
Advanced Configuration
These options let you manually specify GPU resources when you know your exact requirements.
Use vram_gb and sm_percent when:
- You know your exact requirements from profiling or documentation
- You need guaranteed resources for consistent performance (inference servers)
- Jobs need specific memory allocations based on your model size
sm_percent
Manually set the GPU compute percentage (1-100). This controls what portion of GPU compute resources your job receives.
job = client.submit(
script="train.py",
sm_percent=50 # 50% of GPU compute resources
)
| sm_percent | Use Case |
|---|---|
| 100 | Full GPU (maximum performance) |
| 50 | Half GPU, suitable for smaller models |
| 25 | Quarter GPU, good for inference or light workloads |
vram_gb
Manually set VRAM (GPU memory) allocation in GB. This is the guaranteed amount of GPU memory reserved for your job.
job = client.submit(
script="train.py",
vram_gb=20.0 # Guaranteed 20GB VRAM
)
| GPU | Total VRAM | Typical vram_gb values |
|---|---|---|
| A100 | 40GB / 80GB | 10, 20, 40, 80 |
| H100 | 80GB | 20, 40, 80 |
When you set both sm_percent and vram_gb, you get guaranteed GPU resources. This is ideal for:
- Production inference servers needing consistent latency
- Training jobs where you know exact memory requirements
- Workloads that need predictable performance
queue_timeout_seconds
Maximum time (in seconds) to wait in queue before triggering autoscaler.
job = client.submit(
script="train.py",
queue_timeout_seconds=300 # Wait up to 5 minutes
)
service_port
Expose a port via public tunnel for running inference servers, APIs, or development endpoints. The tunnel provides a public URL accessible from anywhere.
job = client.submit(
script="sglang_server.py",
service_port=30000 # Expose SGLang server
)
# Wait for tunnel to be ready
tunnel_url = client.wait_for_tunnel(job.job_id, timeout=300)
print(f"Server available at: {tunnel_url}")
# Returns: http://tunnel.cumuluslabs.io:8443/12345
How it works:
- Your job starts and listens on the specified port (e.g.,
30000) - A tunnel sidecar connects your port to
tunnel.cumuluslabs.io - You get a unique public URL like
http://tunnel.cumuluslabs.io:8443/12345 - Requests to this URL require API key authentication
Getting the tunnel URL:
# Method 1: Wait for tunnel (blocking)
tunnel_url = client.wait_for_tunnel(job.job_id, timeout=300)
# Method 2: Check immediately (non-blocking)
tunnel_url = client.get_tunnel_url(job.job_id) # Returns None if not ready
# Method 3: Poll in your own loop
import time
while True:
tunnel_url = client.get_tunnel_url(job.job_id)
if tunnel_url:
break
time.sleep(5)
Common service ports:
| Framework | Default Port |
|---|---|
| SGLang | 30000 |
| vLLM | 8000 |
| Ollama | 11434 |
| Gradio | 7860 |
| FastAPI | 8000 |
Example: Inference server with guaranteed resources
Tunnel URLs require your Cumulus API key for access. Include it via:
X-API-Key: <your-api-key>header, orAuthorization: Bearer <your-api-key>header
import os
job = client.submit(
script="vllm_server.py",
service_port=8000,
workload_type="inference",
sm_percent=100, # Full GPU compute
vram_gb=40.0, # 40GB guaranteed VRAM
gpu_count=1
)
tunnel_url = client.wait_for_tunnel(job.job_id, timeout=300)
print(f"vLLM server ready at: {tunnel_url}")
# Make authenticated inference requests
import requests
api_key = os.environ.get("CUMULUS_API_KEY")
response = requests.post(
f"{tunnel_url}/v1/completions",
headers={"X-API-Key": api_key}, # Auth required!
json={
"prompt": "Hello, world!",
"max_tokens": 100
}
)
user_id
Override the owner ID for job filtering in the web dashboard.
job = client.submit(
script="train.py",
user_id="team-ml-research" # Custom owner for dashboard filtering
)
Complete Example
from cumulus import CumulusClient
import os
client = CumulusClient()
job = client.submit(
# Script and files
script="src/train.py",
include_patterns=["configs/*.yaml", "data/*.csv"],
exclude_patterns=["*.pyc", "__pycache__/*", "*.log"],
requirements_file="requirements.txt",
requirements=["wandb"],
# Resources
gpu_count=1,
memory_request="16Gi",
memory_limit="32Gi",
worker_image="pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime",
# Job settings
job_id="bert-finetune-v1",
workload_type="training",
priority=7,
# Environment
env={
"WANDB_API_KEY": os.environ.get("WANDB_API_KEY"),
"HF_TOKEN": os.environ.get("HF_TOKEN"),
"EXPERIMENT_NAME": "bert-finetune-v1"
},
# Optimization hints
model_architecture={
"architecture_type": "transformer",
"num_layers": 12,
"hidden_dim": 768,
"num_heads": 12,
"total_params": 110_000_000
},
training_config={
"batch_size": 32,
"precision": "fp16",
"optimizer": "adamw",
"sequence_length": 512
},
auto_detect=True
)
print(f"Job submitted: {job.job_id}")
print(f"Storage path: {job.s3_path}")
# Wait for completion
status = client.wait_for_completion(job.job_id, timeout=7200)
print(f"Final status: {status}")
# Get results
if status == "SUCCEEDED":
results = client.get_results(job.job_id)
print(results)
Parameter Summary
| Parameter | Type | Default | Description |
|---|---|---|---|
script | str | required | Path to main Python script |
include_patterns | list[str] | [] | Glob patterns to include |
exclude_patterns | list[str] | [] | Glob patterns to exclude |
requirements | list[str] | [] | pip packages to install |
requirements_file | str | None | Path to requirements.txt |
gpu_count | int | 1 | Number of GPUs |
memory_request | str | "8Gi" | Minimum memory |
memory_limit | str | "16Gi" | Maximum memory |
worker_image | str | PyTorch default | Docker image |
job_id | str | auto-generated | Custom job ID |
workload_type | str | "training" | Job type |
priority | int | 5 | Priority (1-10) |
env | dict | {} | Environment variables |
model_architecture | dict | None | Model hints |
training_config | dict | None | Training hints |
auto_detect | bool | True | Auto-detect config |
sm_percent | int | auto | GPU compute percentage (1-100) |
vram_gb | float | auto | GPU memory allocation in GB |
queue_timeout_seconds | int | 900 | Max queue wait time |
service_port | int | None | Port to expose via tunnel |
user_id | str | API key user | Owner ID for dashboard |