Resource Management
Configure GPU, memory, and compute resources for optimal inference performance.
- How to size resources for different model types
- Best practices for batch processing
- Monitoring and retrieving results
GPU Allocation
Request GPU resources based on your model requirements:
from cumulus import CumulusClient
client = CumulusClient()
job = client.submit(
script="inference.py",
workload_type="inference",
gpu_count=1 # Number of GPUs
)
When to use multiple GPUs:
- Models too large for single GPU VRAM
- Parallel processing of independent batches
- Multi-GPU inference with model parallelism
Memory Configuration
Set memory limits based on your model size and batch requirements:
job = client.submit(
script="inference.py",
workload_type="inference",
memory_request="16Gi", # Minimum guaranteed
memory_limit="32Gi" # Maximum allowed
)
| Parameter | Description | Default |
|---|---|---|
memory_request | Minimum memory guaranteed for your job | 8Gi |
memory_limit | Maximum memory your job can use | 16Gi |
Use Kubernetes notation: "8Gi", "16Gi", "32Gi", etc. This refers to RAM, not GPU VRAM.
Sizing guidelines by model type:
| Model Type | Parameters | Memory Request | Memory Limit |
|---|---|---|---|
| BERT-base | 110M | 8Gi | 16Gi |
| GPT-2 | 1.5B | 16Gi | 32Gi |
| LLaMA-7B | 7B | 32Gi | 64Gi |
| LLaMA-13B | 13B | 48Gi | 80Gi |
| ResNet-50 | 25M | 8Gi | 16Gi |
| YOLO-v5 | 7M | 8Gi | 16Gi |
Priority Settings
Control scheduling priority with the priority parameter:
job = client.submit(
script="inference.py",
workload_type="inference",
priority=3 # Lower priority for cost savings
)
| Priority | Behavior | Best For |
|---|---|---|
| 1-3 | Low priority, may be evicted | Batch jobs, non-urgent tasks |
| 4-6 | Normal priority (default: 5) | Standard workloads |
| 7-10 | High priority, rarely evicted | Time-sensitive inference |
Inference jobs with lower priority cost less. If your job can tolerate interruptions, use priority=2 or priority=3.
Docker Images
Use optimized images for your inference workload:
job = client.submit(
script="inference.py",
workload_type="inference",
worker_image="pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime"
)
Available images:
| Image | Best For |
|---|---|
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime | PyTorch models (default) |
nvcr.io/nvidia/pytorch:24.01-py3 | NVIDIA-optimized PyTorch |
nvcr.io/nvidia/tritonserver:24.01-py3 | Production serving |
nvcr.io/nvidia/tensorflow:24.01-tf2-py3 | TensorFlow models |
Batch Processing Best Practices
Maximize GPU utilization with efficient batching:
import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("bert-base-uncased").cuda()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model.eval()
# Load your data
texts = load_texts_from_file("input.txt")
# Process in optimized batches
batch_size = 32 # Tune based on GPU memory
results = []
with torch.no_grad():
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
# Tokenize batch together for efficiency
inputs = tokenizer(
batch,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
).to("cuda")
# Single forward pass for entire batch
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, 0, :].cpu()
results.extend(embeddings.tolist())
# Progress logging
print(f"Processed {min(i + batch_size, len(texts))}/{len(texts)}")
print(f"Generated {len(results)} embeddings")
Batch size guidelines:
| GPU VRAM | Recommended Batch Size |
|---|---|
| 8GB | 8-16 |
| 16GB | 16-32 |
| 24GB | 32-64 |
| 40GB+ | 64-128 |
Monitoring Job Status
Check job progress during execution:
from cumulus import CumulusClient
import time
client = CumulusClient()
job_id = "your-job-id"
# Poll for status updates
while True:
status = client.get_status(job_id)
print(f"Status: {status}")
if status in ("SUCCEEDED", "FAILED"):
break
time.sleep(10)
# Get final results
if status == "SUCCEEDED":
results = client.get_results(job_id)
print(results)
Job status values:
| Status | Description |
|---|---|
SUBMITTED | Job uploaded, waiting to be scheduled |
PENDING | Pod created, waiting for GPU allocation |
RUNNING | Job is executing |
SUCCEEDED | Job completed successfully |
FAILED | Job failed (check logs for details) |
Retrieving Results
Get output files from completed jobs:
# Get default output (stdout/stderr)
output = client.get_results(job_id)
print(output)
# Get specific output files
predictions = client.get_results(job_id, file="predictions.json")
embeddings = client.get_results(job_id, file="embeddings.npy")
Saving outputs in your script:
import json
import numpy as np
# Save JSON results
with open("predictions.json", "w") as f:
json.dump(results, f)
# Save NumPy arrays
np.save("embeddings.npy", embeddings_array)
# Save text output
with open("output.txt", "w") as f:
for item in results:
f.write(f"{item}\n")
print("Results saved!") # This appears in default output
Using wait_for_completion()
Block until a job finishes:
from cumulus import CumulusClient
client = CumulusClient()
job = client.submit(
script="inference.py",
requirements=["torch", "transformers"],
workload_type="inference"
)
# Wait with timeout
final_status = client.wait_for_completion(
job.job_id,
timeout=3600, # Maximum wait time in seconds
poll_interval=10.0 # How often to check (default: 10s)
)
if final_status == "SUCCEEDED":
results = client.get_results(job.job_id)
print("Job completed successfully!")
print(results)
elif final_status == "FAILED":
print("Job failed - check logs for details")
elif final_status == "TIMEOUT":
print("Job still running after timeout")
Performance Optimization Tips
1. Use Mixed Precision
import torch
model = model.half() # Convert to FP16
# or use automatic mixed precision
with torch.cuda.amp.autocast():
outputs = model(inputs)
2. Enable CUDA Optimizations
torch.backends.cudnn.benchmark = True
torch.backends.cuda.matmul.allow_tf32 = True
3. Pre-compile with torch.compile (PyTorch 2.0+)
model = torch.compile(model, mode="reduce-overhead")
4. Use Efficient Data Loading
# Pre-load data to GPU if it fits
data = data.pin_memory() # For faster CPU->GPU transfer
# Use async data loading
loader = DataLoader(dataset, num_workers=4, pin_memory=True)
Troubleshooting
Job runs out of memory
- Reduce
batch_sizein your script - Increase
memory_limitin submit call - Use mixed precision (FP16)
Job is slow
- Increase
batch_sizeto better utilize GPU - Use
torch.no_grad()for inference - Enable CUDA optimizations
Job keeps getting evicted
- Increase
priority(7-10 for important jobs) - Consider
workload_type="training"for critical workloads
Next Steps
- Inference Examples - Real-world use cases
- Configuration Reference - All available options