Skip to main content

Resource Management

Configure GPU, memory, and compute resources for optimal inference performance.

What you'll learn
  • How to size resources for different model types
  • Best practices for batch processing
  • Monitoring and retrieving results

GPU Allocation

Request GPU resources based on your model requirements:

from cumulus import CumulusClient

client = CumulusClient()

job = client.submit(
script="inference.py",
workload_type="inference",
gpu_count=1 # Number of GPUs
)

When to use multiple GPUs:

  • Models too large for single GPU VRAM
  • Parallel processing of independent batches
  • Multi-GPU inference with model parallelism

Memory Configuration

Set memory limits based on your model size and batch requirements:

job = client.submit(
script="inference.py",
workload_type="inference",
memory_request="16Gi", # Minimum guaranteed
memory_limit="32Gi" # Maximum allowed
)
ParameterDescriptionDefault
memory_requestMinimum memory guaranteed for your job8Gi
memory_limitMaximum memory your job can use16Gi
Memory Format

Use Kubernetes notation: "8Gi", "16Gi", "32Gi", etc. This refers to RAM, not GPU VRAM.

Sizing guidelines by model type:

Model TypeParametersMemory RequestMemory Limit
BERT-base110M8Gi16Gi
GPT-21.5B16Gi32Gi
LLaMA-7B7B32Gi64Gi
LLaMA-13B13B48Gi80Gi
ResNet-5025M8Gi16Gi
YOLO-v57M8Gi16Gi

Priority Settings

Control scheduling priority with the priority parameter:

job = client.submit(
script="inference.py",
workload_type="inference",
priority=3 # Lower priority for cost savings
)
PriorityBehaviorBest For
1-3Low priority, may be evictedBatch jobs, non-urgent tasks
4-6Normal priority (default: 5)Standard workloads
7-10High priority, rarely evictedTime-sensitive inference
Cost Optimization

Inference jobs with lower priority cost less. If your job can tolerate interruptions, use priority=2 or priority=3.


Docker Images

Use optimized images for your inference workload:

job = client.submit(
script="inference.py",
workload_type="inference",
worker_image="pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime"
)

Available images:

ImageBest For
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtimePyTorch models (default)
nvcr.io/nvidia/pytorch:24.01-py3NVIDIA-optimized PyTorch
nvcr.io/nvidia/tritonserver:24.01-py3Production serving
nvcr.io/nvidia/tensorflow:24.01-tf2-py3TensorFlow models

Batch Processing Best Practices

Maximize GPU utilization with efficient batching:

efficient_inference.py
import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("bert-base-uncased").cuda()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model.eval()

# Load your data
texts = load_texts_from_file("input.txt")

# Process in optimized batches
batch_size = 32 # Tune based on GPU memory
results = []

with torch.no_grad():
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]

# Tokenize batch together for efficiency
inputs = tokenizer(
batch,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
).to("cuda")

# Single forward pass for entire batch
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, 0, :].cpu()

results.extend(embeddings.tolist())

# Progress logging
print(f"Processed {min(i + batch_size, len(texts))}/{len(texts)}")

print(f"Generated {len(results)} embeddings")

Batch size guidelines:

GPU VRAMRecommended Batch Size
8GB8-16
16GB16-32
24GB32-64
40GB+64-128

Monitoring Job Status

Check job progress during execution:

from cumulus import CumulusClient
import time

client = CumulusClient()
job_id = "your-job-id"

# Poll for status updates
while True:
status = client.get_status(job_id)
print(f"Status: {status}")

if status in ("SUCCEEDED", "FAILED"):
break

time.sleep(10)

# Get final results
if status == "SUCCEEDED":
results = client.get_results(job_id)
print(results)

Job status values:

StatusDescription
SUBMITTEDJob uploaded, waiting to be scheduled
PENDINGPod created, waiting for GPU allocation
RUNNINGJob is executing
SUCCEEDEDJob completed successfully
FAILEDJob failed (check logs for details)

Retrieving Results

Get output files from completed jobs:

# Get default output (stdout/stderr)
output = client.get_results(job_id)
print(output)

# Get specific output files
predictions = client.get_results(job_id, file="predictions.json")
embeddings = client.get_results(job_id, file="embeddings.npy")

Saving outputs in your script:

inference.py
import json
import numpy as np

# Save JSON results
with open("predictions.json", "w") as f:
json.dump(results, f)

# Save NumPy arrays
np.save("embeddings.npy", embeddings_array)

# Save text output
with open("output.txt", "w") as f:
for item in results:
f.write(f"{item}\n")

print("Results saved!") # This appears in default output

Using wait_for_completion()

Block until a job finishes:

from cumulus import CumulusClient

client = CumulusClient()

job = client.submit(
script="inference.py",
requirements=["torch", "transformers"],
workload_type="inference"
)

# Wait with timeout
final_status = client.wait_for_completion(
job.job_id,
timeout=3600, # Maximum wait time in seconds
poll_interval=10.0 # How often to check (default: 10s)
)

if final_status == "SUCCEEDED":
results = client.get_results(job.job_id)
print("Job completed successfully!")
print(results)
elif final_status == "FAILED":
print("Job failed - check logs for details")
elif final_status == "TIMEOUT":
print("Job still running after timeout")

Performance Optimization Tips

1. Use Mixed Precision

import torch

model = model.half() # Convert to FP16
# or use automatic mixed precision
with torch.cuda.amp.autocast():
outputs = model(inputs)

2. Enable CUDA Optimizations

torch.backends.cudnn.benchmark = True
torch.backends.cuda.matmul.allow_tf32 = True

3. Pre-compile with torch.compile (PyTorch 2.0+)

model = torch.compile(model, mode="reduce-overhead")

4. Use Efficient Data Loading

# Pre-load data to GPU if it fits
data = data.pin_memory() # For faster CPU->GPU transfer

# Use async data loading
loader = DataLoader(dataset, num_workers=4, pin_memory=True)

Troubleshooting

Job runs out of memory

  • Reduce batch_size in your script
  • Increase memory_limit in submit call
  • Use mixed precision (FP16)

Job is slow

  • Increase batch_size to better utilize GPU
  • Use torch.no_grad() for inference
  • Enable CUDA optimizations

Job keeps getting evicted

  • Increase priority (7-10 for important jobs)
  • Consider workload_type="training" for critical workloads

Next Steps