Inference Jobs
Run model inference on GPU with automatic scaling and efficient resource utilization.
- How to submit batch inference jobs
- How to deploy live inference servers with public URLs
- Difference between inference and training workloads
- Best practices for efficient GPU utilization
Quick Start
Submit any inference script to run on GPU:
from cumulus import CumulusClient
client = CumulusClient()
job = client.submit(
script="inference.py",
requirements=["torch", "transformers"],
workload_type="inference"
)
print(f"Job ID: {job.job_id}")
# Wait for results
client.wait_for_completion(job.job_id)
results = client.get_results(job.job_id)
print(results)
Your script runs on an NVIDIA GPU with all dependencies installed automatically.
Why workload_type="inference"?
Setting workload_type="inference" optimizes your job for batch processing:
| Benefit | Description |
|---|---|
| Optimized resources | Your job runs on GPUs with available capacity |
| Cost efficiency | Lower priority means lower cost |
| Batch-friendly | Designed for jobs that run to completion |
Inference jobs have lower eviction priority than training jobs. This means if GPU resources become scarce, inference jobs may be paused to make room for training workloads. This is usually fine since inference jobs are typically shorter and easier to restart.
Batch Inference Example
Process a dataset through your model:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model once
model = AutoModelForCausalLM.from_pretrained("gpt2").cuda()
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model.eval()
prompts = [
"The quick brown fox",
"Machine learning is",
"In the year 2025"
]
results = []
with torch.no_grad():
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
results.append({"prompt": prompt, "output": result})
print(f"Processed: {prompt[:30]}...")
# Save results
import json
with open("results.json", "w") as f:
json.dump(results, f, indent=2)
print(f"Inference complete! Processed {len(results)} prompts.")
from cumulus import CumulusClient
client = CumulusClient()
job = client.submit(
script="inference.py",
requirements=["torch", "transformers"],
workload_type="inference"
)
print(f"Job: {job.job_id}")
# Wait and retrieve results
client.wait_for_completion(job.job_id)
output = client.get_results(job.job_id, file="results.json")
print(output)
Model Evaluation
Evaluate a trained model on a test dataset:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
# Load your trained model
model = torch.load("model.pt").cuda()
model.eval()
# Test data (replace with your actual test set)
X_test = torch.randn(1000, 784)
y_test = torch.randint(0, 10, (1000,))
test_loader = DataLoader(TensorDataset(X_test, y_test), batch_size=64)
correct = 0
total = 0
with torch.no_grad():
for X, y in test_loader:
X, y = X.cuda(), y.cuda()
outputs = model(X)
_, predicted = outputs.max(1)
total += y.size(0)
correct += predicted.eq(y).sum().item()
accuracy = 100.0 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%")
print(f"Correct: {correct}/{total}")
from cumulus import CumulusClient
client = CumulusClient()
job = client.submit(
script="evaluate.py",
include_patterns=["*.pt"], # Include model file
requirements=["torch"],
workload_type="inference"
)
Long-Running Inference with Checkpointing
For processing large datasets, use CumulusJob to handle interruptions gracefully:
import torch
import json
from transformers import AutoModel, AutoTokenizer
from cumulus import CumulusJob
model = AutoModel.from_pretrained("bert-base-uncased").cuda()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model.eval()
# Large dataset to process
texts = [f"Sample text {i}" for i in range(10000)]
with CumulusJob() as job:
# Resume from where we left off if interrupted
if job.is_resumed:
processed = job.checkpoint.get('processed', [])
start_idx = job.checkpoint.get('last_index', 0) + 1
print(f"Resuming from index {start_idx}")
else:
processed = []
start_idx = 0
batch_size = 32
for i in range(start_idx, len(texts), batch_size):
batch = texts[i:i+batch_size]
inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).to("cuda")
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, 0, :].cpu().tolist()
for text, emb in zip(batch, embeddings):
processed.append({"text": text, "embedding": emb})
# Update state for checkpoint
job.state = {
'processed': processed,
'last_index': i + len(batch) - 1
}
if (i // batch_size) % 10 == 0:
print(f"Processed {len(processed)}/{len(texts)}")
job.complete()
# Save final results
with open("embeddings.json", "w") as f:
json.dump(processed, f)
print(f"Complete! Generated {len(processed)} embeddings.")
Live Inference Servers
Deploy long-running inference servers (vLLM, SGLang, FastAPI, etc.) and get a public URL to access them.
Tunnel URLs require your Cumulus API key for access. Include it via:
X-API-Key: <your-api-key>header, orAuthorization: Bearer <your-api-key>header
from cumulus import CumulusClient
import requests
import os
client = CumulusClient()
# Deploy server with exposed port
job = client.submit(
script="server.py",
requirements=["vllm", "fastapi", "uvicorn"],
service_port=8000, # Port your server listens on
workload_type="inference"
)
# Wait for public URL
tunnel_url = client.wait_for_tunnel(job.job_id, timeout=300)
print(f"Server ready at: {tunnel_url}")
# Example: http://tunnel.cumuluslabs.io:8443/54321
# Make authenticated requests to your server
api_key = os.environ.get("CUMULUS_API_KEY")
response = requests.post(
f"{tunnel_url}/generate",
headers={"X-API-Key": api_key},
json={"prompt": "Hello!"}
)
How it works:
- Set
service_portto the port your server listens on - Cumulus creates a tunnel to make your server publicly accessible
- Use
wait_for_tunnel()to get the public URL - The URL stays active as long as your job is running
For guaranteed performance, set explicit resource limits:
job = client.submit(
script="server.py",
service_port=8000,
workload_type="inference",
sm_percent=100, # 100% GPU compute (dedicated)
vram_gb=40.0 # 40GB VRAM reserved
)
See Inference Examples for complete server examples.
Resource Configuration
Configure resources based on your model size:
job = client.submit(
script="inference.py",
workload_type="inference",
gpu_count=1, # Number of GPUs
memory_request="16Gi", # Minimum guaranteed memory
memory_limit="32Gi" # Maximum allowed
)
Memory guidelines:
| Model Size | Recommended Memory |
|---|---|
| Small (< 1B params) | 8Gi - 16Gi |
| Medium (1B - 7B params) | 16Gi - 32Gi |
| Large (7B+ params) | 32Gi - 64Gi |
Environment Variables
Pass API keys, model paths, or configuration:
job = client.submit(
script="inference.py",
workload_type="inference",
env={
"MODEL_PATH": "s3://my-bucket/model.pt",
"HF_TOKEN": "your-huggingface-token",
"BATCH_SIZE": "64"
}
)
Access in your script:
import os
model_path = os.environ.get("MODEL_PATH")
hf_token = os.environ.get("HF_TOKEN")
batch_size = int(os.environ.get("BATCH_SIZE", "32"))
Workload Types Comparison
| Type | Use Case | Eviction Priority | Typical Duration |
|---|---|---|---|
training | Model training | Highest (rarely evicted) | Hours to days |
finetuning | Fine-tuning pre-trained models | Medium | Hours |
inference | Batch inference, evaluation | Lowest | Minutes to hours |
Choose the right workload type to optimize performance and cost.
Next Steps
- Resource Management - Configure GPU and memory
- Inference Examples - Real-world use cases
- SDK Reference - Complete API documentation