Skip to main content

Inference Jobs

Run model inference on GPU with automatic scaling and efficient resource utilization.

What you'll learn
  • How to submit batch inference jobs
  • How to deploy live inference servers with public URLs
  • Difference between inference and training workloads
  • Best practices for efficient GPU utilization

Quick Start

Submit any inference script to run on GPU:

from cumulus import CumulusClient

client = CumulusClient()

job = client.submit(
script="inference.py",
requirements=["torch", "transformers"],
workload_type="inference"
)

print(f"Job ID: {job.job_id}")

# Wait for results
client.wait_for_completion(job.job_id)
results = client.get_results(job.job_id)
print(results)

Your script runs on an NVIDIA GPU with all dependencies installed automatically.


Why workload_type="inference"?

Setting workload_type="inference" optimizes your job for batch processing:

BenefitDescription
Optimized resourcesYour job runs on GPUs with available capacity
Cost efficiencyLower priority means lower cost
Batch-friendlyDesigned for jobs that run to completion
Eviction Priority

Inference jobs have lower eviction priority than training jobs. This means if GPU resources become scarce, inference jobs may be paused to make room for training workloads. This is usually fine since inference jobs are typically shorter and easier to restart.


Batch Inference Example

Process a dataset through your model:

inference.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model once
model = AutoModelForCausalLM.from_pretrained("gpt2").cuda()
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model.eval()

prompts = [
"The quick brown fox",
"Machine learning is",
"In the year 2025"
]

results = []
with torch.no_grad():
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
results.append({"prompt": prompt, "output": result})
print(f"Processed: {prompt[:30]}...")

# Save results
import json
with open("results.json", "w") as f:
json.dump(results, f, indent=2)

print(f"Inference complete! Processed {len(results)} prompts.")
submit.py
from cumulus import CumulusClient

client = CumulusClient()

job = client.submit(
script="inference.py",
requirements=["torch", "transformers"],
workload_type="inference"
)

print(f"Job: {job.job_id}")

# Wait and retrieve results
client.wait_for_completion(job.job_id)
output = client.get_results(job.job_id, file="results.json")
print(output)

Model Evaluation

Evaluate a trained model on a test dataset:

evaluate.py
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# Load your trained model
model = torch.load("model.pt").cuda()
model.eval()

# Test data (replace with your actual test set)
X_test = torch.randn(1000, 784)
y_test = torch.randint(0, 10, (1000,))
test_loader = DataLoader(TensorDataset(X_test, y_test), batch_size=64)

correct = 0
total = 0

with torch.no_grad():
for X, y in test_loader:
X, y = X.cuda(), y.cuda()
outputs = model(X)
_, predicted = outputs.max(1)
total += y.size(0)
correct += predicted.eq(y).sum().item()

accuracy = 100.0 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%")
print(f"Correct: {correct}/{total}")
submit.py
from cumulus import CumulusClient

client = CumulusClient()

job = client.submit(
script="evaluate.py",
include_patterns=["*.pt"], # Include model file
requirements=["torch"],
workload_type="inference"
)

Long-Running Inference with Checkpointing

For processing large datasets, use CumulusJob to handle interruptions gracefully:

batch_inference.py
import torch
import json
from transformers import AutoModel, AutoTokenizer
from cumulus import CumulusJob

model = AutoModel.from_pretrained("bert-base-uncased").cuda()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model.eval()

# Large dataset to process
texts = [f"Sample text {i}" for i in range(10000)]

with CumulusJob() as job:
# Resume from where we left off if interrupted
if job.is_resumed:
processed = job.checkpoint.get('processed', [])
start_idx = job.checkpoint.get('last_index', 0) + 1
print(f"Resuming from index {start_idx}")
else:
processed = []
start_idx = 0

batch_size = 32
for i in range(start_idx, len(texts), batch_size):
batch = texts[i:i+batch_size]

inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).to("cuda")
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, 0, :].cpu().tolist()

for text, emb in zip(batch, embeddings):
processed.append({"text": text, "embedding": emb})

# Update state for checkpoint
job.state = {
'processed': processed,
'last_index': i + len(batch) - 1
}

if (i // batch_size) % 10 == 0:
print(f"Processed {len(processed)}/{len(texts)}")

job.complete()

# Save final results
with open("embeddings.json", "w") as f:
json.dump(processed, f)

print(f"Complete! Generated {len(processed)} embeddings.")

Live Inference Servers

Deploy long-running inference servers (vLLM, SGLang, FastAPI, etc.) and get a public URL to access them.

Authentication Required

Tunnel URLs require your Cumulus API key for access. Include it via:

  • X-API-Key: <your-api-key> header, or
  • Authorization: Bearer <your-api-key> header
from cumulus import CumulusClient
import requests
import os

client = CumulusClient()

# Deploy server with exposed port
job = client.submit(
script="server.py",
requirements=["vllm", "fastapi", "uvicorn"],
service_port=8000, # Port your server listens on
workload_type="inference"
)

# Wait for public URL
tunnel_url = client.wait_for_tunnel(job.job_id, timeout=300)
print(f"Server ready at: {tunnel_url}")
# Example: http://tunnel.cumuluslabs.io:8443/54321

# Make authenticated requests to your server
api_key = os.environ.get("CUMULUS_API_KEY")

response = requests.post(
f"{tunnel_url}/generate",
headers={"X-API-Key": api_key},
json={"prompt": "Hello!"}
)

How it works:

  1. Set service_port to the port your server listens on
  2. Cumulus creates a tunnel to make your server publicly accessible
  3. Use wait_for_tunnel() to get the public URL
  4. The URL stays active as long as your job is running

For guaranteed performance, set explicit resource limits:

job = client.submit(
script="server.py",
service_port=8000,
workload_type="inference",
sm_percent=100, # 100% GPU compute (dedicated)
vram_gb=40.0 # 40GB VRAM reserved
)

See Inference Examples for complete server examples.


Resource Configuration

Configure resources based on your model size:

job = client.submit(
script="inference.py",
workload_type="inference",
gpu_count=1, # Number of GPUs
memory_request="16Gi", # Minimum guaranteed memory
memory_limit="32Gi" # Maximum allowed
)

Memory guidelines:

Model SizeRecommended Memory
Small (< 1B params)8Gi - 16Gi
Medium (1B - 7B params)16Gi - 32Gi
Large (7B+ params)32Gi - 64Gi

Environment Variables

Pass API keys, model paths, or configuration:

job = client.submit(
script="inference.py",
workload_type="inference",
env={
"MODEL_PATH": "s3://my-bucket/model.pt",
"HF_TOKEN": "your-huggingface-token",
"BATCH_SIZE": "64"
}
)

Access in your script:

import os

model_path = os.environ.get("MODEL_PATH")
hf_token = os.environ.get("HF_TOKEN")
batch_size = int(os.environ.get("BATCH_SIZE", "32"))

Workload Types Comparison

TypeUse CaseEviction PriorityTypical Duration
trainingModel trainingHighest (rarely evicted)Hours to days
finetuningFine-tuning pre-trained modelsMediumHours
inferenceBatch inference, evaluationLowestMinutes to hours

Choose the right workload type to optimize performance and cost.


Next Steps