Inference Jobs

Run model inference on GPU with automatic scaling and efficient resource utilization.

What you'll learn

How to submit batch inference jobs
How to deploy live inference servers with public URLs
Difference between inference and training workloads
Best practices for efficient GPU utilization

Quick Start

Submit any inference script to run on GPU:

from cumulus import CumulusClient

client = CumulusClient()

job = client.submit(
    script="inference.py",
    requirements=["torch", "transformers"],
    workload_type="inference"
)

print(f"Job ID: {job.job_id}")

# Wait for results
client.wait_for_completion(job.job_id)
results = client.get_results(job.job_id)
print(results)

Your script runs on an NVIDIA GPU with all dependencies installed automatically.

Why workload_type="inference"?

Setting workload_type="inference" optimizes your job for batch processing:

Benefit	Description
Optimized resources	Your job runs on GPUs with available capacity
Cost efficiency	Lower priority means lower cost
Batch-friendly	Designed for jobs that run to completion

Eviction Priority

Inference jobs have lower eviction priority than training jobs. This means if GPU resources become scarce, inference jobs may be paused to make room for training workloads. This is usually fine since inference jobs are typically shorter and easier to restart.

Batch Inference Example

Process a dataset through your model:

inference.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model once
model = AutoModelForCausalLM.from_pretrained("gpt2").cuda()
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model.eval()

prompts = [
    "The quick brown fox",
    "Machine learning is",
    "In the year 2025"
]

results = []
with torch.no_grad():
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        outputs = model.generate(**inputs, max_new_tokens=50)
        result = tokenizer.decode(outputs[0], skip_special_tokens=True)
        results.append({"prompt": prompt, "output": result})
        print(f"Processed: {prompt[:30]}...")

# Save results
import json
with open("results.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"Inference complete! Processed {len(results)} prompts.")

submit.py
from cumulus import CumulusClient

client = CumulusClient()

job = client.submit(
    script="inference.py",
    requirements=["torch", "transformers"],
    workload_type="inference"
)

print(f"Job: {job.job_id}")

# Wait and retrieve results
client.wait_for_completion(job.job_id)
output = client.get_results(job.job_id, file="results.json")
print(output)

Model Evaluation

Evaluate a trained model on a test dataset:

evaluate.py
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# Load your trained model
model = torch.load("model.pt").cuda()
model.eval()

# Test data (replace with your actual test set)
X_test = torch.randn(1000, 784)
y_test = torch.randint(0, 10, (1000,))
test_loader = DataLoader(TensorDataset(X_test, y_test), batch_size=64)

correct = 0
total = 0

with torch.no_grad():
    for X, y in test_loader:
        X, y = X.cuda(), y.cuda()
        outputs = model(X)
        _, predicted = outputs.max(1)
        total += y.size(0)
        correct += predicted.eq(y).sum().item()

accuracy = 100.0 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%")
print(f"Correct: {correct}/{total}")

submit.py
from cumulus import CumulusClient

client = CumulusClient()

job = client.submit(
    script="evaluate.py",
    include_patterns=["*.pt"],  # Include model file
    requirements=["torch"],
    workload_type="inference"
)

Long-Running Inference with Checkpointing

For processing large datasets, use CumulusJob to handle interruptions gracefully:

batch_inference.py
import torch
import json
from transformers import AutoModel, AutoTokenizer
from cumulus import CumulusJob

model = AutoModel.from_pretrained("bert-base-uncased").cuda()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model.eval()

# Large dataset to process
texts = [f"Sample text {i}" for i in range(10000)]

with CumulusJob() as job:
    # Resume from where we left off if interrupted
    if job.is_resumed:
        processed = job.checkpoint.get('processed', [])
        start_idx = job.checkpoint.get('last_index', 0) + 1
        print(f"Resuming from index {start_idx}")
    else:
        processed = []
        start_idx = 0

    batch_size = 32
    for i in range(start_idx, len(texts), batch_size):
        batch = texts[i:i+batch_size]

        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).to("cuda")
        with torch.no_grad():
            outputs = model(**inputs)
            embeddings = outputs.last_hidden_state[:, 0, :].cpu().tolist()

        for text, emb in zip(batch, embeddings):
            processed.append({"text": text, "embedding": emb})

        # Update state for checkpoint
        job.state = {
            'processed': processed,
            'last_index': i + len(batch) - 1
        }

        if (i // batch_size) % 10 == 0:
            print(f"Processed {len(processed)}/{len(texts)}")

    job.complete()

# Save final results
with open("embeddings.json", "w") as f:
    json.dump(processed, f)

print(f"Complete! Generated {len(processed)} embeddings.")

Live Inference Servers

Deploy long-running inference servers (vLLM, SGLang, FastAPI, etc.) and get a public URL to access them.

Authentication Required

Tunnel URLs require your Cumulus API key for access. Include it via:

X-API-Key: <your-api-key> header, or
Authorization: Bearer <your-api-key> header

from cumulus import CumulusClient
import requests
import os

client = CumulusClient()

# Deploy server with exposed port
job = client.submit(
    script="server.py",
    requirements=["vllm", "fastapi", "uvicorn"],
    service_port=8000,       # Port your server listens on
    workload_type="inference"
)

# Wait for public URL
tunnel_url = client.wait_for_tunnel(job.job_id, timeout=300)
print(f"Server ready at: {tunnel_url}")
# Example: http://tunnel.cumuluslabs.io:8443/54321

# Make authenticated requests to your server
api_key = os.environ.get("CUMULUS_API_KEY")

response = requests.post(
    f"{tunnel_url}/generate",
    headers={"X-API-Key": api_key},
    json={"prompt": "Hello!"}
)

How it works:

Set service_port to the port your server listens on
Cumulus creates a tunnel to make your server publicly accessible
Use wait_for_tunnel() to get the public URL
The URL stays active as long as your job is running

For guaranteed performance, set explicit resource limits:

job = client.submit(
    script="server.py",
    service_port=8000,
    workload_type="inference",
    sm_percent=100,    # 100% GPU compute (dedicated)
    vram_gb=40.0       # 40GB VRAM reserved
)

See Inference Examples for complete server examples.

Resource Configuration

Configure resources based on your model size:

job = client.submit(
    script="inference.py",
    workload_type="inference",
    gpu_count=1,           # Number of GPUs
    memory_request="16Gi", # Minimum guaranteed memory
    memory_limit="32Gi"    # Maximum allowed
)

Memory guidelines:

Model Size	Recommended Memory
Small (< 1B params)	8Gi - 16Gi
Medium (1B - 7B params)	16Gi - 32Gi
Large (7B+ params)	32Gi - 64Gi

Environment Variables

Pass API keys, model paths, or configuration:

job = client.submit(
    script="inference.py",
    workload_type="inference",
    env={
        "MODEL_PATH": "s3://my-bucket/model.pt",
        "HF_TOKEN": "your-huggingface-token",
        "BATCH_SIZE": "64"
    }
)

Access in your script:

import os

model_path = os.environ.get("MODEL_PATH")
hf_token = os.environ.get("HF_TOKEN")
batch_size = int(os.environ.get("BATCH_SIZE", "32"))

Workload Types Comparison

Type	Use Case	Eviction Priority	Typical Duration
`training`	Model training	Highest (rarely evicted)	Hours to days
`finetuning`	Fine-tuning pre-trained models	Medium	Hours
`inference`	Batch inference, evaluation	Lowest	Minutes to hours

Choose the right workload type to optimize performance and cost.

Next Steps

Resource Management - Configure GPU and memory
Inference Examples - Real-world use cases
SDK Reference - Complete API documentation

Quick Start​

Why workload_type="inference"?​

Batch Inference Example​

Model Evaluation​

Long-Running Inference with Checkpointing​

Live Inference Servers​

Resource Configuration​

Environment Variables​

Workload Types Comparison​

Next Steps​