Skip to main content

Getting Started

Run your first GPU job in under 5 minutes.

1. Install the SDK

pip install cumulus-sdk

2. Set Up Your API Key

Get your API key from the Cumulus Dashboard and set it as an environment variable:

export CUMULUS_API_KEY="your-api-key-here"
Persisting Your API Key

Add this line to your shell profile (~/.bashrc, ~/.zshrc, etc.) to avoid setting it every session:

echo 'export CUMULUS_API_KEY="your-api-key-here"' >> ~/.zshrc
source ~/.zshrc

3. Create a Training Script

Save this as train.py:

import torch
import torch.nn as nn

# Simple model
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
).cuda()

optimizer = torch.optim.Adam(model.parameters())

# Training loop
for epoch in range(10):
x = torch.randn(32, 784).cuda()
y = torch.randint(0, 10, (32,)).cuda()

loss = nn.CrossEntropyLoss()(model(x), y)
optimizer.zero_grad()
loss.backward()
optimizer.step()

print(f"Epoch {epoch}: loss={loss.item():.4f}")

print("Training complete!")

4. Submit the Job

Create submit.py:

from cumulus import CumulusClient

client = CumulusClient()

job = client.submit(
script="train.py",
requirements=["torch"]
)

print(f"Job submitted: {job.job_id}")
print(f"S3 path: {job.s3_path}")

Run it:

python submit.py

5. Monitor and Get Results

from cumulus import CumulusClient

client = CumulusClient()
job_id = "your-job-id" # From step 4

# Check status
status = client.get_status(job_id)
print(f"Status: {status}")

# Wait for completion (blocks until done)
final_status = client.wait_for_completion(job_id)
print(f"Final status: {final_status}")

# Get output
if final_status == "SUCCEEDED":
results = client.get_results(job_id)
print(results)

What Happened?

When you called client.submit():

  1. Packaged - Your script was bundled into a tarball
  2. Uploaded - Code and config were uploaded to cloud storage
  3. Queued - Your job is waiting for GPU resources
  4. Scheduled - A GPU pod was created with your code
  5. Executed - Your script ran on an NVIDIA GPU
  6. Results - Output was saved for retrieval

All of this happened automatically. No Docker, no Kubernetes, no GPU drivers.

Next Steps

Add Checkpointing

Make your training resilient to interruptions with CumulusJob:

# train.py
import torch
import torch.nn as nn
from cumulus import CumulusJob

model = nn.Linear(784, 10).cuda()
optimizer = torch.optim.Adam(model.parameters())

with CumulusJob() as job:
# Resume from checkpoint if this is a requeued job
if job.is_resumed:
model.load_state_dict(job.checkpoint['model'])
optimizer.load_state_dict(job.checkpoint['optimizer'])

for epoch in range(job.start_epoch, 100):
x = torch.randn(32, 784).cuda()
y = torch.randint(0, 10, (32,)).cuda()

loss = nn.CrossEntropyLoss()(model(x), y)
optimizer.zero_grad()
loss.backward()
optimizer.step()

# Keep state updated for automatic checkpoint on eviction
job.state = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': epoch
}
print(f"Epoch {epoch}: loss={loss.item():.4f}")

job.complete()

If your job gets interrupted (preemption, crashes), it automatically resumes from the last checkpoint.

Include Multiple Files

The SDK automatically detects your Python imports and data files. Just submit and it finds what you need:

job = client.submit(
script="main.py",
requirements_file="requirements.txt"
)
# model.py, data.py, config.yaml are detected automatically!

For explicit control, use glob patterns:

job = client.submit(
script="main.py",
include_patterns=["*.yaml", "data/*.csv"], # Include by pattern
exclude_patterns=["*.pyc", "__pycache__/*"] # Exclude patterns
)

Pass Environment Variables

job = client.submit(
script="train.py",
env={
"WANDB_API_KEY": "your-key",
"HF_TOKEN": "your-token"
}
)

Set Priority

Higher priority jobs run sooner and are less likely to be interrupted:

job = client.submit(
script="train.py",
priority=8 # 1-10, higher = more important
)

Learn More