Getting Started
Run your first GPU job in under 5 minutes.
1. Install the SDK
pip install cumulus-sdk
2. Set Up Your API Key
Get your API key from the Cumulus Dashboard and set it as an environment variable:
export CUMULUS_API_KEY="your-api-key-here"
Add this line to your shell profile (~/.bashrc, ~/.zshrc, etc.) to avoid setting it every session:
echo 'export CUMULUS_API_KEY="your-api-key-here"' >> ~/.zshrc
source ~/.zshrc
3. Create a Training Script
Save this as train.py:
import torch
import torch.nn as nn
# Simple model
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
).cuda()
optimizer = torch.optim.Adam(model.parameters())
# Training loop
for epoch in range(10):
x = torch.randn(32, 784).cuda()
y = torch.randint(0, 10, (32,)).cuda()
loss = nn.CrossEntropyLoss()(model(x), y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch}: loss={loss.item():.4f}")
print("Training complete!")
4. Submit the Job
Create submit.py:
from cumulus import CumulusClient
client = CumulusClient()
job = client.submit(
script="train.py",
requirements=["torch"]
)
print(f"Job submitted: {job.job_id}")
print(f"S3 path: {job.s3_path}")
Run it:
python submit.py
5. Monitor and Get Results
from cumulus import CumulusClient
client = CumulusClient()
job_id = "your-job-id" # From step 4
# Check status
status = client.get_status(job_id)
print(f"Status: {status}")
# Wait for completion (blocks until done)
final_status = client.wait_for_completion(job_id)
print(f"Final status: {final_status}")
# Get output
if final_status == "SUCCEEDED":
results = client.get_results(job_id)
print(results)
What Happened?
When you called client.submit():
- Packaged - Your script was bundled into a tarball
- Uploaded - Code and config were uploaded to cloud storage
- Queued - Your job is waiting for GPU resources
- Scheduled - A GPU pod was created with your code
- Executed - Your script ran on an NVIDIA GPU
- Results - Output was saved for retrieval
All of this happened automatically. No Docker, no Kubernetes, no GPU drivers.
Next Steps
Add Checkpointing
Make your training resilient to interruptions with CumulusJob:
# train.py
import torch
import torch.nn as nn
from cumulus import CumulusJob
model = nn.Linear(784, 10).cuda()
optimizer = torch.optim.Adam(model.parameters())
with CumulusJob() as job:
# Resume from checkpoint if this is a requeued job
if job.is_resumed:
model.load_state_dict(job.checkpoint['model'])
optimizer.load_state_dict(job.checkpoint['optimizer'])
for epoch in range(job.start_epoch, 100):
x = torch.randn(32, 784).cuda()
y = torch.randint(0, 10, (32,)).cuda()
loss = nn.CrossEntropyLoss()(model(x), y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Keep state updated for automatic checkpoint on eviction
job.state = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': epoch
}
print(f"Epoch {epoch}: loss={loss.item():.4f}")
job.complete()
If your job gets interrupted (preemption, crashes), it automatically resumes from the last checkpoint.
Include Multiple Files
The SDK automatically detects your Python imports and data files. Just submit and it finds what you need:
job = client.submit(
script="main.py",
requirements_file="requirements.txt"
)
# model.py, data.py, config.yaml are detected automatically!
For explicit control, use glob patterns:
job = client.submit(
script="main.py",
include_patterns=["*.yaml", "data/*.csv"], # Include by pattern
exclude_patterns=["*.pyc", "__pycache__/*"] # Exclude patterns
)
Pass Environment Variables
job = client.submit(
script="train.py",
env={
"WANDB_API_KEY": "your-key",
"HF_TOKEN": "your-token"
}
)
Set Priority
Higher priority jobs run sooner and are less likely to be interrupted:
job = client.submit(
script="train.py",
priority=8 # 1-10, higher = more important
)