Training Jobs
Run GPU training jobs with automatic checkpointing and resume capability.
Basic Training Job
Submit any PyTorch training script:
from cumulus import CumulusClient
client = CumulusClient()
job = client.submit(
script="train.py",
requirements=["torch", "transformers"],
workload_type="training"
)
print(f"Job ID: {job.job_id}")
Your script runs on an NVIDIA GPU with the specified dependencies installed.
With Checkpointing
Training jobs can be interrupted (GPU preemption, node failures). Use CumulusJob to automatically checkpoint and resume:
# train.py
import torch
import torch.nn as nn
from cumulus import CumulusJob
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
with CumulusJob() as job:
# Resume from checkpoint if this is a requeued job
if job.is_resumed:
model.load_state_dict(job.checkpoint['model'])
optimizer.load_state_dict(job.checkpoint['optimizer'])
for epoch in range(job.start_epoch, 100):
for batch in dataloader:
loss = train_step(model, batch)
# Keep state updated for automatic checkpoint on eviction
job.state = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': epoch
}
print(f"Epoch {epoch} complete")
job.complete()
What happens on interruption:
- Cumulus detects the interruption
CumulusJobsaves yourjob.stateto cloud storage- The job is automatically requeued
- On restart,
job.is_resumedisTrueandjob.checkpointcontains your saved state - Training continues where it left off
Priority
Jobs with higher priority are scheduled first and less likely to be interrupted:
job = client.submit(
script="train.py",
priority=8 # 1-10, higher = more important
)
| Priority | Behavior |
|---|---|
| 1-3 | Low priority, may be evicted for higher priority jobs |
| 4-6 | Normal priority (default: 5) |
| 7-10 | High priority, rarely evicted |
GPU and Memory
Request specific resources:
job = client.submit(
script="train.py",
gpu_count=1, # Number of GPUs
memory_request="16Gi", # Minimum memory guaranteed
memory_limit="32Gi" # Maximum memory allowed
)
Multi-File Projects
The SDK automatically detects your Python imports and data files. Just submit your main script:
job = client.submit(
script="main.py",
requirements_file="requirements.txt"
)
# model.py, data.py, utils/helpers.py, config.yaml detected automatically!
For explicit control, use glob patterns:
job = client.submit(
script="main.py",
include_patterns=["*.yaml", "data/*.csv", "models/**/*.pt"],
exclude_patterns=["*.pyc", "__pycache__/*", "*.log"]
)
All files are extracted to the same directory. Use relative imports as normal.
Environment Variables
Pass configuration to your script:
job = client.submit(
script="train.py",
env={
"WANDB_API_KEY": "your-key",
"HF_TOKEN": "your-huggingface-token",
"LEARNING_RATE": "0.001"
}
)
Access them in your script:
import os
wandb_key = os.environ.get("WANDB_API_KEY")
lr = float(os.environ.get("LEARNING_RATE", "0.001"))
Docker Images
Use a specific PyTorch or CUDA image:
job = client.submit(
script="train.py",
worker_image="pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime"
)
Available images:
| Image | Description |
|---|---|
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime | Default PyTorch 2.5 |
nvcr.io/nvidia/pytorch:24.01-py3 | NVIDIA optimized PyTorch |
nvcr.io/nvidia/tensorflow:24.01-tf2-py3 | TensorFlow 2 |
Model Architecture Hints
Provide VRAM hints for better job placement:
job = client.submit(
script="train.py",
model_architecture={
"architecture_type": "transformer",
"num_layers": 12,
"hidden_dim": 768,
"num_heads": 12,
"total_params": 110_000_000
},
training_config={
"batch_size": 32,
"precision": "fp16",
"sequence_length": 512
}
)
Cumulus uses this information to place your job on a GPU with sufficient memory.
VRAM predictions improve as more jobs run on Cumulus. If you find predictions are off for your model, you can override with explicit values:
job = client.submit(
script="train.py",
vram_gb=24.0, # Explicit VRAM (you know from profiling)
sm_percent=50 # Explicit GPU compute percentage
)
This is especially useful for novel architectures or when jobs are being evicted due to underestimated VRAM.