Skip to main content

Training Jobs

Run GPU training jobs with automatic checkpointing and resume capability.

Basic Training Job

Submit any PyTorch training script:

from cumulus import CumulusClient

client = CumulusClient()

job = client.submit(
script="train.py",
requirements=["torch", "transformers"],
workload_type="training"
)

print(f"Job ID: {job.job_id}")

Your script runs on an NVIDIA GPU with the specified dependencies installed.

With Checkpointing

Training jobs can be interrupted (GPU preemption, node failures). Use CumulusJob to automatically checkpoint and resume:

# train.py
import torch
import torch.nn as nn
from cumulus import CumulusJob

model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())

with CumulusJob() as job:
# Resume from checkpoint if this is a requeued job
if job.is_resumed:
model.load_state_dict(job.checkpoint['model'])
optimizer.load_state_dict(job.checkpoint['optimizer'])

for epoch in range(job.start_epoch, 100):
for batch in dataloader:
loss = train_step(model, batch)

# Keep state updated for automatic checkpoint on eviction
job.state = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': epoch
}
print(f"Epoch {epoch} complete")

job.complete()

What happens on interruption:

  1. Cumulus detects the interruption
  2. CumulusJob saves your job.state to cloud storage
  3. The job is automatically requeued
  4. On restart, job.is_resumed is True and job.checkpoint contains your saved state
  5. Training continues where it left off

Priority

Jobs with higher priority are scheduled first and less likely to be interrupted:

job = client.submit(
script="train.py",
priority=8 # 1-10, higher = more important
)
PriorityBehavior
1-3Low priority, may be evicted for higher priority jobs
4-6Normal priority (default: 5)
7-10High priority, rarely evicted

GPU and Memory

Request specific resources:

job = client.submit(
script="train.py",
gpu_count=1, # Number of GPUs
memory_request="16Gi", # Minimum memory guaranteed
memory_limit="32Gi" # Maximum memory allowed
)

Multi-File Projects

The SDK automatically detects your Python imports and data files. Just submit your main script:

job = client.submit(
script="main.py",
requirements_file="requirements.txt"
)
# model.py, data.py, utils/helpers.py, config.yaml detected automatically!

For explicit control, use glob patterns:

job = client.submit(
script="main.py",
include_patterns=["*.yaml", "data/*.csv", "models/**/*.pt"],
exclude_patterns=["*.pyc", "__pycache__/*", "*.log"]
)

All files are extracted to the same directory. Use relative imports as normal.

Environment Variables

Pass configuration to your script:

job = client.submit(
script="train.py",
env={
"WANDB_API_KEY": "your-key",
"HF_TOKEN": "your-huggingface-token",
"LEARNING_RATE": "0.001"
}
)

Access them in your script:

import os

wandb_key = os.environ.get("WANDB_API_KEY")
lr = float(os.environ.get("LEARNING_RATE", "0.001"))

Docker Images

Use a specific PyTorch or CUDA image:

job = client.submit(
script="train.py",
worker_image="pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime"
)

Available images:

ImageDescription
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtimeDefault PyTorch 2.5
nvcr.io/nvidia/pytorch:24.01-py3NVIDIA optimized PyTorch
nvcr.io/nvidia/tensorflow:24.01-tf2-py3TensorFlow 2

Model Architecture Hints

Provide VRAM hints for better job placement:

job = client.submit(
script="train.py",
model_architecture={
"architecture_type": "transformer",
"num_layers": 12,
"hidden_dim": 768,
"num_heads": 12,
"total_params": 110_000_000
},
training_config={
"batch_size": 32,
"precision": "fp16",
"sequence_length": 512
}
)

Cumulus uses this information to place your job on a GPU with sufficient memory.

Prediction Gets Better Over Time

VRAM predictions improve as more jobs run on Cumulus. If you find predictions are off for your model, you can override with explicit values:

job = client.submit(
script="train.py",
vram_gb=24.0, # Explicit VRAM (you know from profiling)
sm_percent=50 # Explicit GPU compute percentage
)

This is especially useful for novel architectures or when jobs are being evicted due to underestimated VRAM.

Next Steps