Checkpointing
Save your training progress automatically. If a job is interrupted, it resumes from the last checkpoint.
Quick Start
Use CumulusJob as a context manager for automatic checkpoint handling:
import torch
from cumulus import CumulusJob
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
with CumulusJob() as job:
# Resume from checkpoint if this is a requeued job
if job.is_resumed:
model.load_state_dict(job.checkpoint['model'])
optimizer.load_state_dict(job.checkpoint['optimizer'])
for epoch in range(job.start_epoch, num_epochs):
for batch in dataloader:
loss = train_step(model, batch)
# Keep state updated - saved automatically on eviction
job.state = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': epoch
}
job.complete()
What Happens on Interruption
- Cumulus detects when your job needs to pause
CumulusJobautomatically saves yourjob.stateto cloud storage- The job is automatically requeued with higher priority
- On restart,
job.is_resumedisTrueandjob.checkpointcontains your saved state - Training continues from
job.start_epoch
CumulusJob API
Properties
with CumulusJob() as job:
job.is_resumed # True if resuming from checkpoint
job.start_epoch # 0 for fresh run, or (checkpoint epoch + 1) for resume
job.checkpoint # Dict with saved state (if resuming), else None
job.state # Current state to save on eviction (you set this)
Methods
job.state = {...} # Set state to save on eviction
job.complete() # Mark job as successfully completed
job.save_checkpoint() # Manually trigger checkpoint save
Manual Checkpoints
Save checkpoints at specific intervals:
with CumulusJob() as job:
for epoch in range(job.start_epoch, num_epochs):
train_one_epoch()
job.state = {'model': model.state_dict(), 'epoch': epoch}
# Save every 10 epochs
if epoch % 10 == 0:
job.save_checkpoint()
job.complete()
Low-Level Control
Use CheckpointManager for custom checkpoint logic:
from cumulus import CheckpointManager
manager = CheckpointManager()
# Check if resuming from previous run
if manager.should_resume():
state = manager.load_checkpoint()
model.load_state_dict(state['model'])
optimizer.load_state_dict(state['optimizer'])
start_epoch = state['epoch'] + 1
else:
start_epoch = 0
# Training loop
for epoch in range(start_epoch, num_epochs):
train_one_epoch()
# Save custom state
manager.save_checkpoint({
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': epoch,
'custom_data': my_data
})
# Mark training complete
manager.mark_complete()
Environment Variables
Cumulus sets environment variables automatically to handle checkpoint and resume. The main variable you may check is RESUME_FROM_CHECKPOINT, which is set to "true" when your job is resuming from a previous checkpoint.
Legacy API: CumulusTrainer
For backward compatibility, CumulusTrainer is still available:
from cumulus import CumulusTrainer
trainer = CumulusTrainer(model, optimizer)
start_epoch = trainer.start()
for epoch in range(start_epoch, num_epochs):
for batch in dataloader:
loss = train_step(model, batch)
trainer.step()
trainer.end_epoch(epoch)
trainer.mark_complete()
We recommend using CumulusJob for new projects - it's simpler and more explicit.