Skip to main content

Checkpointing

Save your training progress automatically. If a job is interrupted, it resumes from the last checkpoint.

Quick Start

Use CumulusJob as a context manager for automatic checkpoint handling:

import torch
from cumulus import CumulusJob

model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())

with CumulusJob() as job:
# Resume from checkpoint if this is a requeued job
if job.is_resumed:
model.load_state_dict(job.checkpoint['model'])
optimizer.load_state_dict(job.checkpoint['optimizer'])

for epoch in range(job.start_epoch, num_epochs):
for batch in dataloader:
loss = train_step(model, batch)

# Keep state updated - saved automatically on eviction
job.state = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': epoch
}

job.complete()

What Happens on Interruption

  1. Cumulus detects when your job needs to pause
  2. CumulusJob automatically saves your job.state to cloud storage
  3. The job is automatically requeued with higher priority
  4. On restart, job.is_resumed is True and job.checkpoint contains your saved state
  5. Training continues from job.start_epoch

CumulusJob API

Properties

with CumulusJob() as job:
job.is_resumed # True if resuming from checkpoint
job.start_epoch # 0 for fresh run, or (checkpoint epoch + 1) for resume
job.checkpoint # Dict with saved state (if resuming), else None
job.state # Current state to save on eviction (you set this)

Methods

job.state = {...}    # Set state to save on eviction
job.complete() # Mark job as successfully completed
job.save_checkpoint() # Manually trigger checkpoint save

Manual Checkpoints

Save checkpoints at specific intervals:

with CumulusJob() as job:
for epoch in range(job.start_epoch, num_epochs):
train_one_epoch()

job.state = {'model': model.state_dict(), 'epoch': epoch}

# Save every 10 epochs
if epoch % 10 == 0:
job.save_checkpoint()

job.complete()

Low-Level Control

Use CheckpointManager for custom checkpoint logic:

from cumulus import CheckpointManager

manager = CheckpointManager()

# Check if resuming from previous run
if manager.should_resume():
state = manager.load_checkpoint()
model.load_state_dict(state['model'])
optimizer.load_state_dict(state['optimizer'])
start_epoch = state['epoch'] + 1
else:
start_epoch = 0

# Training loop
for epoch in range(start_epoch, num_epochs):
train_one_epoch()

# Save custom state
manager.save_checkpoint({
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': epoch,
'custom_data': my_data
})

# Mark training complete
manager.mark_complete()

Environment Variables

Cumulus sets environment variables automatically to handle checkpoint and resume. The main variable you may check is RESUME_FROM_CHECKPOINT, which is set to "true" when your job is resuming from a previous checkpoint.

Legacy API: CumulusTrainer

For backward compatibility, CumulusTrainer is still available:

from cumulus import CumulusTrainer

trainer = CumulusTrainer(model, optimizer)
start_epoch = trainer.start()

for epoch in range(start_epoch, num_epochs):
for batch in dataloader:
loss = train_step(model, batch)
trainer.step()
trainer.end_epoch(epoch)

trainer.mark_complete()

We recommend using CumulusJob for new projects - it's simpler and more explicit.