Checkpointing

Save your training progress automatically. If a job is interrupted, it resumes from the last checkpoint.

Quick Start

Use CumulusJob as a context manager for automatic checkpoint handling:

import torch
from cumulus import CumulusJob

model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())

with CumulusJob() as job:
    # Resume from checkpoint if this is a requeued job
    if job.is_resumed:
        model.load_state_dict(job.checkpoint['model'])
        optimizer.load_state_dict(job.checkpoint['optimizer'])

    for epoch in range(job.start_epoch, num_epochs):
        for batch in dataloader:
            loss = train_step(model, batch)

        # Keep state updated - saved automatically on eviction
        job.state = {
            'model': model.state_dict(),
            'optimizer': optimizer.state_dict(),
            'epoch': epoch
        }

    job.complete()

What Happens on Interruption

Cumulus detects when your job needs to pause
CumulusJob automatically saves your job.state to cloud storage
The job is automatically requeued with higher priority
On restart, job.is_resumed is True and job.checkpoint contains your saved state
Training continues from job.start_epoch

CumulusJob API

Properties

with CumulusJob() as job:
    job.is_resumed    # True if resuming from checkpoint
    job.start_epoch   # 0 for fresh run, or (checkpoint epoch + 1) for resume
    job.checkpoint    # Dict with saved state (if resuming), else None
    job.state         # Current state to save on eviction (you set this)

Methods

job.state = {...}    # Set state to save on eviction
job.complete()       # Mark job as successfully completed
job.save_checkpoint()  # Manually trigger checkpoint save

Manual Checkpoints

Save checkpoints at specific intervals:

with CumulusJob() as job:
    for epoch in range(job.start_epoch, num_epochs):
        train_one_epoch()

        job.state = {'model': model.state_dict(), 'epoch': epoch}

        # Save every 10 epochs
        if epoch % 10 == 0:
            job.save_checkpoint()

    job.complete()

Low-Level Control

Use CheckpointManager for custom checkpoint logic:

from cumulus import CheckpointManager

manager = CheckpointManager()

# Check if resuming from previous run
if manager.should_resume():
    state = manager.load_checkpoint()
    model.load_state_dict(state['model'])
    optimizer.load_state_dict(state['optimizer'])
    start_epoch = state['epoch'] + 1
else:
    start_epoch = 0

# Training loop
for epoch in range(start_epoch, num_epochs):
    train_one_epoch()

# Save custom state
manager.save_checkpoint({
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'epoch': epoch,
    'custom_data': my_data
})

# Mark training complete
manager.mark_complete()

Environment Variables

Cumulus sets environment variables automatically to handle checkpoint and resume. The main variable you may check is RESUME_FROM_CHECKPOINT, which is set to "true" when your job is resuming from a previous checkpoint.

Legacy API: CumulusTrainer

For backward compatibility, CumulusTrainer is still available:

from cumulus import CumulusTrainer

trainer = CumulusTrainer(model, optimizer)
start_epoch = trainer.start()

for epoch in range(start_epoch, num_epochs):
    for batch in dataloader:
        loss = train_step(model, batch)
        trainer.step()
    trainer.end_epoch(epoch)

trainer.mark_complete()

We recommend using CumulusJob for new projects - it's simpler and more explicit.

Quick Start​

What Happens on Interruption​

CumulusJob API​

Properties​

Methods​

Manual Checkpoints​

Low-Level Control​

Environment Variables​

Legacy API: CumulusTrainer​