Check the preview of 2nd version of this platform being developed by the open MLCommons taskforce on automation and reproducibility as a free, open-source and technology-agnostic on-prem platform.

Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization

lib:f6c909338968ac2d (v1.0.0)

Vote to reproduce this paper and share portable workflows   1 
Authors: Paras Jain,Ajay Jain,Aniruddha Nrusimha,Amir Gholami,Pieter Abbeel,Kurt Keutzer,Ion Stoica,Joseph E. Gonzalez
ArXiv: 1910.02653
Document:  PDF  DOI 
Artifact development version: GitHub
Abstract URL: https://arxiv.org/abs/1910.02653v3


We formalize the problem of trading-off DNN training time and memory requirements as the tensor rematerialization optimization problem, a generalization of prior checkpointing strategies. We introduce Checkmate, a system that solves for optimal rematerialization schedules in reasonable times (under an hour) using off-the-shelf MILP solvers or near-optimal schedules with an approximation algorithm, then uses these schedules to accelerate millions of training iterations. Our method scales to complex, realistic architectures and is hardware-aware through the use of accelerator-specific, profile-based cost models. In addition to reducing training cost, Checkmate enables real-world networks to be trained with up to 5.1x larger input sizes. Checkmate is an open-source project, available at https://github.com/parasj/checkmate.

Relevant initiatives  

Related knowledge about this paper Reproduced results (crowd-benchmarking and competitions) Artifact and reproducibility checklists Common formats for research projects and shared artifacts Reproducibility initiatives

Comments  

Please log in to add your comments!
If you notice any inapropriate content that should not be here, please report us as soon as possible and we will try to remove it within 48 hours!