Colloquium Talk: What Does CPS Have to do with Deep Learning?
Title: What Does CPS Have to do with Deep Learning?
Speaker: Professor Jeffrey Mark Siskind, PhD, Associate Professor of Electrical and Computer Engineering, School of Electrical and Computer Engineering at Purdue University
Location: Northeastern University, 440 Huntington Avenue, West Village H, 3rd Floor, Room #366, Boston, Massachusetts 02115
Deep learning is formulated around backpropagation, a method for computing gradients of a particular class of data flow graphs to train models to minimize error by gradient descent. Backpropagation is a special case of a more general method known as reverse mode automatic differentiation (AD) for computing gradients of functions expressed as programs. Reverse mode AD imposes only a small constant-factor overhead in operation count over the original computation, but has storage requirements that grow, in the worst case, in proportion to the time consumed by the original computation. This storage blowup can be ameliorated by checkpointing, a process that reorders application of classical reverse-mode AD over an execution interval to tradeoff space vs. time.
Application of checkpointing in a divide-and-conquer fashion to strategically chosen nested execution intervals can break classical reverse-mode AD into stages which can reduce the worst-case growth in storage from linear to logarithmic. Doing this has been fully automated only for computations of particularly simple form, with checkpoints spanning execution intervals resulting from a limited set of program constructs.
We show how the technique can be automated for arbitrary computations. Doing so relies on implementing general purpose mechanisms for counting the number of instructions executed by a program, interrupting the execution after a specified number of steps, and resuming the execution with a nonstandard interpretation. We implement these general purpose mechanisms with a compiler that converts programs to continuation-passing style (CPS). The process of efficiently computing gradients with checkpointing requires running, and rerunning, little bits of the program out of order. This is made easier by applying the technique to functional programs.
There is a deeper and higher-level message in this talk: machine learning can benefit from advanced techniques from programming language theory.
About the Speaker
Jeffrey Mark Siskind received a BA degree in Computer Science from the Technion - Israel Institute of Technology, Haifa, in 1979, his S.M. degree in Computer Science from the Massachusetts Institute of Technology (MIT), Cambridge, in 1989, and his PhD degree in Computer Science from MIT in 1992.
He did a postdoctoral fellowship at the University of Pennsylvania Institute for Research in Cognitive Science from 1992 to 1993. He was an assistant professor at the University of Toronto Department of Computer Science from 1993 to 1995, a senior lecturer at the Technion, Department of Electrical Engineering in 1996, a visiting assistant professor at the University of Vermont Department of Computer Science and the Department of Electrical Engineering from 1996 to 1997, and a research scientist at NEC Research Institute, Inc. from 1997 to 2001.
Siskind joined the Purdue University School of Electrical and Computer Engineering in 2002 where he is currently an associate professor. His research interests include computer vision, robotics, artificial intelligence, neuroscience, cognitive science, computational linguistics, child language acquisition, automatic differentiation, and programming languages and compilers.
Monday, September 11 at 1:30pm