by Dimitri P. Bertsekas
Publication: August 2022, 237 pages, hardcover
EBOOK at Google Play
Contents and Preface, Overview Slides
In this research monograph, we focus on a new conceptual framework for approximate Dynamic Programming (DP) and Reinforcement Learning (RL). It revolves around two algorithms that operate in synergy through the powerful mechanism of Newton's method, applied to Bellman's equation for solving the underlying DP problem. These are the off-line training and the on-line play algorithms, and they are exemplified by the architectures of the AlphaZero program (which plays chess), the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon), and the model predictive control (MPC) architecture (which is central in control system design).
The player obtained through extensive off-line training in AlphaZero, is not used directly during on-line play (it is too inaccurate due to approximation errors that are inherent in off-line neural network training). Instead a separate on-line player is used, which is based on multistep lookahead and a terminal position evaluator that was trained using experience with the off-line player. The on-line player performs a form of policy improvement, which unlike the off-line player, is not degraded by neural network approximations. As a result, it greatly improves the performance of the off-line player.
An important lesson from AlphaZero is that the performance of an off-line trained controller can be greatly improved by on-line approximation in value space, with long lookahead (involving minimization or rollout with an off-line obtained policy, or both), and terminal cost approximation that is obtained off-line. The performance enhancement is often dramatic and is due to a simple fact, which is the focal point of this monograph.
On-line approximation in value space with one-step lookahead amounts to a single step of Newton?s method for solving Bellman's equation, while the starting point for the Newton step is based on the results of off-line training, and is enhanced by longer on-line lookahead minimization and on-line rollout.
We aim to explain the beneficial effects of this synergism, and to provide a conceptual bridge between the sequential decision making cultures of artificial intelligence/RL, control theory/MPC, and discrete optimization/integer programming. In particular, we show through the unifying principles of abstract DP and the algorithmic ideas of Newton?s method that the AlphaZero/TD-Gammon ideas apply very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces. Moreover, these ideas can be effectively integrated with other important methodologies such as adaptive control, decentralized control, discrete and Bayesian optimization, neural network-based value and policy approximations, and heuristic algorithms for discrete optimization and integer programming.
The book is an excellent supplement to our other Dynamic Programming, Abstract Dynamic Programming, Reinforcement Learning, and Rollout books.
Dimitri P. Bertsekas is Fulton Professor of Computational Decision Making at the Arizona State University, McAfee Professor of Engineering at the Massachusetts Institute of Technology, and a member of the prestigious United States National Academy of Engineering. He is the recipient of the 2001 A. R. Raggazini ACC education award and the 2009 INFORMS expository writing award. He has also received 2014 ACC Richard E. Bellman Control Heritage Award for "contributions to the foundations of deterministic and stochastic optimization-based methods in systems and control," the 2014 Khachiyan Prize for Life-Time Accomplishments in Optimization, the SIAM/MOS 2015 George B. Dantzig Prize, and the 2022 IEEE Control Systems Award. Together with his coauthor John Tsitsiklis, he was awarded the 2018 INFORMS John von Neumann Theory Prize, for the contributions of the research monographs "Parallel and Distributed Computation" and "Neuro-Dynamic Programming".
Course material from ASU classes, 2019-2022.
Lessons from AlphaZero Videolecture: An overview presentation at KTH, Nov. 2021
Slides from the KTH Lecture
"Newton's Method for Reinforcement Learning and Model Predictive Control," Results in Control and Optimization J., 2022.