Dynamic Programming and Optimal Control
Preface:
This two-volume book is based on a first-year graduate course on dynamic programming and optimal control that I have taught for over twenty years at Stanford University, the University of Illinois, and the Massachusetts Institute of Technology. The course has been typically attended by students from engineering, operations research, economics, and applied mathematics. Accordingly, a principal objective of the book has been to provide a unified treatment of the subject, suitable for a broad audience. In particular, problems with a continuous character, such as stochastic control problems, popular in modern control theory, are simultaneously treated with problems with a discrete character, such as Markovian decision problems, popular in operations research. Furthermore, many applications and examples, drawn from a broad variety of fields, are discussed.
The book may be viewed as a greatly expanded and pedagogically improved version of my 1987 book "Dynamic Programming: Deterministic and Stochastic Models," published by Prentice-Hall. I have included much new material on deterministic and stochastic shortest path problems, as well as a new chapter on continuous-time optimal control problems and the Pontryagin Maximum Principle, developed from a dynamic programming viewpoint. I have also added a fairly extensive exposition of simulation-based approximation techniques for dynamic programming. These techniques, which are often referred to as "neuro-dynamic programming" or "reinforcement learning," represent a breakthrough in the practical application of dynamic programming to complex problems that involve the dual curse of large dimension and lack of an accurate mathematical model. Other material was also augmented, substantially modified, and updated.
With the new material, however, the book grew so much in size that it became necessary to divide it into two volumes: one on finite horizon, and the other on infinite horizon problems. This division was not only natural in terms of size, but also in terms of style and orientation. The first volume is more oriented towards modeling, and the second is more oriented towards mathematical analysis and computation. To make the first volume self-contained for instructors who wish to cover a modest amount of inifinite horizon material in a course that is primarily oriented towards modeling, conceptualization, and finite horizon problems, I have added a final chapter that provides an introductory treatment of inifinite horizon problems.
Many topics in the book are relatively independent of the others. For example, Chapter 2 of Vol. I on shortest path problems can be skipped without loss of continuity, and the same is true for Chapter 3 of Vol. I, which deals with continuous-time optimal control. As a result, the book can be used to teach several different types of courses.
The mathematical prerequisite for the text is knowledge of advanced calculus, introductory probability theory, and matrix-vector algebra. A summary of this material is provided in the appendixes. Naturally, prior exposure to dynamic system theory, control, optimization, or operations research will be helpful to the reader, but based on my experience, the material given here is reasonably self-contained.
The book contains a large number of exercises, and the serious reader will benefit greatly by going through them. Solutions to all exercises are compiled in a manual that is available to instructors from the author. Many thanks are due to the several people who spent long hours contributing to this manual, particularly Steven Shreve, Eric Loiederman, Lakis Polymenakos, and Cynara Wu.
Dynamic programming is a conceptually simple technique that can be adequately explained using elementary analysis. Yet a mathematically rigorous treatment of general dynamic programming requires the complicated machinery of measure-theoretic probability. My choice has been to bypass the complicated mathematics by developing the subject in generality, while claiming rigor only when the underlying probability spaces are countable. A mathematically rigorous treatment of the subject is carried out in my monograph "Stochastic Optimal Control: The Discrete Time Case", Academic Press, 1978, coauthored by Steven Shreve. This monograph complements the present text and provides a solid foundation for the subjects developed somewhat informally here.
Finally, I am thankful to a number of individuals and institutions for their contributions to the book. My understanding of the subject was sharpened while I worked with Steven Shreve on our 1978 monograph. My interaction and collaboration with John Tsitsiklis on stochastic shortest paths and approximate dynamic programming have been most valuable. Michael Caramanis, Emmanual Fernandez-Gaucherand, Pierre Humblet, Lennart Ljung, and John Tsitsiklis taught from versions of the book, and contributed several substantive comments and homework problems. A number of colleagues offered valuable insights and information, particularly David Castanon, Eugene Feinberg, and Krishna Pattipati. NSF provided research support. Prentice-Hall graciously allowed the use of material from my 1987 book. Teaching and interacting with the students at MIT have kept up my interest and excitement for the subject.
This second edition has expanded by nearly 30% the coverage of the original. Most of the new material is concentrated in four areas:
(a) In Chapter 4, a section was added on estimation and control of systems with a non-probabilistic (set membership) description of uncertainty. This subject, a personal favorite of the author since it was the subject of his 1971 Ph.D. thesis at MIT, has become popular, as minimax and H-Infinity control methods have gained increased prominence.
(b) Chapter 6 was doubled in size, to reflect the popularity of suboptimal control and neuro-dynamic programming methods. In particular, the coverage of certainty equivalent, and limited lookahead methods has been substantially expanded. Furthermore, a new section was added on neuro-dynamic programming and rollout algorithms, and their applications in combinatorial optimization and stochastic optimal control.
(c) In Chapter 7, an introduction to continuous-time, semi-Markov decision problems was added in a separate last section.
(d) A new appendix was included, which deals with various formulations of problems of decision under uncertainty. The foundations of the minimax and expected utility approaches are framed within a broader context, and some of the aspects of utility theory are discussed.
This second edition of Vol. II should be viewed as a relatively minor revision of the original. The coverage was expanded in a few areas as follows:
(a) In Chapter 1, material was added on variants of the policy iteration method.
(b) In Chapter 2, the material on neuro-dynamic programming methods was updated and expanded to reflect some recent developments.
(c) In Chapter 4, material was added on some new value iteration methods.
(d) In Chapter 5, the material on semi-Markov problems was revised, with a significant portion simplified and shifted to Volume I. \smskip
There are also miscellaneous additions and improvements scattered throughout the text, and a more detailed coverage of deterministic problems is given in Chapter 1. Finally, a new internet-based feature was added to the book, which extends its scope and coverage. Many of the theoretical exercises have been solved in detail and their solutions have been posted in the book's www page. These exercises have been marked with the symbol www.
I would like to express my thanks to the many colleagues who contributed suggestions for improvement of the second edition.
The third edition contains a substantial amount of new material, particularly on approximate dynamic programming, which has now become one of the principal focal points of the book. In particular:
(a) The subject of minimax control was developed in greater detail, including a new section in Chapter 1, which connects with new material in Chapter 6.
(b) The section on auction algorithms for shortest paths in Chapter 2 was eliminated. These methods are not currently used in dynamic programming, and a detailed discussion has been provided in a chapter from the author's Network Optimization book. This chapter can be freely downloaded from http://web.mit.edu/dimitrib/www/net.html
(c) A section was added in Chapter 2 on dynamic programming and shortest path algorithms for constrained and multiobjective problems.
(d) The material on sufficient statistics and partially observable Markov decision problems in Section 5.4 was restructured and expanded.
(d) Considerable new material was added in Chapter 6:
(1) An expanded discussion of one-step lookahead policies and associated performance bounds in Section 6.3.1.
(2) A discussion of aggregation methods and discretization of conti\-nuous-state problems (see Subsection 6.3.4).
(3) A discussion of model predictive control and its relation to other suboptimal control methods (see Subsection 6.5.2).
(4) An expanded treatment of open-loop feedback control and related methods based on a restricted structure (see Subsection 6.5.3).
I have also added a few exercises, and revised a few sections while preserving their essential content. Thanks are due to Haixia Lin, who worked out several exercises, and to Janey Yu, who reviewed some of the new sections and gave me valuable feedback.
Dimitri P. Bertsekas
Summer 2005
This is a major revision of the 2nd edition, and contains a substantial amount of new material, as well as a reorganization of old material. The length of the text has increased by more than 50%, and more than half of the old material has been restructured and/or revised. Most of the added material is in four areas.
(a) The coverage of the average cost problem of Chapter 4 has greatly increased in scope and depth. In particular, there is now a full analysis of multi-chain problems, as well as a more extensive analysis of infinite-spaces problems (Section 4.6).
(b) The material on approximate dynamic programming has been collected in Chapter 6. It has been greatly expanded to include new research, thereby supplementing the 1996 book ``Neuro-Dynamic Programming."
(c) Contraction mappings and their role in various analyses have been highlighted in new material on infinite state space problems (Sections 1.4, 2.5, and 4.6), and in their use in the approximate dynamic programming material of Chapter 6.
(d) The mathematical measure-theoretic issues that must be addressed for a rigorous theory of stochastic dynamic programming have been illustrated and summarized in an appendix for the benefit of the mathematically oriented reader.
Also some exercises were added and a few sections were revised while preserving their essential content.
I would like to express my thanks to many colleagues who contributed valuable comments. I am particularly thankful to Ciamac Moallemi, Steven Shreve, John Tsitsiklis, and Ben Van Roy, who reviewed some of the new material and each contributed several substantial suggestions. I wish to thank especially Janey Yu for her extraordinary help. Janey read with great care and keen eye large parts of the book, contributed important analysis and many incisive, substantive comments, and also collaborated with me in research that was included in Chapter 6.
Dimitri P. Bertsekas
Fall 2006
The fourth edition contains a substantial amount of new material, particularly on approximate DP in Chapter 6. This chapter was thoroughly reorganized and rewritten, to bring it in line, both with the contents of Vol. II, whose latest edition appeared in 2012, and with recent developments, which have propelled approximate DP to the forefront of attention.
Some of the highlights of the revision of Chapter 6 are an increased emphasis on one-step and multistep lookahead methods, parametric approximation architectures, neural networks, rollout, and Monte Carlo tree search. Among other applications, these methods have been instrumental in the recent spectacular success of computer Go programs. The material on approximate DP also provides an introduction and some perspective for the more analytically oriented treatment of Vol. II.
The material of the chapters other than Chapter 6 has been reorganized and somewhat enriched, but not nearly as much. I have just added a few exercises and illustrative examples, and revised a few sections while preserving their essential content. The material on minimum variance control of autoregressive and moving average linear models (Sections 5.3, 6.1.4, and Appendix F in the 3rd edition) was eliminated in this edition, as over time it became disconnected from the remainder of the book. This material is now covered far more comprehensively in specialized textbooks. The material on adaptive control (Section 6.1 in the 3rd edition) has also been substantially reduced. A copy of the 3rd edition material that has been omitted from the 4th edition is posted at the book's web site together with other instructional resources, such as my classroom slides and solutions to the exercises marked with the symbol \www. The course material from several offerings of my class can be found at the MIT Open CourseWare (OCW) site: https://ocw.mit.edu/index.htm
Links to a series of video lectures on approximate DP and related topics may be found at my website, which also contains my research papers on the subject.
Another change is this edition is that the chapter sequence has been reordered, so that the book is now naturally divided in two parts. Part I consists of Chapters 1-4 that are fundamental and ideally should be read as a group. Part II consists of Chapters 5, 6 and 7, each of which is terminal. These chapters can be read independently of each other, and in fact they may be attempted by some readers immediately after Chapter 1, with relatively little loss of continuity. The introductory Section 1.1 explains in more detail the new structure of the book.
Together with Vol. II, this volume provides an instructor with flexibility to follow several different pathways though the material. In my recent graduate offerings of the subject at MIT, I have covered most of Chapters 1 and 2, parts of Chapters 3 and 4, and most of Chapter 5, in the first half of a semester, and then spent the second half of the semester on approximate DP using Chapter 6 and Vol. II.
As always, many thanks are due to the students in my classes at MIT and elsewhere for stimulating interactions. I would like to single out Thomas Stahlbuhk, who has been very helpful with his comments. Thanks are also due to colleagues and collaborators, particularly John Tsitsiklis, Ben Van Roy, Mengdi Wang, and Huizhen Yu, for valuable interactions and suggestions for revision.
January 2017
This is a major revision of the 3rd edition and contains a substantial amount of new material, as well as a reorganization of old material. The length has increased by more than 60% from the 3rd edition, and most of the old material has been restructured and/or revised. Volume 2 now numbers more than 700 pages and is larger in size than Vol. 1. It can arguably be viewed as a new book!
Approximate DP has become the central focal point of Vol. II, and occupies more than half of the book (the last two chapters, and large parts of Chapters 1-3). Thus one may also view Vol.\ II as a followup to my 1996 book ``Neuro-Dynamic Programming" (coauthored with John Tsitsiklis). The present book focuses to a great extent on new research that became available after 1996. On the other hand, the textbook style of the book has been preserved, and some material has been explained at an intuitive or informal level, while referring to the journal literature or the Neuro-Dynamic Programming book for a more mathematical treatment.
In the process of expansion and reorganization, the design of the book became more modular and suitable for classroom use. The core material, which can be covered in about a third to a half of one semester is Chapter 1 (except for the application-specific Sections 1.3 and 1.4), Chapter 2, and Chapter 6, which are self-contained when taken together. This material focuses on discounted problems, and may be supplemented by parts of Chapter 3 and Section 7.1 on stochastic shortest path problems. Indeed, this comprises half of what I cover in my MIT class (the remaining half comes from Volume I, including Chapter 6 of that volume that deals with finite horizon approximate DP). The material on average cost problems, given in Chapter 5, and Sections 7.2 and 7.4, and the advanced material on positive and negative DP models (Chapter 4), and Monte Carlo linear algebra (Section 7.3) are terminal subjects that may be covered at the instructor's discretion.
As the book's focus shifted, I placed increased emphasis on new or recent research in approximate DP and simulation-based methods, as well as on asynchronous iterative methods, in view of the central role of simulation, which is by nature asynchronous. A lot of this material is an outgrowth of my research and the research of my collaborators, conducted in the six years since the previous edition. Some of the highlights, in the order appearing in the book, are:
(a) Computational methods for generalized discounted DP (Sections 2.5 and 2.6), including error bounds for approximations in Section 2.5, and the asynchronous optimistic policy iteration methods of Sections 2.6.2 and 2.6.3, and their application to game and minimax problems, constrained policy iteration, and Q-learning.
(b) Policy iteration methods (including asynchronous optimistic versions) for stochastic shortest path problems that involve improper policies (Section 3.4).
(c) Extensive new material on various simulation-based, approximate value and policy iteration methods in Sections 6.3-6.6 (projected equation and aggregation methods).
(d) New reliable Q-learning algorithms for optimistic policy iteration (Sections 2.6.3 and 6.6.2).
(e) New simulation techniques for multistep methods, such as geometric and free-form sampling (Sections 6.4.1 and 7.3.3).
(f) Extensive new material on Monte Carlo linear algebra in Section 7.3 (primarily the simulation-based and approximate solution of large systems of linear equations), which extends the DP methodology of approximate policy evaluation.
Much of the research in (a)-(e) is based on my work with Janey (Huizhen) Yu, while most of the research in (f) is based on my work with Janey Yu and Mengdi Wang. My collaboration with Janey and Mengdi has had a strong impact on the book, and is greatly appreciated. Some of our work was presented in summary only, and was adapted to fit the style and purposes of this book; naturally, any shortcomings in its presentation are entirely my responsibility. The reader is referred to our joint and individual papers, which describe more fully our research, including material that could not be covered in this book.
I want to express my appreciation to colleagues and collaborators in approximate DP research, who contributed to the book in various ways, particularly Vivek Borkar, Angelia Nedic, and Ben Van Roy. A special thanks goes to John Tsitsiklis, with whom I have interacted extensively through collaboration and sharing of ideas on DP and asynchronous algorithms for more than 30 years. I also wish to acknowledge helpful interactions with many colleagues, including Vivek Farias, Eugene Feinberg, Warren Powell, Martin Puterman, Uriel Rothblum, and Bruno Scherrer. Finally, I want to thank the many students in my DP classes of the last decade, who patiently labored with a textbook under development, and contributed their ideas and experiences through their research projects from a broad variety of application fields.
June 2012