Reinforcement Learning and Dynamic Programming Using Function Approximators pdf epub mobi txt 电子书下载 2026

简体网页||繁体网页

☆☆☆☆☆

出版者:CRC Press

作者:Busoniu, Lucian

出品人:

页数:286

译者:

出版时间:2017-7-28

价格:695.00 元

装帧:

isbn号码:9781439821084

丛书系列:

图书标签:

强化学习
增强学习
运筹学
数学
教材
动态规划
优化
强化学习
动态规划
函数逼近
机器学习
人工智能
控制理论
优化
算法
决策过程
数值方法

下载链接在页面底部

facebook linkedin mastodon messenger pinterest reddit telegram twitter viber vkontakte whatsapp 复制链接

想要找书就要到小美书屋

book.quotespace.org

立刻按 ctrl+D收藏本页

你会得到大惊喜!!

具体描述

From household appliances to applications in robotics, engineered systems involving complex dynamics can only be as effective as the algorithms that control them. While Dynamic Programming (DP) has provided researchers with a way to optimally solve decision and control problems involving complex dynamic systems, its practical value was limited by algorithms that lacked the capacity to scale up to realistic problems.

However, in recent years, dramatic developments in Reinforcement Learning (RL), the model-free counterpart of DP, changed our understanding of what is possible. Those developments led to the creation of reliable methods that can be applied even when a mathematical model of the system is unavailable, allowing researchers to solve challenging control problems in engineering, as well as in a variety of other disciplines, including economics, medicine, and artificial intelligence.

Reinforcement Learning and Dynamic Programming Using Function Approximators provides a comprehensive and unparalleled exploration of the field of RL and DP. With a focus on continuous-variable problems, this seminal text details essential developments that have substantially altered the field over the past decade. In its pages, pioneering experts provide a concise introduction to classical RL and DP, followed by an extensive presentation of the state-of-the-art and novel methods in RL and DP with approximation. Combining algorithm development with theoretical guarantees, they elaborate on their work with illustrative examples and insightful comparisons. Three individual chapters are dedicated to representative algorithms from each of the major classes of techniques: value iteration, policy iteration, and policy search. The features and performance of these algorithms are highlighted in extensive experimental studies on a range of control applications.

The recent development of applications involving complex systems has led to a surge of interest in RL and DP methods and the subsequent need for a quality resource on the subject. For graduate students and others new to the field, this book offers a thorough introduction to both the basics and emerging methods. And for those researchers and practitioners working in the fields of optimal and adaptive control, machine learning, artificial intelligence, and operations research, this resource offers a combination of practical algorithms, theoretical analysis, and comprehensive examples that they will be able to adapt and apply to their own work.

Access the authors' website at www.dcsc.tudelft.nl/rlbook/ for additional material, including computer code used in the studies and information concerning new developments.

探寻智能决策的深层逻辑：一本关于自适应控制与学习方法的书籍这是一部深入剖析如何让智能体在复杂、动态的环境中做出最优决策的著作。本书聚焦于那些智能体必须通过不断试错来学习，并根据经验调整自身行为以最大化长期回报的场景。我们并非探讨静态的、预先设定的解决方案，而是着眼于智能体如何在一个充满不确定性且不断变化的世界中，通过自身的探索和交互来逐步掌握最优策略。本书的核心思想在于，智能体需要构建一个内部模型，用来预测其行为的后果，并利用这些预测来指导未来的行动。这个模型并非一成不变，而是随着智能体与环境的互动而不断完善和更新。我们将详细介绍几种关键的学习范式，这些范式允许智能体从低效的初步尝试中学习，并逐渐收敛到近乎最优甚至最优的决策序列。一个重要的分支是动态规划。我们并非仅仅提及理论上的概念，而是深入探讨如何在实际应用中，通过巧妙的算法设计，将动态规划的思想转化为可行的解决方案。这涉及到对状态空间的有效表示、值函数的迭代更新以及策略的不断优化。本书将演示如何打破传统动态规划在状态空间维度过高时的瓶颈，通过引入更先进的近似方法来处理现实世界中更为庞大的状态空间。更进一步，我们将重点阐述函数逼近在强化学习中的强大作用。当环境的状态或动作空间过于庞大，以至于无法为每一个离散的单元存储一个确定的值时，函数逼近技术就显得尤为重要。我们不只是简单地列举几种函数逼近器，而是深入剖析它们的工作原理、优缺点以及在不同场景下的适用性。从经典的线性逼近，到强大的神经网络，本书将为你揭示如何利用这些工具来有效地表示和学习复杂的策略和价值函数。你将了解到，如何选择合适的函数逼近器，如何训练它们，以及如何避免训练过程中的陷阱，如收敛性问题和过拟合。本书的另一大亮点在于，我们将动态规划与函数逼近这两大核心概念进行有机结合。这并非简单的技术堆砌，而是探索如何利用函数逼近的强大能力来克服传统动态规划在处理大规模问题时的局限性。我们将详细介绍如何将价值函数或策略表示为参数化的函数，并利用强化学习算法的更新信号来迭代地调整这些参数。这个过程就像是让一个学生在没有标准答案的情况下，通过不断练习和反馈来提高自己的技能，最终掌握最优的学习方法。此外，我们还将探讨探索与利用的权衡这一强化学习中的核心难题。智能体如何在尝试新行为以发现潜在更高回报的同时，又不牺牲当前已知的最优策略所带来的回报？本书将深入分析各种探索策略，从简单的$epsilon$-greedy到更复杂的基于不确定性的探索方法，并讨论它们在不同环境下的性能表现。理解这一点对于构建真正能够适应未知环境的智能体至关重要。本书的内容将贯穿理论深度与实践指导。你将看到如何将抽象的算法概念转化为具体的代码实现，并通过详细的案例研究来理解这些技术是如何在实际问题中发挥作用的。无论是机器人控制、游戏AI、资源调度，还是个性化推荐系统，本书提供的框架和方法都能够为解决这些复杂决策问题提供坚实的基础。本书旨在为那些希望理解并应用智能决策技术的研究者、工程师和学生提供一个全面的指南。我们相信，通过掌握书中介绍的核心概念和技术，你将能够构建出更智能、更具适应性的系统，它们能够在不断变化的世界中做出更明智、更具前瞻性的选择。这不是一本关于“是什么”的书，而是一本关于“如何做”的书，它将带领你深入探索智能学习和最优控制的奥秘。

作者简介

Lucian Busoniu is a postdoctoral fellow at the Delft Center for Systems and Control of Delft University of Technology, in the Netherlands. He received his PhD degree (cum laude) in 2009 from the Delft University of Technology, and his MSc degree in 2003 from the Technical University of Cluj-Napoca, Romania. His current research interests include reinforcement learning and dynamic programming with function approximation, intelligent and learning techniques for control problems, and multi-agent learning.

Robert Babuska Robert Babuska is a full professor at the Delft Center for Systems and Control of Delft University of Technology in the Netherlands. He received his PhD degree (cum laude) in Control in 1997 from the Delft University of Technology, and his MSc degree (with honors) in Electrical Engineering in 1990 from Czech Technical University, Prague. His research interests include fuzzy systems modeling and identification, data-driven construction and adaptation of neuro-fuzzy systems, model-based fuzzy control and learning control. He is active in applying these techniques in robotics, mechatronics, and aerospace.

Bart De Schutter Bart De Schutter is a full professor at the Delft Center for Systems and Control and at the Marine & Transport Technology department of Delft University of Technology in the Netherlands. He received the PhD degree in Applied Sciences (summa cum laude with congratulations of the examination jury) in 1996 from K.U. Leuven, Belgium. His current research interests include multi-agent systems, hybrid systems control, discrete-event systems, and control of intelligent transportation systems.

Damien Ernst Damien Ernst received the MSc and PhD degrees from the University of Li�ge in 1998 and 2003, respectively. He is currently a Research Associate of the Belgian FRS-FNRS and he is affiliated with the Systems and Modeling Research Unit of the University of Li�ge. Damien Ernst spent the period 2003--2006 with the University of Li�ge as a Postdoctoral Researcher of the FRS-FNRS and held during this period positions as visiting researcher at CMU, MIT and ETH. He spent the academic year 2006--2007 working at Sup�lec (France) as professor. His main research interests are in the fields of power system dynamics, optimal control, reinforcement learning, and design of dynamic treatment regimes.

目录信息

1. Introduction
1.1 The dynamic programming and reinforcement learning problem
1.2 Approximation in dynamic programming and reinforcement learning
1.3 About this book
2. An introduction to dynamic programming and reinforcement learning
2.1 Introduction
2.2 Markov decision processes
2.2.1 Deterministic setting
2.2.2 Stochastic setting
2.3 Value iteration
2.3.1 Model-based value iteration
2.3.2 Model-free value iteration and the need for exploration
2.4 Policy iteration
2.4.1 Model-based policy iteration
2.4.2 Model-free policy iteration
2.5 Policy search
2.6 Summary and discussion
3. Dynamic programming and reinforcement learning in large and continuous spaces
3.1 Introduction
3.2 The need for approximation in large and continuous spaces
3.3 Approximation architectures
3.3.1 Parametric approximation
3.3.2 Nonparametric approximation
3.3.3 Comparison of parametric and nonparametric approximation
3.3.4 Remarks
3.4 Approximate value iteration
3.4.1 Model-based value iteration with parametric approximation
3.4.2 Model-free value iteration with parametric approximation
3.4.3 Value iteration with nonparametric approximation
3.4.4 Convergence and the role of nonexpansive approximation
3.4.5 Example: Approximate Q-iteration for a DC motor
3.5 Approximate policy iteration
3.5.1 Value iteration-like algorithms for approximate policy
evaluation
3.5.2 Model-free policy evaluation with linearly parameterized approximation
3.5.3 Policy evaluation with nonparametric approximation
3.5.4 Model-based approximate policy evaluation with rollouts
3.5.5 Policy improvement and approximate policy iteration
3.5.6 Theoretical guarantees
3.5.7 Example: Least-squares policy iteration for a DC motor
3.6 Finding value function approximators automatically
3.6.1 Basis function optimization
3.6.2 Basis function construction
3.6.3 Remarks
3.7 Approximate policy search
3.7.1 Policy gradient and actor-critic algorithms
3.7.2 Gradient-free policy search
3.7.3 Example: Gradient-free policy search for a DC motor
3.8 Comparison of approximate value iteration, policy iteration, and policy search
3.9 Summary and discussion
4. Approximate value iteration with a fuzzy representation
4.1 Introduction
4.2 Fuzzy Q-iteration
4.2.1 Approximation and projection mappings of fuzzy Q-iteration
4.2.2 Synchronous and asynchronous fuzzy Q-iteration
4.3 Analysis of fuzzy Q-iteration
4.3.1 Convergence
4.3.2 Consistency
4.3.3 Computational complexity
4.4 Optimizing the membership functions
4.4.1 A general approach to membership function optimization
4.4.2 Cross-entropy optimization
4.4.3 Fuzzy Q-iteration with cross-entropy optimization of the membership functions
4.5 Experimental study
4.5.1 DC motor: Convergence and consistency study
4.5.2 Two-link manipulator: Effects of action interpolation, and comparison with fitted Q-iteration
4.5.3 Inverted pendulum: Real-time control
4.5.4 Car on the hill: Effects of membership function optimization
4.6 Summary and discussion
5. Approximate policy iteration for online learning and continuous-action control
5.1 Introduction
5.2 A recapitulation of least-squares policy iteration
5.3 Online least-squares policy iteration
5.4 Online LSPI with prior knowledge
5.4.1 Online LSPI with policy approximation
5.4.2 Online LSPI with monotonic policies
5.5 LSPI with continuous-action, polynomial approximation
5.6 Experimental study
5.6.1 Online LSPI for the inverted pendulum
5.6.2 Online LSPI for the two-link manipulator
5.6.3 Online LSPI with prior knowledge for the DC motor
5.6.4 LSPI with continuous-action approximation for the inverted pendulum
5.7 Summary and discussion
6. Approximate policy search with cross-entropy optimization of basis functions
6.1 Introduction
6.2 Cross-entropy optimization
6.3 Cross-entropy policy search
6.3.1 General approach
6.3.2 Cross-entropy policy search with radial basis functions
6.4 Experimental study
6.4.1 Discrete-time double integrator
6.4.2 Bicycle balancing
6.4.3 Structured treatment interruptions for HIV infection control
6.5 Summary and discussion
Appendix A. Extremely randomized trees
A.1 Structure of the approximator
A.2 Building and using a tree
Appendix B. The cross-entropy method
B.1 Rare-event simulation using the cross-entropy method
B.2 Cross-entropy optimization
Symbols and abbreviations
Bibliography
List of algorithms
Index
· · · · · · (收起)

读后感

评分☆☆☆☆☆

用户评价

评分☆☆☆☆☆

这本书的书名本身就带有一种强烈的学术气息，让人联想到严谨的数学推导和复杂的算法实现。我原本以为它会是一本专注于讲解如何构建和优化函数逼近器的工具书，内容会偏向于编程实现和具体框架的使用。然而，当我真正翻开这本书时，我发现它远不止于此。作者的笔触非常细腻，不仅仅是罗列公式，更重要的是深入剖析了动态规划和强化学习之间的内在联系。书中对贝尔曼方程的阐述极为透彻，无论是经典的价值迭代还是策略迭代，都被赋予了深刻的理论支撑，读起来不像是在看一本纯粹的教科书，更像是在跟随一位经验丰富的导师进行一次深入的思维漫步。特别是关于如何处理高维状态空间的讨论，作者并没有简单地依赖于现成的深度学习框架，而是花了大量篇幅去探讨理论上的挑战和可能的解决方案，这对于希望构建扎实理论基础的研究者来说，无疑是一份宝贵的财富。整本书的结构安排得很有条理，从基础概念的建立到复杂算法的演化，每一步都铺垫得恰到好处，阅读体验非常流畅，让人感觉自己是在一步步搭建起对整个领域的理解框架。

评分☆☆☆☆☆

这本书的行文风格非常古典且严谨，充满了数学推导的魅力，但同时又保持着一种令人信服的逻辑连贯性。它不像市面上一些快餐式的入门读物，追求快速覆盖所有前沿技术。相反，作者似乎更致力于挖掘问题的“根源”，力求让读者对强化学习的理论基础有一个坚不可摧的认知。在阅读过程中，我经常需要停下来，仔细推敲每一个定义和定理的证明过程，这使得阅读进度相对较慢，但带来的知识沉淀却是无比扎实的。特别是对随机过程和马尔可夫决策过程的背景知识回顾部分，虽然看似是“老生常谈”，但作者的叙述角度非常独特，成功地将这些基础概念与后续的逼近器问题紧密地联系起来，形成了一个有机的整体。对于那些希望深入研究算法收敛性、渐近行为等高级主题的读者而言，这本书提供的理论深度是其他教材难以比拟的。

评分☆☆☆☆☆

这本书给我的最大震撼在于其对“动态规划”这一核心思想的重新审视和现代化解读。很多介绍强化学习的书籍往往在早期就急于引入神经网络等现代工具，导致读者对底层的决策过程理解不够深入。但这本书却反其道而行之，它将动态规划放在了极其重要的位置，详细阐述了其在解决最优控制问题上的强大能力。作者似乎在强调，无论后续使用何种逼近器，理解动态规划的原理都是至关重要的基石。我尤其欣赏作者在讲解蒙特卡洛方法和TD学习时，如何巧妙地将它们与传统的动态规划框架进行对比和融合。这种对比不仅凸显了不同方法的优缺点，更重要的是揭示了学习过程是如何从完全模型依赖逐步过渡到模型无关的。书中的图表和例子设计得非常精妙，它们往往能用最简洁的方式捕捉到问题的本质，避免了冗长而晦涩的数学语言的干扰，让初学者也能迅速抓住重点，这种教学上的匠心值得称赞。

评分☆☆☆☆☆

作为一个在工程领域摸爬滚打多年的实践者，我通常更关注算法的鲁棒性和实际部署的效率。这本书在这方面也给了我不少启发。虽然它偏向理论，但作者在讨论函数逼近器时，并没有回避实际应用中的“陷阱”。例如，关于函数逼近器的选择、误差的界定以及如何避免收敛性问题，都有独到的见解。我发现书中对于如何在高维空间中保持策略的平滑性以及处理函数近似带来的偏差（bias）和方差（variance）权衡的讨论，非常具有实操指导意义。很多时候，理论上的最优策略在实践中会因为逼近器的限制而失效，而这本书似乎预料到了这些问题，并提前提供了理论上的应对思路，这让我在设计实验时可以更有信心。它不是一本教你“如何敲代码”的书，而是一本教你“如何思考”的书，帮助你从根本上理解为什么某些方法有效，而另一些方法容易失败。

评分☆☆☆☆☆

这本书给人的整体感觉是沉稳、厚重，像是一部立足于经典理论，放眼于未来挑战的学术巨著。它最大的价值在于提供了一个稳定的理论框架，让读者在面对不断涌现的新算法和新模型时，能够迅速定位新方法的理论归属和潜在风险。我注意到书中在处理函数逼近时，强调了线性逼近和非线性逼近的根本区别，以及这种区别对解的唯一性和存在性的影响。这种对基础数学性质的执着探究，使得全书的论证无懈可击。对于那些已经对强化学习有一些初步了解，但渴望突破现有瓶颈，进入更深层次研究的学者来说，这本书无疑是一本不可或缺的案头工具书。它不是那种读完一遍就可以束之高阁的读物，而是需要反复研读、在不同阶段会有不同体会的经典之作，其对原理的精雕细琢，保证了其长久的学术生命力。

评分☆☆☆☆☆