强化学习的数学原理-第2课

2025年11月17日

20:42

贝尔曼方程

如何计算return?

直接通过定义，v1表示从状态s1出发的return

通过其他状态的return

写成矩阵的形式，

这就是bellman方程的形式。Though simple, it demonstrates the core idea: the value of one state relies on the values of other states.

状态值

═══════════════════════════════════════════════════════════════════════════════════════════════════

Consider the following single-step process: 首先考虑单步，

t; t + 1: discrete time instances
St: state at time t
At: the action taken at state St
Rt+1: the reward obtained after taking At，有时也写成Rt，反正都是表示采取action At之后得到的reward
St+1: the state transited to after taking At

Note that St;At;Rt+1 are all random variables.都是随机变量，因为这个过程是随机（有概率分布）的

This step is governed by the following probability distributions:

St ==> At，是由概率分布policy π决定的，St,At==>Rt+1，是由reward函数决定的；St,At==>St+1,是由

状态转移函数决定的。

再来看多步，即一个trajectory，

Gt即这个trajectory的discounted return，

Gt也是个随机变量， is also a random variable since Rt+1;Rt+2; : : : are random variables.

Gt这个随机变量的期望（平均值）就是state value。

Remarks:

It is a function of s. 不同的状态，有不同的状态值函数。It is a conditional expectation with the condition that the state starts from s.

It is based on the policy . 不同的policy，有不同的状态值函数。For a dierent policy, the state value maybe dierent.

It represents the \value" of a state. If the state value is greater, then the policy is better because greater cumulative rewards can be obtained.

Q: What is the relationship between return and state value?

A: The state value is the mean of all possible returns that can be

obtained starting from a state. If everything -

- is deterministic, then state value is the same as return.

状态值和return的关系，状态值是所有trajectory return的平均值，而return是单个trajectory的。

Bellman方程

════════════════════════════════════════════════════════════════════════════════════════════════════

将状态值函数改写成上面的形式，

首先来看第一项，E[Rt+1|St=s]表示当前t时刻的状态为s的条件下，下一个时刻，即t+1时刻的reward的期望，

也就是immediate reward。E[Rt+1|St=s]等于对当前状态s下所有可能的action（根据policy π）以及采取action后相应的reward进行求和。

然后看第二项，第二项表示在当前时刻t所处状态为s的条件下，t+1时刻的state value。也就是

future reward。它等于对当前状态s下所有可能的action（根据policy π）以及采取action后相应

的新的状态（根据状态转移概率）的状态值求和

这就是bellman方程，它的含义是，状态s的state value可以由下一步的immediate reward的期望（平均值）

加上future reward的期望（平均值）。

和是需要计算的状态值，也就是说，状态s的状态值由其他状态的状态值决定，是一种bootstrap。

π(a|s) is a given policy. Solving the equation is called policy evaluation. 对于一种policy ，解上面这个

bellman方程（得到状态值），被称为policy evaluation。

p(r|s; a) and p(s'|s; a) represent the dynamic model.，环境模型，也就是两个概率，p(r|s; a) and p(s'|s; a)

p(r|s; a)表示状态s条件下，采取action a时获得的reward，p(s'|s; a) 表示状态s条件下，采取action a时，agent的新状态s'。

每个状态s都有一个状态值，也就是bellman公式，那么对于状态集合S，就有n个bellman公式，将这n个公式

写成矩阵的形式，

也就是下图就种，

求解这组bellman方程有2种方式，

一种是直接求闭式解，这种方式需要求矩阵的逆，

另一种是迭代求解，

Action value

From state value to action value:

State value: the average return the agent can get starting from a state.
Action value: the average return the agent can get starting from a state and taking an action.

Action value和state value的关系：