强化学习的数学原理-2

20251117

20:42

贝尔曼方程

 

如何计算return?

  1. 直接通过定义,v1表示从状态s1出发的return

  1. 通过其他状态的return

写成矩阵的形式,

 

这就是bellman方程的形式。Though simple, it demonstrates the core idea: the value of one state relies on the values of other states.

 

状态值

═══════════════════════════════════════════════════════════════════════════════════════════════════

Consider the following single-step process: 首先考虑单步,

  •  t; t + 1: discrete time instances
  • St: state at time t
  •  At: the action taken at state St
  • Rt+1: the reward obtained after taking At,有时也写成Rt,反正都是表示采取action At之后得到的reward
  • St+1: the state transited to after taking At

 

Note that St;At;Rt+1 are all random variables.都是随机变量,因为这个过程是随机(有概率分布)的

This step is governed by the following probability distributions:

St ==> At,是由概率分布policy π决定的,St,At==>Rt+1,是由reward函数决定的;St,At==>St+1,是由

状态转移函数决定的。

 

再来看多步,即一个trajectory

Gt即这个trajectorydiscounted return

Gt也是个随机变量, is also a random variable since Rt+1;Rt+2; : : : are random variables.

 

Gt这个随机变量的期望(平均值)就是state value

Remarks:

 It is a function of s. 不同的状态,有不同的状态值函数。It is a conditional expectation with the condition that the state starts from s.

 It is based on the policy . 不同的policy,有不同的状态值函数。For a dierent policy, the state value maybe dierent.

 It represents the \value" of a state. If the state value is greater, then the policy is better because greater cumulative rewards can be obtained.

 

Q: What is the relationship between return and state value?

A: The state value is the mean of all possible returns that can be

obtained starting from a state. If everything -  

- is deterministic, then state value is the same as return.

状态值和return的关系,状态值是所有trajectory return的平均值,而return是单个trajectory的。

 

Bellman方程

════════════════════════════════════════════════════════════════════════════════════════════════════

 

将状态值函数改写成上面的形式,

首先来看第一项,E[Rt+1|St=s]表示 当前t时刻的状态为s的条件下,下一个时刻,即t+1时刻的reward的期望,

也就是immediate rewardE[Rt+1|St=s]等于对当前状态s下所有可能的action(根据policy π)以及采取action后相应的reward进行求和。

 

 

然后看第二项,第二项表示 在当前时刻t所处状态为s的条件下,t+1时刻的state value。也就是

future reward它等于 对当前状态s下所有可能的action(根据policy π)以及采取action后相应

的新的状态(根据状态转移概率)的状态值求和

 

 

这就是bellman方程,它的含义是,状态sstate value可以由下一步的immediate reward的期望(平均值)

加上future reward的期望(平均值)。

 

是需要计算的状态值,也就是说,状态s的状态值由其他状态的状态值决定,是一种bootstrap

 

π(a|s) is a given policy. Solving the equation is called policy evaluation. 对于一种policy ,解上面这个

bellman方程(得到状态值),被称为policy evaluation

 

p(r|s; a) and p(s'|s; a) represent the dynamic model.,环境模型,也就是两个概率,p(r|s; a) and p(s'|s; a)

p(r|s; a)表示 状态s条件下,采取action a时获得的rewardp(s'|s; a) 表示 状态s条件下,采取action a时,agent的新状态s'

 

 

每个状态s都有一个状态值,也就是bellman公式,那么对于状态集合S,就有nbellman公式,将这n个公式

写成矩阵的形式,

也就是下图就种,

 

求解这组bellman方程有2种方式,

一种是直接求闭式解,这种方式需要求矩阵的逆,

另一种是迭代求解,

 

Action value

════════════════════════════════════════════════════════════════════════════════════════════════════

From state value to action value:

  •  State value: the average return the agent can get starting from a state.
  •  Action value: the average return the agent can get starting from a state and taking an action.

 

Action valuestate value的关系:

 

 

 

已使用 OneNote 创建。