强化学习的数学原理-第1课

2025年11月17日

15:59

基本概念

状态

════════════════════════════════════════════════════════════════════════════════════════════════════

Agent处于环境中的状态，对于网格环境来说，状态就是所在风格的位置location。

状态空间：所有可能的状态

对于每一个状态，都可以采取不同的action，对于网格来说，任一状态都有5个action，某个状态的

所有action集合称为这个状态的action space。

也就是，action space是依赖于状态的，不同状态的action space是不同的。

State transition

When taking an action, the agent may move from one state to another. Such a

process is called state transition.

可以用表格来描述网格环境下的state transition

但是表格只能表示确定性的state transition，更一般化是用概率的形式表示。

Policy

Policy tells the agent what actions to take at a state.

Intuitive representation: The arrows demonstrate a policy.

上图也是一种确定性的策略policy，更一般的情况是，stochastic policy。

可以用一张表格来表示确定性和不确定性的 Policy。

在实际编程中，也是用数组或矩阵的形式来表示一种策略。

Reward

Reward is one of the most unique concepts of RL.

Reward: a real number we get after taking an action.

也就是采取一个action之后得到的一个实数值，叫做reward。这个动作的reward

A positive reward represents encouragement to take such actions.
A negative reward represents punishment to take such actions.

可以用表格来表示确定性的reward，

对于不确定性的reward，还是用概率来表示。

Trajectory and return

trajectory是按照policy得到的一条轨迹，return是这条轨迹的所有reward的和。

return可以用来评估一个policy的好坏。

discounted return就是所有reward的加权和，

引入r的作用是，让return不会变得无限大，并且可以平衡不同时间步的reward

Roles:

1) the sum becomes finite; 2) balance the far and near future rewards:

If r is close to 0, the value of the discounted return is dominated by the

rewards obtained in the near future.

If r is close to 1, the value of the discounted return is dominated by the

rewards obtained in the far future.

return可以用来评估poicy的好坏。

Episode

When interacting with the environment following a policy, the agent may stop

at some terminal states. The resulting trajectory is called an episode (or a

trial).

An episode is usually assumed to be a finite trajectory. Tasks with episodes are

called episodic tasks.

Some tasks may have no terminal states, meaning the interaction with the

environment will never end. Such tasks are called continuing tasks.

In fact, we can treat episodic and continuing tasks in a unied mathematical

way by converting episodic tasks to continuing tasks.

Option 1: Treat the target state as a special absorbing state. Once the agent

reaches an absorbing state, it will never leave. The consequent rewards

r = 0.

Option 2: Treat the target state as a normal state with a policy. The agent

can still leave the target state and gain r = +1 when entering the target

state.

We consider option 2 in this course so that we don't need to distinguish the

target state from the others and can treat it as a normal state.

MDP

Markov decision process就是 1）markov property（t+1时刻的状态和reward，只

由t时刻的状态st以及t+1的action决定，与之前时刻无关） + 2）decision (也就是决策，即policy)

+ 3）process(过程，也就是概率分布)