Nobu_Portfolio: Q-Learning vs. Policy Gradientのミソ　Cartpole-V1の説明

2019年1月22日火曜日

Q-Learning vs. Policy Gradientのミソ　Cartpole-V1の説明

ソースはこちらです：Policy Gradient のミソ

Algorithm ReinforceはMonte Carlo類ですのでエピソード終了後にトレーニングを行います。

TD（Temporal Difference）の場合、毎ステップトレーニングを行うところと異なります。

Cartpole-V1のStateは４つあります：カートの位置、カートの速度、ポールの角度、ポールの回転数

Actionは２つです：右に押す、左に押す

Policy GradientのSecret Sauce:

１．model.add(Dense(self.action_size, activation='softmax', kernel_initializer='glorot_uniform'))

Softmax activationを使うことでAction-sizeが２の場合、[0.51　0.49]のように右に押す確率、左に押す確率が
出力されます。

DQNの最終層のActivationはSoftmaxではなくLinearでしたのでQ-valueを出力できました。

２．Def get_action(）を見てください。

# using the output of policy network, pick action stochastically
def get_action(self, state):
policy = self.model.predict(state, batch_size=1).flatten() ←ネットワークにて確率を取得
print("get_action policy", policy)
return np.random.choice(self.action_size, 1, p=policy)[0]　←上記確率を利用してランダムにアクションを選択

３．Main内でエピソードが終了した時点でトレーニング開始

if done:
# every episode, agent learns from sample returns
agent.train_model()

４．def train_model(self):内にてModel.fit(）をコールしてBackpropogationを行います。

for i in range(episode_length):
#copying states array to update_inputs
update_inputs[i] = self.states[i] #start from state zero
#print("update_inputs[{}]".format(i), update_inputs[i])
#filling in squashed rewards into advantages for each action at each state
advantages[i][self.actions[i]] = discounted_rewards[i]
#print("advantages 2D", advantages[i][self.actions[i]])
#print("\n\nupdate_inputs final", update_inputs)
#print("\n\nadvantages final {}".format(advantages))
#(training data=states, targets=advantages which is like the action-value)
self.model.fit(update_inputs, advantages, epochs=1, verbose=0)←ここです。
#clear out states, actions, rewards for next episode
self.states, self.actions, self.rewards = [], [], []

５．近日中に以下を和訳します。

# Using categorical crossentropy as a loss is a trick to easily
# implement the policy gradient. Categorical cross entropy is defined
# H(p, q) = sum(p_i * log(q_i)). For the action taken, a, you set
# p_a = advantage. q_a is the output of the policy network, which is
# the probability of taking the action a, i.e. policy(s, a).
# All other p_i are zero, thus we have H(p, q) = A * log(policy(s, a))
model.compile(loss="categorical_crossentropy", optimizer=Adam(lr=self.learning_rate))

Nobu_Portfolio

2019年1月22日火曜日

Q-Learning vs. Policy Gradientのミソ　Cartpole-V1の説明

0 件のコメント:

コメントを投稿

My Github repo

不正行為を報告

2019年1月22日火曜日

Q-Learning vs. Policy Gradientのミソ Cartpole-V1の説明

0 件のコメント:

コメントを投稿

My Github repo

Q-Learning vs. Policy Gradientのミソ　Cartpole-V1の説明