You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Find all (state, action) pairs we've visited in this episode# We convert each state to a tuple so that we can use it as a dict keysa_in_episode=set([(tuple(x[0]), x[1]) forxinepisode])
forstate, actioninsa_in_episode:
sa_pair= (state, action)
# Find the first occurance of the (state, action) pair in the episodefirst_occurence_idx=next(ifori,xinenumerate(episode)
ifx[0] ==stateandx[1] ==action)
# Sum up all rewards since the first occuranceG=sum([x[2]*(discount_factor**i) fori,xinenumerate(episode[first_occurence_idx:])])
# Calculate average return for this state over all sampled episodesreturns_sum[sa_pair] +=Greturns_count[sa_pair] +=1.0Q[state][action] =returns_sum[sa_pair] /returns_count[sa_pair]
# The policy is improved implicitly by changing the Q dictionaryreturnQ, policy
I think a line should be added upon the last line:
Q[state][action] =returns_sum[sa_pair] /returns_count[sa_pair]
# The policy is improved implicitly by changing the Q dictionarypolicy=make_epsilon_greedy_policy(Q, epsilon, env.action_space.n)
returnQ, policy
Otherwise the policy will not upgrade.
The text was updated successfully, but these errors were encountered:
@Ritz111 No, it'll update. Actually, the policy is updating as Q values are updating because it is fetching the next action according to the current Q values.
In the function
mc_control_epsilon_greedy
:I think a line should be added upon the last line:
Otherwise the policy will not upgrade.
The text was updated successfully, but these errors were encountered: