You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thanks for sharing your code. Is it possible to use this for MDPs with negative reward states?
I've tried setting negative rewards inside setup_mdp() in example.py, e.g. like:
def setup_mdp():
"""
Set-up our MDP/GridWorld
"""
# create our world
world = W.IcyGridWorld(size=5, p_slip=0.2)
# set up the reward function
reward = np.zeros(world.n_states)
reward[-1] = 1.0
reward[17] = -0.75
reward[18] = -0.75
reward[19] = -0.75
# set up terminal states
terminal = [24]
return world, reward, terminal
-0.75 seems to be around the lowest I can set - lower than that, and running example.py results in an error:
Traceback (most recent call last):
File "/Users/kierad/Documents/GitHub/irl-maxent/src/example.py", line 141, in <module>
main()
File "/Users/kierad/Documents/GitHub/irl-maxent/src/example.py", line 113, in main
trajectories, expert_policy = generate_trajectories(world, reward, terminal)
File "/Users/kierad/Documents/GitHub/irl-maxent/src/example.py", line 51, in generate_trajectories
tjs = list(T.generate_trajectories(n_trajectories, world, policy_exec, initial, terminal))
File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 128, in <genexpr>
return (_generate_one() for _ in range(n))
File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 126, in _generate_one
return generate_trajectory(world, policy, s, final)
File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 77, in generate_trajectory
action = policy(state)
File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 169, in <lambda>
return lambda state: np.random.choice([*range(policy.shape[1])], p=policy[state, :])
File "mtrand.pyx", line 956, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities are not non-negative
And even with the above setup_mdp, the IRL methods don't seem to produce negative reward estimates (see the colourbar I've added):
True rewards:
Estimated rewards with maxent:
Estimated rewards with causal maxent:
The text was updated successfully, but these errors were encountered:
kierad
changed the title
Negative rewards
Supporting MDPs with negative reward states?
Jun 25, 2023
Ah, this is kinda interesting, thanks for spotting this (needless to say, I didn't test with negative rewards...)
For the crash: What's happening here is that the IRL-maxent algorithm itself doesn't fail, but the "expert data" generation part of the example. As far as I understand, IRL-maxent (both causal and normal) should support negative rewards just fine (more on that later).
To generate the "expert data", I'm just running a normal value-iteration, use that to compute a stochastic policy for the "expert" by computing Q(s, a) / V(s) (where V(s) = sum_a Q(s, a)), and use that in turn to create some trajectories by executing that policy (see here). And the policy computation part is where this goes wrong... because just dividing those is a bit nonsensical. For positive Q(s, a) and V(s), that should actually work out to a valid probability distribution, but as you've spotted: as soon as any of those values are negative, it doesn't.
What I this should have been doing in the first place is use a softmax to compute the probability distribution. I've fixed that in 918044c.
For recovering negative rewards: I guess theoretically, the algorithm should be able to do that, but it might be hard to "convince" it to do so. Essentially, it just tries to find some reward parameters that results in the same distribution of states visited as in the expert trajectories. And those parameters are generally not unique, meaning multiple parameters can result in the same state visitation frequency (and policy).
Hello, thanks for sharing your code. Is it possible to use this for MDPs with negative reward states?
I've tried setting negative rewards inside
setup_mdp()
inexample.py
, e.g. like:-0.75 seems to be around the lowest I can set - lower than that, and running
example.py
results in an error:And even with the above
setup_mdp
, the IRL methods don't seem to produce negative reward estimates (see the colourbar I've added):True rewards:
Estimated rewards with maxent:
Estimated rewards with causal maxent:
The text was updated successfully, but these errors were encountered: