Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting MDPs with negative reward states? #4

Open
kierad opened this issue Jun 25, 2023 · 1 comment
Open

Supporting MDPs with negative reward states? #4

kierad opened this issue Jun 25, 2023 · 1 comment

Comments

@kierad
Copy link

kierad commented Jun 25, 2023

Hello, thanks for sharing your code. Is it possible to use this for MDPs with negative reward states?

I've tried setting negative rewards inside setup_mdp() in example.py, e.g. like:

def setup_mdp():
    """
    Set-up our MDP/GridWorld
    """
    # create our world
    world = W.IcyGridWorld(size=5, p_slip=0.2)

    # set up the reward function
    reward = np.zeros(world.n_states)
    reward[-1] = 1.0
    reward[17] = -0.75
    reward[18] = -0.75
    reward[19] = -0.75

    # set up terminal states
    terminal = [24]

    return world, reward, terminal

-0.75 seems to be around the lowest I can set - lower than that, and running example.py results in an error:

Traceback (most recent call last):
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/example.py", line 141, in <module>
    main()
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/example.py", line 113, in main
    trajectories, expert_policy = generate_trajectories(world, reward, terminal)
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/example.py", line 51, in generate_trajectories
    tjs = list(T.generate_trajectories(n_trajectories, world, policy_exec, initial, terminal))
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 128, in <genexpr>
    return (_generate_one() for _ in range(n))
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 126, in _generate_one
    return generate_trajectory(world, policy, s, final)
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 77, in generate_trajectory
    action = policy(state)
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 169, in <lambda>
    return lambda state: np.random.choice([*range(policy.shape[1])], p=policy[state, :])
  File "mtrand.pyx", line 956, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities are not non-negative

And even with the above setup_mdp, the IRL methods don't seem to produce negative reward estimates (see the colourbar I've added):

True rewards:
reward_estimate_true

Estimated rewards with maxent:
reward_estimate_maxent

Estimated rewards with causal maxent:
reward_estimate_maxent_causal

@kierad kierad changed the title Negative rewards Supporting MDPs with negative reward states? Jun 25, 2023
@qzed
Copy link
Owner

qzed commented Jun 25, 2023

Ah, this is kinda interesting, thanks for spotting this (needless to say, I didn't test with negative rewards...)

For the crash: What's happening here is that the IRL-maxent algorithm itself doesn't fail, but the "expert data" generation part of the example. As far as I understand, IRL-maxent (both causal and normal) should support negative rewards just fine (more on that later).

To generate the "expert data", I'm just running a normal value-iteration, use that to compute a stochastic policy for the "expert" by computing Q(s, a) / V(s) (where V(s) = sum_a Q(s, a)), and use that in turn to create some trajectories by executing that policy (see here). And the policy computation part is where this goes wrong... because just dividing those is a bit nonsensical. For positive Q(s, a) and V(s), that should actually work out to a valid probability distribution, but as you've spotted: as soon as any of those values are negative, it doesn't.

What I this should have been doing in the first place is use a softmax to compute the probability distribution. I've fixed that in 918044c.

For recovering negative rewards: I guess theoretically, the algorithm should be able to do that, but it might be hard to "convince" it to do so. Essentially, it just tries to find some reward parameters that results in the same distribution of states visited as in the expert trajectories. And those parameters are generally not unique, meaning multiple parameters can result in the same state visitation frequency (and policy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants