Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Using the start value of Discrete spaces has no effect #2052

Open
5 tasks done
JoshuaBluem opened this issue Dec 3, 2024 · 5 comments
Open
5 tasks done

[Bug]: Using the start value of Discrete spaces has no effect #2052

JoshuaBluem opened this issue Dec 3, 2024 · 5 comments
Labels
documentation Improvements or additions to documentation duplicate This issue or pull request already exists good first issue Good for newcomers help wanted Help from contributors is welcomed

Comments

@JoshuaBluem
Copy link

🐛 Bug

For example spaces.Discrete(3, start=-1) does not produce actions in {-1, 0, 1} but rather in {0, 1, 2} in both the step function and the predict function.

To Reproduce

Code Sample: Enviroment with a Discrete Action starting not from zero

from typing import Any, Optional, SupportsFloat
import gymnasium as gym
import numpy as np
from gymnasium import spaces
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv

'''
Trains Agents to repeat a number
This enviroment returns '11 if current state is 1' and '10 if state is 0'
Current state is chosen randomly
Each episode is only one choice and gets terminated after
'''

class CustomEnv(gym.Env):
    def __init__(self, render_mode='human'):
        super(CustomEnv, self).__init__()
        self.observation_space = spaces.Discrete(2)  # Observation space (0 or 1)
        **self.action_space = spaces.Discrete(2, start=10)**       # Action space (10 or 11)
        self.current_state = np.random.choice([0, 1])    # Initial state (random 0 or 1)
        self.render_mode = render_mode

    def step(self, action) -> tuple[object, SupportsFloat, bool, bool, dict[str, Any]]:
        # Execute action and compute reward based on repeating numbers
        **print(f"recieved action: {action}")**
        reward = 1 if action-10 == self.current_state else 0 # create reward out of state
        self.current_state = np.random.choice([0, 1]) # create new state for next step
        
        terminated = True  # If episode ends (agent achieved the final goal or reached the end)
        truncated = False   # If current episode should be discarded, because e.g. training time reached max limit

        return self.current_state, reward, terminated, truncated, {}

    def reset(self, *,
        seed: Optional[int] = None,
        options: Optional[dict[str, Any]] = None
    ) -> tuple[object, dict[str, Any]]:
        self.current_state = np.random.choice([0, 1]) # Reset to initial state
        return self.current_state, {}

    def render(self) -> str | list[str] | None:
        if self.render_mode == 'human':
            print(f"Current state: {self.current_state}")
        elif self.render_mode == 'ansi':
            return str(self.current_state)
        else:
            print(f"unknown renderer mode: {self.render_mode}")

env = DummyVecEnv([lambda: CustomEnv(render_mode='ansi')]) 
model = PPO("MlpPolicy", env)
model.learn(total_timesteps=1000, reset_num_timesteps=False)

Relevant log output / Error message

The recieved actions should be "10's and 11's" and not "0's and 1's", due to action space starting at spaces.Discrete

System Info

  • OS: Windows-10-10.0.19045-SP0 10.0.19045
  • Python: 3.11.9
  • Stable-Baselines3: 2.3.2
  • PyTorch: 2.4.0+cu118
  • GPU Enabled: True
  • Numpy: 1.26.4
  • Cloudpickle: 3.0.0
  • Gymnasium: 0.29.1
  • OpenAI Gym: 0.25.1

Checklist

  • My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • I have provided a minimal and working example to reproduce the bug
  • I've used the markdown code blocks for both code and stack traces.
@JoshuaBluem JoshuaBluem added the bug Something isn't working label Dec 3, 2024
@araffin araffin added duplicate This issue or pull request already exists check the checklist You have checked the required items in the checklist but you didn't do what is written... labels Dec 3, 2024
@JoshuaBluem
Copy link
Author

Hey @araffin , this is my first interaction with public git repositories. Could you help me and tell me, what I am missing on this issue, when you comment "You have checked the required items in the checklist but you didn't do whats written ...".
Also considering you flagged this issue as duplicate, could you link the issue that is related to this problem?
I would really appreciate your help!! I just want to help this project.

@araffin
Copy link
Member

araffin commented Dec 4, 2024

Hello,

a quick search in the issues will give you #913 and all the others #1295 #1378

We currently also have an explicit warning in our env checker:

def _check_non_zero_start(space: spaces.Space, space_type: str = "observation", key: str = "") -> None:
"""
:param space: Observation or action space
:param space_type: information about whether it is an observation or action space
(for the warning message)
:param key: When the observation space comes from a Dict space, we pass the
corresponding key to have more precise warning messages. Defaults to "".
"""
if isinstance(space, (spaces.Discrete, spaces.MultiDiscrete)) and not _starts_at_zero(space):
maybe_key = f"(key='{key}')" if key else ""
warnings.warn(
f"{type(space).__name__} {space_type} space {maybe_key} with a non-zero start (start={space.start}) "
"is not supported by Stable-Baselines3. "
f"You can use a wrapper or update your {space_type} space."
)

I would really appreciate your help!! I just want to help this project.

I appreciate your motivation and that's also why we have a contributing guide to get you started: https://github.com/DLR-RM/stable-baselines3/blob/master/CONTRIBUTING.md

Currently, as written in the other issues and in the env checker, there is a simple alternative which is to use a wrapper (or to update the env to do the offset there).
However, according to your PR, it seems that it is less code changes than I expected. Currently, your PR is also missing tests and doesn't seem to cover the off-policy algorithms.

@araffin
Copy link
Member

araffin commented Dec 6, 2024

Currently, as written in the other issues and in the env checker, there is a simple alternative which is to use a wrapper (or to update the env to do the offset there).

sorry for not being clear enough, after looking again at the PR, I think it would be better to document how such wrapper would look like rather than having a more complex codebase.

I guess we should add something like:

class ShiftWrapper(gym.Wrapper):
    """Allow to use Discrete() action spaces with start!=0"""
    def __init__(self, env: gym.Env) -> None:
        super().__init__(env)
        assert isinstance(env.action_space, gym.spaces.Discrete)
        self.action_space = gym.spaces.Discrete(env.action_space.n, start=0)

    def step(self, action: int):
        return self.env.step(action + self.env.action_space.start)

to our documentation, the same way we document how to use isaac sim with sb3.

@araffin araffin added documentation Improvements or additions to documentation good first issue Good for newcomers help wanted Help from contributors is welcomed and removed bug Something isn't working check the checklist You have checked the required items in the checklist but you didn't do what is written... labels Dec 6, 2024
@JoshuaBluem
Copy link
Author

Currently, as written in the other issues and in the env checker, there is a simple alternative which is to use a wrapper (or to update the env to do the offset there).

sorry for not being clear enough, after looking again at the PR, I think it would be better to document how such wrapper would look like rather than having a more complex codebase.

I guess we should add something like:

class ShiftWrapper(gym.Wrapper):
    """Allow to use Discrete() action spaces with start!=0"""
    def __init__(self, env: gym.Env) -> None:
        super().__init__(env)
        assert isinstance(env.action_space, gym.spaces.Discrete)
        self.action_space = gym.spaces.Discrete(env.action_space.n, start=0)

    def step(self, action: int):
        return self.env.step(action + self.env.action_space.start)

to our documentation, the same way we document how to use isaac sim with sb3.

I am confident that everyone will perceive the current state of the discrete action spaces in the project as a bug. If users are required to implement the ShiftWrapper themselves, it will likely be viewed as a workaround. If we truly require such a ShiftWrapper, it should have already been implemented, rather than being added individually by each user.

From my perspective, adding an additional wrapper does not improve readability, as its logic differs from the Box-Scaling logic. The Box already handles scaling, and the primary changes in my pull request are the "elif isinstance(Discrete, MultiDiscrete)" block. The logic here largely mirrors the initial "if" block, using "scale" and "unscale." Therefore, anyone who understands the first "if" statement should also be able to understand the second. In fact, they might even better understand the first "if" statement upon encountering the same logic applied to a different space.

The handling of discrete action spaces in the scale and unscale methods also follows the same logic as the Box-scaling approach: transforming between action bounds and a standard range.

@araffin
Copy link
Member

araffin commented Dec 11, 2024

I am confident that everyone will perceive the current state of the discrete action spaces in the project as a bug.

SB3 intentionally doesn't support all features from Gymnasium.
This is a choice of simplicity to keep the project maintainable (see SB3 blog post).
That's also why we have an env checker (see docs, also linked in issue template) to avoid any bad surprises for our users.

. If users are required to implement the ShiftWrapper themselves, it will likely be viewed as a workaround.
From my perspective, adding an additional wrapper does not improve readability,

It is a workaround, but a simple one. And it is not the only solution.
A better solution is to have start=0 in your env, and simply shift the action inside the step method (a two lines change vs 10+ lines in SB3).

I'm also sorry that you put work in code that might not be used at the end.
This is the reason why we require opening an issue before to discuss and agree on a solution before implementing anything (see contributing guide and PR template).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation duplicate This issue or pull request already exists good first issue Good for newcomers help wanted Help from contributors is welcomed
Projects
None yet
Development

No branches or pull requests

2 participants