Skip to content

Commit 09bc3af

Browse files
author
Vincent Moens
committed
Update
[ghstack-poisoned]
2 parents 3f1aadc + 6b45e9b commit 09bc3af

38 files changed

+1257
-359
lines changed

.github/scripts/td_script.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/bin/bash
22

3-
export TORCHRL_BUILD_VERSION=0.6.0
3+
export TORCHRL_BUILD_VERSION=0.6.1
44

55
${CONDA_RUN} pip install git+https://github.com/pytorch/tensordict.git -U

.github/scripts/version_script.bat

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
@echo off
2-
set TORCHRL_BUILD_VERSION=0.6.0
2+
set TORCHRL_BUILD_VERSION=0.6.1
33
echo TORCHRL_BUILD_VERSION is set to %TORCHRL_BUILD_VERSION%

.github/workflows/wheels-legacy.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ jobs:
3535
shell: bash
3636
run: |
3737
python3 -mpip install wheel
38-
TORCHRL_BUILD_VERSION=0.6.0 python3 setup.py bdist_wheel
38+
TORCHRL_BUILD_VERSION=0.6.1 python3 setup.py bdist_wheel
3939
- name: Upload wheel for the test-wheel job
4040
uses: actions/upload-artifact@v3
4141
with:

docs/source/reference/envs.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -845,6 +845,7 @@ to be able to create this other composition:
845845
TensorDictPrimer
846846
TimeMaxPool
847847
ToTensorImage
848+
TrajCounter
848849
UnsqueezeTransform
849850
VC1Transform
850851
VIPRewardTransform

docs/source/reference/modules.rst

Lines changed: 54 additions & 159 deletions
Original file line numberDiff line numberDiff line change
@@ -92,8 +92,7 @@ Some algorithms such as PPO require a probabilistic policy to be implemented.
9292
In TorchRL, these policies take the form of a model, followed by a distribution
9393
constructor.
9494

95-
.. note::
96-
The choice of a probabilistic or regular actor class depends on the algorithm
95+
.. note:: The choice of a probabilistic or regular actor class depends on the algorithm
9796
that is being implemented. On-policy algorithms usually require a probabilistic
9897
actor, off-policy usually have a deterministic actor with an extra exploration
9998
strategy. There are, however, many exceptions to this rule.
@@ -103,8 +102,12 @@ and outputs the parameters of a distribution, while the distribution constructor
103102
reads these parameters and gets a random sample from the distribution and/or
104103
provides a :class:`torch.distributions.Distribution` object.
105104

106-
>>> from tensordict.nn import NormalParamExtractor, TensorDictSequential
105+
>>> from tensordict.nn import NormalParamExtractor, TensorDictSequential, TensorDictModule
106+
>>> from torchrl.modules import SafeProbabilisticModule
107+
>>> from torchrl.envs import GymEnv
107108
>>> from torch.distributions import Normal
109+
>>> from torch import nn
110+
>>>
108111
>>> env = GymEnv("Pendulum-v1")
109112
>>> action_spec = env.action_spec
110113
>>> model = nn.Sequential(nn.LazyLinear(action_spec.shape[-1] * 2), NormalParamExtractor())
@@ -125,6 +128,7 @@ provides a :class:`torch.distributions.Distribution` object.
125128
To facilitate the construction of probabilistic policies, we provide a dedicated
126129
:class:`~torchrl.modules.tensordict_module.ProbabilisticActor`:
127130

131+
>>> from torchrl.modules import ProbabilisticActor
128132
>>> policy = ProbabilisticActor(
129133
... model,
130134
... in_keys=["loc", "scale"],
@@ -154,69 +158,31 @@ of this action.
154158
Q-Value actors
155159
~~~~~~~~~~~~~~
156160

157-
Q-Value actors are a special type of policy that does not directly predict an action
158-
from an observation, but picks the action that maximised the value (or *quality*)
159-
of a (s,a) -> v map. This map can be a table or a function.
160-
For discrete action spaces with continuous (or near-continuous such as pixels)
161-
states, it is customary to use a non-linear model such as a neural network for
162-
the map.
163-
The semantic of the Q-Value network is hopefully quite simple: we just need to
164-
feed a tensor-to-tensor map that given a certain state (the input tensor),
165-
outputs a list of action values to choose from. The wrapper will write the
166-
resulting action in the input tensordict along with the list of action values.
161+
Q-Value actors are a type of policy that selects actions based on the maximum value
162+
(or "quality") of a state-action pair. This value can be represented as a table or a
163+
function. For discrete action spaces with continuous states, it's common to use a non-linear
164+
model like a neural network to represent this function.
167165

168-
>>> import torch
169-
>>> from tensordict import TensorDict
170-
>>> from tensordict.nn.functional_modules import make_functional
171-
>>> from torch import nn
172-
>>> from torchrl.data import OneHot
173-
>>> from torchrl.modules.tensordict_module.actors import QValueActor
174-
>>> td = TensorDict({'observation': torch.randn(5, 3)}, [5])
175-
>>> # we have 4 actions to choose from
176-
>>> action_spec = OneHot(4)
177-
>>> # the model reads a state of dimension 3 and outputs 4 values, one for each action available
178-
>>> module = nn.Linear(3, 4)
179-
>>> qvalue_actor = QValueActor(module=module, spec=action_spec)
180-
>>> qvalue_actor(td)
181-
>>> print(td)
182-
TensorDict(
183-
fields={
184-
action: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.int64, is_shared=False),
185-
action_value: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False),
186-
chosen_action_value: Tensor(shape=torch.Size([5, 1]), device=cpu, dtype=torch.float32, is_shared=False),
187-
observation: Tensor(shape=torch.Size([5, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
188-
batch_size=torch.Size([5]),
189-
device=None,
190-
is_shared=False)
166+
QValueActor
167+
^^^^^^^^^^^
191168

192-
Distributional Q-learning is slightly different: in this case, the value network
193-
does not output a scalar value for each state-action value.
194-
Instead, the value space is divided in a an arbitrary number of "bins". The
195-
value network outputs a probability that the state-action value belongs to one bin
196-
or another.
197-
Hence, for a state space of dimension M, an action space of dimension N and a number of bins B,
198-
the value network encodes a
199-
of a (s,a) -> v map. This map can be a table or a function.
200-
For discrete action spaces with continuous (or near-continuous such as pixels)
201-
states, it is customary to use a non-linear model such as a neural network for
202-
the map.
203-
The semantic of the Q-Value network is hopefully quite simple: we just need to
204-
feed a tensor-to-tensor map that given a certain state (the input tensor),
205-
outputs a list of action values to choose from. The wrapper will write the
206-
resulting action in the input tensordict along with the list of action values.
169+
The :class:`~torchrl.modules.QValueActor` class takes in a module and an action
170+
specification, and outputs the selected action and its corresponding value.
207171

208172
>>> import torch
209173
>>> from tensordict import TensorDict
210-
>>> from tensordict.nn.functional_modules import make_functional
211174
>>> from torch import nn
212175
>>> from torchrl.data import OneHot
213176
>>> from torchrl.modules.tensordict_module.actors import QValueActor
177+
>>> # Create a tensor dict with an observation
214178
>>> td = TensorDict({'observation': torch.randn(5, 3)}, [5])
215-
>>> # we have 4 actions to choose from
179+
>>> # Define the action space
216180
>>> action_spec = OneHot(4)
217-
>>> # the model reads a state of dimension 3 and outputs 4 values, one for each action available
181+
>>> # Create a linear module to output action values
218182
>>> module = nn.Linear(3, 4)
183+
>>> # Create a QValueActor instance
219184
>>> qvalue_actor = QValueActor(module=module, spec=action_spec)
185+
>>> # Run the actor on the tensor dict
220186
>>> qvalue_actor(td)
221187
>>> print(td)
222188
TensorDict(
@@ -229,122 +195,48 @@ resulting action in the input tensordict along with the list of action values.
229195
device=None,
230196
is_shared=False)
231197

232-
Distributional Q-learning is slightly different: in this case, the value network
233-
does not output a scalar value for each state-action value.
234-
Instead, the value space is divided in a an arbitrary number of "bins". The
235-
value network outputs a probability that the state-action value belongs to one bin
236-
or another.
237-
Hence, for a state space of dimension M, an action space of dimension N and a number of bins B,
238-
the value network encodes a
239-
of a (s,a) -> v map. This map can be a table or a function.
240-
For discrete action spaces with continuous (or near-continuous such as pixels)
241-
states, it is customary to use a non-linear model such as a neural network for
242-
the map.
243-
The semantic of the Q-Value network is hopefully quite simple: we just need to
244-
feed a tensor-to-tensor map that given a certain state (the input tensor),
245-
outputs a list of action values to choose from. The wrapper will write the
246-
resulting action in the input tensordict along with the list of action values.
198+
This will output a tensor dict with the selected action and its corresponding value.
199+
200+
Distributional Q-Learning
201+
^^^^^^^^^^^^^^^^^^^^^^^^^
202+
203+
Distributional Q-learning is a variant of Q-learning that represents the value function as a
204+
probability distribution over possible values, rather than a single scalar value.
205+
This allows the agent to learn about the uncertainty in the environment and make more informed
206+
decisions.
207+
In TorchRL, distributional Q-learning is implemented using the :class:`~torchrl.modules.DistributionalQValueActor`
208+
class. This class takes in a module, an action specification, and a support vector, and outputs the selected
209+
action and its corresponding value distribution.
210+
247211

248212
>>> import torch
249213
>>> from tensordict import TensorDict
250-
>>> from tensordict.nn.functional_modules import make_functional
251214
>>> from torch import nn
252215
>>> from torchrl.data import OneHot
253-
>>> from torchrl.modules.tensordict_module.actors import QValueActor
254-
>>> td = TensorDict({'observation': torch.randn(5, 3)}, [5])
255-
>>> # we have 4 actions to choose from
216+
>>> from torchrl.modules import DistributionalQValueActor, MLP
217+
>>> # Create a tensor dict with an observation
218+
>>> td = TensorDict({'observation': torch.randn(5, 4)}, [5])
219+
>>> # Define the action space
256220
>>> action_spec = OneHot(4)
257-
>>> # the model reads a state of dimension 3 and outputs 4 values, one for each action available
258-
>>> module = nn.Linear(3, 4)
259-
>>> qvalue_actor = QValueActor(module=module, spec=action_spec)
260-
>>> qvalue_actor(td)
221+
>>> # Define the number of bins for the value distribution
222+
>>> nbins = 3
223+
>>> # Create an MLP module to output logits for the value distribution
224+
>>> module = MLP(out_features=(nbins, 4), depth=2)
225+
>>> # Create a DistributionalQValueActor instance
226+
>>> qvalue_actor = DistributionalQValueActor(module=module, spec=action_spec, support=torch.arange(nbins))
227+
>>> # Run the actor on the tensor dict
228+
>>> td = qvalue_actor(td)
261229
>>> print(td)
262230
TensorDict(
263231
fields={
264232
action: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.int64, is_shared=False),
265-
action_value: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False),
266-
chosen_action_value: Tensor(shape=torch.Size([5, 1]), device=cpu, dtype=torch.float32, is_shared=False),
267-
observation: Tensor(shape=torch.Size([5, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
233+
action_value: Tensor(shape=torch.Size([5, 3, 4]), device=cpu, dtype=torch.float32, is_shared=False),
234+
observation: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
268235
batch_size=torch.Size([5]),
269236
device=None,
270237
is_shared=False)
271238

272-
Distributional Q-learning is slightly different: in this case, the value network
273-
does not output a scalar value for each state-action value.
274-
Instead, the value space is divided in a an arbitrary number of "bins". The
275-
value network outputs a probability that the state-action value belongs to one bin
276-
or another.
277-
Hence, for a state space of dimension M, an action space of dimension N and a number of bins B,
278-
the value network encodes a :math:`\mathbb{R}^{M} \rightarrow \mathbb{R}^{N \times B}`
279-
map. The following example shows how this works in TorchRL with the :class:`~torchrl.modules.tensordict_module.DistributionalQValueActor`
280-
class:
281-
282-
>>> import torch
283-
>>> from tensordict import TensorDict
284-
>>> from torch import nn
285-
>>> from torchrl.data import OneHot
286-
>>> from torchrl.modules import DistributionalQValueActor, MLP
287-
>>> td = TensorDict({'observation': torch.randn(5, 4)}, [5])
288-
>>> nbins = 3
289-
>>> # our model reads the observation and outputs a stack of 4 logits (one for each action) of size nbins=3
290-
>>> module = MLP(out_features=(nbins, 4), depth=2)
291-
>>> action_spec = OneHot(4)
292-
>>> qvalue_actor = DistributionalQValueActor(module=module, spec=action_spec, support=torch.arange(nbins))
293-
>>> td = qvalue_actor(td)
294-
>>> print(td)
295-
TensorDict(
296-
fields={
297-
action: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.int64, is_shared=False),
298-
action_value: Tensor(shape=torch.Size([5, 3, 4]), device=cpu, dtype=torch.float32, is_shared=False),
299-
observation: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
300-
batch_size=torch.Size([5]),
301-
device=None,
302-
is_shared=False)
303-
304-
>>> import torch
305-
>>> from tensordict import TensorDict
306-
>>> from torch import nn
307-
>>> from torchrl.data import OneHot
308-
>>> from torchrl.modules import DistributionalQValueActor, MLP
309-
>>> td = TensorDict({'observation': torch.randn(5, 4)}, [5])
310-
>>> nbins = 3
311-
>>> # our model reads the observation and outputs a stack of 4 logits (one for each action) of size nbins=3
312-
>>> module = MLP(out_features=(nbins, 4), depth=2)
313-
>>> action_spec = OneHot(4)
314-
>>> qvalue_actor = DistributionalQValueActor(module=module, spec=action_spec, support=torch.arange(nbins))
315-
>>> td = qvalue_actor(td)
316-
>>> print(td)
317-
TensorDict(
318-
fields={
319-
action: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.int64, is_shared=False),
320-
action_value: Tensor(shape=torch.Size([5, 3, 4]), device=cpu, dtype=torch.float32, is_shared=False),
321-
observation: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
322-
batch_size=torch.Size([5]),
323-
device=None,
324-
is_shared=False)
325-
326-
>>> import torch
327-
>>> from tensordict import TensorDict
328-
>>> from torch import nn
329-
>>> from torchrl.data import OneHot
330-
>>> from torchrl.modules import DistributionalQValueActor, MLP
331-
>>> td = TensorDict({'observation': torch.randn(5, 4)}, [5])
332-
>>> nbins = 3
333-
>>> # our model reads the observation and outputs a stack of 4 logits (one for each action) of size nbins=3
334-
>>> module = MLP(out_features=(nbins, 4), depth=2)
335-
>>> action_spec = OneHot(4)
336-
>>> qvalue_actor = DistributionalQValueActor(module=module, spec=action_spec, support=torch.arange(nbins))
337-
>>> td = qvalue_actor(td)
338-
>>> print(td)
339-
TensorDict(
340-
fields={
341-
action: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.int64, is_shared=False),
342-
action_value: Tensor(shape=torch.Size([5, 3, 4]), device=cpu, dtype=torch.float32, is_shared=False),
343-
observation: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
344-
batch_size=torch.Size([5]),
345-
device=None,
346-
is_shared=False)
347-
239+
This will output a tensor dict with the selected action and its corresponding value distribution.
348240

349241
.. currentmodule:: torchrl.modules.tensordict_module
350242

@@ -403,11 +295,10 @@ without shared parameters. It is mainly intended as a replacement for
403295

404296
Domain-specific TensorDict modules
405297
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
298+
.. currentmodule:: torchrl.modules.tensordict_module
406299

407300
These modules include dedicated solutions for MBRL or RLHF pipelines.
408301

409-
.. currentmodule:: torchrl.modules.tensordict_module
410-
411302
.. autosummary::
412303
:toctree: generated/
413304
:template: rl_template_noinherit.rst
@@ -553,12 +444,16 @@ Some distributions are typically used in RL scripts.
553444
OneHotCategorical
554445
MaskedCategorical
555446
MaskedOneHotCategorical
447+
Ordinal
448+
OneHotOrdinal
556449

557450
Utils
558451
-----
559-
560452
.. currentmodule:: torchrl.modules.utils
561453

454+
The module utils include functionals used to do some custom mappings as well as a tool to
455+
build :class:`~torchrl.envs.TensorDictPrimer` instances from a given module.
456+
562457
.. autosummary::
563458
:toctree: generated/
564459
:template: rl_template_noinherit.rst

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,7 @@ def _main(argv):
176176
if is_nightly:
177177
tensordict_dep = "tensordict-nightly"
178178
else:
179-
tensordict_dep = "tensordict>=0.6.0"
179+
tensordict_dep = "tensordict>=0.6.1"
180180

181181
if is_nightly:
182182
version = get_nightly_version()

test/test_actors.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88
import pytest
99
import torch
1010

11-
from mocking_classes import NestedCountingEnv
1211
from tensordict import TensorDict
1312
from tensordict.nn import CompositeDistribution, TensorDictModule
1413
from tensordict.nn.distributions import NormalParamExtractor
@@ -33,8 +32,10 @@
3332

3433
if os.getenv("PYTORCH_TEST_FBCODE"):
3534
from pytorch.rl.test._utils_internal import get_default_devices
35+
from pytorch.rl.test.mocking_classes import NestedCountingEnv
3636
else:
3737
from _utils_internal import get_default_devices
38+
from mocking_classes import NestedCountingEnv
3839

3940

4041
@pytest.mark.parametrize(

0 commit comments

Comments
 (0)