@@ -92,8 +92,7 @@ Some algorithms such as PPO require a probabilistic policy to be implemented.
92
92
In TorchRL, these policies take the form of a model, followed by a distribution
93
93
constructor.
94
94
95
- .. note ::
96
- The choice of a probabilistic or regular actor class depends on the algorithm
95
+ .. note :: The choice of a probabilistic or regular actor class depends on the algorithm
97
96
that is being implemented. On-policy algorithms usually require a probabilistic
98
97
actor, off-policy usually have a deterministic actor with an extra exploration
99
98
strategy. There are, however, many exceptions to this rule.
@@ -103,8 +102,12 @@ and outputs the parameters of a distribution, while the distribution constructor
103
102
reads these parameters and gets a random sample from the distribution and/or
104
103
provides a :class: `torch.distributions.Distribution ` object.
105
104
106
- >>> from tensordict.nn import NormalParamExtractor, TensorDictSequential
105
+ >>> from tensordict.nn import NormalParamExtractor, TensorDictSequential, TensorDictModule
106
+ >>> from torchrl.modules import SafeProbabilisticModule
107
+ >>> from torchrl.envs import GymEnv
107
108
>>> from torch.distributions import Normal
109
+ >>> from torch import nn
110
+ >>>
108
111
>>> env = GymEnv(" Pendulum-v1" )
109
112
>>> action_spec = env.action_spec
110
113
>>> model = nn.Sequential(nn.LazyLinear(action_spec.shape[- 1 ] * 2 ), NormalParamExtractor())
@@ -125,6 +128,7 @@ provides a :class:`torch.distributions.Distribution` object.
125
128
To facilitate the construction of probabilistic policies, we provide a dedicated
126
129
:class: `~torchrl.modules.tensordict_module.ProbabilisticActor `:
127
130
131
+ >>> from torchrl.modules import ProbabilisticActor
128
132
>>> policy = ProbabilisticActor(
129
133
... model,
130
134
... in_keys= [" loc" , " scale" ],
@@ -154,69 +158,31 @@ of this action.
154
158
Q-Value actors
155
159
~~~~~~~~~~~~~~
156
160
157
- Q-Value actors are a special type of policy that does not directly predict an action
158
- from an observation, but picks the action that maximised the value (or *quality *)
159
- of a (s,a) -> v map. This map can be a table or a function.
160
- For discrete action spaces with continuous (or near-continuous such as pixels)
161
- states, it is customary to use a non-linear model such as a neural network for
162
- the map.
163
- The semantic of the Q-Value network is hopefully quite simple: we just need to
164
- feed a tensor-to-tensor map that given a certain state (the input tensor),
165
- outputs a list of action values to choose from. The wrapper will write the
166
- resulting action in the input tensordict along with the list of action values.
161
+ Q-Value actors are a type of policy that selects actions based on the maximum value
162
+ (or "quality") of a state-action pair. This value can be represented as a table or a
163
+ function. For discrete action spaces with continuous states, it's common to use a non-linear
164
+ model like a neural network to represent this function.
167
165
168
- >>> import torch
169
- >>> from tensordict import TensorDict
170
- >>> from tensordict.nn.functional_modules import make_functional
171
- >>> from torch import nn
172
- >>> from torchrl.data import OneHot
173
- >>> from torchrl.modules.tensordict_module.actors import QValueActor
174
- >>> td = TensorDict({' observation' : torch.randn(5 , 3 )}, [5 ])
175
- >>> # we have 4 actions to choose from
176
- >>> action_spec = OneHot(4 )
177
- >>> # the model reads a state of dimension 3 and outputs 4 values, one for each action available
178
- >>> module = nn.Linear(3 , 4 )
179
- >>> qvalue_actor = QValueActor(module = module, spec = action_spec)
180
- >>> qvalue_actor(td)
181
- >>> print (td)
182
- TensorDict(
183
- fields={
184
- action: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.int64, is_shared=False),
185
- action_value: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False),
186
- chosen_action_value: Tensor(shape=torch.Size([5, 1]), device=cpu, dtype=torch.float32, is_shared=False),
187
- observation: Tensor(shape=torch.Size([5, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
188
- batch_size=torch.Size([5]),
189
- device=None,
190
- is_shared=False)
166
+ QValueActor
167
+ ^^^^^^^^^^^
191
168
192
- Distributional Q-learning is slightly different: in this case, the value network
193
- does not output a scalar value for each state-action value.
194
- Instead, the value space is divided in a an arbitrary number of "bins". The
195
- value network outputs a probability that the state-action value belongs to one bin
196
- or another.
197
- Hence, for a state space of dimension M, an action space of dimension N and a number of bins B,
198
- the value network encodes a
199
- of a (s,a) -> v map. This map can be a table or a function.
200
- For discrete action spaces with continuous (or near-continuous such as pixels)
201
- states, it is customary to use a non-linear model such as a neural network for
202
- the map.
203
- The semantic of the Q-Value network is hopefully quite simple: we just need to
204
- feed a tensor-to-tensor map that given a certain state (the input tensor),
205
- outputs a list of action values to choose from. The wrapper will write the
206
- resulting action in the input tensordict along with the list of action values.
169
+ The :class: `~torchrl.modules.QValueActor ` class takes in a module and an action
170
+ specification, and outputs the selected action and its corresponding value.
207
171
208
172
>>> import torch
209
173
>>> from tensordict import TensorDict
210
- >>> from tensordict.nn.functional_modules import make_functional
211
174
>>> from torch import nn
212
175
>>> from torchrl.data import OneHot
213
176
>>> from torchrl.modules.tensordict_module.actors import QValueActor
177
+ >>> # Create a tensor dict with an observation
214
178
>>> td = TensorDict({' observation' : torch.randn(5 , 3 )}, [5 ])
215
- >>> # we have 4 actions to choose from
179
+ >>> # Define the action space
216
180
>>> action_spec = OneHot(4 )
217
- >>> # the model reads a state of dimension 3 and outputs 4 values, one for each action available
181
+ >>> # Create a linear module to output action values
218
182
>>> module = nn.Linear(3 , 4 )
183
+ >>> # Create a QValueActor instance
219
184
>>> qvalue_actor = QValueActor(module = module, spec = action_spec)
185
+ >>> # Run the actor on the tensor dict
220
186
>>> qvalue_actor(td)
221
187
>>> print (td)
222
188
TensorDict(
@@ -229,122 +195,48 @@ resulting action in the input tensordict along with the list of action values.
229
195
device=None,
230
196
is_shared=False)
231
197
232
- Distributional Q-learning is slightly different: in this case, the value network
233
- does not output a scalar value for each state-action value.
234
- Instead, the value space is divided in a an arbitrary number of "bins". The
235
- value network outputs a probability that the state-action value belongs to one bin
236
- or another.
237
- Hence, for a state space of dimension M, an action space of dimension N and a number of bins B,
238
- the value network encodes a
239
- of a (s,a) -> v map. This map can be a table or a function.
240
- For discrete action spaces with continuous (or near-continuous such as pixels)
241
- states, it is customary to use a non-linear model such as a neural network for
242
- the map.
243
- The semantic of the Q-Value network is hopefully quite simple: we just need to
244
- feed a tensor-to-tensor map that given a certain state (the input tensor),
245
- outputs a list of action values to choose from. The wrapper will write the
246
- resulting action in the input tensordict along with the list of action values.
198
+ This will output a tensor dict with the selected action and its corresponding value.
199
+
200
+ Distributional Q-Learning
201
+ ^^^^^^^^^^^^^^^^^^^^^^^^^
202
+
203
+ Distributional Q-learning is a variant of Q-learning that represents the value function as a
204
+ probability distribution over possible values, rather than a single scalar value.
205
+ This allows the agent to learn about the uncertainty in the environment and make more informed
206
+ decisions.
207
+ In TorchRL, distributional Q-learning is implemented using the :class: `~torchrl.modules.DistributionalQValueActor `
208
+ class. This class takes in a module, an action specification, and a support vector, and outputs the selected
209
+ action and its corresponding value distribution.
210
+
247
211
248
212
>>> import torch
249
213
>>> from tensordict import TensorDict
250
- >>> from tensordict.nn.functional_modules import make_functional
251
214
>>> from torch import nn
252
215
>>> from torchrl.data import OneHot
253
- >>> from torchrl.modules.tensordict_module.actors import QValueActor
254
- >>> td = TensorDict({' observation' : torch.randn(5 , 3 )}, [5 ])
255
- >>> # we have 4 actions to choose from
216
+ >>> from torchrl.modules import DistributionalQValueActor, MLP
217
+ >>> # Create a tensor dict with an observation
218
+ >>> td = TensorDict({' observation' : torch.randn(5 , 4 )}, [5 ])
219
+ >>> # Define the action space
256
220
>>> action_spec = OneHot(4 )
257
- >>> # the model reads a state of dimension 3 and outputs 4 values, one for each action available
258
- >>> module = nn.Linear(3 , 4 )
259
- >>> qvalue_actor = QValueActor(module = module, spec = action_spec)
260
- >>> qvalue_actor(td)
221
+ >>> # Define the number of bins for the value distribution
222
+ >>> nbins = 3
223
+ >>> # Create an MLP module to output logits for the value distribution
224
+ >>> module = MLP(out_features = (nbins, 4 ), depth = 2 )
225
+ >>> # Create a DistributionalQValueActor instance
226
+ >>> qvalue_actor = DistributionalQValueActor(module = module, spec = action_spec, support = torch.arange(nbins))
227
+ >>> # Run the actor on the tensor dict
228
+ >>> td = qvalue_actor(td)
261
229
>>> print (td)
262
230
TensorDict(
263
231
fields={
264
232
action: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.int64, is_shared=False),
265
- action_value: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False),
266
- chosen_action_value: Tensor(shape=torch.Size([5, 1]), device=cpu, dtype=torch.float32, is_shared=False),
267
- observation: Tensor(shape=torch.Size([5, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
233
+ action_value: Tensor(shape=torch.Size([5, 3, 4]), device=cpu, dtype=torch.float32, is_shared=False),
234
+ observation: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
268
235
batch_size=torch.Size([5]),
269
236
device=None,
270
237
is_shared=False)
271
238
272
- Distributional Q-learning is slightly different: in this case, the value network
273
- does not output a scalar value for each state-action value.
274
- Instead, the value space is divided in a an arbitrary number of "bins". The
275
- value network outputs a probability that the state-action value belongs to one bin
276
- or another.
277
- Hence, for a state space of dimension M, an action space of dimension N and a number of bins B,
278
- the value network encodes a :math: `\mathbb {R}^{M} \rightarrow \mathbb {R}^{N \times B}`
279
- map. The following example shows how this works in TorchRL with the :class: `~torchrl.modules.tensordict_module.DistributionalQValueActor `
280
- class:
281
-
282
- >>> import torch
283
- >>> from tensordict import TensorDict
284
- >>> from torch import nn
285
- >>> from torchrl.data import OneHot
286
- >>> from torchrl.modules import DistributionalQValueActor, MLP
287
- >>> td = TensorDict({' observation' : torch.randn(5 , 4 )}, [5 ])
288
- >>> nbins = 3
289
- >>> # our model reads the observation and outputs a stack of 4 logits (one for each action) of size nbins=3
290
- >>> module = MLP(out_features = (nbins, 4 ), depth = 2 )
291
- >>> action_spec = OneHot(4 )
292
- >>> qvalue_actor = DistributionalQValueActor(module = module, spec = action_spec, support = torch.arange(nbins))
293
- >>> td = qvalue_actor(td)
294
- >>> print (td)
295
- TensorDict(
296
- fields={
297
- action: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.int64, is_shared=False),
298
- action_value: Tensor(shape=torch.Size([5, 3, 4]), device=cpu, dtype=torch.float32, is_shared=False),
299
- observation: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
300
- batch_size=torch.Size([5]),
301
- device=None,
302
- is_shared=False)
303
-
304
- >>> import torch
305
- >>> from tensordict import TensorDict
306
- >>> from torch import nn
307
- >>> from torchrl.data import OneHot
308
- >>> from torchrl.modules import DistributionalQValueActor, MLP
309
- >>> td = TensorDict({' observation' : torch.randn(5 , 4 )}, [5 ])
310
- >>> nbins = 3
311
- >>> # our model reads the observation and outputs a stack of 4 logits (one for each action) of size nbins=3
312
- >>> module = MLP(out_features = (nbins, 4 ), depth = 2 )
313
- >>> action_spec = OneHot(4 )
314
- >>> qvalue_actor = DistributionalQValueActor(module = module, spec = action_spec, support = torch.arange(nbins))
315
- >>> td = qvalue_actor(td)
316
- >>> print (td)
317
- TensorDict(
318
- fields={
319
- action: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.int64, is_shared=False),
320
- action_value: Tensor(shape=torch.Size([5, 3, 4]), device=cpu, dtype=torch.float32, is_shared=False),
321
- observation: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
322
- batch_size=torch.Size([5]),
323
- device=None,
324
- is_shared=False)
325
-
326
- >>> import torch
327
- >>> from tensordict import TensorDict
328
- >>> from torch import nn
329
- >>> from torchrl.data import OneHot
330
- >>> from torchrl.modules import DistributionalQValueActor, MLP
331
- >>> td = TensorDict({' observation' : torch.randn(5 , 4 )}, [5 ])
332
- >>> nbins = 3
333
- >>> # our model reads the observation and outputs a stack of 4 logits (one for each action) of size nbins=3
334
- >>> module = MLP(out_features = (nbins, 4 ), depth = 2 )
335
- >>> action_spec = OneHot(4 )
336
- >>> qvalue_actor = DistributionalQValueActor(module = module, spec = action_spec, support = torch.arange(nbins))
337
- >>> td = qvalue_actor(td)
338
- >>> print (td)
339
- TensorDict(
340
- fields={
341
- action: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.int64, is_shared=False),
342
- action_value: Tensor(shape=torch.Size([5, 3, 4]), device=cpu, dtype=torch.float32, is_shared=False),
343
- observation: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
344
- batch_size=torch.Size([5]),
345
- device=None,
346
- is_shared=False)
347
-
239
+ This will output a tensor dict with the selected action and its corresponding value distribution.
348
240
349
241
.. currentmodule :: torchrl.modules.tensordict_module
350
242
@@ -403,11 +295,10 @@ without shared parameters. It is mainly intended as a replacement for
403
295
404
296
Domain-specific TensorDict modules
405
297
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
298
+ .. currentmodule :: torchrl.modules.tensordict_module
406
299
407
300
These modules include dedicated solutions for MBRL or RLHF pipelines.
408
301
409
- .. currentmodule :: torchrl.modules.tensordict_module
410
-
411
302
.. autosummary ::
412
303
:toctree: generated/
413
304
:template: rl_template_noinherit.rst
@@ -553,12 +444,16 @@ Some distributions are typically used in RL scripts.
553
444
OneHotCategorical
554
445
MaskedCategorical
555
446
MaskedOneHotCategorical
447
+ Ordinal
448
+ OneHotOrdinal
556
449
557
450
Utils
558
451
-----
559
-
560
452
.. currentmodule :: torchrl.modules.utils
561
453
454
+ The module utils include functionals used to do some custom mappings as well as a tool to
455
+ build :class: `~torchrl.envs.TensorDictPrimer ` instances from a given module.
456
+
562
457
.. autosummary ::
563
458
:toctree: generated/
564
459
:template: rl_template_noinherit.rst
0 commit comments