diff --git a/README.md b/README.md
index 1078eba1..502a4d42 100644
--- a/README.md
+++ b/README.md
@@ -15,7 +15,8 @@
 
 -------
 ## Latest Updates
-[08/09/2021] **v0.1.0**: Initial code and paper release
+- [12/16/2021] **v0.2.0**: Modular observation modalities and encoders :wrench:, support for [MOMART](https://sites.google.com/view/il-for-mm/home) datasets :open_file_folder:
+- [08/09/2021] **v0.1.0**: Initial code and paper release
 
 -------
 
@@ -24,10 +25,11 @@
 Imitating human demonstrations is a promising approach to endow robots with various manipulation capabilities. While recent advances have been made in imitation learning and batch (offline) reinforcement learning, a lack of open-source human datasets and reproducible learning methods make assessing the state of the field difficult. The overarching goal of **robomimic** is to provide researchers and practitioners with:
 
 - a **standardized set of large demonstration datasets** across several benchmarking tasks to facilitate fair comparisons, with a focus on learning from human-provided demonstrations
+- a **standardized set of large demonstration datasets** across several benchmarking tasks to facilitate fair comparisons, with a focus on learning from human-provided demonstrations (see [this link](https://arise-initiative.github.io/robomimic-web/docs/introduction/quickstart.html#supported-datasets) for a list of supported datasets)
 - **high-quality implementations of several learning algorithms** for training closed-loop policies from offline datasets to make reproducing results easy and lower the barrier to entry
 - a **modular design** that offers great flexibility in extending algorithms and designing new algorithms
 
-This release of **robomimic** contains seven offline learning [algorithms](https://arise-initiative.github.io/robomimic-web/docs/modules/algorithms.html) and standardized [datasets](https://arise-initiative.github.io/robomimic-web/docs/introduction/results.html) collected across five simulated and three real-world multi-stage manipulation tasks of varying complexity. We highlight some features below:
+This release of **robomimic** contains seven offline learning [algorithms](https://arise-initiative.github.io/robomimic-web/docs/modules/algorithms.html) and standardized [datasets](https://arise-initiative.github.io/robomimic-web/docs/introduction/results.html) collected across five simulated and three real-world multi-stage manipulation tasks of varying complexity. We highlight some features below (for a more thorough list of features, see [this link](https://arise-initiative.github.io/robomimic-web/docs/introduction/quickstart.html#features-overview)):
 
 - **standardized datasets:** a set of datasets collected from different sources (single proficient human, multiple humans, and machine-generated) across several simulated and real-world tasks, along with a plug-and-play [Dataset class](https://arise-initiative.github.io/robomimic-web/docs/modules/datasets.html) to easily use the datasets outside of this project
 - **algorithm implementations:** several high-quality implementations of offline learning algorithms, including BC, BC-RNN, HBC, IRIS, BCQ, CQL, and TD3-BC
diff --git a/docs/api/robomimic.envs.rst b/docs/api/robomimic.envs.rst
index ecc89605..94bfb690 100644
--- a/docs/api/robomimic.envs.rst
+++ b/docs/api/robomimic.envs.rst
@@ -20,6 +20,14 @@ robomimic.envs.env\_gym module
    :undoc-members:
    :show-inheritance:
 
+robomimic.envs.env\_ig\_momart module
+-------------------------------------
+
+.. automodule:: robomimic.envs.env_ig_momart
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
 robomimic.envs.env\_robosuite module
 ------------------------------------
 
diff --git a/docs/api/robomimic.utils.rst b/docs/api/robomimic.utils.rst
index e3174be0..57160ef4 100644
--- a/docs/api/robomimic.utils.rst
+++ b/docs/api/robomimic.utils.rst
@@ -52,6 +52,14 @@ robomimic.utils.loss\_utils module
    :undoc-members:
    :show-inheritance:
 
+robomimic.utils.macros module
+-----------------------------
+
+.. automodule:: robomimic.utils.macros
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
 robomimic.utils.obs\_utils module
 ---------------------------------
 
@@ -60,6 +68,14 @@ robomimic.utils.obs\_utils module
    :undoc-members:
    :show-inheritance:
 
+robomimic.utils.python\_utils module
+------------------------------------
+
+.. automodule:: robomimic.utils.python_utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
 robomimic.utils.tensor\_utils module
 ------------------------------------
 
diff --git a/docs/images/modules.png b/docs/images/modules.png
index 75fea8cd..d8fbd85d 100644
Binary files a/docs/images/modules.png and b/docs/images/modules.png differ
diff --git a/docs/images/momart_bowl_in_sink.png b/docs/images/momart_bowl_in_sink.png
new file mode 100644
index 00000000..08dfa6f7
Binary files /dev/null and b/docs/images/momart_bowl_in_sink.png differ
diff --git a/docs/images/momart_dump_trash.png b/docs/images/momart_dump_trash.png
new file mode 100644
index 00000000..f7f7fefe
Binary files /dev/null and b/docs/images/momart_dump_trash.png differ
diff --git a/docs/images/momart_grab_bowl.png b/docs/images/momart_grab_bowl.png
new file mode 100644
index 00000000..dfede5d9
Binary files /dev/null and b/docs/images/momart_grab_bowl.png differ
diff --git a/docs/images/momart_open_dishwasher.png b/docs/images/momart_open_dishwasher.png
new file mode 100644
index 00000000..ae00551a
Binary files /dev/null and b/docs/images/momart_open_dishwasher.png differ
diff --git a/docs/images/momart_open_dresser.png b/docs/images/momart_open_dresser.png
new file mode 100644
index 00000000..9a891db2
Binary files /dev/null and b/docs/images/momart_open_dresser.png differ
diff --git a/docs/images/momart_table_cleanup_to_dishwasher_overview.png b/docs/images/momart_table_cleanup_to_dishwasher_overview.png
new file mode 100644
index 00000000..2cdc7ce1
Binary files /dev/null and b/docs/images/momart_table_cleanup_to_dishwasher_overview.png differ
diff --git a/docs/images/momart_table_cleanup_to_sink_overview.png b/docs/images/momart_table_cleanup_to_sink_overview.png
new file mode 100644
index 00000000..387f0f6f
Binary files /dev/null and b/docs/images/momart_table_cleanup_to_sink_overview.png differ
diff --git a/docs/images/momart_table_setup_from_dishwasher_overview.png b/docs/images/momart_table_setup_from_dishwasher_overview.png
new file mode 100644
index 00000000..c8cabcc4
Binary files /dev/null and b/docs/images/momart_table_setup_from_dishwasher_overview.png differ
diff --git a/docs/images/momart_table_setup_from_dresser_overview.png b/docs/images/momart_table_setup_from_dresser_overview.png
new file mode 100644
index 00000000..4f81c4cc
Binary files /dev/null and b/docs/images/momart_table_setup_from_dresser_overview.png differ
diff --git a/docs/images/momart_unload_dishwasher_to_dresser_overview.png b/docs/images/momart_unload_dishwasher_to_dresser_overview.png
new file mode 100644
index 00000000..d2935fa2
Binary files /dev/null and b/docs/images/momart_unload_dishwasher_to_dresser_overview.png differ
diff --git a/docs/index.rst b/docs/index.rst
index f59e5408..5c51c748 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -13,6 +13,7 @@ Welcome to robomimic's documentation!
    introduction/overview
    introduction/installation
    introduction/quickstart
+   introduction/features
    introduction/advanced
    introduction/examples
    introduction/datasets
@@ -25,6 +26,7 @@ Welcome to robomimic's documentation!
 
    modules/overview
    modules/dataset
+   modules/observations
    modules/algorithms
    modules/models
    modules/configs
diff --git a/docs/introduction/datasets.md b/docs/introduction/datasets.md
index 984cf4a1..565e75b6 100644
--- a/docs/introduction/datasets.md
+++ b/docs/introduction/datasets.md
@@ -30,15 +30,15 @@ data (group)
 
   - `dones` (dataset) - done signal, equal to 1 if playing the corresponding action in the state should terminate the episode. Shape (N,) where N is the length of the trajectory.
 
-  - `obs` (group) - group for the observation modalities. Each modality is stored as a dataset.
+  - `obs` (group) - group for the observation keys. Each key is stored as a dataset.
 
-    - `<modality_1>` (dataset) - the first observation modality. Note that the name of this dataset and shape will vary. As an example, the name could be "agentview_image", and the shape could be (N, 84, 84, 3). 
+    - `<obs_key_1>` (dataset) - the first observation key. Note that the name of this dataset and shape will vary. As an example, the name could be "agentview_image", and the shape could be (N, 84, 84, 3). 
 
       ...
 
   - `next_obs` (group) - group for the next observations.
 
-    - `<modality_1>` (dataset) - the first observation modality.
+    - `<obs_key_1>` (dataset) - the first observation key.
 
       ...
 
@@ -169,6 +169,25 @@ For more details on how the released `demo.hdf5` dataset files were used to gene
 
 
 
+## MOMART Datasets
+
+<p align="center">
+  <img width="19.0%" src="../images/momart_table_setup_from_dishwasher_overview.png">
+  <img width="19.0%" src="../images/momart_table_setup_from_dresser_overview.png">
+  <img width="19.0%" src="../images/momart_table_cleanup_to_dishwasher_overview.png">
+  <img width="19.0%" src="../images/momart_table_cleanup_to_sink_overview.png">
+  <img width="19.0%" src="../images/momart_unload_dishwasher_to_dresser_overview.png">
+  <img width="19.0%" src="../images/momart_bowl_in_sink.png">
+  <img width="19.0%" src="../images/momart_dump_trash.png">
+  <img width="19.0%" src="../images/momart_grab_bowl.png">
+  <img width="19.0%" src="../images/momart_open_dishwasher.png">
+  <img width="19.0%" src="../images/momart_open_dresser.png">
+ </p>
+
+This repository is fully compatible with [MOMART](https://sites.google.com/view/il-for-mm/home) datasets, a large collection of long-horizon, multi-stage simulated kitchen tasks executed by a mobile manipulator robot. See [this link](https://sites.google.com/view/il-for-mm/datasets) for a breakdown of the MOMART dataset structure, guide on downloading MOMART datasets, and running experiments using the datasets.
+
+
+
 ## D4RL Datasets
 
 This repository is fully compatible with most [D4RL](https://github.com/rail-berkeley/d4rl) datasets. See [this link](./results.html#d4rl) for a guide on downloading D4RL datasets and running D4RL experiments.
diff --git a/docs/introduction/examples.md b/docs/introduction/examples.md
index bea90f8b..779c62d0 100644
--- a/docs/introduction/examples.md
+++ b/docs/introduction/examples.md
@@ -153,4 +153,10 @@ Please see the [Config documentation](../modules/configs.html) for more informat
 
 ## Observation Networks Example
 
-The example script in `examples/simple_obs_net.py` discusses how to construct networks for taking observation dictionaries as input, and that produce dictionaries as outputs. See [this section](../modules/models.html#observation-encoder-and-decoder) in the documentation for more details.
\ No newline at end of file
+The example script in `examples/simple_obs_net.py` discusses how to construct networks for taking observation dictionaries as input, and that produce dictionaries as outputs. See [this section](../modules/models.html#observation-encoder-and-decoder) in the documentation for more details.
+
+
+
+## Custom Observation Modalities Example
+
+The example script in `examples/add_new_modality.py` discusses how to (a) modify pre-existing observation modalities, and (b) add your own custom observation modalities with custom encoding. See [this section](../modules/models.html#observation-encoder-and-decoder) in the documentation for more details about the encoding and decoding process.
\ No newline at end of file
diff --git a/docs/introduction/features.md b/docs/introduction/features.md
new file mode 100644
index 00000000..3aa8248f
--- /dev/null
+++ b/docs/introduction/features.md
@@ -0,0 +1,41 @@
+# Features Overview
+
+## Summary
+
+In this section, we briefly summarize some key features and where you should look to learn more about them.
+
+1. **Datasets supported by robomimic**
+   - See a list of supported datasets [here](./features.html#supported-datasets).<br><br>
+2. **Visualizing datasets**
+   - Learn how to visualize dataset trajectories [here](./datasets.html#view-dataset-structure-and-videos).<br><br>
+3. **Reproducing paper experiments**
+   - Easily reproduce experiments from the following papers
+     - robomimic: [here](./results.html)
+     - MOMART: [here](https://sites.google.com/view/il-for-mm/datasets)<br><br>
+4. **Making your own dataset**
+   - Learn how to make your own collected dataset compatible with this repository [here](./datasets.html#dataset-structure). 
+   - Note that **all datasets collected through robosuite are also readily compatible** (see [here](./datasets.html#converting-robosuite-hdf5-datasets)).<br><br>
+5. **Using filter keys to easily train on subsets of a dataset**
+   - See [this section](./datasets.html#filter-keys-and-train-valid-splits) on how to use filter keys.<br><br>
+6. **Running hyperparameter scans easily**
+   - See [this guide](./advanced.html#using-the-hyperparameter-helper-to-launch-runs) on running hyperparameter scans.<br><br>
+7. **Using pretrained models in the model zoo**
+   - See [this link](./model_zoo.html) to download and use pretrained models.<br><br>
+8. **Getting familiar with configs**
+   - Learn about how configs work [here](../modules/configs.html).<br><br>
+9. **Getting familiar with operations over tensor collections**
+   - Learn about using useful tensor utilities [here](../modules/utils.html#tensorutils).<br><br>
+10. **Creating your own observation modalities**
+    - Learn how to make your own observation modalities and process them with custom network architectures [here](../modules/observations.html).<br><br>
+11. **Creating your own algorithm**
+    - Learn how to implement your own learning algorithm [here](../modules/algorithms.html#building-your-own-algorithm).<br><br>
+
+## Supported Datasets
+
+This is a list of datasets that we currently support, along with links on how to work with them. This list will be expanded as more datasets are made compatible with robomimic.
+
+- [robomimic](./results.html#downloading-released-datasets)
+- [robosuite](./datasets.html#converting-robosuite-hdf5-datasets)
+- [MOMART](./datasets.html#momart-datasets)
+- [D4RL](./results.html#d4rl)
+- [RoboTurk Pilot](./datasets.html#roboturk-pilot-datasets)
\ No newline at end of file
diff --git a/docs/introduction/overview.md b/docs/introduction/overview.md
index 11d23051..cb257b83 100644
--- a/docs/introduction/overview.md
+++ b/docs/introduction/overview.md
@@ -15,11 +15,11 @@
 
 Imitating human demonstrations is a promising approach to endow robots with various manipulation capabilities. While recent advances have been made in imitation learning and batch (offline) reinforcement learning, a lack of open-source human datasets and reproducible learning methods make assessing the state of the field difficult. The overarching goal of **robomimic** is to provide researchers and practitioners with:
 
-- a **standardized set of large demonstration datasets** across several benchmarking tasks to facilitate fair comparisons, with a focus on learning from human-provided demonstrations
+- a **standardized set of large demonstration datasets** across several benchmarking tasks to facilitate fair comparisons, with a focus on learning from human-provided demonstrations (see [this link](./features.html#supported-datasets) for a list of supported datasets)
 - **high-quality implementations of several learning algorithms** for training closed-loop policies from offline datasets to make reproducing results easy and lower the barrier to entry
 - a **modular design** that offers great flexibility in extending algorithms and designing new algorithms
 
-This release of **robomimic** contains seven offline learning [algorithms](../modules/algorithms.html) and standardized [datasets](./results.html) collected across five simulated and three real-world multi-stage manipulation tasks of varying complexity. We highlight some features below:
+This release of **robomimic** contains seven offline learning [algorithms](../modules/algorithms.html) and standardized [datasets](./results.html) collected across five simulated and three real-world multi-stage manipulation tasks of varying complexity. We highlight some features below (for a more thorough list of features, see [this link](./features.html#features-overview)):
 
 - **standardized datasets:** a set of datasets collected from different sources (single proficient human, multiple humans, and machine-generated) across several simulated and real-world tasks, along with a plug-and-play [Dataset class](../modules/datasets.html) to easily use the datasets outside of this project
 - **algorithm implementations:** several high-quality implementations of offline learning algorithms, including BC, BC-RNN, HBC, IRIS, BCQ, CQL, and TD3-BC
diff --git a/docs/introduction/quickstart.md b/docs/introduction/quickstart.md
index 63324182..3ac9efd3 100644
--- a/docs/introduction/quickstart.md
+++ b/docs/introduction/quickstart.md
@@ -108,6 +108,3 @@ Instead of storing the observations, which can consist of high-dimensional image
 ```sh
 python run_trained_agent.py --agent /path/to/model.pth --n_rollouts 50 --horizon 400 --seed 0 --dataset_path /path/to/output.hdf5
 ```
-
-
-
diff --git a/docs/introduction/results.md b/docs/introduction/results.md
index 8e172965..b9d2363c 100644
--- a/docs/introduction/results.md
+++ b/docs/introduction/results.md
@@ -22,6 +22,24 @@ $ python train.py --config ../exps/paper/core/lift/ph/low_dim/bc.json
 
 See the [downloading released datasets](./results.html#downloading-released-datasets) section below for more information on downloading different datasets, and the [results on released datasets](./results.html#results-on-released-datasets) section below for more detailed information on reproducing different results from the study.
 
+## Quick Example
+
+In this section, we show a simple example of how to reproduce one of the results from the study - the BC-RNN result on the Lift (Proficient-Human) low-dim dataset.
+
+```sh
+# default behavior for download script - just download lift proficient-human low-dim dataset to robomimic/../datasets
+$ python download_datasets.py
+
+# generate json configs for running all experiments at robomimic/exps/paper
+$ python generate_paper_configs.py --output_dir /tmp/experiment_results
+
+# the training command can be found in robomimic/exps/paper/core.sh
+# Training results can be viewed at /tmp/experiment_results (--output_dir when generating paper configs).
+$ python train.py --config ../exps/paper/core/lift/ph/low_dim/bc.json
+```
+
+See the [downloading released datasets](./results.html#downloading-released-datasets) section below for more information on downloading different datasets, and the [results on released datasets](./results.html#results-on-released-datasets) section below for more detailed information on reproducing different results from the study.
+
 ## Downloading Released Datasets
 
 Released datasets can be downloaded easily by using the `download_datasets.py` script. **This is the preferred method for downloading the datasets**, because the script will also set up a directory structure for the datasets that works out of the box with examples for reproducing some benchmark results with the repository. A few examples of using this script are provided below.
diff --git a/docs/miscellaneous/contributing.md b/docs/miscellaneous/contributing.md
index 0d2eb5db..aef40460 100644
--- a/docs/miscellaneous/contributing.md
+++ b/docs/miscellaneous/contributing.md
@@ -43,7 +43,7 @@ We also list additional suggested contributing guidelines that we adhered to dur
 
 - When creating new networks (e.g. subclasses of `Module` in `models/base_nets.py`), always sub-modules into a property called `self.nets`, and if there is more than one sub-module, make it a module collection (such as a `torch.nn.ModuleDict`). This is to ensure that the pattern `model.to(device)` works as expected with multiple levels of nested torch modules. As an example of nesting, see the `_create_networks` function in the `VAE` class (`models/vae_nets.py`) and the `MIMO_MLP` class (`models/obs_nets.py`).
 
-- Do not use default mutable arguments -- they can lead to terrible bugs and unexpected behavior (see [this link](https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/) for more information). For this reason, in functions that expect optional dictionaries and lists (for example, the `visual_core_kwargs` argument in the  `obs_encoder_factory` function, or the `layer_dims` argument in the `MLP` class constructor), we use a default argument of `visual_core_kwargs=None` or an empty tuple (since tuples are immutable) `layer_dims=()`.
+- Do not use default mutable arguments -- they can lead to terrible bugs and unexpected behavior (see [this link](https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/) for more information). For this reason, in functions that expect optional dictionaries and lists (for example, the `core_kwargs` argument in the  `obs_encoder_factory` function, or the `layer_dims` argument in the `MLP` class constructor), we use a default argument of `core_kwargs=None` or an empty tuple (since tuples are immutable) `layer_dims=()`.
 
 - Prefer `torch.expand` over `torch.repeat` wherever possible, for memory efficiency. See [this link](https://discuss.pytorch.org/t/expand-vs-repeat-semantic-difference/59789) for more details.
 
diff --git a/docs/miscellaneous/references.md b/docs/miscellaneous/references.md
index 1a7c00f9..325c1da9 100644
--- a/docs/miscellaneous/references.md
+++ b/docs/miscellaneous/references.md
@@ -8,6 +8,7 @@ A list of projects and papers that use **robomimic**. If you would like to add y
 
 ## Imitation Learning and Batch (Offline) Reinforcement Learning
 
+- [Error-Aware Imitation Learning from Teleoperation Data for Mobile Manipulation](https://arxiv.org/abs/2112.05251) Josiah Wong, Albert Tung, Andrey Kurenkov, Ajay Mandlekar, Li Fei-Fei, Silvio Savarese, Roberto Martín-Martín
 - [Generalization Through Hand-Eye Coordination: An Action Space for Learning Spatially-Invariant Visuomotor Control](https://arxiv.org/abs/2103.00375) Chen Wang, Rui Wang, Danfei Xu, Ajay Mandlekar, Li Fei-Fei, Silvio Savarese
 - [Human-in-the-Loop Imitation Learning using Remote Teleoperation](https://arxiv.org/abs/2012.06733) Ajay Mandlekar, Danfei Xu, Roberto Martín-Martín, Yuke Zhu, Li Fei-Fei, Silvio Savarese
 - [Learning Multi-Arm Manipulation Through Collaborative Teleoperation](https://arxiv.org/abs/2012.06738) Albert Tung, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Yuke Zhu, Li Fei-Fei, Silvio Savarese
diff --git a/docs/modules/algorithms.md b/docs/modules/algorithms.md
index 8f8cbfae..8fe954ac 100644
--- a/docs/modules/algorithms.md
+++ b/docs/modules/algorithms.md
@@ -141,7 +141,7 @@ config, _ = config_from_checkpoint(algo_name=algo_name, ckpt_dict=ckpt_dict)
 model = algo_factory(
     algo_name,
     config,
-    modality_shapes=ckpt_dict["shape_metadata"]["all_shapes"],
+    obs_key_shapes=ckpt_dict["shape_metadata"]["all_shapes"],
     ac_dim=ckpt_dict["shape_metadata"]["ac_dim"],
     device=device,
 )
diff --git a/docs/modules/dataset.md b/docs/modules/dataset.md
index 7bbb4a95..8a7871ee 100644
--- a/docs/modules/dataset.md
+++ b/docs/modules/dataset.md
@@ -30,7 +30,7 @@ dataset = SequenceDataset(
 - `hdf5_path`
 	- The absolute / relative path to the hdf5 file containing training demonstrations. See [datasets](../introduction/datasets.html) page for the expected data structure.
 - `obs_keys`
-	- A list of strings specifying which observation modalities to read from the dataset. This is typically read from the config file: our implementation pools observation keys from `config.observation.modalities.obs.low_dim` and `config.observation.modalities.obs.images`.
+	- A list of strings specifying which observation modalities to read from the dataset. This is typically read from the config file: our implementation pools observation keys from `config.observation.modalities.obs.low_dim` and `config.observation.modalities.obs.rgb`.
 - `dataset_keys`
 	- Keys of non-observation data to read from a demonstration. Typically include `actions`, `rewards`, `dones`.
 - `seq_length`
diff --git a/docs/modules/models.md b/docs/modules/models.md
index 8423998b..dc13afac 100644
--- a/docs/modules/models.md
+++ b/docs/modules/models.md
@@ -33,7 +33,20 @@ class ResNet18Conv(ConvBase):
       return [512, out_h, out_w]
 ```
 
-## VisualCore
+## EncoderCore
+We create the `EncoderCore` abstract class to encapsulate any network intended to encode a specific type of observation modality (e.g.: `VisualCore` for RGB and depth observations, and `ScanCore` for range scanner observations. See below for descriptions of both!). When a new encoder class is subclassed from `EncoderCore`, it will automatically be registered internally in robomimic, allowing the user to directly refer to their custom encoder classes in their config in string form. For example, if the user specifies a custom `EncoderCore`-based class named `MyCustomRGBEncoder` to encode RGB observations, they can directly set this in their config:
+
+```python
+config.observation.encoder.rgb.core_class = "MyCustomRGBEncoder"
+config.observation.encoder.rgb.core_kwargs = ...
+```
+
+Any corresponding keyword arguments that should be passed to the encoder constructor should be specified in `core_kwargs` in the config. For more information on creating your own custom encoder, please see our [example script](../introduction/examples.html#custom-observation-modalities-example).
+
+Below, we provide descriptions of specific EncoderCore-based classes used to encode RGB and depth observations (`VisualCore`) and range scanner observations (`ScanCore`).
+
+
+### VisualCore
 We provide a `VisualCore` module for constructing custom vision architectures. A `VisualCore` consists of a backbone network that featurizes image input --- typically a `ConvBase` module --- and a pooling module that reduces the feature tensor into a fixed-sized vector representation.  Below is a `VisualCore` built from a `ResNet18Conv` backbone and a `SpatialSoftmax` ([paper](https://rll.berkeley.edu/dsae/dsae.pdf)) pooling module. 
 
 ```python
@@ -41,17 +54,40 @@ from robomimic.models.base_nets import VisualCore, ResNet18Conv, SpatialSoftmax
 
 vis_net = VisualCore(
   input_shape=(3, 224, 224),
-  visual_core_class="ResNet18Conv",  # use ResNet18 as the visualcore backbone
-  visual_core_kwargs={"pretrained": False, "input_coord_conv": False},  # kwargs for the ResNet18Conv class
+  core_class="ResNet18Conv",  # use ResNet18 as the visualcore backbone
+  core_kwargs={"pretrained": False, "input_coord_conv": False},  # kwargs for the ResNet18Conv class
   pool_class="SpatialSoftmax",  # use spatial softmax to regularize the model output
   pool_kwargs={"num_kp": 32},  # kwargs for the SpatialSoftmax --- use 32 keypoints
   flatten=True,  # flatten the output of the spatial softmax layer
-  visual_feature_dimension=64,  # project the flattened feature into a 64-dim vector through a linear layer 
+  feature_dimension=64,  # project the flattened feature into a 64-dim vector through a linear layer 
 )
 ```
 
 New vision backbone and pooling classes can be added by subclassing `ConvBase`. 
 
+
+### ScanCore
+We provide a `ScanCore` module for constructing custom range finder architectures. `ScanCore` consists of a 1D Convolution backbone network (`Conv1dBase`) that featurizes a high-dimensional 1D input, and a pooling module that reduces the feature tensor into a fixed-sized vector representation.  Below is an example of a `ScanCore` network with a `SpatialSoftmax` ([paper](https://rll.berkeley.edu/dsae/dsae.pdf)) pooling module.
+
+```python
+from robomimic.models.base_nets import ScanCore, SpatialSoftmax
+
+vis_net = VisualCore(
+  input_shape=(1, 120),
+  conv_kwargs={
+      "out_channels": [32, 64, 64],
+      "kernel_size": [8, 4, 2],
+      "stride": [4, 2, 1],
+  },    # kwarg settings to pass to individual Conv1d layers
+  conv_activation="relu",   # use relu in between each Conv1d layer
+  pool_class="SpatialSoftmax",  # use spatial softmax to regularize the model output
+  pool_kwargs={"num_kp": 32},  # kwargs for the SpatialSoftmax --- use 32 keypoints
+  flatten=True,  # flatten the output of the spatial softmax layer
+  feature_dimension=64,  # project the flattened feature into a 64-dim vector through a linear layer 
+)
+```
+
+
 ## Randomizers
 
 Randomizers are `Modules` that perturb network inputs during training, and optionally during evaluation. A `Randomizer` implements a `forward_in` and a `forward_out` function, which are intended to process the input and output of a neural network module. As an example, the `forward_in` function of a `CropRandomizer` instance perturbs an input image by taking a random crop of the image (as shown in th gif below). If the `CropRandomizer` is configured to take more than one random crop (`n_crops > 1`) of each input image, it will send all random crops through the image network and reduce the network output by average pooling the outputs along the `n_crops` dimension in the `forward_out` function.
@@ -62,7 +98,7 @@ Randomizers are `Modules` that perturb network inputs during training, and optio
 
 
 ## Observation Encoder and Decoder
- `ObservationEncoder` and `ObservationDecoder` are basic building blocks for dealing with observation dictionary inputs and outputs. They are designed to take in multiple streams of observation modalities as input (e.g. a dictionary containing images and robot proprioception signals), and output a dictionary of predictions like actions and subgoals. Below is an example of how to manually create an `ObservationEncoder` instance by registering observation modalities with the `register_modality` function.
+ `ObservationEncoder` and `ObservationDecoder` are basic building blocks for dealing with observation dictionary inputs and outputs. They are designed to take in multiple streams of observation modalities as input (e.g. a dictionary containing images and robot proprioception signals), and output a dictionary of predictions like actions and subgoals. Below is an example of how to manually create an `ObservationEncoder` instance by registering observation modalities with the `register_obs_key` function.
 
 ```python
 from robomimic.models.obs_nets import ObservationEncoder, CropRandomizer, MLP, VisualCore, ObservationDecoder
@@ -76,33 +112,33 @@ camera1_shape = [3, 224, 224]
 image_randomizer = CropRandomizer(input_shape=camera2_shape, crop_height=200, crop_width=200)
 
 # We will use a reconfigurable image processing backbone VisualCore to process the input image modality
-mod_net_class = "VisualCore"  # this is defined in models/base_nets.py
+net_class = "VisualCore"  # this is defined in models/base_nets.py
 
 # kwargs for VisualCore network
-mod_net_kwargs = {
+net_kwargs = {
     "input_shape": camera1_shape,
-    "visual_core_class": "ResNet18Conv",  # use ResNet18 as the visualcore backbone
-    "visual_core_kwargs": {"pretrained": False, "input_coord_conv": False},
+    "core_class": "ResNet18Conv",  # use ResNet18 as the visualcore backbone
+    "core_kwargs": {"pretrained": False, "input_coord_conv": False},
     "pool_class": "SpatialSoftmax",  # use spatial softmax to regularize the model output
     "pool_kwargs": {"num_kp": 32}
 }
 
 # register the network for processing the modality
-obs_encoder.register_modality(
-    mod_name="camera1",
-    mod_shape=camera1_shape,
-    mod_net_class=mod_net_class,
-    mod_net_kwargs=mod_net_kwargs,
-    mod_randomizer=image_randomizer
+obs_encoder.register_obs_key(
+    name="camera1",
+    shape=camera1_shape,
+    net_class=net_class,
+    net_kwargs=net_kwargs,
+    randomizer=image_randomizer
 )
 
 # We could mix low-dimensional observation, e.g., proprioception signal, in the encoder
 proprio_shape = [12]
-mod_net = MLP(input_dim=12, output_dim=32, layer_dims=(128,), output_activation=None)
-obs_encoder.register_modality(
-    mod_name="proprio",
-    mod_shape=proprio_shape,
-    mod_net=mod_net
+net = MLP(input_dim=12, output_dim=32, layer_dims=(128,), output_activation=None)
+obs_encoder.register_obs_key(
+    name="proprio",
+    shape=proprio_shape,
+    net=net
 )
 ```
 
diff --git a/docs/modules/observations.md b/docs/modules/observations.md
new file mode 100644
index 00000000..51dd61d0
--- /dev/null
+++ b/docs/modules/observations.md
@@ -0,0 +1,20 @@
+# Observations
+
+**robomimic** natively supports multiple different observation modalities, and provides integrated support for modifying observations and adding your own custom ones.
+
+First, we highlight semantic distinctions when referring to different aspects of observations:
+
+- **Keys** are individual observations that are received from an environment / dataset. For example, `rgb_wrist`, `eef_pos`, and `joint_vel` could be keys, depending on the dataset / environment.
+- **Modalities** are different observation modes. For example, low dimensional states are considered a single mode, whereas RGB observations might be another mode. **robomimic** natively supports four modalities: `low_dim`, `rgb`, `depth`, and `scan`. Each modality owns it own set of observation keys.
+- **Groups** consist of potentially multiple modalities and multiple keys per modality, which are together passed to a learning model. For example, **robomimic** commonly uses three different groups: `obs`, which contains the normal observations passed to any model using these as inputs, and `goal` / `subgoal`, which means any specified modalities / keys correspond to a goal / subgoal to be learned.
+
+Observations are handled in the following way:
+1. Each observation key is according to their modality via their `Modality` class,
+2. All observations for a given modality are concatenated and passed through an `ObservationEncoder` for that modality,
+3. All processed observations over all modalities are concatenated together and passed to a learning network
+
+## Modifying and Adding Your Own Observation Modalities
+
+**robomimic** natively supports low dimensional (`low_dim`), RGB images (`rgb`), depth images (`depth`), and scan arrays (`scan`). The way each of these modalities are processed and encoded can be easily specified by modifying their respective `encoder` parameters in your `Config` class.
+
+You may want to specify your own custom modalities that get processed and encoded in a certain way (e.g.: semantic segmentation, optical flow, etc...). This can also easily be done, and we refer you to our [example script](../introduction/examples.html#custom-observation-modalities-example) which walks through the process.
\ No newline at end of file
diff --git a/docs/modules/utils.md b/docs/modules/utils.md
index f6824c6b..c0420cd3 100644
--- a/docs/modules/utils.md
+++ b/docs/modules/utils.md
@@ -38,15 +38,15 @@ The library also supports nontrivial shape operations on the nested dict. For ex
 ```python
 # create a new dimension at dim=1 and expand the dimension size to 10
 x = TensorUtils.unsqueeze_expand_at(x, size=10, dim=1)  
-# x["image"].shape == torch.Size([1, 10, 3, 224, 224])
+# x["rgb"].shape == torch.Size([1, 10, 3, 224, 224])
 
 # repeat the 0-th dimension 10 times
 x = TensorUtils.repeat_by_expand_at(x, repeats=10, dim=0)  
-# x["image"].shape == torch.Size([10, 10, 3, 224, 224])
+# x["rgb"].shape == torch.Size([10, 10, 3, 224, 224])
 
 # gather the sequence dimension (dim=1) by some index
 x = TensorUtils.gather_sequence(x_seq, indices=torch.arange(10)) 
-# x["image"].shape == torch.Size([10, 3, 224, 224])
+# x["rgb"].shape == torch.Size([10, 3, 224, 224])
 ```
 
 In addition, `map_tensor` allows applying an arbitrary function to all tensors in a nested dictionary or list of tensors and returns the same nested structure.
@@ -63,21 +63,21 @@ The complete documentation of `robomimic.utils.tensor_utils.py` is available [he
 
 - **initialize_obs_utils_with_obs_specs(obs_modality_specs)**
     
-    This function initialize a global registry of mapping between observation modality names and observation types e.g. which ones are low-dimensional, and which ones are images). For example, given an `obs_modality_specs` of the following format:
+    This function initialize a global registry of mapping between observation key names and observation modalities e.g. which ones are low-dimensional, and which ones are rgb images). For example, given an `obs_modality_specs` of the following format:
     ```python
     {
         "obs": {
             "low_dim": ["robot0_eef_pos", "robot0_eef_quat"],
-            "image": ["agentview_image", "robot0_eye_in_hand"],
+            "rgb": ["agentview_image", "robot0_eye_in_hand"],
         }
         "goal": {
             "low_dim": ["robot0_eef_pos"],
-            "image": ["agentview_image"]
+            "rgb": ["agentview_image"]
         }
     }
 
     ```
-    The function will create a mapping between observation names such as `'agentview_image'` and observation types such as `'image'`. The registry is stored in `OBS_TYPE_TO_MODALITIES` and can be accessed globally. Utility functions such as `key_is_image()` rely on this global registry to determine observation types.
+    The function will create a mapping between observation names such as `'agentview_image'` and observation modalities such as `'rgb'`. The registry is stored in `OBS_MODALITIES_TO_KEYS` and can be accessed globally. Utility functions such as `key_is_obs_modality()` rely on this global registry to determine observation modalities.
 
 - **process_obs(obs_dict)**
     
@@ -89,4 +89,4 @@ The complete documentation of `robomimic.utils.tensor_utils.py` is available [he
 
 - **normalize_obs(obs_dict, obs_normalization_stats)**
 
-    Normalize observations by computing the mean observation and std of each observation (in each dimension and modality), and normalizing unit mean and variance in each dimension.
\ No newline at end of file
+    Normalize observations by computing the mean observation and std of each observation (in each dimension and observation key), and normalizing unit mean and variance in each dimension.
\ No newline at end of file
diff --git a/examples/add_new_modality.py b/examples/add_new_modality.py
new file mode 100644
index 00000000..cf7fc876
--- /dev/null
+++ b/examples/add_new_modality.py
@@ -0,0 +1,214 @@
+"""
+A simple example showing how to add custom observation modalities, and custom
+observation networks (EncoderCore, ObservationRandomizer, etc.) as well.
+We also show how to use your custom classes directly in a config, and link them to
+your environment's observations
+"""
+
+import numpy as np
+import torch
+import robomimic
+from robomimic.models import EncoderCore, Randomizer
+from robomimic.utils.obs_utils import Modality, ScanModality
+from robomimic.config.bc_config import BCConfig
+import robomimic.utils.tensor_utils as TensorUtils
+
+
+# Let's create a new modality to handle observation modalities, which will be interpreted as
+# single frame images, with raw shape (H, W) in range [0, 255]
+class CustomImageModality(Modality):
+    # We must define the class string name to reference this modality with the @name attribute
+    name = "custom_image"
+
+    # We must define two class methods: a processor and an unprocessor method. The processor
+    # method should map the raw observations (a numpy array OR torch tensor) into a form / shape suitable for learning,
+    # and the unprocess method should do the inverse operation
+    @classmethod
+    def _default_obs_processor(cls, obs):
+        # We add a channel dimension and normalize them to be in range [-1, 1]
+        return (obs / 255.0 - 0.5) * 2
+
+    @classmethod
+    def _default_obs_unprocessor(cls, obs):
+        # We do the reverse
+        return ((obs / 2) + 0.5) * 255.0
+
+
+# You can also modify pre-existing modalities as well. Let's say you have scan data that pads the ends with a 0, so we
+# want to pre-process those scans in a different way. We can specify a custom processor / unprocessor
+# method that will override the default one (assumes obs are a flat 1D array):
+def custom_scan_processor(obs):
+    # Trim the padded ends
+    return obs[1:-1]
+
+
+def custom_scan_unprocessor(obs):
+    # Re-add the padding
+    # Note: need to check type
+    return np.concatenate([np.zeros(1), obs, np.zeros(1)]) if isinstance(obs, np.ndarray) else \
+        torch.concat([torch.zeros(1), obs, torch.zeros(1)])
+
+
+# Override the default functions for ScanModality
+ScanModality.set_obs_processor(processor=custom_scan_processor)
+ScanModality.set_obs_unprocessor(unprocessor=custom_scan_unprocessor)
+
+
+# Let's now create a custom encoding class for the custom image modality
+class CustomImageEncoderCore(EncoderCore):
+    # For simplicity, this will be a pass-through with some simple kwargs
+    def __init__(
+            self,
+            input_shape,        # Required, will be inferred automatically at runtime
+
+            # Any args below here you can specify arbitrarily
+            welcome_str,
+    ):
+        # Always need to run super init first and pass in input_shape
+        super().__init__(input_shape=input_shape)
+
+        # Anything else should can be custom to your class
+        # Let's print out the welcome string
+        print(f"Welcome! {welcome_str}")
+
+    # We need to always specify the output shape from this model, based on a given input_shape
+    def output_shape(self, input_shape=None):
+        # this is just a pass-through, so we return input_shape
+        return input_shape
+
+    # we also need to specify the forward pass for this network
+    def forward(self, inputs):
+        # just a pass through again
+        return inputs
+
+
+# Let's also create a custom randomizer class for randomizing our observations
+class CustomImageRandomizer(Randomizer):
+    """
+    A simple example of a randomizer - we make @num_rand copies of each image in the batch,
+    and add some small uniform noise to each. All randomized images will then get passed
+    through the network, resulting in outputs corresponding to each copy - we will pool
+    these outputs across the copies with a simple average.
+    """
+    def __init__(
+        self,
+        input_shape,
+        num_rand=1,
+        noise_scale=0.01,
+    ):
+        """
+        Args:
+            input_shape (tuple, list): shape of input (not including batch dimension)
+            num_rand (int): number of random images to create on each forward pass
+            noise_scale (float): magnitude of uniform noise to apply
+        """
+        super(CustomImageRandomizer, self).__init__()
+
+        assert len(input_shape) == 3 # (C, H, W)
+
+        self.input_shape = input_shape
+        self.num_rand = num_rand
+        self.noise_scale = noise_scale
+
+    def output_shape_in(self, input_shape=None):
+        """
+        Function to compute output shape from inputs to this module. Corresponds to
+        the @forward_in operation, where raw inputs (usually observation modalities)
+        are passed in.
+
+        Args:
+            input_shape (iterable of int): shape of input. Does not include batch dimension.
+                Some modules may not need this argument, if their output does not depend 
+                on the size of the input, or if they assume fixed size input.
+
+        Returns:
+            out_shape ([int]): list of integers corresponding to output shape
+        """
+
+        # @forward_in takes (B, C, H, W) -> (B, N, C, H, W) -> (B * N, C, H, W).
+        # since only the batch dimension changes, and @input_shape does not include batch
+        # dimension, we indicate that the non-batch dimensions don't change
+        return list(input_shape)
+
+    def output_shape_out(self, input_shape=None):
+        """
+        Function to compute output shape from inputs to this module. Corresponds to
+        the @forward_out operation, where processed inputs (usually encoded observation
+        modalities) are passed in.
+
+        Args:
+            input_shape (iterable of int): shape of input. Does not include batch dimension.
+                Some modules may not need this argument, if their output does not depend 
+                on the size of the input, or if they assume fixed size input.
+
+        Returns:
+            out_shape ([int]): list of integers corresponding to output shape
+        """
+        
+        # since the @forward_out operation splits [B * N, ...] -> [B, N, ...]
+        # and then pools to result in [B, ...], only the batch dimension changes,
+        # and so the other dimensions retain their shape.
+        return list(input_shape)
+
+    def forward_in(self, inputs):
+        """
+        Make N copies of each image, add random noise to each, and move
+        copies into batch dimension to ensure compatibility with rest
+        of network.
+        """
+
+        # note the use of @self.training to ensure no randomization at test-time
+        if self.training:
+
+            # make N copies of the images [B, C, H, W] -> [B, N, C, H, W]
+            out = TensorUtils.unsqueeze_expand_at(inputs, size=self.num_rand, dim=1)
+
+            # add random noise to each copy
+            out = out + self.noise_scale * (2. * torch.rand_like(out) - 1.)
+
+            # reshape [B, N, C, H, W] -> [B * N, C, H, W] to ensure network forward pass is unchanged
+            return TensorUtils.join_dimensions(out, 0, 1)
+        return inputs
+
+    def forward_out(self, inputs):
+        """
+        Pools outputs across the copies by averaging them. It does this by splitting
+        the outputs from shape [B * N, ...] -> [B, N, ...] and then averaging across N
+        to result in shape [B, ...] to make sure the network output is consistent with
+        what would have happened if there were no randomization.
+        """
+
+        # note the use of @self.training to ensure no randomization at test-time
+        if self.training:
+            batch_size = (inputs.shape[0] // self.num_rand)
+            out = TensorUtils.reshape_dimensions(inputs, begin_axis=0, end_axis=0, 
+                target_dims=(batch_size, self.num_rand))
+            return out.mean(dim=1)
+        return inputs
+
+    def __repr__(self):
+        """Pretty print network."""
+        header = '{}'.format(str(self.__class__.__name__))
+        msg = header + "(input_shape={}, num_rand={}, noise_scale={})".format(
+            self.input_shape, self.num_rand, self.noise_scale)
+        return msg
+
+
+if __name__ == "__main__":
+    # Now, we can directly reference the classes in our config!
+    config = BCConfig()
+    config.observation.encoder.custom_image.core_class = "CustomImageEncoderCore"       # Custom class, in string form
+    config.observation.encoder.custom_image.core_kwargs.welcome_str = "hi there!"       # Any custom arguments, of any primitive type that is json-able
+    config.observation.encoder.custom_image.obs_randomizer_class = "CustomImageRandomizer"
+    config.observation.encoder.custom_image.obs_randomizer_kwargs.num_rand = 3
+    config.observation.encoder.custom_image.obs_randomizer_kwargs.noise_scale = 0.05
+
+    # We can also directly use this new modality and associate dataset / observation keys with it!
+    config.observation.modalities.obs.custom_image = ["my_image1", "my_image2"]
+    config.observation.modalities.goal.custom_image = ["my_image2", "my_image3"]
+
+    # Let's view our config
+    print(config)
+
+    # That's it! Now we can pass this config into our training script, and robomimic will directly use our
+    # custom modality + encoder network
diff --git a/examples/simple_obs_nets.py b/examples/simple_obs_nets.py
index 17c0191c..9ef78187 100644
--- a/examples/simple_obs_nets.py
+++ b/examples/simple_obs_nets.py
@@ -7,76 +7,86 @@
 from collections import OrderedDict
 
 import torch
-from robomimic.models.obs_nets import ObservationEncoder, CropRandomizer, MLP, VisualCore, ObservationDecoder
+from robomimic.models.obs_nets import ObservationEncoder, MLP, ObservationDecoder
+from robomimic.models.base_nets import CropRandomizer
 import robomimic.utils.tensor_utils as TensorUtils
+import robomimic.utils.obs_utils as ObsUtils
 
 
 def simple_obs_example():
     obs_encoder = ObservationEncoder(feature_activation=torch.nn.ReLU)
 
-    # There are two ways to construct the network for processing a input modality.
+    # There are two ways to construct the network for processing a input observation.
 
     # 1. Construct through keyword args and class name
 
     # Assume we are processing image input of shape (3, 224, 224).
     camera1_shape = [3, 224, 224]
 
-    # We will use a reconfigurable image processing backbone VisualCore to process the input image modality
-    mod_net_class = "VisualCore"  # this is defined in models/base_nets.py
+    # We will use a reconfigurable image processing backbone VisualCore to process the input image observation key
+    net_class = "VisualCore"  # this is defined in models/base_nets.py
 
     # kwargs for VisualCore network
-    mod_net_kwargs = {
+    net_kwargs = {
         "input_shape": camera1_shape,
-        "visual_core_class": "ResNet18Conv",  # use ResNet18 as the visualcore backbone
-        "visual_core_kwargs": {"pretrained": False, "input_coord_conv": False},
+        "backbone_class": "ResNet18Conv",  # use ResNet18 as the visualcore backbone
+        "backbone_kwargs": {"pretrained": False, "input_coord_conv": False},
         "pool_class": "SpatialSoftmax",  # use spatial softmax to regularize the model output
         "pool_kwargs": {"num_kp": 32}
     }
 
-    # register the network for processing the modality
-    obs_encoder.register_modality(
-        mod_name="camera1",
-        mod_shape=camera1_shape,
-        mod_net_class=mod_net_class,
-        mod_net_kwargs=mod_net_kwargs
+    # register the network for processing the observation key
+    obs_encoder.register_obs_key(
+        name="camera1",
+        shape=camera1_shape,
+        net_class=net_class,
+        net_kwargs=net_kwargs,
     )
 
-    # 2. Alternatively, we could initialize the modality network outside of the ObservationEncoder
+    # 2. Alternatively, we could initialize the observation key network outside of the ObservationEncoder
 
     # The image doesn't have to be of the same shape
     camera2_shape = [3, 160, 240]
 
-    # We could also attach an observation randomizer to perturb the input modality before sending to the network
+    # We could also attach an observation randomizer to perturb the input observation key before sending to the network
     image_randomizer = CropRandomizer(input_shape=camera2_shape, crop_height=140, crop_width=220)
 
     # the cropper will alter the input shape
-    mod_net_kwargs["input_shape"] = image_randomizer.output_shape_in(camera2_shape)
-    mod_net = eval(mod_net_class)(**mod_net_kwargs)
-
-    obs_encoder.register_modality(
-        mod_name="camera2",
-        mod_shape=camera2_shape,
-        mod_net=mod_net,
-        mod_randomizer=image_randomizer
+    net_kwargs["input_shape"] = image_randomizer.output_shape_in(camera2_shape)
+    net = ObsUtils.OBS_ENCODER_CORES[net_class](**net_kwargs)
+
+    obs_encoder.register_obs_key(
+        name="camera2",
+        shape=camera2_shape,
+        net=net,
+        randomizer=image_randomizer,
     )
 
-    # ObservationEncoder also supports weight sharing between modalities
+    # ObservationEncoder also supports weight sharing between keys
     camera3_shape = [3, 224, 224]
-    obs_encoder.register_modality(
-        mod_name="camera3",
-        mod_shape=camera3_shape,
-        share_mod_net_from="camera1"
+    obs_encoder.register_obs_key(
+        name="camera3",
+        shape=camera3_shape,
+        share_net_from="camera1",
     )
 
     # We could mix low-dimensional observation, e.g., proprioception signal, in the encoder
     proprio_shape = [12]
-    mod_net = MLP(input_dim=12, output_dim=32, layer_dims=(128,), output_activation=None)
-    obs_encoder.register_modality(
-        mod_name="proprio",
-        mod_shape=proprio_shape,
-        mod_net=mod_net
+    net = MLP(input_dim=12, output_dim=32, layer_dims=(128,), output_activation=None)
+    obs_encoder.register_obs_key(
+        name="proprio",
+        shape=proprio_shape,
+        net=net,
     )
 
+    # Before constructing the encoder, make sure we register all of our observation keys with corresponding modalities
+    # (this will determine how they are processed during training)
+    obs_modality_mapping = {
+        "low_dim": ["proprio"],
+        "rgb": ["camera1", "camera2", "camera3"],
+    }
+    ObsUtils.initialize_obs_modality_mapping_from_dict(modality_mapping=obs_modality_mapping)
+
     # Finally, construct the observation encoder
     obs_encoder.make()
 
@@ -99,8 +109,8 @@ def simple_obs_example():
         inputs = TensorUtils.to_device(inputs, torch.device("cuda:0"))
         obs_encoder.cuda()
 
-    # output from each modality network is concatenated as a flat vector.
-    # The concatenation order is the same as the modalities are registered
+    # output from each obs key network is concatenated as a flat vector.
+    # The concatenation order is the same as the keys are registered
     obs_feature = obs_encoder(inputs)
 
     print(obs_feature.shape)
@@ -111,6 +121,10 @@ def simple_obs_example():
         decode_shapes=OrderedDict({"action": (7,)})
     )
 
+    # Send to GPU if applicable
+    if torch.cuda.is_available():
+        obs_decoder.cuda()
+
     print(obs_decoder(obs_feature))
 
 
diff --git a/examples/simple_train_loop.py b/examples/simple_train_loop.py
index 64af0287..9e65d902 100644
--- a/examples/simple_train_loop.py
+++ b/examples/simple_train_loop.py
@@ -77,13 +77,13 @@ def get_example_model(dataset_path, device):
     # default BC config
     config = config_factory(algo_name="bc")
 
-    # read config to set up metadata for observation types (e.g. detecting image observations)
+    # read config to set up metadata for observation modalities (e.g. detecting rgb observations)
     ObsUtils.initialize_obs_utils_with_config(config)
 
     # read dataset to get some metadata for constructing model
     shape_meta = FileUtils.get_shape_metadata_from_dataset(
         dataset_path=dataset_path, 
-        all_modalities=sorted((
+        all_obs_keys=sorted((
             "robot0_eef_pos", 
             "robot0_eef_quat", 
             "robot0_gripper_qpos", 
@@ -95,7 +95,7 @@ def get_example_model(dataset_path, device):
     model = algo_factory(
         algo_name=config.algo_name,
         config=config,
-        modality_shapes=shape_meta["all_shapes"],
+        obs_key_shapes=shape_meta["all_shapes"],
         ac_dim=shape_meta["ac_dim"],
         device=device,
     )
@@ -107,8 +107,8 @@ def print_batch_info(batch):
     for k in batch:
         if k in ["obs", "next_obs"]:
             print("key {}".format(k))
-            for mod in batch[k]:
-                print("    modality {} with shape {}".format(mod, batch[k][mod].shape))
+            for obs_key in batch[k]:
+                print("    obs key {} with shape {}".format(obs_key, batch[k][obs_key].shape))
         else:
             print("key {} with shape {}".format(k, batch[k].shape))
     print("")
diff --git a/examples/train_bc_rnn.py b/examples/train_bc_rnn.py
index 7e892d79..205e7e46 100644
--- a/examples/train_bc_rnn.py
+++ b/examples/train_bc_rnn.py
@@ -21,41 +21,21 @@
 import robomimic
 import robomimic.utils.torch_utils as TorchUtils
 import robomimic.utils.test_utils as TestUtils
+import robomimic.utils.macros as Macros
 from robomimic.config import config_factory
 from robomimic.scripts.train import train
 
 
-def get_config(dataset_path=None, output_dir=None, debug=False):
+def robosuite_hyperparameters(config):
     """
-    Construct config for training.
+    Sets robosuite-specific hyperparameters.
 
     Args:
-        dataset_path (str): path to hdf5 dataset. Pass None to use a small default dataset.
-        output_dir (str): path to output folder, where logs, model checkpoints, and videos
-            will be written. If it doesn't exist, the directory will be created. Pass
-            None to use a default directory in /tmp.
-        debug (bool): if True, shrink training and rollout times to test a full training
-            run quickly.
-    """
-
-    # handle args
-    if dataset_path is None:
-        # small dataset with a handful of trajectories
-        dataset_path = TestUtils.example_dataset_path()
-
-    if output_dir is None:
-        # default output directory created in /tmp
-        output_dir = TestUtils.temp_model_dir_path()
-
-    # make default BC config
-    config = config_factory(algo_name="bc")
-
-    ### Experiment Config ###
-    config.experiment.name = "bc_rnn_example"                   # name of experiment used to make log files
-    config.experiment.validate = True                           # whether to do validation or not
-    config.experiment.logging.terminal_output_to_txt = False    # whether to log stdout to txt file 
-    config.experiment.logging.log_tb = True                     # enable tensorboard logging
+        config (Config): Config to modify
 
+    Returns:
+        Config: Modified config
+    """
     ## save config - if and when to save checkpoints ##
     config.experiment.save.enabled = True                       # whether model saving should be enabled or disabled
     config.experiment.save.every_n_seconds = None               # save model every n seconds (set to None to disable)
@@ -87,23 +67,12 @@ def get_config(dataset_path=None, output_dir=None, debug=False):
     config.experiment.rollout.warmstart = 0                     # number of epochs to wait before starting rollouts
     config.experiment.rollout.terminate_on_success = True       # end rollout early after task success
 
-
-    ### Train Config ###
-    config.train.data = dataset_path                            # path to hdf5 dataset
-
-    # Write all results to this directory. A new folder with the timestamp will be created
-    # in this directory, and it will contain three subfolders - "log", "models", and "videos".
-    # The "log" directory will contain tensorboard and stdout txt logs. The "models" directory
-    # will contain saved model checkpoints. The "videos" directory contains evaluation rollout
-    # videos.
-    config.train.output_dir = output_dir                        # path to output folder
-
     ## dataset loader config ##
 
     # num workers for loading data - generally set to 0 for low-dim datasets, and 2 for image datasets
-    config.train.num_data_workers = 0                           # assume low-dim dataset                 
+    config.train.num_data_workers = 0                           # assume low-dim dataset
 
-    # One of ["all", "low_dim", or None]. Set to "all" to cache entire hdf5 in memory - this is 
+    # One of ["all", "low_dim", or None]. Set to "all" to cache entire hdf5 in memory - this is
     # by far the fastest for data loading. Set to "low_dim" to cache all non-image data. Set
     # to None to use no caching - in this case, every batch sample is retrieved via file i/o.
     # You should almost never set this to None, even for large image datasets.
@@ -124,8 +93,8 @@ def get_config(dataset_path=None, output_dir=None, debug=False):
 
     # keys from hdf5 to load per demonstration, besides "obs" and "next_obs"
     config.train.dataset_keys = (
-        "actions", 
-        "rewards", 
+        "actions",
+        "rewards",
         "dones",
     )
 
@@ -141,35 +110,36 @@ def get_config(dataset_path=None, output_dir=None, debug=False):
 
     ### Observation Config ###
     config.observation.modalities.obs.low_dim = [               # specify low-dim observations for agent
-        "robot0_eef_pos", 
-        "robot0_eef_quat", 
-        "robot0_gripper_qpos", 
+        "robot0_eef_pos",
+        "robot0_eef_quat",
+        "robot0_gripper_qpos",
         "object",
     ]
-    config.observation.modalities.obs.image = []                # no image observations
+    config.observation.modalities.obs.rgb = []                # no image observations
     config.observation.modalities.goal.low_dim = []             # no low-dim goals
-    config.observation.modalities.goal.image = []               # no image goals
+    config.observation.modalities.goal.rgb = []               # no image goals
 
     # observation encoder architecture - applies to all networks that take observation dicts as input
-    config.observation.encoder.visual_core = 'ResNet18Conv'                         # ResNet backbone for image observations (unused if no image observations)
-    config.observation.encoder.visual_core_kwargs.pretrained = False                # kwargs for visual core
-    config.observation.encoder.visual_core_kwargs.input_coord_conv = False
+
+    config.observation.encoder.rgb.core_class = "VisualCore"
+    config.observation.encoder.rgb.core_kwargs.feature_dimension = 64
+    config.observation.encoder.rgb.core_kwargs.backbone_class = 'ResNet18Conv'                         # ResNet backbone for image observations (unused if no image observations)
+    config.observation.encoder.rgb.core_kwargs.backbone_kwargs.pretrained = False                # kwargs for visual core
+    config.observation.encoder.rgb.core_kwargs.backbone_kwargs.input_coord_conv = False
+    config.observation.encoder.rgb.core_kwargs.pool_class = "SpatialSoftmax"                # Alternate options are "SpatialMeanPool" or None (no pooling)
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.num_kp = 32                      # Default arguments for "SpatialSoftmax"
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.learnable_temperature = False    # Default arguments for "SpatialSoftmax"
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.temperature = 1.0                # Default arguments for "SpatialSoftmax"
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.noise_std = 0.0                  # Default arguments for "SpatialSoftmax"
 
     # observation randomizer class - set to None to use no randomization, or 'CropRandomizer' to use crop randomization
-    config.observation.encoder.obs_randomizer_class = None
+    config.observation.encoder.rgb.obs_randomizer_class = None
 
     # kwargs for observation randomizers (for the CropRandomizer, this is size and number of crops)
-    config.observation.encoder.obs_randomizer_kwargs.crop_height = 76
-    config.observation.encoder.obs_randomizer_kwargs.crop_width = 76
-    config.observation.encoder.obs_randomizer_kwargs.num_crops = 1
-    config.observation.encoder.obs_randomizer_kwargs.pos_enc = False
-
-    config.observation.encoder.visual_feature_dimension = 64                        # images are encoded into feature vectors of this size
-    config.observation.encoder.use_spatial_softmax = True                           # use spatial softmax layer at end of conv layers
-    config.observation.encoder.spatial_softmax_kwargs.num_kp = 32                   # kwargs for spatial softmax layer
-    config.observation.encoder.spatial_softmax_kwargs.learnable_temperature = False 
-    config.observation.encoder.spatial_softmax_kwargs.temperature = 1.0
-    config.observation.encoder.spatial_softmax_kwargs.noise_std = 0.0
+    config.observation.encoder.rgb.obs_randomizer_kwargs.crop_height = 76
+    config.observation.encoder.rgb.obs_randomizer_kwargs.crop_width = 76
+    config.observation.encoder.rgb.obs_randomizer_kwargs.num_crops = 1
+    config.observation.encoder.rgb.obs_randomizer_kwargs.pos_enc = False
 
     ### Algo Config ###
 
@@ -192,10 +162,10 @@ def get_config(dataset_path=None, output_dir=None, debug=False):
     config.algo.gmm.num_modes = 5                       # number of GMM modes
     config.algo.gmm.min_std = 0.0001                    # minimum std output from network
     config.algo.gmm.std_activation = "softplus"         # activation to use for std output from policy net
-    config.algo.gmm.low_noise_eval = True               # low-std at test-time 
+    config.algo.gmm.low_noise_eval = True               # low-std at test-time
 
     # rnn policy config
-    config.algo.rnn.enabled = True      # enable RNN policy                                        
+    config.algo.rnn.enabled = True      # enable RNN policy
     config.algo.rnn.horizon = 10        # unroll length for RNN - should usually match train.seq_length
     config.algo.rnn.hidden_dim = 400    # hidden dimension size
     config.algo.rnn.rnn_type = "LSTM"   # rnn type - one of "LSTM" or "GRU"
@@ -203,6 +173,205 @@ def get_config(dataset_path=None, output_dir=None, debug=False):
     config.algo.rnn.open_loop = False   # if True, action predictions are only based on a single observation (not sequence) + hidden state
     config.algo.rnn.kwargs.bidirectional = False          # rnn kwargs
 
+    return config
+
+
+def momart_hyperparameters(config):
+    """
+    Sets momart-specific hyperparameters.
+
+    Args:
+        config (Config): Config to modify
+
+    Returns:
+        Config: Modified config
+    """
+    ## save config - if and when to save checkpoints ##
+    config.experiment.save.enabled = True                       # whether model saving should be enabled or disabled
+    config.experiment.save.every_n_seconds = None               # save model every n seconds (set to None to disable)
+    config.experiment.save.every_n_epochs = 3                   # save model every n epochs (set to None to disable)
+    config.experiment.save.epochs = []                          # save model on these specific epochs
+    config.experiment.save.on_best_validation = True            # save models that achieve best validation score
+    config.experiment.save.on_best_rollout_return = False       # save models that achieve best rollout return
+    config.experiment.save.on_best_rollout_success_rate = True  # save models that achieve best success rate
+
+    # epoch definition - if not None, set an epoch to be this many gradient steps, else the full dataset size will be used
+    config.experiment.epoch_every_n_steps = None                # each epoch is 100 gradient steps
+    config.experiment.validation_epoch_every_n_steps = 10       # each validation epoch is 10 gradient steps
+
+    # envs to evaluate model on (assuming rollouts are enabled), to override the metadata stored in dataset
+    config.experiment.env = None                                # no need to set this (unless you want to override)
+    config.experiment.additional_envs = None                    # additional environments that should get evaluated
+
+    ## rendering config ##
+    config.experiment.render = False                            # render on-screen or not
+    config.experiment.render_video = True                       # render evaluation rollouts to videos
+    config.experiment.keep_all_videos = False                   # save all videos, instead of only saving those for saved model checkpoints
+    config.experiment.video_skip = 5                            # render video frame every n environment steps during rollout
+
+    ## evaluation rollout config ##
+    config.experiment.rollout.enabled = True                    # enable evaluation rollouts
+    config.experiment.rollout.n = 30                            # number of rollouts per evaluation
+    config.experiment.rollout.horizon = 1500                    # maximum number of env steps per rollout
+    config.experiment.rollout.rate = 3                          # do rollouts every @rate epochs
+    config.experiment.rollout.warmstart = 0                     # number of epochs to wait before starting rollouts
+    config.experiment.rollout.terminate_on_success = True       # end rollout early after task success
+
+    ## dataset loader config ##
+
+    # num workers for loading data - generally set to 0 for low-dim datasets, and 2 for image datasets
+    config.train.num_data_workers = 2                           # assume low-dim dataset
+
+    # One of ["all", "low_dim", or None]. Set to "all" to cache entire hdf5 in memory - this is
+    # by far the fastest for data loading. Set to "low_dim" to cache all non-image data. Set
+    # to None to use no caching - in this case, every batch sample is retrieved via file i/o.
+    # You should almost never set this to None, even for large image datasets.
+    config.train.hdf5_cache_mode = "low_dim"
+
+    config.train.hdf5_use_swmr = True                           # used for parallel data loading
+
+    # if true, normalize observations at train and test time, using the global mean and standard deviation
+    # of each observation in each dimension, computed across the training set. See SequenceDataset.normalize_obs
+    # in utils/dataset.py for more information.
+    config.train.hdf5_normalize_obs = False                     # no obs normalization
+
+    # if provided, demonstrations are filtered by the list of demo keys under "mask/@hdf5_filter_key"
+    config.train.hdf5_filter_key = None                         # by default, use no filter key
+
+    # fetch sequences of length 10 from dataset for RNN training
+    config.train.seq_length = 50
+
+    # keys from hdf5 to load per demonstration, besides "obs" and "next_obs"
+    config.train.dataset_keys = (
+        "actions",
+        "rewards",
+        "dones",
+    )
+
+    # one of [None, "last"] - set to "last" to include goal observations in each batch
+    config.train.goal_mode = "last"                               # no need for goal observations
+
+    ## learning config ##
+    config.train.cuda = True                                    # try to use GPU (if present) or not
+    config.train.batch_size = 4                               # batch size
+    config.train.num_epochs = 31                              # number of training epochs
+    config.train.seed = 1                                       # seed for training
+
+
+    ### Observation Config ###
+    config.observation.modalities.obs.low_dim = [               # specify low-dim observations for agent
+        "proprio",
+    ]
+    config.observation.modalities.obs.rgb = [
+        "rgb",
+        "rgb_wrist",
+    ]
+
+    config.observation.modalities.obs.depth = [
+        "depth",
+        "depth_wrist",
+    ]
+    config.observation.modalities.obs.scan = [
+        "scan",
+    ]
+    config.observation.modalities.goal.low_dim = []             # no low-dim goals
+    config.observation.modalities.goal.rgb = []               # no rgb image goals
+
+    ### Algo Config ###
+
+    # optimization parameters
+    config.algo.optim_params.policy.learning_rate.initial = 1e-4        # policy learning rate
+    config.algo.optim_params.policy.learning_rate.decay_factor = 0.1    # factor to decay LR by (if epoch schedule non-empty)
+    config.algo.optim_params.policy.learning_rate.epoch_schedule = []   # epochs where LR decay occurs
+    config.algo.optim_params.policy.regularization.L2 = 0.00            # L2 regularization strength
+
+    # loss weights
+    config.algo.loss.l2_weight = 1.0    # L2 loss weight
+    config.algo.loss.l1_weight = 0.0    # L1 loss weight
+    config.algo.loss.cos_weight = 0.0   # cosine loss weight
+
+    # MLP network architecture (layers after observation encoder and RNN, if present)
+    config.algo.actor_layer_dims = (300, 400)           # MLP layers between RNN layer and action output
+
+    # stochastic GMM policy
+    config.algo.gmm.enabled = True                      # enable GMM policy - policy outputs GMM action distribution
+    config.algo.gmm.num_modes = 5                       # number of GMM modes
+    config.algo.gmm.min_std = 0.01                      # minimum std output from network
+    config.algo.gmm.std_activation = "softplus"         # activation to use for std output from policy net
+    config.algo.gmm.low_noise_eval = True               # low-std at test-time
+
+    # rnn policy config
+    config.algo.rnn.enabled = True      # enable RNN policy
+    config.algo.rnn.horizon = 50        # unroll length for RNN - should usually match train.seq_length
+    config.algo.rnn.hidden_dim = 1200   # hidden dimension size
+    config.algo.rnn.rnn_type = "LSTM"   # rnn type - one of "LSTM" or "GRU"
+    config.algo.rnn.num_layers = 2      # number of RNN layers that are stacked
+    config.algo.rnn.open_loop = False   # if True, action predictions are only based on a single observation (not sequence) + hidden state
+    config.algo.rnn.kwargs.bidirectional = False          # rnn kwargs
+
+    return config
+
+
+# Valid dataset types to use
+DATASET_TYPES = {
+    "robosuite": {
+        "default_dataset_func": TestUtils.example_dataset_path,
+        "hp": robosuite_hyperparameters,
+    },
+    "momart": {
+        "default_dataset_func": TestUtils.example_momart_dataset_path,
+        "hp": momart_hyperparameters,
+    },
+}
+
+
+def get_config(dataset_type="robosuite", dataset_path=None, output_dir=None, debug=False):
+    """
+    Construct config for training.
+
+    Args:
+        dataset_type (str): Dataset type to use. Valid options are DATASET_TYPES. Default is "robosuite"
+        dataset_path (str or None): path to hdf5 dataset. Pass None to use a small default dataset.
+        output_dir (str): path to output folder, where logs, model checkpoints, and videos
+            will be written. If it doesn't exist, the directory will be created. Pass
+            None to use a default directory in /tmp.
+        debug (bool): if True, shrink training and rollout times to test a full training
+            run quickly.
+    """
+    assert dataset_type in DATASET_TYPES, \
+        f"Invalid dataset type. Valid options are: {list(DATASET_TYPES.keys())}, got: {dataset_type}"
+
+    # handle args
+    if dataset_path is None:
+        # small dataset with a handful of trajectories
+        dataset_path = DATASET_TYPES[dataset_type]["default_dataset_func"]()
+
+    if output_dir is None:
+        # default output directory created in /tmp
+        output_dir = TestUtils.temp_model_dir_path()
+
+    # make default BC config
+    config = config_factory(algo_name="bc")
+
+    ### Experiment Config ###
+    config.experiment.name = f"{dataset_type}_bc_rnn_example"   # name of experiment used to make log files
+    config.experiment.validate = True                           # whether to do validation or not
+    config.experiment.logging.terminal_output_to_txt = False    # whether to log stdout to txt file 
+    config.experiment.logging.log_tb = True                     # enable tensorboard logging
+
+    ### Train Config ###
+    config.train.data = dataset_path                            # path to hdf5 dataset
+
+    # Write all results to this directory. A new folder with the timestamp will be created
+    # in this directory, and it will contain three subfolders - "log", "models", and "videos".
+    # The "log" directory will contain tensorboard and stdout txt logs. The "models" directory
+    # will contain saved model checkpoints. The "videos" directory contains evaluation rollout
+    # videos.
+    config.train.output_dir = output_dir                        # path to output folder
+
+    # Load default hyperparameters based on dataset type
+    config = DATASET_TYPES[dataset_type]["hp"](config)
+
     # maybe make training length small for a quick run
     if debug:
 
@@ -248,10 +417,29 @@ def get_config(dataset_path=None, output_dir=None, debug=False):
         help="set this flag to run a quick training run for debugging purposes"
     )
 
+    # type
+    parser.add_argument(
+        "--dataset_type",
+        type=str,
+        default="robosuite",
+        choices=list(DATASET_TYPES.keys()),
+        help=f"Dataset type to use. This will determine the default hyperparameter settings to use for training."
+             f"Valid options are: {list(DATASET_TYPES.keys())}. Default is robosuite."
+    )
+
     args = parser.parse_args()
 
+    # Turn debug mode on possibly
+    if args.debug:
+        Macros.DEBUG = True
+
     # config for training
-    config = get_config(dataset_path=args.dataset, output_dir=args.output, debug=args.debug)
+    config = get_config(
+        dataset_type=args.dataset_type,
+        dataset_path=args.dataset,
+        output_dir=args.output,
+        debug=args.debug
+    )
 
     # set torch device
     device = TorchUtils.get_torch_device(try_to_use_cuda=config.train.cuda)
diff --git a/robomimic/__init__.py b/robomimic/__init__.py
index 7f9ade24..f5d4c357 100644
--- a/robomimic/__init__.py
+++ b/robomimic/__init__.py
@@ -1,7 +1,10 @@
-__version__ = "0.1.0"
+__version__ = "0.2.0"
 
 
-# stores released dataset links and rollout horizons in global dictionary. Structure is given below:
+# stores released dataset links and rollout horizons in global dictionary.
+# Structure is given below for each type of dataset:
+
+# robosuite / real
 # {
 #   task:
 #       dataset_type:
@@ -14,6 +17,17 @@
 # }
 DATASET_REGISTRY = {}
 
+# momart
+# {
+#   task:
+#       dataset_type:
+#           url: link
+#           size: value
+#       ...
+#   ...
+# }
+MOMART_DATASET_REGISTRY = {}
+
 
 def register_dataset_link(task, dataset_type, hdf5_type, link, horizon):
     """
@@ -88,4 +102,55 @@ def register_all_links():
         link="http://downloads.cs.stanford.edu/downloads/rt_benchmark/can/paired/image.hdf5")
 
 
-register_all_links()
\ No newline at end of file
+def register_momart_dataset_link(task, dataset_type, link, dataset_size):
+    """
+    Helper function to register dataset link in global dictionary.
+    Also takes a @horizon parameter - this corresponds to the evaluation
+    rollout horizon that should be used during training.
+
+    Args:
+        task (str): name of task for this dataset
+        dataset_type (str): type of dataset (usually identifies the dataset source)
+        link (str): download link for the dataset
+        dataset_size (float): size of the dataset, in GB
+    """
+    if task not in MOMART_DATASET_REGISTRY:
+        MOMART_DATASET_REGISTRY[task] = {}
+    if dataset_type not in MOMART_DATASET_REGISTRY[task]:
+        MOMART_DATASET_REGISTRY[task][dataset_type] = {}
+    MOMART_DATASET_REGISTRY[task][dataset_type] = dict(url=link, size=dataset_size)
+
+
+def register_all_momart_links():
+    """
+    Record all dataset links in this function.
+    """
+    # all tasks, mapped to their [exp, sub, gen, sam] sizes
+    momart_tasks = {
+        "table_setup_from_dishwasher": [14, 14, 3.3, 0.6],
+        "table_setup_from_dresser": [16, 17, 3.1, 0.7],
+        "table_cleanup_to_dishwasher": [23, 36, 5.3, 1.1],
+        "table_cleanup_to_sink": [17, 28, 2.9, 0.8],
+        "unload_dishwasher": [21, 27, 5.4, 1.0],
+    }
+
+    momart_dataset_types = [
+        "expert",
+        "suboptimal",
+        "generalize",
+        "sample",
+    ]
+
+    # Iterate over all combos and register the link
+    for task, dataset_sizes in momart_tasks.items():
+        for dataset_type, dataset_size in zip(momart_dataset_types, dataset_sizes):
+            register_momart_dataset_link(
+                task=task,
+                dataset_type=dataset_type,
+                link=f"http://downloads.cs.stanford.edu/downloads/rt_mm/{dataset_type}/{task}_{dataset_type}.hdf5",
+                dataset_size=dataset_size,
+            )
+
+
+register_all_links()
+register_all_momart_links()
diff --git a/robomimic/algo/algo.py b/robomimic/algo/algo.py
index c8d4cf09..1a4a835f 100644
--- a/robomimic/algo/algo.py
+++ b/robomimic/algo/algo.py
@@ -45,7 +45,7 @@ def algo_name_to_factory_func(algo_name):
     return REGISTERED_ALGO_FACTORY_FUNCS[algo_name]
 
 
-def algo_factory(algo_name, config, modality_shapes, ac_dim, device):
+def algo_factory(algo_name, config, obs_key_shapes, ac_dim, device):
     """
     Factory function for creating algorithms based on the algorithm name and config.
 
@@ -54,7 +54,7 @@ def algo_factory(algo_name, config, modality_shapes, ac_dim, device):
 
         config (BaseConfig instance): config object
 
-        modality_shapes (OrderedDict): dictionary that maps modality keys to shapes
+        obs_key_shapes (OrderedDict): dictionary that maps observation keys to shapes
 
         ac_dim (int): dimension of action space
 
@@ -73,7 +73,7 @@ def algo_factory(algo_name, config, modality_shapes, ac_dim, device):
         algo_config=config.algo,
         obs_config=config.observation,
         global_config=config,
-        modality_shapes=modality_shapes,
+        obs_key_shapes=obs_key_shapes,
         ac_dim=ac_dim,
         device=device,
         **algo_kwargs
@@ -92,7 +92,7 @@ def __init__(
         algo_config,
         obs_config,
         global_config,
-        modality_shapes,
+        obs_key_shapes,
         ac_dim,
         device
     ):
@@ -106,7 +106,7 @@ def __init__(
 
             global_config (Config object): global training config
 
-            modality_shapes (OrderedDict): dictionary that maps modality keys to shapes
+            obs_key_shapes (OrderedDict): dictionary that maps observation keys to shapes
 
             ac_dim (int): dimension of action space
 
@@ -119,36 +119,39 @@ def __init__(
 
         self.ac_dim = ac_dim
         self.device = device
-        self.modality_shapes = modality_shapes
+        self.obs_key_shapes = obs_key_shapes
 
         self.nets = nn.ModuleDict()
-        self._create_shapes(obs_config.modalities, modality_shapes)
+        self._create_shapes(obs_config.modalities, obs_key_shapes)
         self._create_networks()
         self._create_optimizers()
         assert isinstance(self.nets, nn.ModuleDict)
 
-    def _create_shapes(self, modalities, modality_shapes):
+    def _create_shapes(self, obs_keys, obs_key_shapes):
         """
         Create obs_shapes, goal_shapes, and subgoal_shapes dictionaries, to make it
-        easy for this algorithm object to keep track of modality shapes. Each dictionary
-        maps modality to shape.
+        easy for this algorithm object to keep track of observation key shapes. Each dictionary
+        maps observation key to shape.
 
         Args:
-            modalities (dict): dict of required modalities for this training run (usually 
-                specified by the obs config), e.g., {"obs": ["image", "proprio"], "goal": ["proprio"]}
-            modality_shapes (dict): dict of modality shapes, e.g., {"image": [3, 224, 224]}
+            obs_keys (dict): dict of required observation keys for this training run (usually
+                specified by the obs config), e.g., {"obs": ["rgb", "proprio"], "goal": ["proprio"]}
+            obs_key_shapes (dict): dict of observation key shapes, e.g., {"rgb": [3, 224, 224]}
         """
         # determine shapes
         self.obs_shapes = OrderedDict()
         self.goal_shapes = OrderedDict()
         self.subgoal_shapes = OrderedDict()
-        for k in modality_shapes:
-            if "obs" in self.obs_config.modalities and k in (self.obs_config.modalities.obs.low_dim + self.obs_config.modalities.obs.image):
-                self.obs_shapes[k] = modality_shapes[k]
-            if "goal" in self.obs_config.modalities and k in (self.obs_config.modalities.goal.low_dim + self.obs_config.modalities.goal.image):
-                self.goal_shapes[k] = modality_shapes[k]
-            if "subgoal" in self.obs_config.modalities and k in (self.obs_config.modalities.subgoal.low_dim + self.obs_config.modalities.subgoal.image):
-                self.subgoal_shapes[k] = modality_shapes[k]
+
+        # We check across all modality groups (obs, goal, subgoal), and see if the inputted observation key exists
+        # across all modalitie specified in the config. If so, we store its corresponding shape internally
+        for k in obs_key_shapes:
+            if "obs" in self.obs_config.modalities and k in [obs_key for modality in self.obs_config.modalities.obs.values() for obs_key in modality]:
+                self.obs_shapes[k] = obs_key_shapes[k]
+            if "goal" in self.obs_config.modalities and k in [obs_key for modality in self.obs_config.modalities.goal.values() for obs_key in modality]:
+                self.goal_shapes[k] = obs_key_shapes[k]
+            if "subgoal" in self.obs_config.modalities and k in [obs_key for modality in self.obs_config.modalities.subgoal.values() for obs_key in modality]:
+                self.subgoal_shapes[k] = obs_key_shapes[k]
 
     def _create_networks(self):
         """
@@ -424,7 +427,7 @@ def __init__(self, policy, obs_normalization_stats=None):
             policy (Algo instance): @Algo object to wrap to prepare for rollouts
 
             obs_normalization_stats (dict): optionally pass a dictionary for observation
-                normalization. This should map observation modality keys to dicts
+                normalization. This should map observation keys to dicts
                 with a "mean" and "std" of shape (1, ...) where ... is the default
                 shape for the observation.
         """
diff --git a/robomimic/algo/bc.py b/robomimic/algo/bc.py
index 158e3e85..4853f75a 100644
--- a/robomimic/algo/bc.py
+++ b/robomimic/algo/bc.py
@@ -14,6 +14,7 @@
 import robomimic.utils.loss_utils as LossUtils
 import robomimic.utils.tensor_utils as TensorUtils
 import robomimic.utils.torch_utils as TorchUtils
+import robomimic.utils.obs_utils as ObsUtils
 
 from robomimic.algo import register_algo_factory_func, PolicyAlgo
 
@@ -65,7 +66,7 @@ def _create_networks(self):
             goal_shapes=self.goal_shapes,
             ac_dim=self.ac_dim,
             mlp_layer_dims=self.algo_config.actor_layer_dims,
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
         )
         self.nets = self.nets.float().to(self.device)
 
@@ -244,7 +245,7 @@ def _create_networks(self):
             std_limits=(self.algo_config.gaussian.min_std, 7.5),
             std_activation=self.algo_config.gaussian.std_activation,
             low_noise_eval=self.algo_config.gaussian.low_noise_eval,
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
         )
 
         self.nets = self.nets.float().to(self.device)
@@ -336,7 +337,7 @@ def _create_networks(self):
             min_std=self.algo_config.gmm.min_std,
             std_activation=self.algo_config.gmm.std_activation,
             low_noise_eval=self.algo_config.gmm.low_noise_eval,
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
         )
 
         self.nets = self.nets.float().to(self.device)
@@ -356,8 +357,8 @@ def _create_networks(self):
             goal_shapes=self.goal_shapes,
             ac_dim=self.ac_dim,
             device=self.device,
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
             **VAENets.vae_args_from_config(self.algo_config.vae),
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
         )
         
         self.nets = self.nets.float().to(self.device)
@@ -466,8 +467,8 @@ def _create_networks(self):
             goal_shapes=self.goal_shapes,
             ac_dim=self.ac_dim,
             mlp_layer_dims=self.algo_config.actor_layer_dims,
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
             **BaseNets.rnn_args_from_config(self.algo_config.rnn),
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
         )
 
         self._rnn_hidden_state = None
@@ -566,8 +567,8 @@ def _create_networks(self):
             min_std=self.algo_config.gmm.min_std,
             std_activation=self.algo_config.gmm.std_activation,
             low_noise_eval=self.algo_config.gmm.low_noise_eval,
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
             **BaseNets.rnn_args_from_config(self.algo_config.rnn),
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
         )
 
         self._rnn_hidden_state = None
diff --git a/robomimic/algo/bcq.py b/robomimic/algo/bcq.py
index ab20dfec..27123b7f 100644
--- a/robomimic/algo/bcq.py
+++ b/robomimic/algo/bcq.py
@@ -90,7 +90,7 @@ def _create_critics(self):
             mlp_layer_dims=self.algo_config.critic.layer_dims,
             value_bounds=self.algo_config.critic.value_bounds,
             goal_shapes=self.goal_shapes,
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
         )
 
         # Q network ensemble and target ensemble
@@ -115,8 +115,8 @@ def _create_action_sampler(self):
             ac_dim=self.ac_dim,
             device=self.device,
             goal_shapes=self.goal_shapes,
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
             **VAENets.vae_args_from_config(self.algo_config.action_sampler.vae),
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
         )
 
     def _create_actor(self):
@@ -131,7 +131,7 @@ def _create_actor(self):
             ac_dim=self.ac_dim,
             mlp_layer_dims=self.algo_config.actor.layer_dims,
             perturbation_scale=self.algo_config.actor.perturbation_scale,
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
         )
 
         self.nets["actor"] = actor_class(**actor_args)
@@ -848,7 +848,7 @@ def _create_action_sampler(self):
             min_std=self.algo_config.action_sampler.gmm.min_std,
             std_activation=self.algo_config.action_sampler.gmm.std_activation,
             low_noise_eval=self.algo_config.action_sampler.gmm.low_noise_eval,
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
         )
 
     def _train_action_sampler_on_batch(self, batch, epoch, no_backprop=False):
@@ -927,7 +927,7 @@ def _create_critics(self):
             value_bounds=self.algo_config.critic.value_bounds,
             num_atoms=self.algo_config.critic.distributional.num_atoms,
             goal_shapes=self.goal_shapes,
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
         )
 
         # Q network ensemble and target ensemble
diff --git a/robomimic/algo/cql.py b/robomimic/algo/cql.py
index 7d6a3750..ef41812a 100644
--- a/robomimic/algo/cql.py
+++ b/robomimic/algo/cql.py
@@ -105,8 +105,8 @@ def _create_networks(self):
             goal_shapes=self.goal_shapes,
             ac_dim=self.ac_dim,
             mlp_layer_dims=self.algo_config.actor.layer_dims,
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
             **actor_args,
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
         )
 
         # Critics
@@ -120,7 +120,7 @@ def _create_networks(self):
                     mlp_layer_dims=self.algo_config.critic.layer_dims,
                     value_bounds=self.algo_config.critic.value_bounds,
                     goal_shapes=self.goal_shapes,
-                    **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
+                    encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
                 )
                 net_list.append(critic)
 
diff --git a/robomimic/algo/gl.py b/robomimic/algo/gl.py
index c8bc5fc2..6b243b48 100644
--- a/robomimic/algo/gl.py
+++ b/robomimic/algo/gl.py
@@ -43,7 +43,7 @@ def __init__(
         algo_config,
         obs_config,
         global_config,
-        modality_shapes,
+        obs_key_shapes,
         ac_dim,
         device
     ):
@@ -57,7 +57,7 @@ def __init__(
 
             global_config (Config object): global training config
 
-            modality_shapes (OrderedDict): dictionary that maps modality keys to shapes
+            obs_key_shapes (OrderedDict): dictionary that maps observation keys to shapes
 
             ac_dim (int): dimension of action space
 
@@ -69,7 +69,7 @@ def __init__(
             algo_config=algo_config,
             obs_config=obs_config,
             global_config=global_config,
-            modality_shapes=modality_shapes,
+            obs_key_shapes=obs_key_shapes,
             ac_dim=ac_dim,
             device=device
         )
@@ -90,7 +90,7 @@ def _create_networks(self):
             input_obs_group_shapes=obs_group_shapes, 
             output_shapes=self.subgoal_shapes,
             layer_dims=self.algo_config.ae.planner_layer_dims,
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
         )
 
         self.nets = self.nets.float().to(self.device)
@@ -155,7 +155,7 @@ def train_on_batch(self, batch, epoch, validate=False):
             # predict subgoal observations with goal network
             pred_subgoals = self.nets["goal_network"](obs=batch["obs"], goal=batch["goal_obs"])
 
-            # compute loss as L2 error for each modality
+            # compute loss as L2 error for each observation key
             losses = OrderedDict()
             target_subgoals = batch["target_subgoals"]  # targets for network prediction
             goal_loss = 0.
@@ -268,8 +268,8 @@ def _create_networks(self):
             condition_shapes=self.obs_shapes,
             goal_shapes=self.goal_shapes,
             device=self.device,
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
             **VAENets.vae_args_from_config(self.algo_config.vae),
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
         )
 
         self.nets = self.nets.float().to(self.device)
@@ -508,7 +508,7 @@ def __init__(
         algo_config,
         obs_config,
         global_config,
-        modality_shapes,
+        obs_key_shapes,
         ac_dim,
         device,
 
@@ -527,7 +527,7 @@ def __init__(
 
             global_config (Config object); global config
 
-            modality_shapes (OrderedDict): dictionary that maps input/output modality keys to shapes
+            obs_key_shapes (OrderedDict): dictionary that maps input/output observation keys to shapes
 
             ac_dim (int): action dimension
 
@@ -544,7 +544,7 @@ def __init__(
             algo_config=algo_config.planner,
             obs_config=obs_config.planner,
             global_config=global_config,
-            modality_shapes=modality_shapes,
+            obs_key_shapes=obs_key_shapes,
             ac_dim=ac_dim,
             device=device
         )
@@ -553,7 +553,7 @@ def __init__(
             algo_config=algo_config.value,
             obs_config=obs_config.value,
             global_config=global_config,
-            modality_shapes=modality_shapes,
+            obs_key_shapes=obs_key_shapes,
             ac_dim=ac_dim,
             device=device
         )
diff --git a/robomimic/algo/hbc.py b/robomimic/algo/hbc.py
index 500211f4..7c540454 100644
--- a/robomimic/algo/hbc.py
+++ b/robomimic/algo/hbc.py
@@ -46,7 +46,7 @@ def __init__(
         algo_config,
         obs_config,
         global_config,
-        modality_shapes,
+        obs_key_shapes,
         ac_dim,
         device,
     ):
@@ -64,7 +64,7 @@ def __init__(
 
             global_config (Config object): global training config
 
-            modality_shapes (dict): dictionary that maps input/output modality keys to shapes
+            obs_key_shapes (dict): dictionary that maps input/output observation keys to shapes
 
             ac_dim (int): action dimension
 
@@ -90,7 +90,7 @@ def __init__(
             algo_config=algo_config.planner,
             obs_config=obs_config.planner,
             global_config=global_config,
-            modality_shapes=modality_shapes,
+            obs_key_shapes=obs_key_shapes,
             ac_dim=ac_dim,
             device=device
         )
@@ -102,29 +102,26 @@ def __init__(
             self.actor_goal_shapes = OrderedDict(latent_subgoal=(self.planner.algo_config.vae.latent_dim,))
 
         # only for the actor: override goal modalities and shapes to match the subgoal set by the planner
-        actor_modality_shapes = deepcopy(modality_shapes)
-        # make sure we are not modifying existing modality shapes
+        actor_obs_key_shapes = deepcopy(obs_key_shapes)
+        # make sure we are not modifying existing observation key shapes
         for k in self.actor_goal_shapes:
-            if k in actor_modality_shapes:
-                assert actor_modality_shapes[k] == self.actor_goal_shapes[k]
-        actor_modality_shapes.update(self.actor_goal_shapes)
+            if k in actor_obs_key_shapes:
+                assert actor_obs_key_shapes[k] == self.actor_goal_shapes[k]
+        actor_obs_key_shapes.update(self.actor_goal_shapes)
 
-        goal_modalities = {"low_dim": [], "image": []}
+        goal_obs_keys = {obs_modality: [] for obs_modality in ObsUtils.OBS_MODALITY_CLASSES.keys()}
         for k in self.actor_goal_shapes.keys():
-            if ObsUtils.key_is_image(k):
-                goal_modalities["image"].append(k)
-            else:
-                goal_modalities["low_dim"].append(k)
+            goal_obs_keys[ObsUtils.OBS_KEYS_TO_MODALITIES[k]].append(k)
 
         actor_obs_config = deepcopy(obs_config.actor)
         with actor_obs_config.unlocked():
-            actor_obs_config["goal"] = Config(**goal_modalities)
+            actor_obs_config["goal"] = Config(**goal_obs_keys)
 
         self.actor = policy_algo_class(
             algo_config=algo_config.actor,
             obs_config=actor_obs_config,
             global_config=global_config,
-            modality_shapes=actor_modality_shapes,
+            obs_key_shapes=actor_obs_key_shapes,
             ac_dim=ac_dim,
             device=device,
         )
@@ -295,9 +292,9 @@ def current_subgoal(self, sg):
         for k, v in sg.items():
             if not self.algo_config.latent_subgoal.enabled:
                 # subgoal should only match subgoal shapes if not using latent subgoals
-                assert v.shape[1:] == self.planner.subgoal_shapes[k]
+                assert list(v.shape[1:]) == list(self.planner.subgoal_shapes[k])
             # subgoal shapes should always match actor goal shapes
-            assert v.shape[1:] == self.actor_goal_shapes[k]
+            assert list(v.shape[1:]) == list(self.actor_goal_shapes[k])
         self._current_subgoal = { k : sg[k].clone() for k in sg }
 
     def get_action(self, obs_dict, goal_dict=None):
diff --git a/robomimic/algo/iris.py b/robomimic/algo/iris.py
index 80f7dacc..de79bd71 100644
--- a/robomimic/algo/iris.py
+++ b/robomimic/algo/iris.py
@@ -43,7 +43,7 @@ def __init__(
         algo_config,
         obs_config,
         global_config,
-        modality_shapes,
+        obs_key_shapes,
         ac_dim,
         device,
     ):
@@ -61,7 +61,7 @@ def __init__(
 
             global_config (Config object): global training config
 
-            modality_shapes (OrderedDict): dictionary that maps input/output modality keys to shapes
+            obs_key_shapes (OrderedDict): dictionary that maps input/output observation keys to shapes
 
             ac_dim (int): action dimension
 
@@ -89,7 +89,7 @@ def __init__(
             algo_config=algo_config.value_planner,
             obs_config=obs_config.value_planner,
             global_config=global_config,
-            modality_shapes=modality_shapes,
+            obs_key_shapes=obs_key_shapes,
             ac_dim=ac_dim,
             device=device
         )
@@ -98,19 +98,16 @@ def __init__(
         assert not algo_config.latent_subgoal.enabled, "IRIS does not support latent subgoals"
 
         # only for the actor: override goal modalities and shapes to match the subgoal set by the planner
-        actor_modality_shapes = deepcopy(modality_shapes)
-        # make sure we are not modifying existing modality shapes
+        actor_obs_key_shapes = deepcopy(obs_key_shapes)
+        # make sure we are not modifying existing observation key shapes
         for k in self.actor_goal_shapes:
-            if k in actor_modality_shapes:
-                assert actor_modality_shapes[k] == self.actor_goal_shapes[k]
-        actor_modality_shapes.update(self.actor_goal_shapes)
+            if k in actor_obs_key_shapes:
+                assert actor_obs_key_shapes[k] == self.actor_goal_shapes[k]
+        actor_obs_key_shapes.update(self.actor_goal_shapes)
 
-        goal_modalities = {"low_dim": [], "image": []}
+        goal_modalities = {obs_modality: [] for obs_modality in ObsUtils.OBS_MODALITY_CLASSES.keys()}
         for k in self.actor_goal_shapes.keys():
-            if ObsUtils.key_is_image(k):
-                goal_modalities["image"].append(k)
-            else:
-                goal_modalities["low_dim"].append(k)
+            goal_modalities[ObsUtils.OBS_KEYS_TO_MODALITIES[k]].append(k)
 
         actor_obs_config = deepcopy(obs_config.actor)
         with actor_obs_config.unlocked():
@@ -120,7 +117,7 @@ def __init__(
             algo_config=algo_config.actor,
             obs_config=actor_obs_config,
             global_config=global_config,
-            modality_shapes=actor_modality_shapes,
+            obs_key_shapes=actor_obs_key_shapes,
             ac_dim=ac_dim,
             device=device
         )
diff --git a/robomimic/algo/td3_bc.py b/robomimic/algo/td3_bc.py
index ef7a5238..fb8b21c6 100644
--- a/robomimic/algo/td3_bc.py
+++ b/robomimic/algo/td3_bc.py
@@ -94,7 +94,7 @@ def _create_critics(self):
             mlp_layer_dims=self.algo_config.critic.layer_dims,
             value_bounds=self.algo_config.critic.value_bounds,
             goal_shapes=self.goal_shapes,
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
         )
 
         # Q network ensemble and target ensemble
@@ -117,7 +117,7 @@ def _create_actor(self):
             goal_shapes=self.goal_shapes,
             ac_dim=self.ac_dim,
             mlp_layer_dims=self.algo_config.actor.layer_dims,
-            **ObsNets.obs_encoder_args_from_config(self.obs_config.encoder),
+            encoder_kwargs=ObsUtils.obs_encoder_kwargs_from_config(self.obs_config.encoder),
         )
 
         self.nets["actor"] = actor_class(**actor_args)
diff --git a/robomimic/config/base_config.py b/robomimic/config/base_config.py
index 4b01a638..664736f1 100644
--- a/robomimic/config/base_config.py
+++ b/robomimic/config/base_config.py
@@ -211,49 +211,91 @@ def observation_config(self):
             "robot0_gripper_qpos", 
             "object",
         ]
-        self.observation.modalities.obs.image = []              # specify image observations for agent
-        self.observation.modalities.goal.low_dim = []           # specify low-dim goal bservations to condition agent on
-        self.observation.modalities.goal.image = []             # specify image goal bservations to condition agent on
-
-        # observation encoder architecture - applies to all networks that take observation dicts as input
-        self.observation.encoder.visual_core = 'ResNet18Conv'   # visual core network backbone for image observations (unused if no image observations)
-        # kwargs for visual core class specified above
-        self.observation.encoder.visual_core_kwargs.pretrained = False
-        self.observation.encoder.visual_core_kwargs.input_coord_conv = False
-        self.observation.encoder.visual_core_kwargs.do_not_lock_keys()
-
-        # observation randomizer class - set to None to use no randomization, or 'CropRandomizer' to use crop randomization
-        self.observation.encoder.obs_randomizer_class = None
-
-        # kwargs for observation randomizers (for the CropRandomizer, this is size and number of crops)
-        self.observation.encoder.obs_randomizer_kwargs.crop_height = 76
-        self.observation.encoder.obs_randomizer_kwargs.crop_width = 76
-        self.observation.encoder.obs_randomizer_kwargs.num_crops = 1
-        self.observation.encoder.obs_randomizer_kwargs.pos_enc = False
-        self.observation.encoder.obs_randomizer_kwargs.do_not_lock_keys()
-
-        self.observation.encoder.visual_feature_dimension = 64  # images are encoded into feature vectors of this size
-        self.observation.encoder.use_spatial_softmax = True     # whether to use spatial softmax layer at end of conv layers
-
-        # kwargs for spatial softmax layer
-        self.observation.encoder.spatial_softmax_kwargs.num_kp = 32
-        self.observation.encoder.spatial_softmax_kwargs.learnable_temperature = False
-        self.observation.encoder.spatial_softmax_kwargs.temperature = 1.0
-        self.observation.encoder.spatial_softmax_kwargs.noise_std = 0.0
-        self.observation.encoder.spatial_softmax_kwargs.do_not_lock_keys()
-
+        self.observation.modalities.obs.rgb = []              # specify rgb image observations for agent
+        self.observation.modalities.obs.depth = []
+        self.observation.modalities.obs.scan = []
+        self.observation.modalities.goal.low_dim = []           # specify low-dim goal observations to condition agent on
+        self.observation.modalities.goal.rgb = []             # specify rgb image goal observations to condition agent on
+        self.observation.modalities.goal.depth = []
+        self.observation.modalities.goal.scan = []
+        self.observation.modalities.obs.do_not_lock_keys()
+        self.observation.modalities.goal.do_not_lock_keys()
+
+        # observation encoder architectures (per obs modality)
+        # This applies to all networks that take observation dicts as input
+
+        # =============== Low Dim default encoder (no encoder) ===============
+        self.observation.encoder.low_dim.core_class = None
+        self.observation.encoder.low_dim.core_kwargs = Config()                 # No kwargs by default
+        self.observation.encoder.low_dim.core_kwargs.do_not_lock_keys()
+
+        # Low Dim: Obs Randomizer settings
+        self.observation.encoder.low_dim.obs_randomizer_class = None
+        self.observation.encoder.low_dim.obs_randomizer_kwargs = Config()       # No kwargs by default
+        self.observation.encoder.low_dim.obs_randomizer_kwargs.do_not_lock_keys()
+
+        # =============== RGB default encoder (ResNet backbone + linear layer output) ===============
+        self.observation.encoder.rgb.core_class = "VisualCore"
+        self.observation.encoder.rgb.core_kwargs.feature_dimension = 64
+        self.observation.encoder.rgb.core_kwargs.flatten = True
+        self.observation.encoder.rgb.core_kwargs.backbone_class = "ResNet18Conv"
+        self.observation.encoder.rgb.core_kwargs.backbone_kwargs.pretrained = False
+        self.observation.encoder.rgb.core_kwargs.backbone_kwargs.input_coord_conv = False
+        self.observation.encoder.rgb.core_kwargs.backbone_kwargs.do_not_lock_keys()
+        self.observation.encoder.rgb.core_kwargs.pool_class = "SpatialSoftmax"                # Alternate options are "SpatialMeanPool" or None (no pooling)
+        self.observation.encoder.rgb.core_kwargs.pool_kwargs.num_kp = 32                      # Default arguments for "SpatialSoftmax"
+        self.observation.encoder.rgb.core_kwargs.pool_kwargs.learnable_temperature = False    # Default arguments for "SpatialSoftmax"
+        self.observation.encoder.rgb.core_kwargs.pool_kwargs.temperature = 1.0                # Default arguments for "SpatialSoftmax"
+        self.observation.encoder.rgb.core_kwargs.pool_kwargs.noise_std = 0.0                  # Default arguments for "SpatialSoftmax"
+        self.observation.encoder.rgb.core_kwargs.pool_kwargs.output_variance = False          # Default arguments for "SpatialSoftmax"
+        self.observation.encoder.rgb.core_kwargs.pool_kwargs.do_not_lock_keys()
+
+        # RGB: Obs Randomizer settings
+        self.observation.encoder.rgb.obs_randomizer_class = None                  # Can set to 'CropRandomizer' to use crop randomization
+        self.observation.encoder.rgb.obs_randomizer_kwargs.crop_height = 76       # Default arguments for "CropRandomizer"
+        self.observation.encoder.rgb.obs_randomizer_kwargs.crop_width = 76        # Default arguments for "CropRandomizer"
+        self.observation.encoder.rgb.obs_randomizer_kwargs.num_crops = 1          # Default arguments for "CropRandomizer"
+        self.observation.encoder.rgb.obs_randomizer_kwargs.pos_enc = False        # Default arguments for "CropRandomizer"
+        self.observation.encoder.rgb.obs_randomizer_kwargs.do_not_lock_keys()
+
+        # Allow for other custom modalities to be specified
+        self.observation.encoder.do_not_lock_keys()
+
+        # =============== Depth default encoder (same as rgb) ===============
+        self.observation.encoder.depth = deepcopy(self.observation.encoder.rgb)
+
+        # =============== Scan default encoder (Conv1d backbone + linear layer output) ===============
+        self.observation.encoder.scan = deepcopy(self.observation.encoder.rgb)
+        self.observation.encoder.scan.core_kwargs.pop("backbone_class")
+        self.observation.encoder.scan.core_kwargs.pop("backbone_kwargs")
+
+        # Scan: Modify the core class + kwargs, otherwise, is same as rgb encoder
+        self.observation.encoder.scan.core_class = "ScanCore"
+        self.observation.encoder.scan.core_kwargs.conv_activation = "relu"
+        self.observation.encoder.scan.core_kwargs.conv_kwargs.out_channels = [32, 64, 64]
+        self.observation.encoder.scan.core_kwargs.conv_kwargs.kernel_size = [8, 4, 2]
+        self.observation.encoder.scan.core_kwargs.conv_kwargs.stride = [4, 2, 1]
 
     @property
     def use_goals(self):
         # whether the agent is goal-conditioned
-        return len(self.observation.modalities.goal.low_dim + self.observation.modalities.goal.image) > 0
+        return len([obs_key for modality in self.observation.modalities.goal.values() for obs_key in modality]) > 0
 
     @property
-    def all_modalities(self):
+    def all_obs_keys(self):
+        """
+        This grabs the union of observation keys over all modalities (e.g.: low_dim, rgb, depth, etc.) and over all
+        modality groups (e.g: obs, goal, subgoal, etc...)
+
+        Returns:
+            n-array: all observation keys used for this model
+        """
         # pool all modalities
-        return sorted(tuple(set(
-            self.observation.modalities.obs.low_dim + 
-            self.observation.modalities.obs.image + 
-            self.observation.modalities.goal.low_dim + 
-            self.observation.modalities.goal.image
-        )))
+        return sorted(tuple(set([
+            obs_key for group in [
+                self.observation.modalities.obs.values(),
+                self.observation.modalities.goal.values()
+            ]
+            for modality in group
+            for obs_key in modality
+         ])))
diff --git a/robomimic/config/gl_config.py b/robomimic/config/gl_config.py
index 7e507f6d..939103e6 100644
--- a/robomimic/config/gl_config.py
+++ b/robomimic/config/gl_config.py
@@ -67,18 +67,23 @@ def observation_config(self):
             "robot0_gripper_qpos", 
             "object",
         ]
-        self.observation.modalities.subgoal.image = []                      # specify image subgoal observations for agent to predict
+        self.observation.modalities.subgoal.rgb = []                      # specify rgb image subgoal observations for agent to predict
+        self.observation.modalities.subgoal.depth = []
+        self.observation.modalities.subgoal.scan = []
+        self.observation.modalities.subgoal.do_not_lock_keys()
 
     @property
-    def all_modalities(self):
+    def all_obs_keys(self):
         """
         Update from superclass to include subgoals.
         """
-        return sorted(tuple(set(
-            self.observation.modalities.obs.low_dim + 
-            self.observation.modalities.obs.image +
-            self.observation.modalities.subgoal.low_dim + 
-            self.observation.modalities.subgoal.image + 
-            self.observation.modalities.goal.low_dim + 
-            self.observation.modalities.goal.image
-        )))
+        # pool all modalities
+        return sorted(tuple(set([
+            obs_key for group in [
+                self.observation.modalities.obs.values(),
+                self.observation.modalities.goal.values(),
+                self.observation.modalities.subgoal.values(),
+            ]
+            for modality in group
+            for obs_key in modality
+        ])))
diff --git a/robomimic/config/hbc_config.py b/robomimic/config/hbc_config.py
index dfa38b84..ae65c9b8 100644
--- a/robomimic/config/hbc_config.py
+++ b/robomimic/config/hbc_config.py
@@ -75,22 +75,22 @@ def use_goals(self):
         """
         return len(
             self.observation.planner.modalities.goal.low_dim +
-            self.observation.planner.modalities.goal.image) > 0
+            self.observation.planner.modalities.goal.rgb) > 0
 
     @property
-    def all_modalities(self):
+    def all_obs_keys(self):
         """
         Update from superclass to include modalities from planner and actor.
         """
-        return sorted(tuple(set(
-            self.observation.planner.modalities.obs.low_dim +
-            self.observation.planner.modalities.obs.image +
-            self.observation.planner.modalities.subgoal.low_dim +
-            self.observation.planner.modalities.subgoal.image +
-            self.observation.planner.modalities.goal.low_dim +
-            self.observation.planner.modalities.goal.image +
-            self.observation.actor.modalities.obs.low_dim +
-            self.observation.actor.modalities.obs.image +
-            self.observation.actor.modalities.goal.low_dim +
-            self.observation.actor.modalities.goal.image
-        )))
+        # pool all modalities
+        return sorted(tuple(set([
+            obs_key for group in [
+                self.observation.planner.modalities.obs.values(),
+                self.observation.planner.modalities.goal.values(),
+                self.observation.planner.modalities.subgoal.values(),
+                self.observation.actor.modalities.obs.values(),
+                self.observation.actor.modalities.goal.values(),
+            ]
+            for modality in group
+            for obs_key in modality
+        ])))
diff --git a/robomimic/config/iris_config.py b/robomimic/config/iris_config.py
index 7c8d7a3a..c03328ce 100644
--- a/robomimic/config/iris_config.py
+++ b/robomimic/config/iris_config.py
@@ -76,26 +76,24 @@ def use_goals(self):
         """
         return len(
             self.observation.value_planner.planner.modalities.goal.low_dim +
-            self.observation.value_planner.planner.modalities.goal.image) > 0
+            self.observation.value_planner.planner.modalities.goal.rgb) > 0
 
     @property
-    def all_modalities(self):
+    def all_obs_keys(self):
         """
         Update from superclass to include modalities from value planner and actor.
         """
-        return sorted(tuple(set(
-            self.observation.value_planner.planner.modalities.obs.low_dim +
-            self.observation.value_planner.planner.modalities.obs.image +
-            self.observation.value_planner.planner.modalities.subgoal.low_dim +
-            self.observation.value_planner.planner.modalities.subgoal.image +
-            self.observation.value_planner.planner.modalities.goal.low_dim +
-            self.observation.value_planner.planner.modalities.goal.image +
-            self.observation.value_planner.value.modalities.obs.low_dim +
-            self.observation.value_planner.value.modalities.obs.image +
-            self.observation.value_planner.value.modalities.goal.low_dim +
-            self.observation.value_planner.value.modalities.goal.image +
-            self.observation.actor.modalities.obs.low_dim +
-            self.observation.actor.modalities.obs.image +
-            self.observation.actor.modalities.goal.low_dim +
-            self.observation.actor.modalities.goal.image
-        )))
+        # pool all modalities
+        return sorted(tuple(set([
+            obs_key for group in [
+                self.observation.value_planner.planner.modalities.obs.values(),
+                self.observation.value_planner.planner.modalities.goal.values(),
+                self.observation.value_planner.planner.modalities.subgoal.values(),
+                self.observation.value_planner.value.modalities.obs.values(),
+                self.observation.value_planner.value.modalities.goal.values(),
+                self.observation.actor.modalities.obs.values(),
+                self.observation.actor.modalities.goal.values(),
+            ]
+            for modality in group
+            for obs_key in modality
+        ])))
diff --git a/robomimic/envs/env_base.py b/robomimic/envs/env_base.py
index 391f5be7..df13b3ef 100644
--- a/robomimic/envs/env_base.py
+++ b/robomimic/envs/env_base.py
@@ -13,6 +13,7 @@ class EnvType:
     """
     ROBOSUITE_TYPE = 1
     GYM_TYPE = 2
+    IG_MOMART_TYPE = 3
 
 
 class EnvBase(abc.ABC):
diff --git a/robomimic/envs/env_gym.py b/robomimic/envs/env_gym.py
index 94bef1a7..6cb1fa56 100644
--- a/robomimic/envs/env_gym.py
+++ b/robomimic/envs/env_gym.py
@@ -224,7 +224,7 @@ def create_for_data_processing(cls, env_name, camera_names, camera_height, camer
         obs_modality_specs = {
             "obs": {
                 "low_dim": ["flat"],
-                "image": [],
+                "rgb": [],
             }
         }
         ObsUtils.initialize_obs_utils_with_obs_specs(obs_modality_specs)
diff --git a/robomimic/envs/env_ig_momart.py b/robomimic/envs/env_ig_momart.py
new file mode 100644
index 00000000..81dd312f
--- /dev/null
+++ b/robomimic/envs/env_ig_momart.py
@@ -0,0 +1,395 @@
+"""
+Wrapper environment class to enable using iGibson-based environments used in the MOMART paper
+"""
+
+from copy import deepcopy
+import numpy as np
+import json
+
+import pybullet as p
+import gibson2
+from gibson2.envs.semantic_organize_and_fetch import SemanticOrganizeAndFetch
+from gibson2.utils.custom_utils import ObjectConfig
+import gibson2.external.pybullet_tools.utils as PBU
+import tempfile
+import os
+import yaml
+import cv2
+
+import robomimic.utils.obs_utils as ObsUtils
+import robomimic.envs.env_base as EB
+
+
+# TODO: Once iG 2.0 is more stable, automate available environments, similar to robosuite
+ENV_MAPPING = {
+    "SemanticOrganizeAndFetch": SemanticOrganizeAndFetch,
+}
+
+
+class EnvGibsonMOMART(EB.EnvBase):
+    """
+    Wrapper class for gibson environments (https://github.com/StanfordVL/iGibson) specifically compatible with
+    MoMaRT datasets
+    """
+    def __init__(
+            self,
+            env_name,
+            ig_config,
+            postprocess_visual_obs=True,
+            render=False,
+            render_offscreen=False,
+            use_image_obs=False,
+            image_height=None,
+            image_width=None,
+            physics_timestep=1./240.,
+            action_timestep=1./20.,
+            **kwargs,
+    ):
+        """
+        Args:
+            ig_config (dict): YAML configuration to use for iGibson, as a dict
+
+            postprocess_visual_obs (bool): if True, postprocess image observations
+                to prepare for learning
+
+            render (bool): if True, environment supports on-screen rendering
+
+            render_offscreen (bool): if True, environment supports off-screen rendering. This
+                is forced to be True if @use_image_obs is True.
+
+            use_image_obs (bool): if True, environment is expected to render rgb image observations
+                on every env.step call. Set this to False for efficiency reasons, if image
+                observations are not required.
+
+            render_mode (str): How to run simulation rendering. Options are {"pbgui", "iggui", or "headless"}
+
+            image_height (int): If specified, overrides internal iG image height when rendering
+
+            image_width (int): If specified, overrides internal iG image width when rendering
+
+            physics_timestep (float): Pybullet physics timestep to use
+
+            action_timestep (float): Action timestep to use for robot in simulation
+
+            kwargs (unrolled dict): Any args to substitute in the ig_configuration
+        """
+        self._env_name = env_name
+        self.ig_config = deepcopy(ig_config)
+        self.postprocess_visual_obs = postprocess_visual_obs
+        self._init_kwargs = kwargs
+
+        # Determine rendering mode
+        self.render_mode = "iggui" if render else "headless"
+        self.render_onscreen = render
+
+        # Make sure rgb is part of obs in ig config
+        self.ig_config["output"] = list(set(self.ig_config["output"] + ["rgb"]))
+
+        # Warn user that iG always uses a renderer
+        if (not render) and (not render_offscreen):
+            print("WARNING: iGibson always uses a renderer -- using headless by default.")
+
+        # Update ig config
+        for k, v in kwargs.items():
+            assert k in self.ig_config, f"Got unknown ig configuration key {k}!"
+            self.ig_config[k] = v
+
+        # Set rendering values
+        self.obs_img_height = image_height if image_height is not None else self.ig_config.get("obs_image_height", 120)
+        self.obs_img_width = image_width if image_width is not None else self.ig_config.get("obs_image_width", 120)
+
+        # Get class to create
+        envClass = ENV_MAPPING.get(self._env_name, None)
+
+        # Make sure we have a valid environment class
+        assert envClass is not None, "No valid environment for the requested task was found!"
+
+        # Set device idx for rendering
+        # ensure that we select the correct GPU device for rendering by testing for EGL rendering
+        # NOTE: this package should be installed from this link (https://github.com/StanfordVL/egl_probe)
+        import egl_probe
+        device_idx = 0
+        valid_gpu_devices = egl_probe.get_available_devices()
+        if len(valid_gpu_devices) > 0:
+            device_idx = valid_gpu_devices[0]
+
+        # Create environment
+        self.env = envClass(
+            config_file=deepcopy(self.ig_config),
+            mode=self.render_mode,
+            physics_timestep=physics_timestep,
+            action_timestep=action_timestep,
+            device_idx=device_idx,
+        )
+
+        # If we have a viewer, make sure to remove all bodies belonging to the visual markers
+        self.exclude_body_ids = []      # Bodies to exclude when saving state
+        if self.env.simulator.viewer is not None:
+            self.exclude_body_ids.append(self.env.simulator.viewer.constraint_marker.body_id)
+            self.exclude_body_ids.append(self.env.simulator.viewer.constraint_marker2.body_id)
+
+    def step(self, action):
+        """
+        Step in the environment with an action
+
+        Args:
+            action: action to take
+
+        Returns:
+            observation: new observation
+            reward: step reward
+            done: whether the task is done
+            info: extra information
+        """
+        obs, r, done, info = self.env.step(action)
+        obs = self.get_observation(obs)
+        return obs, r, self.is_done(), info
+
+    def reset(self):
+        """Reset environment"""
+        di = self.env.reset()
+        return self.get_observation(di)
+
+    def reset_to(self, state):
+        """
+        Reset to a specific state
+        Args:
+            state (dict): contains:
+                - states (np.ndarray): initial state of the mujoco environment
+                - goal (dict): goal components to reset
+        Returns:
+            new observation
+        """
+        if "states" in state:
+            self.env.reset_to(state["states"], exclude=self.exclude_body_ids)
+
+        if "goal" in state:
+            self.set_goal(**state["goal"])
+
+        # Return obs
+        return self.get_observation()
+
+    def render(self, mode="human", camera_name="rgb", height=None, width=None):
+        """
+        Render
+
+        Args:
+            mode (str): Mode(s) to render. Options are either 'human' (rendering onscreen) or 'rgb' (rendering to
+                frames offscreen)
+            camera_name (str): Name of the camera to use -- valid options are "rgb" or "rgb_wrist"
+            height (int): If specified with width, resizes the rendered image to this height
+            width (int): If specified with height, resizes the rendered image to this width
+
+        Returns:
+            array or None: If rendering to frame, returns the rendered frame. Otherwise, returns None
+        """
+        # Only robotview camera is currently supported
+        assert camera_name in {"rgb", "rgb_wrist"}, \
+            f"Only rgb, rgb_wrist cameras currently supported, got {camera_name}."
+
+        if mode == "human":
+            assert self.render_onscreen, "Rendering has not been enabled for onscreen!"
+            self.env.simulator.sync()
+        else:
+            assert self.env.simulator.renderer is not None, "No renderer enabled for this env!"
+
+            frame = self.env.sensors["vision"].get_obs(self.env)[camera_name]
+
+            # Reshape all frames
+            if height is not None and width is not None:
+                frame = cv2.resize(frame, dsize=(height, width), interpolation=cv2.INTER_CUBIC)
+                return frame
+
+    def resize_obs_frame(self, frame):
+        """
+        Resizes frame to be internal height and width values
+        """
+        return cv2.resize(frame, dsize=(self.obs_img_width, self.obs_img_height), interpolation=cv2.INTER_CUBIC)
+
+    def get_observation(self, di=None):
+        """Get environment observation"""
+        if di is None:
+            di = self.env.get_state()
+        ret = {}
+        for k in di:
+            # RGB Images
+            if "rgb" in k:
+                ret[k] = di[k]
+                # ret[k] = np.transpose(di[k], (2, 0, 1))
+                if self.postprocess_visual_obs:
+                    ret[k] = ObsUtils.process_obs(obs=self.resize_obs_frame(ret[k]), obs_key=k)
+
+            # Depth images
+            elif "depth" in k:
+                # ret[k] = np.transpose(di[k], (2, 0, 1))
+                # Values can be corrupted (negative or > 1.0, so we clip values)
+                ret[k] = np.clip(di[k], 0.0, 1.0)
+                if self.postprocess_visual_obs:
+                    ret[k] = ObsUtils.process_obs(obs=self.resize_obs_frame(ret[k])[..., None], obs_key=k)
+
+            # Segmentation Images
+            elif "seg" in k:
+                ret[k] = di[k][..., None]
+                if self.postprocess_visual_obs:
+                    ret[k] = ObsUtils.process_obs(obs=self.resize_obs_frame(ret[k]), obs_key=k)
+
+            # Scans
+            elif "scan" in k:
+                ret[k] = np.transpose(np.array(di[k]), axes=(1, 0))
+
+        # Compose proprio obs
+        proprio_obs = di["proprio"]
+
+        # Compute intermediate values
+        lin_vel = np.linalg.norm(proprio_obs["base_lin_vel"][:2])
+        ang_vel = proprio_obs["base_ang_vel"][2]
+
+        ret["proprio"] = np.concatenate([
+            proprio_obs["head_joint_pos"],
+            proprio_obs["grasped"],
+            proprio_obs["eef_pos"],
+            proprio_obs["eef_quat"],
+        ])
+
+        # Proprio info that's only relevant for navigation
+        ret["proprio_nav"] = np.concatenate([
+            [lin_vel],
+            [ang_vel],
+        ])
+
+        # Compose task obs
+        ret["object"] = np.concatenate([
+            np.array(di["task_obs"]["object-state"]),
+        ])
+
+        # Add ground truth navigational state
+        ret["gt_nav"] = np.concatenate([
+            proprio_obs["base_pos"][:2],
+            [np.sin(proprio_obs["base_rpy"][2])],
+            [np.cos(proprio_obs["base_rpy"][2])],
+        ])
+
+        return ret
+
+    def sync_task(self):
+        """
+        Method to synchronize iG task, since we're not actually resetting the env but instead setting states directly.
+        Should only be called after resetting the initial state of an episode
+        """
+        self.env.task.update_target_object_init_pos()
+        self.env.task.update_location_info()
+
+    def set_task_conditions(self, task_conditions):
+        """
+        Method to override task conditions (e.g.: target object), useful in cases such as playing back
+            from demonstrations
+
+        Args:
+            task_conditions (dict): Keyword-mapped arguments to pass to task instance to set internally
+        """
+        self.env.set_task_conditions(task_conditions)
+
+    def get_state(self):
+        """Get iG flattened state"""
+        return {"states": PBU.WorldSaver(exclude_body_ids=self.exclude_body_ids).serialize()}
+
+    def get_reward(self):
+        return self.env.task.get_reward(self.env)[0]
+        # return float(self.is_success()["task"])
+
+    def get_goal(self):
+        """Get goal specification"""
+        # No support yet in iG
+        raise NotImplementedError
+
+    def set_goal(self, **kwargs):
+        """Set env target with external specification"""
+        # No support yet in iG
+        raise NotImplementedError
+
+    def is_done(self):
+        """Check if the agent is done (not necessarily successful)."""
+        return False
+
+    def is_success(self):
+        """
+        Check if the task condition(s) is reached. Should return a dictionary
+        { str: bool } with at least a "task" key for the overall task success,
+        and additional optional keys corresponding to other task criteria.
+        """
+        succ = self.env.check_success()
+        if isinstance(succ, dict):
+            assert "task" in succ
+            return succ
+        return { "task" : succ }
+
+    @classmethod
+    def create_for_data_processing(
+            cls,
+            env_name,
+            camera_names,
+            camera_height,
+            camera_width,
+            reward_shaping,
+            **kwargs,
+    ):
+        """
+        Create environment for processing datasets, which includes extracting
+        observations, labeling dense / sparse rewards, and annotating dones in
+        transitions.
+
+        Args:
+            env_name (str): name of environment
+            camera_names (list of str): list of camera names that correspond to image observations
+            camera_height (int): camera height for all cameras
+            camera_width (int): camera width for all cameras
+            reward_shaping (bool): if True, use shaped environment rewards, else use sparse task completion rewards
+        """
+        has_camera = (len(camera_names) > 0)
+
+        # note that @postprocess_visual_obs is False since this env's images will be written to a dataset
+        return cls(
+            env_name=env_name,
+            render=False,
+            render_offscreen=has_camera,
+            use_image_obs=has_camera,
+            postprocess_visual_obs=False,
+            image_height=camera_height,
+            image_width=camera_width,
+            **kwargs,
+        )
+
+    @property
+    def action_dimension(self):
+        """Action dimension"""
+        return self.env.robots[0].action_dim
+
+    @property
+    def name(self):
+        """Environment name"""
+        return self._env_name
+
+    @property
+    def type(self):
+        """Environment type"""
+        return EB.EnvType.IG_MOMART_TYPE
+
+    def serialize(self):
+        """Serialize to dictionary"""
+        return dict(env_name=self.name, type=self.type,
+                    ig_config=self.ig_config,
+                    env_kwargs=deepcopy(self._init_kwargs))
+
+    @classmethod
+    def deserialize(cls, info, postprocess_visual_obs=True):
+        """Create environment with external info"""
+        return cls(env_name=info["env_name"], ig_config=info["ig_config"], postprocess_visual_obs=postprocess_visual_obs, **info["env_kwargs"])
+
+    @property
+    def rollout_exceptions(self):
+        """Return tuple of exceptions to except when doing rollouts"""
+        return (RuntimeError)
+
+    def __repr__(self):
+        return self.name + "\n" + json.dumps(self._init_kwargs, sort_keys=True, indent=4) + \
+               "\niGibson Config: \n" + json.dumps(self.ig_config, sort_keys=True, indent=4)
diff --git a/robomimic/envs/env_robosuite.py b/robomimic/envs/env_robosuite.py
index 2e1533f4..1f0ffc03 100644
--- a/robomimic/envs/env_robosuite.py
+++ b/robomimic/envs/env_robosuite.py
@@ -49,7 +49,7 @@ def __init__(
         # robosuite version check
         self._is_v1 = (robosuite.__version__.split(".")[0] == "1")
         if self._is_v1:
-            assert (robosuite.__version__.split(".")[1] == "2"), "only support robosuite v0.3 and v1.2+"
+            assert (int(robosuite.__version__.split(".")[1]) >= 2), "only support robosuite v0.3 and v1.2+"
 
         kwargs = deepcopy(kwargs)
 
@@ -181,10 +181,10 @@ def get_observation(self, di=None):
             di = self.env._get_observations(force_update=True) if self._is_v1 else self.env._get_observation()
         ret = {}
         for k in di:
-            if ObsUtils.key_is_image(k):
+            if (k in ObsUtils.OBS_KEYS_TO_MODALITIES) and ObsUtils.key_is_obs_modality(key=k, obs_modality="rgb"):
                 ret[k] = di[k][::-1]
                 if self.postprocess_visual_obs:
-                    ret[k] = ObsUtils.process_image(ret[k])
+                    ret[k] = ObsUtils.process_obs(obs=ret[k], obs_key=k)
 
         # "object" key contains object information
         ret["object"] = np.array(di["object-state"])
@@ -195,7 +195,8 @@ def get_observation(self, di=None):
                 # ensures that we don't accidentally add robot wrist images a second time
                 pf = robot.robot_model.naming_prefix
                 for k in di:
-                    if k.startswith(pf) and (k not in ret) and (not k.endswith("proprio-state")):
+                    if k.startswith(pf) and (k not in ret) and \
+                            (not k.endswith("proprio-state")) and (k in ObsUtils.OBS_KEYS_TO_MODALITIES):
                         ret[k] = np.array(di[k])
         else:
             # minimal proprioception for older versions of robosuite
@@ -329,13 +330,13 @@ def create_for_data_processing(
         if is_v1:
             image_modalities = ["{}_image".format(cn) for cn in camera_names]
         elif has_camera:
-            # v0.3 only had support for one image, and it was named "image"
+            # v0.3 only had support for one image, and it was named "rgb"
             assert len(image_modalities) == 1
-            image_modalities = ["image"]
+            image_modalities = ["rgb"]
         obs_modality_specs = {
             "obs": {
                 "low_dim": [], # technically unused, so we don't have to specify all of them
-                "image": image_modalities,
+                "rgb": image_modalities,
             }
         }
         ObsUtils.initialize_obs_utils_with_obs_specs(obs_modality_specs)
diff --git a/robomimic/exps/templates/bc.json b/robomimic/exps/templates/bc.json
index e908727d..99668b67 100644
--- a/robomimic/exps/templates/bc.json
+++ b/robomimic/exps/templates/bc.json
@@ -146,33 +146,116 @@
                     "robot0_gripper_qpos",
                     "object"
                 ],
-                "image": []
+                "rgb": [],
+                "depth": [],
+                "scan": []
             },
             "goal": {
                 "low_dim": [],
-                "image": []
+                "rgb": [],
+                "depth": [],
+                "scan": []
             }
         },
         "encoder": {
-            "visual_core": "ResNet18Conv",
-            "visual_core_kwargs": {
-                "pretrained": false,
-                "input_coord_conv": false
+            "low_dim": {
+                "feature_dimension": null,
+                "core_class": null,
+                "core_kwargs": {},
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {}
             },
-            "obs_randomizer_class": null,
-            "obs_randomizer_kwargs": {
-                "crop_height": 76,
-                "crop_width": 76,
-                "num_crops": 1,
-                "pos_enc": false
+            "rgb": {
+                "feature_dimension": 64,
+                "core_class": "VisualCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
+            },
+            "depth": {
+                "feature_dimension": 64,
+                "core_class": "VisualCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
             },
-            "visual_feature_dimension": 64,
-            "use_spatial_softmax": true,
-            "spatial_softmax_kwargs": {
-                "num_kp": 32,
-                "learnable_temperature": false,
-                "temperature": 1.0,
-                "noise_std": 0.0
+            "scan": {
+                "feature_dimension": 64,
+                "core_class": "ScanCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    },
+                    "conv_kwargs": {
+                        "out_channels": [
+                            32,
+                            64,
+                            64
+                        ],
+                        "kernel_size": [
+                            8,
+                            4,
+                            2
+                        ],
+                        "stride": [
+                            4,
+                            2,
+                            1
+                        ]
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
             }
         }
     }
diff --git a/robomimic/exps/templates/bcq.json b/robomimic/exps/templates/bcq.json
index 7fa37159..f48ed964 100644
--- a/robomimic/exps/templates/bcq.json
+++ b/robomimic/exps/templates/bcq.json
@@ -182,33 +182,116 @@
                     "robot0_gripper_qpos",
                     "object"
                 ],
-                "image": []
+                "rgb": [],
+                "depth": [],
+                "scan": []
             },
             "goal": {
                 "low_dim": [],
-                "image": []
+                "rgb": [],
+                "depth": [],
+                "scan": []
             }
         },
         "encoder": {
-            "visual_core": "ResNet18Conv",
-            "visual_core_kwargs": {
-                "pretrained": false,
-                "input_coord_conv": false
+            "low_dim": {
+                "feature_dimension": null,
+                "core_class": null,
+                "core_kwargs": {},
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {}
             },
-            "obs_randomizer_class": null,
-            "obs_randomizer_kwargs": {
-                "crop_height": 76,
-                "crop_width": 76,
-                "num_crops": 1,
-                "pos_enc": false
+            "rgb": {
+                "feature_dimension": 64,
+                "core_class": "VisualCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
             },
-            "visual_feature_dimension": 64,
-            "use_spatial_softmax": true,
-            "spatial_softmax_kwargs": {
-                "num_kp": 32,
-                "learnable_temperature": false,
-                "temperature": 1.0,
-                "noise_std": 0.0
+            "depth": {
+                "feature_dimension": 64,
+                "core_class": "VisualCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
+            },
+            "scan": {
+                "feature_dimension": 64,
+                "core_class": "ScanCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    },
+                    "conv_kwargs": {
+                        "out_channels": [
+                            32,
+                            64,
+                            64
+                        ],
+                        "kernel_size": [
+                            8,
+                            4,
+                            2
+                        ],
+                        "stride": [
+                            4,
+                            2,
+                            1
+                        ]
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
             }
         }
     }
diff --git a/robomimic/exps/templates/cql.json b/robomimic/exps/templates/cql.json
index 89647f66..07422d2b 100644
--- a/robomimic/exps/templates/cql.json
+++ b/robomimic/exps/templates/cql.json
@@ -129,33 +129,116 @@
                     "robot0_gripper_qpos",
                     "object"
                 ],
-                "image": []
+                "rgb": [],
+                "depth": [],
+                "scan": []
             },
             "goal": {
                 "low_dim": [],
-                "image": []
+                "rgb": [],
+                "depth": [],
+                "scan": []
             }
         },
         "encoder": {
-            "visual_core": "ResNet18Conv",
-            "visual_core_kwargs": {
-                "pretrained": false,
-                "input_coord_conv": false
+            "low_dim": {
+                "feature_dimension": null,
+                "core_class": null,
+                "core_kwargs": {},
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {}
             },
-            "obs_randomizer_class": null,
-            "obs_randomizer_kwargs": {
-                "crop_height": 76,
-                "crop_width": 76,
-                "num_crops": 1,
-                "pos_enc": false
+            "rgb": {
+                "feature_dimension": 64,
+                "core_class": "VisualCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
+            },
+            "depth": {
+                "feature_dimension": 64,
+                "core_class": "VisualCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
             },
-            "visual_feature_dimension": 64,
-            "use_spatial_softmax": true,
-            "spatial_softmax_kwargs": {
-                "num_kp": 32,
-                "learnable_temperature": false,
-                "temperature": 1.0,
-                "noise_std": 0.0
+            "scan": {
+                "feature_dimension": 64,
+                "core_class": "ScanCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    },
+                    "conv_kwargs": {
+                        "out_channels": [
+                            32,
+                            64,
+                            64
+                        ],
+                        "kernel_size": [
+                            8,
+                            4,
+                            2
+                        ],
+                        "stride": [
+                            4,
+                            2,
+                            1
+                        ]
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
             }
         }
     }
diff --git a/robomimic/exps/templates/gl.json b/robomimic/exps/templates/gl.json
index 4b522c21..8198c172 100644
--- a/robomimic/exps/templates/gl.json
+++ b/robomimic/exps/templates/gl.json
@@ -118,11 +118,15 @@
                     "robot0_gripper_qpos",
                     "object"
                 ],
-                "image": []
+                "rgb": [],
+                "depth": [],
+                "scan": []
             },
             "goal": {
                 "low_dim": [],
-                "image": []
+                "rgb": [],
+                "depth": [],
+                "scan": []
             },
             "subgoal": {
                 "low_dim": [
@@ -131,29 +135,110 @@
                     "robot0_gripper_qpos",
                     "object"
                 ],
-                "image": []
+                "rgb": [],
+                "depth": [],
+                "scan": []
             }
         },
         "encoder": {
-            "visual_core": "ResNet18Conv",
-            "visual_core_kwargs": {
-                "pretrained": false,
-                "input_coord_conv": false
+            "low_dim": {
+                "feature_dimension": null,
+                "core_class": null,
+                "core_kwargs": {},
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {}
             },
-            "obs_randomizer_class": null,
-            "obs_randomizer_kwargs": {
-                "crop_height": 76,
-                "crop_width": 76,
-                "num_crops": 1,
-                "pos_enc": false
+            "rgb": {
+                "feature_dimension": 64,
+                "core_class": "VisualCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
+            },
+            "depth": {
+                "feature_dimension": 64,
+                "core_class": "VisualCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
             },
-            "visual_feature_dimension": 64,
-            "use_spatial_softmax": true,
-            "spatial_softmax_kwargs": {
-                "num_kp": 32,
-                "learnable_temperature": false,
-                "temperature": 1.0,
-                "noise_std": 0.0
+            "scan": {
+                "feature_dimension": 64,
+                "core_class": "ScanCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    },
+                    "conv_kwargs": {
+                        "out_channels": [
+                            32,
+                            64,
+                            64
+                        ],
+                        "kernel_size": [
+                            8,
+                            4,
+                            2
+                        ],
+                        "stride": [
+                            4,
+                            2,
+                            1
+                        ]
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
             }
         }
     }
diff --git a/robomimic/exps/templates/hbc.json b/robomimic/exps/templates/hbc.json
index 09708924..db5dad57 100644
--- a/robomimic/exps/templates/hbc.json
+++ b/robomimic/exps/templates/hbc.json
@@ -165,11 +165,15 @@
                         "robot0_gripper_qpos",
                         "object"
                     ],
-                    "image": []
+                    "rgb": [],
+                    "depth": [],
+                    "scan": []
                 },
                 "goal": {
                     "low_dim": [],
-                    "image": []
+                    "rgb": [],
+                    "depth": [],
+                    "scan": []
                 },
                 "subgoal": {
                     "low_dim": [
@@ -178,29 +182,110 @@
                         "robot0_gripper_qpos",
                         "object"
                     ],
-                    "image": []
+                    "rgb": [],
+                    "depth": [],
+                    "scan": []
                 }
             },
             "encoder": {
-                "visual_core": "ResNet18Conv",
-                "visual_core_kwargs": {
-                    "pretrained": false,
-                    "input_coord_conv": false
+                "low_dim": {
+                    "feature_dimension": null,
+                    "core_class": null,
+                    "core_kwargs": {},
+                    "obs_randomizer_class": null,
+                    "obs_randomizer_kwargs": {}
                 },
-                "obs_randomizer_class": null,
-                "obs_randomizer_kwargs": {
-                    "crop_height": 76,
-                    "crop_width": 76,
-                    "num_crops": 1,
-                    "pos_enc": false
+                "rgb": {
+                    "feature_dimension": 64,
+                    "core_class": "VisualCore",
+                    "core_kwargs": {
+                        "backbone_class": "ResNet18Conv",
+                        "backbone_kwargs": {
+                            "pretrained": false,
+                            "input_coord_conv": false
+                        }
+                    },
+                    "obs_randomizer_class": null,
+                    "obs_randomizer_kwargs": {
+                        "crop_height": 76,
+                        "crop_width": 76,
+                        "num_crops": 1,
+                        "pos_enc": false
+                    },
+                    "pool_class": "SpatialSoftmax",
+                    "pool_kwargs": {
+                        "num_kp": 32,
+                        "learnable_temperature": false,
+                        "temperature": 1.0,
+                        "noise_std": 0.0
+                    }
+                },
+                "depth": {
+                    "feature_dimension": 64,
+                    "core_class": "VisualCore",
+                    "core_kwargs": {
+                        "backbone_class": "ResNet18Conv",
+                        "backbone_kwargs": {
+                            "pretrained": false,
+                            "input_coord_conv": false
+                        }
+                    },
+                    "obs_randomizer_class": null,
+                    "obs_randomizer_kwargs": {
+                        "crop_height": 76,
+                        "crop_width": 76,
+                        "num_crops": 1,
+                        "pos_enc": false
+                    },
+                    "pool_class": "SpatialSoftmax",
+                    "pool_kwargs": {
+                        "num_kp": 32,
+                        "learnable_temperature": false,
+                        "temperature": 1.0,
+                        "noise_std": 0.0
+                    }
                 },
-                "visual_feature_dimension": 64,
-                "use_spatial_softmax": true,
-                "spatial_softmax_kwargs": {
-                    "num_kp": 32,
-                    "learnable_temperature": false,
-                    "temperature": 1.0,
-                    "noise_std": 0.0
+                "scan": {
+                    "feature_dimension": 64,
+                    "core_class": "ScanCore",
+                    "core_kwargs": {
+                        "backbone_class": "ResNet18Conv",
+                        "backbone_kwargs": {
+                            "pretrained": false,
+                            "input_coord_conv": false
+                        },
+                        "conv_kwargs": {
+                            "out_channels": [
+                                32,
+                                64,
+                                64
+                            ],
+                            "kernel_size": [
+                                8,
+                                4,
+                                2
+                            ],
+                            "stride": [
+                                4,
+                                2,
+                                1
+                            ]
+                        }
+                    },
+                    "obs_randomizer_class": null,
+                    "obs_randomizer_kwargs": {
+                        "crop_height": 76,
+                        "crop_width": 76,
+                        "num_crops": 1,
+                        "pos_enc": false
+                    },
+                    "pool_class": "SpatialSoftmax",
+                    "pool_kwargs": {
+                        "num_kp": 32,
+                        "learnable_temperature": false,
+                        "temperature": 1.0,
+                        "noise_std": 0.0
+                    }
                 }
             }
         },
@@ -213,33 +298,116 @@
                         "robot0_gripper_qpos",
                         "object"
                     ],
-                    "image": []
+                    "rgb": [],
+                    "depth": [],
+                    "scan": []
                 },
                 "goal": {
                     "low_dim": [],
-                    "image": []
+                    "rgb": [],
+                    "depth": [],
+                    "scan": []
                 }
             },
             "encoder": {
-                "visual_core": "ResNet18Conv",
-                "visual_core_kwargs": {
-                    "pretrained": false,
-                    "input_coord_conv": false
+                "low_dim": {
+                    "feature_dimension": null,
+                    "core_class": null,
+                    "core_kwargs": {},
+                    "obs_randomizer_class": null,
+                    "obs_randomizer_kwargs": {}
                 },
-                "obs_randomizer_class": null,
-                "obs_randomizer_kwargs": {
-                    "crop_height": 76,
-                    "crop_width": 76,
-                    "num_crops": 1,
-                    "pos_enc": false
+                "rgb": {
+                    "feature_dimension": 64,
+                    "core_class": "VisualCore",
+                    "core_kwargs": {
+                        "backbone_class": "ResNet18Conv",
+                        "backbone_kwargs": {
+                            "pretrained": false,
+                            "input_coord_conv": false
+                        }
+                    },
+                    "obs_randomizer_class": null,
+                    "obs_randomizer_kwargs": {
+                        "crop_height": 76,
+                        "crop_width": 76,
+                        "num_crops": 1,
+                        "pos_enc": false
+                    },
+                    "pool_class": "SpatialSoftmax",
+                    "pool_kwargs": {
+                        "num_kp": 32,
+                        "learnable_temperature": false,
+                        "temperature": 1.0,
+                        "noise_std": 0.0
+                    }
+                },
+                "depth": {
+                    "feature_dimension": 64,
+                    "core_class": "VisualCore",
+                    "core_kwargs": {
+                        "backbone_class": "ResNet18Conv",
+                        "backbone_kwargs": {
+                            "pretrained": false,
+                            "input_coord_conv": false
+                        }
+                    },
+                    "obs_randomizer_class": null,
+                    "obs_randomizer_kwargs": {
+                        "crop_height": 76,
+                        "crop_width": 76,
+                        "num_crops": 1,
+                        "pos_enc": false
+                    },
+                    "pool_class": "SpatialSoftmax",
+                    "pool_kwargs": {
+                        "num_kp": 32,
+                        "learnable_temperature": false,
+                        "temperature": 1.0,
+                        "noise_std": 0.0
+                    }
                 },
-                "visual_feature_dimension": 64,
-                "use_spatial_softmax": true,
-                "spatial_softmax_kwargs": {
-                    "num_kp": 32,
-                    "learnable_temperature": false,
-                    "temperature": 1.0,
-                    "noise_std": 0.0
+                "scan": {
+                    "feature_dimension": 64,
+                    "core_class": "ScanCore",
+                    "core_kwargs": {
+                        "backbone_class": "ResNet18Conv",
+                        "backbone_kwargs": {
+                            "pretrained": false,
+                            "input_coord_conv": false
+                        },
+                        "conv_kwargs": {
+                            "out_channels": [
+                                32,
+                                64,
+                                64
+                            ],
+                            "kernel_size": [
+                                8,
+                                4,
+                                2
+                            ],
+                            "stride": [
+                                4,
+                                2,
+                                1
+                            ]
+                        }
+                    },
+                    "obs_randomizer_class": null,
+                    "obs_randomizer_kwargs": {
+                        "crop_height": 76,
+                        "crop_width": 76,
+                        "num_crops": 1,
+                        "pos_enc": false
+                    },
+                    "pool_class": "SpatialSoftmax",
+                    "pool_kwargs": {
+                        "num_kp": 32,
+                        "learnable_temperature": false,
+                        "temperature": 1.0,
+                        "noise_std": 0.0
+                    }
                 }
             }
         }
diff --git a/robomimic/exps/templates/iris.json b/robomimic/exps/templates/iris.json
index 6f1748de..b28833a2 100644
--- a/robomimic/exps/templates/iris.json
+++ b/robomimic/exps/templates/iris.json
@@ -289,11 +289,15 @@
                             "robot0_gripper_qpos",
                             "object"
                         ],
-                        "image": []
+                        "rgb": [],
+                        "depth": [],
+                        "scan": []
                     },
                     "goal": {
                         "low_dim": [],
-                        "image": []
+                        "rgb": [],
+                        "depth": [],
+                        "scan": []
                     },
                     "subgoal": {
                         "low_dim": [
@@ -302,29 +306,110 @@
                             "robot0_gripper_qpos",
                             "object"
                         ],
-                        "image": []
+                        "rgb": [],
+                        "depth": [],
+                        "scan": []
                     }
                 },
                 "encoder": {
-                    "visual_core": "ResNet18Conv",
-                    "visual_core_kwargs": {
-                        "pretrained": false,
-                        "input_coord_conv": false
+                    "low_dim": {
+                        "feature_dimension": null,
+                        "core_class": null,
+                        "core_kwargs": {},
+                        "obs_randomizer_class": null,
+                        "obs_randomizer_kwargs": {}
                     },
-                    "obs_randomizer_class": null,
-                    "obs_randomizer_kwargs": {
-                        "crop_height": 76,
-                        "crop_width": 76,
-                        "num_crops": 1,
-                        "pos_enc": false
+                    "rgb": {
+                        "feature_dimension": 64,
+                        "core_class": "VisualCore",
+                        "core_kwargs": {
+                            "backbone_class": "ResNet18Conv",
+                            "backbone_kwargs": {
+                                "pretrained": false,
+                                "input_coord_conv": false
+                            }
+                        },
+                        "obs_randomizer_class": null,
+                        "obs_randomizer_kwargs": {
+                            "crop_height": 76,
+                            "crop_width": 76,
+                            "num_crops": 1,
+                            "pos_enc": false
+                        },
+                        "pool_class": "SpatialSoftmax",
+                        "pool_kwargs": {
+                            "num_kp": 32,
+                            "learnable_temperature": false,
+                            "temperature": 1.0,
+                            "noise_std": 0.0
+                        }
                     },
-                    "visual_feature_dimension": 64,
-                    "use_spatial_softmax": true,
-                    "spatial_softmax_kwargs": {
-                        "num_kp": 32,
-                        "learnable_temperature": false,
-                        "temperature": 1.0,
-                        "noise_std": 0.0
+                    "depth": {
+                        "feature_dimension": 64,
+                        "core_class": "VisualCore",
+                        "core_kwargs": {
+                            "backbone_class": "ResNet18Conv",
+                            "backbone_kwargs": {
+                                "pretrained": false,
+                                "input_coord_conv": false
+                            }
+                        },
+                        "obs_randomizer_class": null,
+                        "obs_randomizer_kwargs": {
+                            "crop_height": 76,
+                            "crop_width": 76,
+                            "num_crops": 1,
+                            "pos_enc": false
+                        },
+                        "pool_class": "SpatialSoftmax",
+                        "pool_kwargs": {
+                            "num_kp": 32,
+                            "learnable_temperature": false,
+                            "temperature": 1.0,
+                            "noise_std": 0.0
+                        }
+                    },
+                    "scan": {
+                        "feature_dimension": 64,
+                        "core_class": "ScanCore",
+                        "core_kwargs": {
+                            "backbone_class": "ResNet18Conv",
+                            "backbone_kwargs": {
+                                "pretrained": false,
+                                "input_coord_conv": false
+                            },
+                            "conv_kwargs": {
+                                "out_channels": [
+                                    32,
+                                    64,
+                                    64
+                                ],
+                                "kernel_size": [
+                                    8,
+                                    4,
+                                    2
+                                ],
+                                "stride": [
+                                    4,
+                                    2,
+                                    1
+                                ]
+                            }
+                        },
+                        "obs_randomizer_class": null,
+                        "obs_randomizer_kwargs": {
+                            "crop_height": 76,
+                            "crop_width": 76,
+                            "num_crops": 1,
+                            "pos_enc": false
+                        },
+                        "pool_class": "SpatialSoftmax",
+                        "pool_kwargs": {
+                            "num_kp": 32,
+                            "learnable_temperature": false,
+                            "temperature": 1.0,
+                            "noise_std": 0.0
+                        }
                     }
                 }
             },
@@ -337,33 +422,116 @@
                             "robot0_gripper_qpos",
                             "object"
                         ],
-                        "image": []
+                        "rgb": [],
+                        "depth": [],
+                        "scan": []
                     },
                     "goal": {
                         "low_dim": [],
-                        "image": []
+                        "rgb": [],
+                        "depth": [],
+                        "scan": []
                     }
                 },
                 "encoder": {
-                    "visual_core": "ResNet18Conv",
-                    "visual_core_kwargs": {
-                        "pretrained": false,
-                        "input_coord_conv": false
+                    "low_dim": {
+                        "feature_dimension": null,
+                        "core_class": null,
+                        "core_kwargs": {},
+                        "obs_randomizer_class": null,
+                        "obs_randomizer_kwargs": {}
                     },
-                    "obs_randomizer_class": null,
-                    "obs_randomizer_kwargs": {
-                        "crop_height": 76,
-                        "crop_width": 76,
-                        "num_crops": 1,
-                        "pos_enc": false
+                    "rgb": {
+                        "feature_dimension": 64,
+                        "core_class": "VisualCore",
+                        "core_kwargs": {
+                            "backbone_class": "ResNet18Conv",
+                            "backbone_kwargs": {
+                                "pretrained": false,
+                                "input_coord_conv": false
+                            }
+                        },
+                        "obs_randomizer_class": null,
+                        "obs_randomizer_kwargs": {
+                            "crop_height": 76,
+                            "crop_width": 76,
+                            "num_crops": 1,
+                            "pos_enc": false
+                        },
+                        "pool_class": "SpatialSoftmax",
+                        "pool_kwargs": {
+                            "num_kp": 32,
+                            "learnable_temperature": false,
+                            "temperature": 1.0,
+                            "noise_std": 0.0
+                        }
                     },
-                    "visual_feature_dimension": 64,
-                    "use_spatial_softmax": true,
-                    "spatial_softmax_kwargs": {
-                        "num_kp": 32,
-                        "learnable_temperature": false,
-                        "temperature": 1.0,
-                        "noise_std": 0.0
+                    "depth": {
+                        "feature_dimension": 64,
+                        "core_class": "VisualCore",
+                        "core_kwargs": {
+                            "backbone_class": "ResNet18Conv",
+                            "backbone_kwargs": {
+                                "pretrained": false,
+                                "input_coord_conv": false
+                            }
+                        },
+                        "obs_randomizer_class": null,
+                        "obs_randomizer_kwargs": {
+                            "crop_height": 76,
+                            "crop_width": 76,
+                            "num_crops": 1,
+                            "pos_enc": false
+                        },
+                        "pool_class": "SpatialSoftmax",
+                        "pool_kwargs": {
+                            "num_kp": 32,
+                            "learnable_temperature": false,
+                            "temperature": 1.0,
+                            "noise_std": 0.0
+                        }
+                    },
+                    "scan": {
+                        "feature_dimension": 64,
+                        "core_class": "ScanCore",
+                        "core_kwargs": {
+                            "backbone_class": "ResNet18Conv",
+                            "backbone_kwargs": {
+                                "pretrained": false,
+                                "input_coord_conv": false
+                            },
+                            "conv_kwargs": {
+                                "out_channels": [
+                                    32,
+                                    64,
+                                    64
+                                ],
+                                "kernel_size": [
+                                    8,
+                                    4,
+                                    2
+                                ],
+                                "stride": [
+                                    4,
+                                    2,
+                                    1
+                                ]
+                            }
+                        },
+                        "obs_randomizer_class": null,
+                        "obs_randomizer_kwargs": {
+                            "crop_height": 76,
+                            "crop_width": 76,
+                            "num_crops": 1,
+                            "pos_enc": false
+                        },
+                        "pool_class": "SpatialSoftmax",
+                        "pool_kwargs": {
+                            "num_kp": 32,
+                            "learnable_temperature": false,
+                            "temperature": 1.0,
+                            "noise_std": 0.0
+                        }
                     }
                 }
             }
@@ -377,33 +545,116 @@
                         "robot0_gripper_qpos",
                         "object"
                     ],
-                    "image": []
+                    "rgb": [],
+                    "depth": [],
+                    "scan": []
                 },
                 "goal": {
                     "low_dim": [],
-                    "image": []
+                    "rgb": [],
+                    "depth": [],
+                    "scan": []
                 }
             },
             "encoder": {
-                "visual_core": "ResNet18Conv",
-                "visual_core_kwargs": {
-                    "pretrained": false,
-                    "input_coord_conv": false
+                "low_dim": {
+                    "feature_dimension": null,
+                    "core_class": null,
+                    "core_kwargs": {},
+                    "obs_randomizer_class": null,
+                    "obs_randomizer_kwargs": {}
                 },
-                "obs_randomizer_class": null,
-                "obs_randomizer_kwargs": {
-                    "crop_height": 76,
-                    "crop_width": 76,
-                    "num_crops": 1,
-                    "pos_enc": false
+                "rgb": {
+                    "feature_dimension": 64,
+                    "core_class": "VisualCore",
+                    "core_kwargs": {
+                        "backbone_class": "ResNet18Conv",
+                        "backbone_kwargs": {
+                            "pretrained": false,
+                            "input_coord_conv": false
+                        }
+                    },
+                    "obs_randomizer_class": null,
+                    "obs_randomizer_kwargs": {
+                        "crop_height": 76,
+                        "crop_width": 76,
+                        "num_crops": 1,
+                        "pos_enc": false
+                    },
+                    "pool_class": "SpatialSoftmax",
+                    "pool_kwargs": {
+                        "num_kp": 32,
+                        "learnable_temperature": false,
+                        "temperature": 1.0,
+                        "noise_std": 0.0
+                    }
+                },
+                "depth": {
+                    "feature_dimension": 64,
+                    "core_class": "VisualCore",
+                    "core_kwargs": {
+                        "backbone_class": "ResNet18Conv",
+                        "backbone_kwargs": {
+                            "pretrained": false,
+                            "input_coord_conv": false
+                        }
+                    },
+                    "obs_randomizer_class": null,
+                    "obs_randomizer_kwargs": {
+                        "crop_height": 76,
+                        "crop_width": 76,
+                        "num_crops": 1,
+                        "pos_enc": false
+                    },
+                    "pool_class": "SpatialSoftmax",
+                    "pool_kwargs": {
+                        "num_kp": 32,
+                        "learnable_temperature": false,
+                        "temperature": 1.0,
+                        "noise_std": 0.0
+                    }
                 },
-                "visual_feature_dimension": 64,
-                "use_spatial_softmax": true,
-                "spatial_softmax_kwargs": {
-                    "num_kp": 32,
-                    "learnable_temperature": false,
-                    "temperature": 1.0,
-                    "noise_std": 0.0
+                "scan": {
+                    "feature_dimension": 64,
+                    "core_class": "ScanCore",
+                    "core_kwargs": {
+                        "backbone_class": "ResNet18Conv",
+                        "backbone_kwargs": {
+                            "pretrained": false,
+                            "input_coord_conv": false
+                        },
+                        "conv_kwargs": {
+                            "out_channels": [
+                                32,
+                                64,
+                                64
+                            ],
+                            "kernel_size": [
+                                8,
+                                4,
+                                2
+                            ],
+                            "stride": [
+                                4,
+                                2,
+                                1
+                            ]
+                        }
+                    },
+                    "obs_randomizer_class": null,
+                    "obs_randomizer_kwargs": {
+                        "crop_height": 76,
+                        "crop_width": 76,
+                        "num_crops": 1,
+                        "pos_enc": false
+                    },
+                    "pool_class": "SpatialSoftmax",
+                    "pool_kwargs": {
+                        "num_kp": 32,
+                        "learnable_temperature": false,
+                        "temperature": 1.0,
+                        "noise_std": 0.0
+                    }
                 }
             }
         }
diff --git a/robomimic/exps/templates/td3_bc.json b/robomimic/exps/templates/td3_bc.json
index a240d926..f8d2d0ca 100644
--- a/robomimic/exps/templates/td3_bc.json
+++ b/robomimic/exps/templates/td3_bc.json
@@ -114,33 +114,116 @@
                 "low_dim": [
                     "flat"
                 ],
-                "image": []
+                "rgb": [],
+                "depth": [],
+                "scan": []
             },
             "goal": {
                 "low_dim": [],
-                "image": []
+                "rgb": [],
+                "depth": [],
+                "scan": []
             }
         },
         "encoder": {
-            "visual_core": "ResNet18Conv",
-            "visual_core_kwargs": {
-                "pretrained": false,
-                "input_coord_conv": false
+            "low_dim": {
+                "feature_dimension": null,
+                "core_class": null,
+                "core_kwargs": {},
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {}
             },
-            "obs_randomizer_class": null,
-            "obs_randomizer_kwargs": {
-                "crop_height": 76,
-                "crop_width": 76,
-                "num_crops": 1,
-                "pos_enc": false
+            "rgb": {
+                "feature_dimension": 64,
+                "core_class": "VisualCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
             },
-            "visual_feature_dimension": 64,
-            "use_spatial_softmax": true,
-            "spatial_softmax_kwargs": {
-                "num_kp": 32,
-                "learnable_temperature": false,
-                "temperature": 1.0,
-                "noise_std": 0.0
+            "depth": {
+                "feature_dimension": 64,
+                "core_class": "VisualCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
+            },
+            "scan": {
+                "feature_dimension": 64,
+                "core_class": "ScanCore",
+                "core_kwargs": {
+                    "backbone_class": "ResNet18Conv",
+                    "backbone_kwargs": {
+                        "pretrained": false,
+                        "input_coord_conv": false
+                    },
+                    "conv_kwargs": {
+                        "out_channels": [
+                            32,
+                            64,
+                            64
+                        ],
+                        "kernel_size": [
+                            8,
+                            4,
+                            2
+                        ],
+                        "stride": [
+                            4,
+                            2,
+                            1
+                        ]
+                    }
+                },
+                "obs_randomizer_class": null,
+                "obs_randomizer_kwargs": {
+                    "crop_height": 76,
+                    "crop_width": 76,
+                    "num_crops": 1,
+                    "pos_enc": false
+                },
+                "pool_class": "SpatialSoftmax",
+                "pool_kwargs": {
+                    "num_kp": 32,
+                    "learnable_temperature": false,
+                    "temperature": 1.0,
+                    "noise_std": 0.0
+                }
             }
         }
     }
diff --git a/robomimic/models/__init__.py b/robomimic/models/__init__.py
index e69de29b..c3a0eb93 100644
--- a/robomimic/models/__init__.py
+++ b/robomimic/models/__init__.py
@@ -0,0 +1 @@
+from .base_nets import EncoderCore, Randomizer
diff --git a/robomimic/models/base_nets.py b/robomimic/models/base_nets.py
index fe8700b8..6be2a23b 100644
--- a/robomimic/models/base_nets.py
+++ b/robomimic/models/base_nets.py
@@ -9,6 +9,7 @@
 import numpy as np
 import textwrap
 from copy import deepcopy
+from collections import OrderedDict
 
 import torch
 import torch.nn as nn
@@ -17,6 +18,14 @@
 
 import robomimic.utils.tensor_utils as TensorUtils
 import robomimic.utils.obs_utils as ObsUtils
+from robomimic.utils.python_utils import extract_class_init_kwargs_from_dict
+
+
+CONV_ACTIVATIONS = {
+    "relu": nn.ReLU,
+    "None": None,
+    None: None,
+}
 
 
 def rnn_args_from_config(rnn_config):
@@ -115,6 +124,39 @@ def forward(self, inputs=None):
         return self.param
 
 
+class Unsqueeze(Module):
+    """
+    Trivial class that unsqueezes the input. Useful for including in a nn.Sequential network
+    """
+    def __init__(self, dim):
+        super(Unsqueeze, self).__init__()
+        self.dim = dim
+
+    def output_shape(self, input_shape=None):
+        assert input_shape is not None
+        return input_shape + [1] if self.dim == -1 else input_shape[:self.dim + 1] + [1] + input_shape[self.dim + 1:]
+
+    def forward(self, x):
+        return x.unsqueeze(dim=self.dim)
+
+
+class Squeeze(Module):
+    """
+    Trivial class that squeezes the input. Useful for including in a nn.Sequential network
+    """
+
+    def __init__(self, dim):
+        super(Squeeze, self).__init__()
+        self.dim = dim
+
+    def output_shape(self, input_shape=None):
+        assert input_shape is not None
+        return input_shape[:self.dim] + input_shape[self.dim+1:] if input_shape[self.dim] == 1 else input_shape
+
+    def forward(self, x):
+        return x.squeeze(dim=self.dim)
+
+
 class MLP(Module):
     """
     Base class for simple Multi-Layer Perceptrons.
@@ -579,6 +621,82 @@ def output_shape(self, input_shape):
         return [self._output_channel, out_h, out_w]
 
 
+class Conv1dBase(Module):
+    """
+    Base class for stacked Conv1d layers.
+
+    Args:
+        input_channel (int): Number of channels for inputs to this network
+        activation (None or str): Per-layer activation to use. Defaults to "relu". Valid options are
+            currently {relu, None} for no activation
+        conv_kwargs (dict): Specific nn.Conv1D args to use, in list form, where the ith element corresponds to the
+            argument to be passed to the ith Conv1D layer.
+            See https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html for specific possible arguments.
+
+            e.g.: common values to use:
+                out_channels (list of int): Output channel size for each sequential Conv1d layer
+                kernel_size (list of int): Kernel sizes for each sequential Conv1d layer
+                stride (list of int): Stride sizes for each sequential Conv1d layer
+    """
+    def __init__(
+        self,
+        input_channel=1,
+        activation="relu",
+        **conv_kwargs,
+    ):
+        super(Conv1dBase, self).__init__()
+
+        # Get activation requested
+        activation = CONV_ACTIVATIONS[activation]
+
+        # Make sure out_channels and kernel_size are specified
+        for kwarg in ("out_channels", "kernel_size"):
+            assert kwarg in conv_kwargs, f"{kwarg} must be specified in Conv1dBase kwargs!"
+
+        # Generate network
+        self.n_layers = len(conv_kwargs["out_channels"])
+        layers = OrderedDict()
+        for i in range(self.n_layers):
+            layer_kwargs = {k: v[i] for k, v in conv_kwargs.items()}
+            layers[f'conv{i}'] = nn.Conv1d(
+                in_channels=input_channel,
+                **layer_kwargs,
+            )
+            if activation is not None:
+                layers[f'act{i}'] = activation()
+            input_channel = layer_kwargs["out_channels"]
+
+        # Store network
+        self.nets = nn.Sequential(layers)
+
+    def output_shape(self, input_shape):
+        """
+        Function to compute output shape from inputs to this module.
+
+        Args:
+            input_shape (iterable of int): shape of input. Does not include batch dimension.
+                Some modules may not need this argument, if their output does not depend
+                on the size of the input, or if they assume fixed size input.
+
+        Returns:
+            out_shape ([int]): list of integers corresponding to output shape
+        """
+        channels, length = input_shape
+        for i in range(self.n_layers):
+            net = getattr(self.nets, f"conv{i}")
+            channels = net.out_channels
+            length = int((length + 2 * net.padding[0] - net.dilation[0] * (net.kernel_size[0] - 1) - 1) / net.stride[0]) + 1
+        return [channels, length]
+
+    def forward(self, inputs):
+        x = self.nets(inputs)
+        if list(self.output_shape(list(inputs.shape)[1:])) != list(x.shape)[1:]:
+            raise ValueError('Size mismatch: expect size %s, but got size %s' % (
+                str(self.output_shape(list(inputs.shape)[1:])), str(list(x.shape)[1:]))
+            )
+        return x
+
+
 """
 ================================================
 Pooling Networks
@@ -794,12 +912,40 @@ def forward(self, x):
         raise Exception("unexpected agg type: {}".forward(self.agg_type))
 
 
+"""
+================================================
+Encoder Core Networks (Abstract class)
+================================================
+"""
+class EncoderCore(Module):
+    """
+    Abstract class used to categorize all cores used to encode observations
+    """
+    def __init__(self, input_shape):
+        self.input_shape = input_shape
+        super(EncoderCore, self).__init__()
+
+    def __init_subclass__(cls, **kwargs):
+        """
+        Hook method to automatically register all valid subclasses so we can keep track of valid observation encoders
+        in a global dict.
+
+        This global dict stores mapping from observation encoder network name to class.
+        We keep track of these registries to enable automated class inference at runtime, allowing
+        users to simply extend our base encoder class and refer to that class in string form
+        in their config, without having to manually register their class internally.
+        This also future-proofs us for any additional encoder classes we would
+        like to add ourselves.
+        """
+        ObsUtils.register_encoder_core(cls)
+
+
 """
 ================================================
 Visual Core Networks (Backbone + Pool)
 ================================================
 """
-class VisualCore(ConvBase):
+class VisualCore(EncoderCore, ConvBase):
     """
     A network block that combines a visual backbone network with optional pooling
     and linear layers.
@@ -807,67 +953,69 @@ class VisualCore(ConvBase):
     def __init__(
         self,
         input_shape,
-        visual_core_class,
-        visual_core_kwargs,
+        backbone_class,
+        backbone_kwargs,
         pool_class=None,
         pool_kwargs=None,
         flatten=True,
-        visual_feature_dimension=None,
+        feature_dimension=None,
     ):
         """
         Args:
             input_shape (tuple): shape of input (not including batch dimension)
-            visual_core_class (str): class name for the visual core
-            visual_core_kwargs (dict): kwargs for the visual core
+            backbone_class (str): class name for the visual backbone network (e.g.: ResNet18)
+            backbone_kwargs (dict): kwargs for the visual backbone network
             pool_class (str): class name for the visual feature pooler (optional)
+                Common options are "SpatialSoftmax" and "SpatialMeanPool"
             pool_kwargs (dict): kwargs for the visual feature pooler (optional)
             flatten (bool): whether to flatten the visual feature
-            visual_feature_dimension (int): if not None, add a Linear layer to 
+            feature_dimension (int): if not None, add a Linear layer to
                 project output into a desired feature dimension
         """
-        super(VisualCore, self).__init__()
-        self.input_shape = input_shape
+        super(VisualCore, self).__init__(input_shape=input_shape)
         self.flatten = flatten
 
         # add input channel dimension to visual core inputs
-        visual_core_kwargs = deepcopy(visual_core_kwargs)
-        visual_core_kwargs["input_channel"] = input_shape[0]
+        backbone_kwargs["input_channel"] = input_shape[0]
+
+        # extract only relevant kwargs for this specific backbone
+        backbone_kwargs = extract_class_init_kwargs_from_dict(cls=eval(backbone_class), dic=backbone_kwargs, copy=True)
 
         # visual backbone
-        assert isinstance(visual_core_class, str)
-        if pool_class is not None:
-            assert isinstance(pool_class, str)
-        self.vis_core = eval(visual_core_class)(**visual_core_kwargs)
+        assert isinstance(backbone_class, str)
+        self.backbone = eval(backbone_class)(**backbone_kwargs)
 
-        assert isinstance(self.vis_core, ConvBase)
+        assert isinstance(self.backbone, ConvBase)
 
-        feat_shape = self.vis_core.output_shape(input_shape)
-        net_list = [self.vis_core]
+        feat_shape = self.backbone.output_shape(input_shape)
+        net_list = [self.backbone]
 
         # maybe make pool net
         if pool_class is not None:
+            assert isinstance(pool_class, str)
             # feed output shape of backbone to pool net
             if pool_kwargs is None:
                 pool_kwargs = dict()
-            pool_kwargs = deepcopy(pool_kwargs)
+            # extract only relevant kwargs for this specific backbone
             pool_kwargs["input_shape"] = feat_shape
-            self.pool_net = eval(pool_class)(**pool_kwargs)
-            assert isinstance(self.pool_net, Module)
+            pool_kwargs = extract_class_init_kwargs_from_dict(cls=eval(pool_class), dic=pool_kwargs, copy=True)
+            self.pool = eval(pool_class)(**pool_kwargs)
+            assert isinstance(self.pool, Module)
 
-            feat_shape = self.pool_net.output_shape(feat_shape)
-            net_list.append(self.pool_net)
+            feat_shape = self.pool.output_shape(feat_shape)
+            net_list.append(self.pool)
         else:
-            self.pool_net = None
+            self.pool = None
 
         # flatten layer
         if self.flatten:
             net_list.append(torch.nn.Flatten(start_dim=1, end_dim=-1))
 
         # maybe linear layer
-        self.visual_feature_dimension = visual_feature_dimension
-        if visual_feature_dimension is not None:
+        self.feature_dimension = feature_dimension
+        if feature_dimension is not None:
             assert self.flatten
-            linear = torch.nn.Linear(int(np.prod(feat_shape)), visual_feature_dimension)
+            linear = torch.nn.Linear(int(np.prod(feat_shape)), feature_dimension)
             net_list.append(linear)
 
         self.nets = nn.Sequential(*net_list)
@@ -884,13 +1032,13 @@ def output_shape(self, input_shape):
         Returns:
             out_shape ([int]): list of integers corresponding to output shape
         """
-        if self.visual_feature_dimension is not None:
+        if self.feature_dimension is not None:
             # linear output
-            return [self.visual_feature_dimension]
-        feat_shape = self.vis_core.output_shape(input_shape)
-        if self.pool_net is not None:
+            return [self.feature_dimension]
+        feat_shape = self.backbone.output_shape(input_shape)
+        if self.pool is not None:
             # pool output
-            feat_shape = self.pool_net.output_shape(feat_shape)
+            feat_shape = self.pool.output_shape(feat_shape)
         # backbone + flat output
         if self.flatten:
             return [np.prod(feat_shape)]
@@ -912,12 +1060,135 @@ def __repr__(self):
         indent = ' ' * 2
         msg += textwrap.indent(
             "\ninput_shape={}\noutput_shape={}".format(self.input_shape, self.output_shape(self.input_shape)), indent)
-        msg += textwrap.indent("\nvisual_net={}".format(self.vis_core), indent)
-        msg += textwrap.indent("\npool_net={}".format(self.pool_net), indent)
+        msg += textwrap.indent("\nbackbone_net={}".format(self.backbone), indent)
+        msg += textwrap.indent("\npool_net={}".format(self.pool), indent)
+        msg = header + '(' + msg + '\n)'
+        return msg
+
+
+"""
+================================================
+Scan Core Networks (Conv1D Sequential + Pool)
+================================================
+"""
+class ScanCore(EncoderCore, ConvBase):
+    """
+    A network block that combines a Conv1D backbone network with optional pooling
+    and linear layers.
+    """
+    def __init__(
+        self,
+        input_shape,
+        conv_kwargs,
+        conv_activation="relu",
+        pool_class=None,
+        pool_kwargs=None,
+        flatten=True,
+        feature_dimension=None,
+    ):
+        """
+        Args:
+            input_shape (tuple): shape of input (not including batch dimension)
+            conv_kwargs (dict): kwargs for the conv1d backbone network. Should contain lists for the following values:
+                out_channels (int)
+                kernel_size (int)
+                stride (int)
+                ...
+            conv_activation (str or None): Activation to use between conv layers. Default is relu.
+                Currently, valid options are {relu}
+            pool_class (str): class name for the visual feature pooler (optional)
+                Common options are "SpatialSoftmax" and "SpatialMeanPool"
+            pool_kwargs (dict): kwargs for the visual feature pooler (optional)
+            flatten (bool): whether to flatten the network output
+            feature_dimension (int): if not None, add a Linear layer to
+                project output into a desired feature dimension (note: flatten must be set to True!)
+        """
+        super(ScanCore, self).__init__(input_shape=input_shape)
+        self.flatten = flatten
+        self.feature_dimension = feature_dimension
+
+        # Generate backbone network
+        self.backbone = Conv1dBase(
+            input_channel=1,
+            activation=conv_activation,
+            **conv_kwargs,
+        )
+        feat_shape = self.backbone.output_shape(input_shape=input_shape)
+
+        # Create netlist of all generated networks
+        net_list = [self.backbone]
+
+        # Possibly add pooling network
+        if pool_class is not None:
+            # Add an unsqueeze network so that the shape is correct to pass to pooling network
+            self.unsqueeze = Unsqueeze(dim=-1)
+            net_list.append(self.unsqueeze)
+            # Get output shape
+            feat_shape = self.unsqueeze.output_shape(feat_shape)
+            # Create pooling network
+            self.pool = eval(pool_class)(input_shape=feat_shape, **pool_kwargs)
+            net_list.append(self.pool)
+            feat_shape = self.pool.output_shape(feat_shape)
+        else:
+            self.unsqueeze, self.pool = None, None
+
+        # flatten layer
+        if self.flatten:
+            net_list.append(torch.nn.Flatten(start_dim=1, end_dim=-1))
+
+        # maybe linear layer
+        if self.feature_dimension is not None:
+            assert self.flatten
+            linear = torch.nn.Linear(int(np.prod(feat_shape)), self.feature_dimension)
+            net_list.append(linear)
+
+        # Generate final network
+        self.nets = nn.Sequential(*net_list)
+
+    def output_shape(self, input_shape):
+        """
+        Function to compute output shape from inputs to this module.
+
+        Args:
+            input_shape (iterable of int): shape of input. Does not include batch dimension.
+                Some modules may not need this argument, if their output does not depend
+                on the size of the input, or if they assume fixed size input.
+
+        Returns:
+            out_shape ([int]): list of integers corresponding to output shape
+        """
+        if self.feature_dimension is not None:
+            # linear output
+            return [self.feature_dimension]
+        feat_shape = self.backbone.output_shape(input_shape)
+        if self.pool is not None:
+            # pool output
+            feat_shape = self.pool.output_shape(self.unsqueeze.output_shape(feat_shape))
+        # backbone + flat output
+        return [np.prod(feat_shape)] if self.flatten else feat_shape
+
+    def forward(self, inputs):
+        """
+        Forward pass through visual core.
+        """
+        ndim = len(self.input_shape)
+        assert tuple(inputs.shape)[-ndim:] == tuple(self.input_shape)
+        return super(ScanCore, self).forward(inputs)
+
+    def __repr__(self):
+        """Pretty print network."""
+        header = '{}'.format(str(self.__class__.__name__))
+        msg = ''
+        indent = ' ' * 2
+        msg += textwrap.indent(
+            "\ninput_shape={}\noutput_shape={}".format(self.input_shape, self.output_shape(self.input_shape)), indent)
+        msg += textwrap.indent("\nbackbone_net={}".format(self.backbone), indent)
+        msg += textwrap.indent("\npool_net={}".format(self.pool), indent)
         msg = header + '(' + msg + '\n)'
         return msg
 
 
+
 """
 ================================================
 Observation Randomizer Networks
@@ -934,6 +1205,20 @@ class Randomizer(Module):
     def __init__(self):
         super(Randomizer, self).__init__()
 
+    def __init_subclass__(cls, **kwargs):
+        """
+        Hook method to automatically register all valid subclasses so we can keep track of valid observation randomizers
+        in a global dict.
+
+        This global dict stores mapping from observation randomizer network name to class.
+        We keep track of these registries to enable automated class inference at runtime, allowing
+        users to simply extend our base randomizer class and refer to that class in string form
+        in their config, without having to manually register their class internally.
+        This also future-proofs us for any additional randomizer classes we would
+        like to add ourselves.
+        """
+        ObsUtils.register_randomizer(cls)
+
     def output_shape(self, input_shape=None):
         """
         This function is unused. See @output_shape_in and @output_shape_out.
@@ -1095,4 +1380,4 @@ def __repr__(self):
         header = '{}'.format(str(self.__class__.__name__))
         msg = header + "(input_shape={}, crop_size=[{}, {}], num_crops={})".format(
             self.input_shape, self.crop_height, self.crop_width, self.num_crops)
-        return msg
\ No newline at end of file
+        return msg
diff --git a/robomimic/models/obs_nets.py b/robomimic/models/obs_nets.py
index c0c98962..739d8a94 100644
--- a/robomimic/models/obs_nets.py
+++ b/robomimic/models/obs_nets.py
@@ -2,10 +2,10 @@
 Contains torch Modules that help deal with inputs consisting of multiple
 modalities. This is extremely common when networks must deal with one or 
 more observation dictionaries, where each input dictionary can have
-modality keys of a certain type and shape. 
+observation keys of a certain modality and shape.
 
-As an example, an observation could consist of a flat "robot0_eef_pos" modality, 
-and a 3-channel RGB "agentview_image" modality.
+As an example, an observation could consist of a flat "robot0_eef_pos" observation key,
+and a 3-channel RGB "agentview_image" observation key.
 """
 import sys
 import numpy as np
@@ -18,109 +18,76 @@
 import torch.nn.functional as F
 import torch.distributions as D
 
+from robomimic.utils.python_utils import extract_class_init_kwargs_from_dict
 import robomimic.utils.tensor_utils as TensorUtils
 import robomimic.utils.obs_utils as ObsUtils
 from robomimic.models.base_nets import Module, Sequential, MLP, RNN_Base, ResNet18Conv, SpatialSoftmax, \
-    FeatureAggregator, VisualCore, Randomizer, CropRandomizer
-
-
-def obs_encoder_args_from_config(obs_encoder_config):
-    """
-    Generate a set of args used to create visual backbones for networks
-    from the obseration encoder config.
-    """
-    return dict(
-        visual_feature_dimension=obs_encoder_config.visual_feature_dimension,
-        visual_core_class=obs_encoder_config.visual_core,
-        visual_core_kwargs=dict(obs_encoder_config.visual_core_kwargs),
-        obs_randomizer_class=obs_encoder_config.obs_randomizer_class,
-        obs_randomizer_kwargs=dict(obs_encoder_config.obs_randomizer_kwargs),
-        use_spatial_softmax=obs_encoder_config.use_spatial_softmax,
-        spatial_softmax_kwargs=dict(obs_encoder_config.spatial_softmax_kwargs),
-    )
+    FeatureAggregator, VisualCore, Randomizer
 
 
 def obs_encoder_factory(
         obs_shapes,
-        visual_feature_dimension,
-        visual_core_class,
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         feature_activation=nn.ReLU,
+        encoder_kwargs=None,
     ):
     """
     Utility function to create an @ObservationEncoder from kwargs specified in config.
 
     Args:
-        obs_shapes (OrderedDict): a dictionary that maps modality to
+        obs_shapes (OrderedDict): a dictionary that maps observation key to
             expected shapes for observations.
 
-        visual_feature_dimension (int): feature dimension to encode images into
-
-        visual_core_class (str): specifies Visual Backbone network for encoding images
-
-        visual_core_kwargs (dict): arguments to pass to @visual_core_class
-
-        obs_randomizer_class (str): specifies a Randomizer class for the input modality
-
-        obs_randomizer_kwargs (dict): kwargs for the observation randomizer
-
-        use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-            the end of the visual backbone network, resulting in a sharp bottleneck
-            representation for visual inputs.
-
-        spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer
-
         feature_activation: non-linearity to apply after each obs net - defaults to ReLU. Pass
             None to apply no activation.
-    """
-
-    ### TODO: clean this part up in the config and args to this function ###
-    if visual_core_kwargs is None:
-        visual_core_kwargs = dict()
-    visual_core_kwargs = deepcopy(visual_core_kwargs)
-
-    if obs_randomizer_class is not None:
-        obs_randomizer_class = eval(obs_randomizer_class)
-    if obs_randomizer_kwargs is None:
-        obs_randomizer_kwargs = dict()
-
-    # use a special class to wrap the visual core and pooling together
-    visual_core_kwargs_template = dict(
-        visual_core_class=visual_core_class,
-        visual_core_kwargs=deepcopy(visual_core_kwargs),
-        visual_feature_dimension=visual_feature_dimension
-    )
-    if use_spatial_softmax:
-        visual_core_kwargs_template["pool_class"] = "SpatialSoftmax"
-        visual_core_kwargs_template["pool_kwargs"] = deepcopy(spatial_softmax_kwargs)
-    else:
-        visual_core_kwargs_template["pool_class"] = "SpatialMeanPool"
 
+        encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should be
+            nested dictionary containing relevant per-modality information for encoder networks.
+            Should be of form:
+
+            obs_modality1: dict
+                feature_dimension: int
+                core_class: str
+                core_kwargs: dict
+                    ...
+                    ...
+                obs_randomizer_class: str
+                obs_randomizer_kwargs: dict
+                    ...
+                    ...
+            obs_modality2: dict
+                ...
+    """
     enc = ObservationEncoder(feature_activation=feature_activation)
-    for k in obs_shapes:
-        mod_net_class = None
-        mod_net_kwargs = None
-        mod_randomizer = None
-        if ObsUtils.has_image([k]):
-            mod_net_class = "VisualCore"
-            mod_net_kwargs = deepcopy(visual_core_kwargs_template)
-            # need input shape to create visual core
-            mod_net_kwargs["input_shape"] = obs_shapes[k]
-            if obs_randomizer_class is not None:
-                mod_obs_randomizer_kwargs = deepcopy(obs_randomizer_kwargs)
-                mod_obs_randomizer_kwargs["input_shape"] = obs_shapes[k]
-                mod_randomizer = obs_randomizer_class(**mod_obs_randomizer_kwargs)
-
-        enc.register_modality(
-            mod_name=k,
-            mod_shape=obs_shapes[k],
-            mod_net_class=mod_net_class,
-            mod_net_kwargs=mod_net_kwargs,
-            mod_randomizer=mod_randomizer
+    for k, obs_shape in obs_shapes.items():
+        obs_modality = ObsUtils.OBS_KEYS_TO_MODALITIES[k]
+        enc_kwargs = deepcopy(ObsUtils.DEFAULT_ENCODER_KWARGS[obs_modality]) if encoder_kwargs is None else \
+            deepcopy(encoder_kwargs[obs_modality])
+
+        for obs_module, cls_mapping in zip(("core", "obs_randomizer"),
+                                      (ObsUtils.OBS_ENCODER_CORES, ObsUtils.OBS_RANDOMIZERS)):
+            # Sanity check for kwargs in case they don't exist / are None
+            if enc_kwargs.get(f"{obs_module}_kwargs", None) is None:
+                enc_kwargs[f"{obs_module}_kwargs"] = {}
+            # Add in input shape info
+            enc_kwargs[f"{obs_module}_kwargs"]["input_shape"] = obs_shape
+            # If group class is specified, then make sure corresponding kwargs only contain relevant kwargs
+            if enc_kwargs[f"{obs_module}_class"] is not None:
+                enc_kwargs[f"{obs_module}_kwargs"] = extract_class_init_kwargs_from_dict(
+                    cls=cls_mapping[enc_kwargs[f"{obs_module}_class"]],
+                    dic=enc_kwargs[f"{obs_module}_kwargs"],
+                    copy=False,
+                )
+
+        # Add in input shape info
+        randomizer = None if enc_kwargs["obs_randomizer_class"] is None else \
+            ObsUtils.OBS_RANDOMIZERS[enc_kwargs["obs_randomizer_class"]](**enc_kwargs["obs_randomizer_kwargs"])
+
+        enc.register_obs_key(
+            name=k,
+            shape=obs_shape,
+            net_class=enc_kwargs["core_class"],
+            net_kwargs=enc_kwargs["core_kwargs"],
+            randomizer=randomizer,
         )
 
     enc.make()
@@ -129,9 +96,9 @@ def obs_encoder_factory(
 
 class ObservationEncoder(Module):
     """
-    Module that processes inputs by modality and then concatenates the processed
-    modalities together. Each modality is processed with an encoder head network.
-    Call @register_modality to register modalities with the encoder and then
+    Module that processes inputs by observation key and then concatenates the processed
+    observation keys together. Each key is processed with an encoder head network.
+    Call @register_obs_key to register observation keys with the encoder and then
     finally call @make to create the encoder networks. 
     """
     def __init__(self, feature_activation=nn.ReLU):
@@ -150,68 +117,60 @@ def __init__(self, feature_activation=nn.ReLU):
         self.feature_activation = feature_activation
         self._locked = False
 
-    def register_modality(
+    def register_obs_key(
         self, 
-        mod_name,
-        mod_shape, 
-        mod_net_class=None, 
-        mod_net_kwargs=None, 
-        mod_net=None, 
-        mod_randomizer=None,
-        share_mod_net_from=None,
+        name,
+        shape, 
+        net_class=None, 
+        net_kwargs=None, 
+        net=None, 
+        randomizer=None,
+        share_net_from=None,
     ):
         """
-        Register a modality that this encoder should be responsible for.
+        Register an observation key that this encoder should be responsible for.
 
         Args:
-            mod_name (str): modality name
-            mod_shape (int tuple): shape of modality
-            mod_net_class (str): name of class in base_nets.py that should be used
-                to process this modality before concatenation. Pass None to flatten
-                and concatenate the modality directly.
-            mod_net_kwargs (dict): arguments to pass to @mod_net_class
-            mod_net (Module instance): if provided, use this Module to process the modality
+            name (str): modality name
+            shape (int tuple): shape of modality
+            net_class (str): name of class in base_nets.py that should be used
+                to process this observation key before concatenation. Pass None to flatten
+                and concatenate the observation key directly.
+            net_kwargs (dict): arguments to pass to @net_class
+            net (Module instance): if provided, use this Module to process the observation key
                 instead of creating a different net
-            mod_randomizer (Randomizer instance): if provided, use this Module to augment modalities
+            randomizer (Randomizer instance): if provided, use this Module to augment observation keys
                 coming in to the encoder, and possibly augment the processed output as well
-            share_mod_net_from (str): if provided, use the same instance of @mod_net_class 
-                as another modality. This modality must already exist in this encoder.
-                Warning: Note that this does not share the modality randomizer
+            share_net_from (str): if provided, use the same instance of @net_class 
+                as another observation key. This observation key must already exist in this encoder.
+                Warning: Note that this does not share the observation key randomizer
         """
-        assert not self._locked, "ObservationEncoder: @register_modality called after @make"
-        assert mod_name not in self.obs_shapes, "ObservationEncoder: modality {} already exists".format(mod_name)
+        assert not self._locked, "ObservationEncoder: @register_obs_key called after @make"
+        assert name not in self.obs_shapes, "ObservationEncoder: modality {} already exists".format(name)
 
-        if mod_net is not None:
-            assert isinstance(mod_net, Module), "ObservationEncoder: @mod_net must be instance of Module class"
-            assert (mod_net_class is None) and (mod_net_kwargs is None) and (share_mod_net_from is None), \
-                "ObservationEncoder: @mod_net provided - ignore other net creation options"
+        if net is not None:
+            assert isinstance(net, Module), "ObservationEncoder: @net must be instance of Module class"
+            assert (net_class is None) and (net_kwargs is None) and (share_net_from is None), \
+                "ObservationEncoder: @net provided - ignore other net creation options"
 
-        if share_mod_net_from is not None:
+        if share_net_from is not None:
             # share processing with another modality
-            assert (mod_net_class is None) and (mod_net_kwargs is None)
-            assert share_mod_net_from in self.obs_shapes
-
-        if mod_net_class is not None:
-            # convert string into class
-            if sys.version_info.major == 3:
-                assert isinstance(mod_net_class, str)
-            else:
-                assert isinstance(mod_net_class, (str, unicode))
-            mod_net_class = eval(mod_net_class)
-
-        mod_net_kwargs = deepcopy(mod_net_kwargs) if mod_net_kwargs is not None else {}
-        if mod_randomizer is not None:
-            assert isinstance(mod_randomizer, Randomizer)
-            if mod_net_kwargs is not None:
+            assert (net_class is None) and (net_kwargs is None)
+            assert share_net_from in self.obs_shapes
+
+        net_kwargs = deepcopy(net_kwargs) if net_kwargs is not None else {}
+        if randomizer is not None:
+            assert isinstance(randomizer, Randomizer)
+            if net_kwargs is not None:
                 # update input shape to visual core
-                mod_net_kwargs["input_shape"] = mod_randomizer.output_shape_in(mod_shape)
+                net_kwargs["input_shape"] = randomizer.output_shape_in(shape)
 
-        self.obs_shapes[mod_name] = mod_shape
-        self.obs_nets_classes[mod_name] = mod_net_class
-        self.obs_nets_kwargs[mod_name] = mod_net_kwargs
-        self.obs_nets[mod_name] = mod_net
-        self.obs_randomizers[mod_name] = mod_randomizer
-        self.obs_share_mods[mod_name] = share_mod_net_from
+        self.obs_shapes[name] = shape
+        self.obs_nets_classes[name] = net_class
+        self.obs_nets_kwargs[name] = net_kwargs
+        self.obs_nets[name] = net
+        self.obs_randomizers[name] = randomizer
+        self.obs_share_mods[name] = share_net_from
 
     def make(self):
         """
@@ -230,7 +189,7 @@ def _create_layers(self):
         for k in self.obs_shapes:
             if self.obs_nets_classes[k] is not None:
                 # create net to process this modality
-                self.obs_nets[k] = self.obs_nets_classes[k](**self.obs_nets_kwargs[k])
+                self.obs_nets[k] = ObsUtils.OBS_ENCODER_CORES[self.obs_nets_classes[k]](**self.obs_nets_kwargs[k])
             elif self.obs_share_mods[k] is not None:
                 # make sure net is shared with another modality
                 self.obs_nets[k] = self.obs_nets[self.obs_share_mods[k]]
@@ -307,9 +266,10 @@ def __repr__(self):
         header = '{}'.format(str(self.__class__.__name__))
         msg = ''
         for k in self.obs_shapes:
-            msg += textwrap.indent('\nModality(\n', ' ' * 4)
+            msg += textwrap.indent('\nKey(\n', ' ' * 4)
             indent = ' ' * 8
             msg += textwrap.indent("name={}\nshape={}\n".format(k, self.obs_shapes[k]), indent)
+            msg += textwrap.indent("modality={}\n".format(ObsUtils.OBS_KEYS_TO_MODALITIES[k]), indent)
             msg += textwrap.indent("randomizer={}\n".format(self.obs_randomizers[k]), indent)
             msg += textwrap.indent("net={}\n".format(self.obs_nets[k]), indent)
             msg += textwrap.indent("sharing_from={}\n".format(self.obs_share_mods[k]), indent)
@@ -334,7 +294,7 @@ def __init__(
     ):
         """
         Args:
-            decode_shapes (OrderedDict): a dictionary that maps observation modality to 
+            decode_shapes (OrderedDict): a dictionary that maps observation key to
                 expected shape. This is used to generate output modalities from the
                 input features.
 
@@ -382,9 +342,10 @@ def __repr__(self):
         header = '{}'.format(str(self.__class__.__name__))
         msg = ''
         for k in self.obs_shapes:
-            msg += textwrap.indent('\nModality(\n', ' ' * 4)
+            msg += textwrap.indent('\nKey(\n', ' ' * 4)
             indent = ' ' * 8
             msg += textwrap.indent("name={}\nshape={}\n".format(k, self.obs_shapes[k]), indent)
+            msg += textwrap.indent("modality={}\n".format(ObsUtils.OBS_KEYS_TO_MODALITIES[k]), indent)
             msg += textwrap.indent("net=({})\n".format(self.nets[k]), indent)
             msg += textwrap.indent(")", ' ' * 4)
         msg = header + '(' + msg + '\n)'
@@ -405,14 +366,8 @@ class ObservationGroupEncoder(Module):
     def __init__(
         self,
         observation_group_shapes,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=True,
-        spatial_softmax_kwargs=None,
         feature_activation=nn.ReLU,
+        encoder_kwargs=None,
     ):
         """
         Args:
@@ -421,24 +376,25 @@ def __init__(
                 the value should be an OrderedDict that maps modalities to
                 expected shapes.
 
-            visual_feature_dimension (int): feature dimension to encode images into
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class
-
-            obs_randomizer_class (str): specifies a Randomizer class for the input modality
-
-            obs_randomizer_kwargs (dict): kwargs for the observation randomizer
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer
-
             feature_activation: non-linearity to apply after each obs net - defaults to ReLU. Pass
-                None to apply no activation. 
+                None to apply no activation.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
         super(ObservationGroupEncoder, self).__init__()
 
@@ -453,14 +409,8 @@ def __init__(
         for obs_group in self.observation_group_shapes:
             self.nets[obs_group] = obs_encoder_factory(
                 obs_shapes=self.observation_group_shapes[obs_group],
-                visual_feature_dimension=visual_feature_dimension,
-                visual_core_class=visual_core_class,
-                visual_core_kwargs=visual_core_kwargs,
-                obs_randomizer_class=obs_randomizer_class,
-                obs_randomizer_kwargs=obs_randomizer_kwargs,
-                use_spatial_softmax=use_spatial_softmax,
-                spatial_softmax_kwargs=spatial_softmax_kwargs,
                 feature_activation=feature_activation,
+                encoder_kwargs=encoder_kwargs,
             )
 
     def forward(self, **inputs):
@@ -531,19 +481,13 @@ class MIMO_MLP(Module):
     (including visual outputs).
     """
     def __init__(
-        self, 
+        self,
         input_obs_group_shapes,
-        output_shapes, 
+        output_shapes,
         layer_dims,
         layer_func=nn.Linear, 
         activation=nn.ReLU,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
@@ -561,21 +505,22 @@ def __init__(
 
             activation: non-linearity per MLP layer - defaults to ReLU
 
-            visual_feature_dimension (int): feature dimension to encode images into
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class
-
-            obs_randomizer_class (str): specifies a Randomizer class for the input modality
-
-            obs_randomizer_kwargs (dict): kwargs for the observation randomizer
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
         super(MIMO_MLP, self).__init__()
 
@@ -591,13 +536,7 @@ def __init__(
         # Encoder for all observation groups.
         self.nets["encoder"] = ObservationGroupEncoder(
             observation_group_shapes=input_obs_group_shapes,
-            visual_feature_dimension=visual_feature_dimension,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs, 
+            encoder_kwargs=encoder_kwargs,
         )
 
         # flat encoder output dimension
@@ -685,13 +624,7 @@ def __init__(
         mlp_activation=nn.ReLU,
         mlp_layer_func=nn.Linear,
         per_step=True,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
@@ -713,23 +646,24 @@ def __init__(
 
             per_step (bool): if True, apply the MLP and observation decoder into @output_shapes
                 at every step of the RNN. Otherwise, apply them to the final hidden state of the 
-                RNN. 
-
-            visual_feature_dimension (int): feature dimension to encode images into
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class
-
-            obs_randomizer_class (str): specifies a Randomizer class for the input modality
-
-            obs_randomizer_kwargs (dict): kwargs for the observation randomizer
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer
+                RNN.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
         super(RNN_MIMO_MLP, self).__init__()
         assert isinstance(input_obs_group_shapes, OrderedDict)
@@ -744,13 +678,7 @@ def __init__(
         # Encoder for all observation groups.
         self.nets["encoder"] = ObservationGroupEncoder(
             observation_group_shapes=input_obs_group_shapes,
-            visual_feature_dimension=visual_feature_dimension,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs,
+            encoder_kwargs=encoder_kwargs,
         )
 
         # flat encoder output dimension
diff --git a/robomimic/models/policy_nets.py b/robomimic/models/policy_nets.py
index e9605898..b9229e69 100644
--- a/robomimic/models/policy_nets.py
+++ b/robomimic/models/policy_nets.py
@@ -32,42 +32,37 @@ def __init__(
         obs_shapes,
         ac_dim,
         mlp_layer_dims,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
-            obs_shapes (OrderedDict): a dictionary that maps modality to
+            obs_shapes (OrderedDict): a dictionary that maps observation keys to
                 expected shapes for observations.
 
             ac_dim (int): dimension of action space.
 
             mlp_layer_dims ([int]): sequence of integers for the MLP hidden layers sizes.
 
-            visual_feature_dimension (int): feature dimension to encode images into.
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images.
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class.
-
-            obs_randomizer_class (str): specifies observation randomizer class
-
-            obs_randomizer_kwargs (dict): kwargs for observation randomizer (e.g., CropRandomizer)
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer.
-
-            goal_shapes (OrderedDict): a dictionary that maps modality to
+            goal_shapes (OrderedDict): a dictionary that maps observation keys to
                 expected shapes for goal observations.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-observation key information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
         assert isinstance(obs_shapes, OrderedDict)
         self.obs_shapes = obs_shapes
@@ -91,13 +86,7 @@ def __init__(
             input_obs_group_shapes=observation_group_shapes,
             output_shapes=output_shapes,
             layer_dims=mlp_layer_dims,
-            visual_feature_dimension=visual_feature_dimension,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs, 
+            encoder_kwargs=encoder_kwargs,
         )
 
     def _get_output_shapes(self):
@@ -132,18 +121,12 @@ def __init__(
         ac_dim,
         mlp_layer_dims,
         perturbation_scale=0.05,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
-            obs_shapes (OrderedDict): a dictionary that maps modality to
+            obs_shapes (OrderedDict): a dictionary that maps observation keys to
                 expected shapes for observations.
 
             ac_dim (int): dimension of action space.
@@ -154,24 +137,25 @@ def __init__(
                 lie in +/- @perturbation_scale. The final action output is equal to the original 
                 input action added to the output perturbation (and clipped to lie in [-1, 1]).
 
-            visual_feature_dimension (int): feature dimension to encode images into.
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images.
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class.
-
-            obs_randomizer_class (str): specifies observation randomizer class
-
-            obs_randomizer_kwargs (dict): kwargs for observation randomizer (e.g., CropRandomizer)
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer.
-
             goal_shapes (OrderedDict): a dictionary that maps modality to
                 expected shapes for goal observations.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
         self.perturbation_scale = perturbation_scale
 
@@ -184,14 +168,8 @@ def __init__(
             obs_shapes=new_obs_shapes,
             ac_dim=ac_dim,
             mlp_layer_dims=mlp_layer_dims,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            visual_feature_dimension=visual_feature_dimension,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs,
             goal_shapes=goal_shapes,
+            encoder_kwargs=encoder_kwargs,
         )
 
     def forward(self, obs_dict, acts, goal_dict=None):
@@ -229,14 +207,8 @@ def __init__(
         std_limits=(0.007, 7.5),
         low_noise_eval=True,
         use_tanh=False,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
@@ -276,24 +248,25 @@ def __init__(
 
             use_tanh (bool): if True, use a tanh-Gaussian distribution
 
-            visual_feature_dimension (int): feature dimension to encode images into.
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images.
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class.
-
-            obs_randomizer_class (str): specifies observation randomizer class
-
-            obs_randomizer_kwargs (dict): kwargs for observation randomizer (e.g., CropRandomizer)
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer.
-
             goal_shapes (OrderedDict): a dictionary that maps modality to
                 expected shapes for goal observations.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
 
         # parameters specific to Gaussian actor
@@ -324,14 +297,8 @@ def softplus_scaled(x):
             obs_shapes=obs_shapes,
             ac_dim=ac_dim,
             mlp_layer_dims=mlp_layer_dims,
-            visual_feature_dimension=visual_feature_dimension,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs,
             goal_shapes=goal_shapes,
+            encoder_kwargs=encoder_kwargs,
         )
 
         # If initialization weight was specified, make sure all final layer network weights are specified correctly
@@ -441,14 +408,8 @@ def __init__(
         std_activation="softplus",
         low_noise_eval=True,
         use_tanh=False,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
@@ -474,24 +435,25 @@ def __init__(
 
             use_tanh (bool): if True, use a tanh-Gaussian distribution
 
-            visual_feature_dimension (int): feature dimension to encode images into.
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images.
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class.
-
-            obs_randomizer_class (str): specifies observation randomizer class
-
-            obs_randomizer_kwargs (dict): kwargs for observation randomizer (e.g., CropRandomizer)
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer.
-
             goal_shapes (OrderedDict): a dictionary that maps modality to
                 expected shapes for goal observations.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
 
         # parameters specific to GMM actor
@@ -513,14 +475,8 @@ def __init__(
             obs_shapes=obs_shapes,
             ac_dim=ac_dim,
             mlp_layer_dims=mlp_layer_dims,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            visual_feature_dimension=visual_feature_dimension,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs,
             goal_shapes=goal_shapes,
+            encoder_kwargs=encoder_kwargs,
         )
 
     def _get_output_shapes(self):
@@ -616,14 +572,8 @@ def __init__(
         rnn_num_layers,
         rnn_type="LSTM",  # [LSTM, GRU]
         rnn_kwargs=None,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
@@ -642,24 +592,25 @@ def __init__(
 
             rnn_kwargs (dict): kwargs for the torch.nn.LSTM / GRU
 
-            visual_feature_dimension (int): feature dimension to encode images into.
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images.
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class.
-
-            obs_randomizer_class (str): specifies a Randomizer class for the input modality
-
-            obs_randomizer_kwargs (dict): kwargs for the observation randomizer
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer.
-
             goal_shapes (OrderedDict): a dictionary that maps modality to
                 expected shapes for goal observations.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
         self.ac_dim = ac_dim
 
@@ -691,13 +642,7 @@ def __init__(
             rnn_type=rnn_type,
             rnn_kwargs=rnn_kwargs,
             per_step=True,
-            visual_feature_dimension=visual_feature_dimension,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs, 
+            encoder_kwargs=encoder_kwargs,
         )
 
     def _get_output_shapes(self):
@@ -797,14 +742,8 @@ def __init__(
         std_activation="softplus",
         low_noise_eval=True,
         use_tanh=False,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
@@ -831,6 +770,23 @@ def __init__(
                 one of the GMM modes will be sampled (approximately)
 
             use_tanh (bool): if True, use a tanh-Gaussian distribution
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
 
         # parameters specific to GMM actor
@@ -856,14 +812,8 @@ def __init__(
             rnn_num_layers=rnn_num_layers,
             rnn_type=rnn_type,
             rnn_kwargs=rnn_kwargs,
-            visual_feature_dimension=visual_feature_dimension,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs,
             goal_shapes=goal_shapes,
+            encoder_kwargs=encoder_kwargs,
         )
 
     def _get_output_shapes(self):
@@ -1049,14 +999,8 @@ def __init__(
         prior_use_categorical=False,
         prior_categorical_dim=10,
         prior_categorical_gumbel_softmax_hard=False,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
@@ -1065,26 +1009,25 @@ def __init__(
 
             ac_dim (int): dimension of action space.
 
-            mlp_layer_dims ([int]): sequence of integers for the MLP hidden layers sizes. 
-
-            visual_feature_dimension (int): feature dimension to encode images into.
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images.
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class. 
-
-            obs_randomizer_class (str): specifies a Randomizer class for the input modality
-
-            obs_randomizer_kwargs (dict): kwargs for the observation randomizer
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer.
-
             goal_shapes (OrderedDict): a dictionary that maps modality to
                 expected shapes for goal observations.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
         super(VAEActor, self).__init__()
 
@@ -1118,14 +1061,8 @@ def __init__(
             prior_use_categorical=prior_use_categorical,
             prior_categorical_dim=prior_categorical_dim,
             prior_categorical_gumbel_softmax_hard=prior_categorical_gumbel_softmax_hard,
-            visual_feature_dimension=visual_feature_dimension,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs,
             goal_shapes=goal_shapes,
+            encoder_kwargs=encoder_kwargs,
         )
 
     def encode(self, actions, obs_dict, goal_dict=None):
diff --git a/robomimic/models/vae_nets.py b/robomimic/models/vae_nets.py
index aa6b826d..ec230771 100644
--- a/robomimic/models/vae_nets.py
+++ b/robomimic/models/vae_nets.py
@@ -57,14 +57,8 @@ def __init__(
         param_obs_dependent,
         obs_shapes=None,
         mlp_layer_dims=(),
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
@@ -82,24 +76,25 @@ def __init__(
 
             mlp_layer_dims ([int]): sequence of integers for the MLP hidden layer sizes
 
-            visual_feature_dimension (int): feature dimension to encode images into
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class
-
-            obs_randomizer_class (str): specifies a Randomizer class for the input modality
-
-            obs_randomizer_kwargs (dict): kwargs for the observation randomizer
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer
-
             goal_shapes (OrderedDict): a dictionary that maps modality to
                 expected shapes for goal observations.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
         super(Prior, self).__init__()
 
@@ -111,14 +106,8 @@ def __init__(
         net_kwargs = dict(
             obs_shapes=obs_shapes,
             mlp_layer_dims=mlp_layer_dims,
-            visual_feature_dimension=visual_feature_dimension,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs,
             goal_shapes=goal_shapes,
+            encoder_kwargs=encoder_kwargs,
         )
         self._create_layers(net_kwargs)
 
@@ -156,11 +145,7 @@ def _create_layers(self, net_kwargs):
                 input_obs_group_shapes=obs_group_shapes,
                 output_shapes=mlp_output_shapes,
                 layer_dims=net_kwargs["mlp_layer_dims"],
-                visual_feature_dimension=net_kwargs["visual_feature_dimension"],
-                visual_core_class=net_kwargs["visual_core_class"],
-                visual_core_kwargs=net_kwargs["visual_core_kwargs"],
-                use_spatial_softmax=net_kwargs["use_spatial_softmax"],
-                spatial_softmax_kwargs=net_kwargs["spatial_softmax_kwargs"], 
+                encoder_kwargs=net_kwargs["encoder_kwargs"],
             )
 
     def sample(self, n, obs_dict=None, goal_dict=None):
@@ -265,14 +250,8 @@ def __init__(
         gmm_learn_weights=False,
         obs_shapes=None,
         mlp_layer_dims=(),
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
@@ -304,24 +283,25 @@ def __init__(
 
             mlp_layer_dims ([int]): sequence of integers for the MLP hidden layer sizes
 
-            visual_feature_dimension (int): feature dimension to encode images into
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class
-
-            obs_randomizer_class (str): specifies a Randomizer class for the input modality
-
-            obs_randomizer_kwargs (dict): kwargs for the observation randomizer
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer
-
             goal_shapes (OrderedDict): a dictionary that maps modality to
                 expected shapes for goal observations.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
         self.device = device
         self.latent_dim = latent_dim
@@ -371,14 +351,8 @@ def __init__(
             param_obs_dependent=param_obs_dependent,
             obs_shapes=obs_shapes,
             mlp_layer_dims=mlp_layer_dims,
-            visual_feature_dimension=visual_feature_dimension,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs,
             goal_shapes=goal_shapes,
+            encoder_kwargs=encoder_kwargs,
         )
 
     def _create_layers(self, net_kwargs):
@@ -564,14 +538,8 @@ def __init__(
         learnable=False,
         obs_shapes=None,
         mlp_layer_dims=(),
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
 
     ):
         """
@@ -593,24 +561,25 @@ def __init__(
 
             mlp_layer_dims ([int]): sequence of integers for the MLP hidden layer sizes
 
-            visual_feature_dimension (int): feature dimension to encode images into
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class
-
-            obs_randomizer_class (str): specifies a Randomizer class for the input modality
-
-            obs_randomizer_kwargs (dict): kwargs for the observation randomizer
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer
-
             goal_shapes (OrderedDict): a dictionary that maps modality to
                 expected shapes for goal observations.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
         self.device = device
         self.latent_dim = latent_dim
@@ -640,14 +609,8 @@ def __init__(
             param_obs_dependent=param_obs_dependent,
             obs_shapes=obs_shapes,
             mlp_layer_dims=mlp_layer_dims,
-            visual_feature_dimension=visual_feature_dimension,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs,
             goal_shapes=goal_shapes,
+            encoder_kwargs=encoder_kwargs,
         )
 
     def _create_layers(self, net_kwargs):
@@ -832,14 +795,8 @@ def __init__(
         prior_use_categorical=False,
         prior_categorical_dim=10,
         prior_categorical_gumbel_softmax_hard=False,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
@@ -936,27 +893,28 @@ def __init__(
             prior_categorical_gumbel_softmax_hard (bool): if True, use the "hard" version of
                 Gumbel Softmax for reparametrization. Only used if @prior_use_categorical is True.
 
-            visual_feature_dimension (int): feature dimension to encode images into
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            obs_randomizer_class (str): specifies a Randomizer class for the input modality
-
-            obs_randomizer_kwargs (dict): kwargs for the observation randomizer
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer
-
             goal_shapes (OrderedDict): a dictionary that maps modality to
                 expected shapes for goal observations. Goals are treates as additional
                 conditioning inputs. They are usually specified separately because
                 they have duplicate modalities as the conditioning inputs (otherwise
                 they could just be added to the set of conditioning inputs).
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
         super(VAE, self).__init__()
 
@@ -1018,14 +976,8 @@ def __init__(
         self.prior_categorical_gumbel_softmax_hard = prior_categorical_gumbel_softmax_hard
         assert np.sum([self.prior_use_gmm, self.prior_use_categorical]) <= 1
 
-        # for visual obs core
-        self._visual_feature_dimension = visual_feature_dimension
-        self._visual_core_class = visual_core_class
-        self._visual_core_kwargs = visual_core_kwargs if visual_core_kwargs is not None else OrderedDict()
-        self._obs_randomizer_class = obs_randomizer_class
-        self._obs_randomizer_kwargs = obs_randomizer_kwargs
-        self._use_spatial_softmax = use_spatial_softmax
-        self._spatial_softmax_kwargs = spatial_softmax_kwargs if spatial_softmax_kwargs is not None else OrderedDict()
+        # for obs core
+        self._encoder_kwargs = encoder_kwargs
 
         if self.prior_use_gmm:
             assert self.prior_learn, "GMM must be learned"
@@ -1043,27 +995,16 @@ def _create_layers(self):
         """
         self.nets = nn.ModuleDict()
 
-        # args for observation encoders
-        obs_enc_args = dict(
-            visual_feature_dimension=self._visual_feature_dimension,
-            visual_core_class=self._visual_core_class,
-            visual_core_kwargs=self._visual_core_kwargs,
-            obs_randomizer_class=self._obs_randomizer_class,
-            obs_randomizer_kwargs=self._obs_randomizer_kwargs,
-            use_spatial_softmax=self._use_spatial_softmax,
-            spatial_softmax_kwargs=self._spatial_softmax_kwargs,
-        )
-
         # VAE Encoder
-        self._create_encoder(obs_encoder_args=obs_enc_args)
+        self._create_encoder()
 
         # VAE Decoder
-        self._create_decoder(obs_encoder_args=obs_enc_args)
+        self._create_decoder()
 
         # VAE Prior.
-        self._create_prior(obs_encoder_args=obs_enc_args)
+        self._create_prior()
 
-    def _create_encoder(self, obs_encoder_args):
+    def _create_encoder(self):
         """
         Helper function to create encoder.
         """
@@ -1091,10 +1032,10 @@ def _create_encoder(self, obs_encoder_args):
             input_obs_group_shapes=encoder_obs_group_shapes,
             output_shapes=encoder_output_shapes, 
             layer_dims=self.encoder_layer_dims,
-            **obs_encoder_args
+            encoder_kwargs=self._encoder_kwargs,
         )
 
-    def _create_decoder(self, obs_encoder_args):
+    def _create_decoder(self):
         """
         Helper function to create decoder.
         """
@@ -1114,10 +1055,10 @@ def _create_decoder(self, obs_encoder_args):
             input_obs_group_shapes=decoder_obs_group_shapes,
             output_shapes=self.output_shapes, 
             layer_dims=self.decoder_layer_dims,
-            **obs_encoder_args
+            encoder_kwargs=self._encoder_kwargs,
         )
 
-    def _create_prior(self, obs_encoder_args):
+    def _create_prior(self):
         """
         Helper function to create prior.
         """
@@ -1138,7 +1079,7 @@ def _create_prior(self, obs_encoder_args):
                 obs_shapes=prior_obs_group_shapes["condition"],
                 mlp_layer_dims=self.prior_layer_dims,
                 goal_shapes=prior_obs_group_shapes["goal"],
-                **obs_encoder_args
+                encoder_kwargs=self._encoder_kwargs,
             )
         else:
             self.nets["prior"] = GaussianPrior(
@@ -1152,7 +1093,7 @@ def _create_prior(self, obs_encoder_args):
                 obs_shapes=prior_obs_group_shapes["condition"],
                 mlp_layer_dims=self.prior_layer_dims,
                 goal_shapes=prior_obs_group_shapes["goal"],
-                **obs_encoder_args
+                encoder_kwargs=self._encoder_kwargs,
             )
 
     def encode(self, inputs, conditions=None, goals=None):
diff --git a/robomimic/models/value_nets.py b/robomimic/models/value_nets.py
index 20829e74..c98fa7e4 100644
--- a/robomimic/models/value_nets.py
+++ b/robomimic/models/value_nets.py
@@ -27,18 +27,12 @@ def __init__(
         obs_shapes,
         mlp_layer_dims,
         value_bounds=None,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
-            obs_shapes (OrderedDict): a dictionary that maps modality to
+            obs_shapes (OrderedDict): a dictionary that maps observation keys to
                 expected shapes for observations.
 
             mlp_layer_dims ([int]): sequence of integers for the MLP hidden layers sizes. 
@@ -47,24 +41,25 @@ def __init__(
                 that the network should be possible of generating. The network will rescale outputs
                 using a tanh layer to lie within these bounds. If None, no tanh re-scaling is done.
 
-            visual_feature_dimension (int): feature dimension to encode images into. 
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images.
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class. 
-
-            obs_randomizer_class (str): specifies a Randomizer class for the input modality
-
-            obs_randomizer_kwargs (dict): kwargs for the observation randomizer
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer.
-
-            goal_shapes (OrderedDict): a dictionary that maps modality to
+            goal_shapes (OrderedDict): a dictionary that maps observation keys to
                 expected shapes for goal observations.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-observation key information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
         self.value_bounds = value_bounds
         if self.value_bounds is not None:
@@ -93,13 +88,7 @@ def __init__(
             input_obs_group_shapes=observation_group_shapes,
             output_shapes=output_shapes,
             layer_dims=mlp_layer_dims,
-            visual_feature_dimension=visual_feature_dimension,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs, 
+            encoder_kwargs=encoder_kwargs,
         )
 
     def _get_output_shapes(self):
@@ -148,18 +137,12 @@ def __init__(
         ac_dim,
         mlp_layer_dims,
         value_bounds=None,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
-            obs_shapes (OrderedDict): a dictionary that maps modality to
+            obs_shapes (OrderedDict): a dictionary that maps observation keys to
                 expected shapes for observations.
 
             ac_dim (int): dimension of action space.
@@ -170,24 +153,25 @@ def __init__(
                 that the network should be possible of generating. The network will rescale outputs
                 using a tanh layer to lie within these bounds. If None, no tanh re-scaling is done.
 
-            visual_feature_dimension (int): feature dimension to encode images into. 
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images.
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class. 
-
-            obs_randomizer_class (str): specifies a Randomizer class for the input modality
-
-            obs_randomizer_kwargs (dict): kwargs for the observation randomizer
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer.
-
-            goal_shapes (OrderedDict): a dictionary that maps modality to
+            goal_shapes (OrderedDict): a dictionary that maps observation keys to
                 expected shapes for goal observations.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-observation key information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
 
         # add in action as a modality
@@ -200,14 +184,8 @@ def __init__(
             obs_shapes=new_obs_shapes,
             mlp_layer_dims=mlp_layer_dims,
             value_bounds=value_bounds,
-            visual_feature_dimension=visual_feature_dimension,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_class=obs_randomizer_class,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs,
             goal_shapes=goal_shapes,
+            encoder_kwargs=encoder_kwargs,
         )
 
     def forward(self, obs_dict, acts, goal_dict=None):
@@ -235,14 +213,8 @@ def __init__(
         mlp_layer_dims,
         value_bounds,
         num_atoms,
-        visual_feature_dimension=64,
-        visual_core_class='ResNet18Conv',
-        visual_core_kwargs=None,
-        obs_randomizer_class=None,
-        obs_randomizer_kwargs=None,
-        use_spatial_softmax=False,
-        spatial_softmax_kwargs=None,
         goal_shapes=None,
+        encoder_kwargs=None,
     ):
         """
         Args:
@@ -260,24 +232,25 @@ def __init__(
             num_atoms (int): number of value atoms to use for the categorical distribution - which
                 is the representation of the value distribution.
 
-            visual_feature_dimension (int): feature dimension to encode images into. 
-
-            visual_core_class (str): specifies Visual Backbone network for encoding images.
-
-            visual_core_kwargs (dict): arguments to pass to @visual_core_class. 
-
-            obs_randomizer_class (str): specifies a Randomizer class for the input modality
-
-            obs_randomizer_kwargs (dict): kwargs for the observation randomizer
-
-            use_spatial_softmax (bool): if True, introduce a spatial softmax layer at
-                the end of the visual backbone network, resulting in a sharp bottleneck
-                representation for visual inputs.
-
-            spatial_softmax_kwargs (dict): arguments to pass to spatial softmax layer.
-
             goal_shapes (OrderedDict): a dictionary that maps modality to
                 expected shapes for goal observations.
+
+            encoder_kwargs (dict or None): If None, results in default encoder_kwargs being applied. Otherwise, should
+                be nested dictionary containing relevant per-modality information for encoder networks.
+                Should be of form:
+
+                obs_modality1: dict
+                    feature_dimension: int
+                    core_class: str
+                    core_kwargs: dict
+                        ...
+                        ...
+                    obs_randomizer_class: str
+                    obs_randomizer_kwargs: dict
+                        ...
+                        ...
+                obs_modality2: dict
+                    ...
         """
 
         # parameters specific to DistributionalActionValueNetwork
@@ -290,13 +263,8 @@ def __init__(
             ac_dim=ac_dim,
             mlp_layer_dims=mlp_layer_dims,
             value_bounds=value_bounds,
-            visual_feature_dimension=visual_feature_dimension,
-            visual_core_class=visual_core_class,
-            visual_core_kwargs=visual_core_kwargs,
-            obs_randomizer_kwargs=obs_randomizer_kwargs,
-            use_spatial_softmax=use_spatial_softmax,
-            spatial_softmax_kwargs=spatial_softmax_kwargs,
             goal_shapes=goal_shapes,
+            encoder_kwargs=encoder_kwargs,
         )
 
     def _get_output_shapes(self):
diff --git a/robomimic/scripts/dataset_states_to_obs.py b/robomimic/scripts/dataset_states_to_obs.py
index e9fbcd6c..008d955a 100644
--- a/robomimic/scripts/dataset_states_to_obs.py
+++ b/robomimic/scripts/dataset_states_to_obs.py
@@ -239,12 +239,14 @@ def dataset_states_to_obs(args):
     parser.add_argument(
         "--dataset",
         type=str,
+        required=True,
         help="path to input hdf5 dataset",
     )
     # name of hdf5 to write - it will be in the same directory as @dataset
     parser.add_argument(
         "--output_name",
         type=str,
+        required=True,
         help="name of output hdf5 dataset",
     )
 
diff --git a/robomimic/scripts/download_datasets.py b/robomimic/scripts/download_datasets.py
index d159c132..3ad45965 100644
--- a/robomimic/scripts/download_datasets.py
+++ b/robomimic/scripts/download_datasets.py
@@ -55,21 +55,6 @@
 ALL_HDF5_TYPES = ["raw", "low_dim", "image", "low_dim_sparse", "low_dim_dense", "image_sparse", "image_dense"]
 
 
-def make_dataset_dirs(base_dir):
-    """
-    Create directory structure for all datasets. The datasets are organized into 
-    subfolders by task (e.g. lift, can, square, transport, tool hang) and dataset types 
-    (e.g. mg (machine generated), ph (proficient human), mh (multi-human)).
-
-    Args:
-        base_dir (str): base dataset directory where all subfolders should be created
-    """
-    for task in DATASET_REGISTRY:
-        for dataset_type in DATASET_REGISTRY[task]:
-            dataset_dir = os.path.join(base_dir, task, dataset_type)
-            os.makedirs(dataset_dir, exist_ok=True)
-
-
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
 
@@ -122,11 +107,10 @@ def make_dataset_dirs(base_dir):
 
     args = parser.parse_args()
 
-    # make directory structure
+    # set default base directory for downloads
     default_base_dir = args.download_dir
     if default_base_dir is None:
         default_base_dir = os.path.join(robomimic.__path__[0], "../datasets")
-    make_dataset_dirs(base_dir=default_base_dir)
 
     # load args
     download_tasks = args.tasks
@@ -163,6 +147,8 @@ def make_dataset_dirs(base_dir):
                             if args.dry_run:
                                 print("\ndry run: skip download")
                             else:
+                                # Make sure path exists and create if it doesn't
+                                os.makedirs(download_dir, exist_ok=True)
                                 FileUtils.download_url(
                                     url=DATASET_REGISTRY[task][dataset_type][hdf5_type]["url"], 
                                     download_dir=download_dir,
diff --git a/robomimic/scripts/download_momart_datasets.py b/robomimic/scripts/download_momart_datasets.py
new file mode 100644
index 00000000..affecf11
--- /dev/null
+++ b/robomimic/scripts/download_momart_datasets.py
@@ -0,0 +1,161 @@
+"""
+Script to download datasets used in MoMaRT paper (https://arxiv.org/abs/2112.05251). By default, all
+datasets will be stored at robomimic/datasets, unless the @download_dir
+argument is supplied. We recommend using the default, as most examples that
+use these datasets assume that they can be found there.
+
+The @tasks and @dataset_types arguments can all be supplied
+to choose which datasets to download. 
+
+Args:
+    download_dir (str): Base download directory. Created if it doesn't exist. 
+        Defaults to datasets folder in repository - only pass in if you would
+        like to override the location.
+
+    tasks (list): Tasks to download datasets for. Defaults to table_setup_from_dishwasher task. Pass 'all' to
+        download all tasks - 5 total:
+            - table_setup_from_dishwasher
+            - table_setup_from_dresser
+            - table_cleanup_to_dishwasher
+            - table_cleanup_to_sink
+            - unload_dishwasher
+    
+    dataset_types (list): Dataset types to download datasets for (expert, suboptimal, generalize, sample).
+        Defaults to expert. Pass 'all' to download datasets for all available dataset
+        types per task, or directly specify the list of dataset types.
+        NOTE: Because these datasets are huge, we will always print out a warning
+        that a user must respond yes to to acknowledge the data size (can be up to >100G for all tasks of a single type)
+
+Example usage:
+
+    # default behavior - just download expert table_setup_from_dishwasher dataset
+    python download_momart_datasets.py
+
+    # download expert datasets for all tasks
+    # (do a dry run first to see which datasets would be downloaded)
+    python download_momart_datasets.py --tasks all --dataset_types expert --dry_run
+    python download_momart_datasets.py --tasks all --dataset_types expert low_dim
+
+    # download all expert and suboptimal datasets for the table_setup_from_dishwasher and table_cleanup_to_dishwasher tasks
+    python download_datasets.py --tasks table_setup_from_dishwasher table_cleanup_to_dishwasher --dataset_types expert suboptimal
+
+    # download the sample datasets
+    python download_datasets.py --tasks all --dataset_types sample
+
+    # download all datasets
+    python download_datasets.py --tasks all --dataset_types all
+"""
+import os
+import argparse
+
+import robomimic
+import robomimic.utils.file_utils as FileUtils
+from robomimic import MOMART_DATASET_REGISTRY
+
+ALL_TASKS = [
+    "table_setup_from_dishwasher",
+    "table_setup_from_dresser",
+    "table_cleanup_to_dishwasher",
+    "table_cleanup_to_sink",
+    "unload_dishwasher",
+]
+ALL_DATASET_TYPES = [
+    "expert",
+    "suboptimal",
+    "generalize",
+    "sample",
+]
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    # directory to download datasets to
+    parser.add_argument(
+        "--download_dir",
+        type=str,
+        default=None,
+        help="Base download directory. Created if it doesn't exist. Defaults to datasets folder in repository.",
+    )
+
+    # tasks to download datasets for
+    parser.add_argument(
+        "--tasks",
+        type=str,
+        nargs='+',
+        default=["table_setup_from_dishwasher"],
+        help="Tasks to download datasets for. Defaults to table_setup_from_dishwasher task. Pass 'all' to download all"
+             f"5 tasks, or directly specify the list of tasks. Options are any of: {ALL_TASKS}",
+    )
+
+    # dataset types to download datasets for
+    parser.add_argument(
+        "--dataset_types",
+        type=str,
+        nargs='+',
+        default=["expert"],
+        help="Dataset types to download datasets for (e.g. expert, suboptimal). Defaults to expert. Pass 'all' to "
+             "download datasets for all available dataset types per task, or directly specify the list of dataset "
+             f"types. Options are any of: {ALL_DATASET_TYPES}",
+    )
+
+    # dry run - don't actually download datasets, but print which datasets would be downloaded
+    parser.add_argument(
+        "--dry_run",
+        action='store_true',
+        help="set this flag to do a dry run to only print which datasets would be downloaded"
+    )
+
+    args = parser.parse_args()
+
+    # set default base directory for downloads
+    default_base_dir = args.download_dir
+    if default_base_dir is None:
+        default_base_dir = os.path.join(robomimic.__path__[0], "../datasets")
+
+    # load args
+    download_tasks = args.tasks
+    if "all" in download_tasks:
+        assert len(download_tasks) == 1, "all should be only tasks argument but got: {}".format(args.tasks)
+        download_tasks = ALL_TASKS
+
+    download_dataset_types = args.dataset_types
+    if "all" in download_dataset_types:
+        assert len(download_dataset_types) == 1, "all should be only dataset_types argument but got: {}".format(args.dataset_types)
+        download_dataset_types = ALL_DATASET_TYPES
+
+    # Run sanity check first to warn user if they're about to download a huge amount of data
+    total_size = 0
+    for task in MOMART_DATASET_REGISTRY:
+        if task in download_tasks:
+            for dataset_type in MOMART_DATASET_REGISTRY[task]:
+                if dataset_type in download_dataset_types:
+                    total_size += MOMART_DATASET_REGISTRY[task][dataset_type]["size"]
+
+    # Verify user acknowledgement if we're not doing a dry run
+    if not args.dry_run:
+        user_response = input(f"Warning: requested datasets will take a total of {total_size}GB. Proceed? y/n\n")
+        assert user_response.lower() in {"yes", "y"}, f"Did not receive confirmation. Aborting download."
+
+    # download requested datasets
+    for task in MOMART_DATASET_REGISTRY:
+        if task in download_tasks:
+            for dataset_type in MOMART_DATASET_REGISTRY[task]:
+                if dataset_type in download_dataset_types:
+                    dataset_info = MOMART_DATASET_REGISTRY[task][dataset_type]
+                    download_dir = os.path.abspath(os.path.join(default_base_dir, task, dataset_type))
+                    print(f"\nDownloading dataset:\n"
+                          f"    task: {task}\n"
+                          f"    dataset type: {dataset_type}\n"
+                          f"    dataset size: {dataset_info['size']}GB\n"
+                          f"    download path: {download_dir}")
+                    if args.dry_run:
+                        print("\ndry run: skip download")
+                    else:
+                        # Make sure path exists and create if it doesn't
+                        os.makedirs(download_dir, exist_ok=True)
+                        FileUtils.download_url(
+                            url=dataset_info["url"],
+                            download_dir=download_dir,
+                        )
+                    print("")
diff --git a/robomimic/scripts/generate_paper_configs.py b/robomimic/scripts/generate_paper_configs.py
index a5a8e2b2..a455d406 100644
--- a/robomimic/scripts/generate_paper_configs.py
+++ b/robomimic/scripts/generate_paper_configs.py
@@ -72,24 +72,24 @@ def modify_config_for_default_low_dim_exp(config):
         ]
         # handle hierarchical observation configs
         if config.algo_name == "hbc":
-            mod_configs_to_set = [
+            configs_to_set = [
                 config.observation.actor.modalities.obs,
                 config.observation.planner.modalities.obs,
                 config.observation.planner.modalities.subgoal,
             ]
         elif config.algo_name == "iris":
-            mod_configs_to_set = [
+            configs_to_set = [
                 config.observation.actor.modalities.obs,
                 config.observation.value_planner.planner.modalities.obs,
                 config.observation.value_planner.planner.modalities.subgoal,
                 config.observation.value_planner.value.modalities.obs,
             ]
         else:
-            mod_configs_to_set = [config.observation.modalities.obs]
+            configs_to_set = [config.observation.modalities.obs]
         # set all observations / subgoals to use the correct low-dim modalities
-        for mod_config in mod_configs_to_set:
-            mod_config.low_dim = list(default_low_dim_obs)
-            mod_config.image = []
+        for cfg in configs_to_set:
+            cfg.low_dim = list(default_low_dim_obs)
+            cfg.rgb = []
 
     return config
 
@@ -140,30 +140,33 @@ def modify_config_for_default_image_exp(config):
             "robot0_eef_quat", 
             "robot0_gripper_qpos", 
         ]
-        config.observation.modalities.obs.image = [
+        config.observation.modalities.obs.rgb = [
             "agentview_image",
             "robot0_eye_in_hand_image",
         ]
         config.observation.modalities.goal.low_dim = []
-        config.observation.modalities.goal.image = []
+        config.observation.modalities.goal.rgb = []
 
         # default image encoder architecture is ResNet with spatial softmax
-        config.observation.encoder.visual_core = 'ResNet18Conv'
-        config.observation.encoder.visual_core_kwargs = Config()
-        config.observation.encoder.visual_feature_dimension = 64
-
-        config.observation.encoder.use_spatial_softmax = True
-        config.observation.encoder.spatial_softmax_kwargs.num_kp = 32
-        config.observation.encoder.spatial_softmax_kwargs.learnable_temperature = False
-        config.observation.encoder.spatial_softmax_kwargs.temperature = 1.0
-        config.observation.encoder.spatial_softmax_kwargs.noise_std = 0.
-
-        # use crop randomization as well
-        config.observation.encoder.obs_randomizer_class = 'CropRandomizer'  # observation randomizer class
-        config.observation.encoder.obs_randomizer_kwargs.crop_height = 76
-        config.observation.encoder.obs_randomizer_kwargs.crop_width = 76
-        config.observation.encoder.obs_randomizer_kwargs.num_crops = 1
-        config.observation.encoder.obs_randomizer_kwargs.pos_enc = False
+        config.observation.encoder.rgb.core_class = "VisualCore"
+        config.observation.encoder.rgb.core_kwargs.feature_dimension = 64
+        config.observation.encoder.rgb.core_kwargs.backbone_class = 'ResNet18Conv'                         # ResNet backbone for image observations (unused if no image observations)
+        config.observation.encoder.rgb.core_kwargs.backbone_kwargs.pretrained = False                # kwargs for visual core
+        config.observation.encoder.rgb.core_kwargs.backbone_kwargs.input_coord_conv = False
+        config.observation.encoder.rgb.core_kwargs.pool_class = "SpatialSoftmax"                # Alternate options are "SpatialMeanPool" or None (no pooling)
+        config.observation.encoder.rgb.core_kwargs.pool_kwargs.num_kp = 32                      # Default arguments for "SpatialSoftmax"
+        config.observation.encoder.rgb.core_kwargs.pool_kwargs.learnable_temperature = False    # Default arguments for "SpatialSoftmax"
+        config.observation.encoder.rgb.core_kwargs.pool_kwargs.temperature = 1.0                # Default arguments for "SpatialSoftmax"
+        config.observation.encoder.rgb.core_kwargs.pool_kwargs.noise_std = 0.0
+
+        # observation randomizer class - set to None to use no randomization, or 'CropRandomizer' to use crop randomization
+        config.observation.encoder.rgb.obs_randomizer_class = "CropRandomizer"
+
+        # kwargs for observation randomizers (for the CropRandomizer, this is size and number of crops)
+        config.observation.encoder.rgb.obs_randomizer_kwargs.crop_height = 76
+        config.observation.encoder.rgb.obs_randomizer_kwargs.crop_width = 76
+        config.observation.encoder.rgb.obs_randomizer_kwargs.num_crops = 1
+        config.observation.encoder.rgb.obs_randomizer_kwargs.pos_enc = False
 
     return config
 
@@ -236,22 +239,22 @@ def modify_config_for_dataset(config, task_name, dataset_type, hdf5_type, base_d
 
             if task_name == "tool_hang_real":
                 # side and wrist camera
-                config.observation.modalities.obs.image = [
+                config.observation.modalities.obs.rgb = [
                     "image_side",
                     "image_wrist",
                 ]
                 # 240x240 images -> crops should be 216x216
-                config.observation.encoder.obs_randomizer_kwargs.crop_height = 216
-                config.observation.encoder.obs_randomizer_kwargs.crop_width = 216
+                config.observation.encoder.rgb.obs_randomizer_kwargs.crop_height = 216
+                config.observation.encoder.rgb.obs_randomizer_kwargs.crop_width = 216
             else:
                 # front and wrist camera
-                config.observation.modalities.obs.image = [
+                config.observation.modalities.obs.rgb = [
                     "image",
                     "image_wrist",
                 ]
                 # 120x120 images -> crops should be 108x108
-                config.observation.encoder.obs_randomizer_kwargs.crop_height = 108
-                config.observation.encoder.obs_randomizer_kwargs.crop_width = 108
+                config.observation.encoder.rgb.obs_randomizer_kwargs.crop_height = 108
+                config.observation.encoder.rgb.obs_randomizer_kwargs.crop_width = 108
 
         elif hdf5_type in ["image", "image_sparse", "image_dense"]:
             if task_name == "transport":
@@ -266,7 +269,7 @@ def modify_config_for_dataset(config, task_name, dataset_type, hdf5_type, base_d
                 ]
 
                 # shoulder and wrist cameras per arm
-                config.observation.modalities.obs.image = [
+                config.observation.modalities.obs.rgb = [
                     "shouldercamera0_image",
                     "robot0_eye_in_hand_image",
                     "shouldercamera1_image",
@@ -274,13 +277,13 @@ def modify_config_for_dataset(config, task_name, dataset_type, hdf5_type, base_d
                 ]
             elif task_name == "tool_hang":
                 # side and wrist camera
-                config.observation.modalities.obs.image = [
+                config.observation.modalities.obs.rgb = [
                     "sideview_image",
                     "robot0_eye_in_hand_image",
                 ]
                 # 240x240 images -> crops should be 216x216
-                config.observation.encoder.obs_randomizer_kwargs.crop_height = 216
-                config.observation.encoder.obs_randomizer_kwargs.crop_width = 216
+                config.observation.encoder.rgb.obs_randomizer_kwargs.crop_height = 216
+                config.observation.encoder.rgb.obs_randomizer_kwargs.crop_width = 216
 
         elif hdf5_type in ["low_dim", "low_dim_sparse", "low_dim_dense"]:
             if task_name == "transport":
@@ -296,24 +299,24 @@ def modify_config_for_dataset(config, task_name, dataset_type, hdf5_type, base_d
                 ]
                 # handle hierarchical observation configs
                 if config.algo_name == "hbc":
-                    mod_configs_to_set = [
+                    configs_to_set = [
                         config.observation.actor.modalities.obs,
                         config.observation.planner.modalities.obs,
                         config.observation.planner.modalities.subgoal,
                     ]
                 elif config.algo_name == "iris":
-                    mod_configs_to_set = [
+                    configs_to_set = [
                         config.observation.actor.modalities.obs,
                         config.observation.value_planner.planner.modalities.obs,
                         config.observation.value_planner.planner.modalities.subgoal,
                         config.observation.value_planner.value.modalities.obs,
                     ]
                 else:
-                    mod_configs_to_set = [config.observation.modalities.obs]
+                    configs_to_set = [config.observation.modalities.obs]
                 # set all observations / subgoals to use the correct low-dim modalities
-                for mod_config in mod_configs_to_set:
-                    mod_config.low_dim = list(default_low_dim_obs)
-                    mod_config.image = []
+                for obs_key_config in configs_to_set:
+                    obs_key_config.low_dim = list(default_low_dim_obs)
+                    obs_key_config.rgb = []
 
     return config
 
@@ -722,7 +725,7 @@ def generate_experiment_config(
 
     algo_config_name = "bc" if algo_name == "bc_rnn" else algo_name
     config = config_factory(algo_name=algo_config_name)
-    # turn into default config for observation type (low-dim or image)
+    # turn into default config for observation modalities (e.g.: low-dim or rgb)
     config = modifier_for_obs(config)
     # add in config based on the dataset
     config = modify_config_for_dataset(
@@ -991,13 +994,13 @@ def add_proprio(config):
 
     def remove_wrist(config):
         with config.observation.values_unlocked():
-            old_image_mods = list(config.observation.modalities.obs.image)
-            config.observation.modalities.obs.image = [m for m in old_image_mods if "eye_in_hand" not in m]
+            old_image_mods = list(config.observation.modalities.obs.rgb)
+            config.observation.modalities.obs.rgb = [m for m in old_image_mods if "eye_in_hand" not in m]
         return config
 
     def remove_rand(config):
         with config.observation.values_unlocked():
-            config.observation.encoder.obs_randomizer_class = None
+            config.observation.encoder.rgb.obs_randomizer_class = None
         return config
 
     obs_ablation_json_paths = Config() # use for convenient nested dict
@@ -1079,8 +1082,8 @@ def change_mlp(config):
 
     def change_conv(config):
         with config.observation.values_unlocked():
-            config.observation.encoder.visual_core = 'ShallowConv'
-            config.observation.encoder.visual_core_kwargs = Config()
+            config.observation.encoder.rgb.core_class = 'ShallowConv'
+            config.observation.encoder.rgb.core_kwargs = Config()
         return config
 
     def change_rnnd_low_dim(config):
diff --git a/robomimic/scripts/get_dataset_info.py b/robomimic/scripts/get_dataset_info.py
index ca4fd3fc..9349ed8a 100644
--- a/robomimic/scripts/get_dataset_info.py
+++ b/robomimic/scripts/get_dataset_info.py
@@ -116,9 +116,9 @@
         for k in f["data/{}".format(ep)]:
             if k in ["obs", "next_obs"]:
                 print("    key: {}".format(k))
-                for mod in f["data/{}/{}".format(ep, k)]:
-                    mod_shape = f["data/{}/{}/{}".format(ep, k, mod)].shape
-                    print("        mod {} with shape {}".format(mod, mod_shape))
+                for obs_k in f["data/{}/{}".format(ep, k)]:
+                    shape = f["data/{}/{}/{}".format(ep, k, obs_k)].shape
+                    print("        observation key {} with shape {}".format(obs_k, shape))
             elif isinstance(f["data/{}/{}".format(ep, k)], h5py.Dataset):
                 key_shape = f["data/{}/{}".format(ep, k)].shape
                 print("    key: {} with shape {}".format(k, key_shape))
diff --git a/robomimic/scripts/playback_dataset.py b/robomimic/scripts/playback_dataset.py
index aaf1469f..94799140 100644
--- a/robomimic/scripts/playback_dataset.py
+++ b/robomimic/scripts/playback_dataset.py
@@ -66,7 +66,15 @@
 import robomimic.utils.obs_utils as ObsUtils
 import robomimic.utils.env_utils as EnvUtils
 import robomimic.utils.file_utils as FileUtils
-from robomimic.envs.env_base import EnvBase
+from robomimic.envs.env_base import EnvBase, EnvType
+
+
+# Define default cameras to use for each env type
+DEFAULT_CAMERAS = {
+    EnvType.ROBOSUITE_TYPE: ["agentview"],
+    EnvType.IG_MOMART_TYPE: ["rgb"],
+    EnvType.GYM_TYPE: ValueError("No camera names supported for gym type env!"),
+}
 
 
 def playback_trajectory_with_env(
@@ -150,7 +158,7 @@ def playback_trajectory_with_obs(
     first=False,
 ):
     """
-    This function reads all "image" observations in the dataset trajectory and
+    This function reads all "rgb" observations in the dataset trajectory and
     writes them into a video.
 
     Args:
@@ -181,9 +189,18 @@ def playback_dataset(args):
     # some arg checking
     write_video = (args.video_path is not None)
     assert not (args.render and write_video) # either on-screen or video but not both
+
+    # Auto-fill camera rendering info if not specified
+    if args.render_image_names is None:
+        # We fill in the automatic values
+        env_meta = FileUtils.get_env_metadata_from_dataset(dataset_path=args.dataset)
+        env_type = EnvUtils.get_env_type(env_meta=env_meta)
+        args.render_image_names = DEFAULT_CAMERAS[env_type]
+
     if args.render:
         # on-screen rendering can only support one camera
         assert len(args.render_image_names) == 1
+
     if args.use_obs:
         assert write_video, "playback with observations can only write to video"
         assert not args.use_actions, "playback with observations is offline and does not support action playback"
@@ -195,7 +212,7 @@ def playback_dataset(args):
         dummy_spec = dict(
             obs=dict(
                     low_dim=["robot0_eef_pos"],
-                    image=[],
+                    rgb=[],
                 ),
         )
         ObsUtils.initialize_obs_utils_with_obs_specs(obs_modality_specs=dummy_spec)
@@ -331,8 +348,9 @@ def playback_dataset(args):
         "--render_image_names",
         type=str,
         nargs='+',
-        default=["agentview"],
-        help="(optional) camera name(s) / image observation(s) to use for rendering on-screen or to video",
+        default=None,
+        help="(optional) camera name(s) / image observation(s) to use for rendering on-screen or to video. Default is"
+             "None, which corresponds to a predefined camera for each env type",
     )
 
     # Only use the first frame of each episode
diff --git a/robomimic/scripts/run_trained_agent.py b/robomimic/scripts/run_trained_agent.py
index 94a5bf7f..95bddb21 100644
--- a/robomimic/scripts/run_trained_agent.py
+++ b/robomimic/scripts/run_trained_agent.py
@@ -143,8 +143,8 @@ def rollout(policy, env, horizon, render=False, video_writer=None, video_skip=5,
                 # Note: We need to "unprocess" the observations to prepare to write them to dataset.
                 #       This includes operations like channel swapping and float to uint8 conversion
                 #       for saving disk space.
-                traj["obs"].append(ObsUtils.unprocess_obs(obs))
-                traj["next_obs"].append(ObsUtils.unprocess_obs(next_obs))
+                traj["obs"].append(ObsUtils.unprocess_obs_dict(obs))
+                traj["next_obs"].append(ObsUtils.unprocess_obs_dict(next_obs))
 
             # break if done or if success
             if done or success:
diff --git a/robomimic/scripts/train.py b/robomimic/scripts/train.py
index 2438b037..45345639 100644
--- a/robomimic/scripts/train.py
+++ b/robomimic/scripts/train.py
@@ -62,7 +62,7 @@ def train(config, device):
         sys.stdout = logger
         sys.stderr = logger
 
-    # read config to set up metadata for observation types (e.g. detecting image observations)
+    # read config to set up metadata for observation modalities (e.g. detecting rgb observations)
     ObsUtils.initialize_obs_utils_with_config(config)
 
     # make sure the dataset exists
@@ -75,7 +75,7 @@ def train(config, device):
     env_meta = FileUtils.get_env_metadata_from_dataset(dataset_path=config.train.data)
     shape_meta = FileUtils.get_shape_metadata_from_dataset(
         dataset_path=config.train.data,
-        all_modalities=config.all_modalities,
+        all_obs_keys=config.all_obs_keys,
         verbose=True
     )
 
@@ -114,7 +114,7 @@ def train(config, device):
     model = algo_factory(
         algo_name=config.algo_name,
         config=config,
-        modality_shapes=shape_meta["all_shapes"],
+        obs_key_shapes=shape_meta["all_shapes"],
         ac_dim=shape_meta["ac_dim"],
         device=device,
     )
@@ -125,7 +125,7 @@ def train(config, device):
 
     # load training data
     trainset, validset = TrainUtils.load_data_for_training(
-        config, obs_keys=shape_meta["all_modalities"])
+        config, obs_keys=shape_meta["all_obs_keys"])
     train_sampler = trainset.get_dataset_sampler()
     print("\n============= Training Dataset =============")
     print(trainset)
diff --git a/robomimic/utils/dataset.py b/robomimic/utils/dataset.py
index 501dbc33..03b7bd7b 100644
--- a/robomimic/utils/dataset.py
+++ b/robomimic/utils/dataset.py
@@ -127,7 +127,7 @@ def __init__(
                 # only store low-dim observations
                 obs_keys_in_memory = []
                 for k in self.obs_keys:
-                    if not (ObsUtils.key_is_image(k)):
+                    if ObsUtils.key_is_obs_modality(k, "low_dim"):
                         obs_keys_in_memory.append(k)
             self.obs_keys_in_memory = obs_keys_in_memory
 
@@ -336,12 +336,12 @@ def _aggregate_traj_stats(traj_stats_a, traj_stats_b):
         # with the previous statistics.
         ep = self.demos[0]
         obs_traj = {k: self.hdf5_file["data/{}/obs/{}".format(ep, k)][()].astype('float32') for k in self.obs_keys}
-        obs_traj = ObsUtils.process_obs(obs_traj)
+        obs_traj = ObsUtils.process_obs_dict(obs_traj)
         merged_stats = _compute_traj_stats(obs_traj)
         print("SequenceDataset: normalizing observations...")
         for ep in LogUtils.custom_tqdm(self.demos[1:]):
             obs_traj = {k: self.hdf5_file["data/{}/obs/{}".format(ep, k)][()].astype('float32') for k in self.obs_keys}
-            obs_traj = ObsUtils.process_obs(obs_traj)
+            obs_traj = ObsUtils.process_obs_dict(obs_traj)
             traj_stats = _compute_traj_stats(obs_traj)
             merged_stats = _aggregate_traj_stats(merged_stats, traj_stats)
 
@@ -354,12 +354,12 @@ def _aggregate_traj_stats(traj_stats_a, traj_stats_b):
 
     def get_obs_normalization_stats(self):
         """
-        Returns dictionary of mean and std for each observation modality if using
+        Returns dictionary of mean and std for each observation key if using
         observation normalization, otherwise None.
 
         Returns:
             obs_normalization_stats (dict): a dictionary for observation
-                normalization. This maps observation modality keys to dicts
+                normalization. This maps observation keys to dicts
                 with a "mean" and "std" of shape (1, ...) where ... is the default
                 shape for the observation.
         """
@@ -544,7 +544,7 @@ def get_obs_sequence_from_demo(self, demo_id, index_in_demo, keys, num_frames_to
             obs["pad_mask"] = pad_mask
 
         # prepare image observations from dataset
-        return ObsUtils.process_obs(obs)
+        return ObsUtils.process_obs_dict(obs)
 
     def get_dataset_sequence_from_demo(self, demo_id, index_in_demo, keys, seq_length=1):
         """
diff --git a/robomimic/utils/env_utils.py b/robomimic/utils/env_utils.py
index 9c21f204..9c722e15 100644
--- a/robomimic/utils/env_utils.py
+++ b/robomimic/utils/env_utils.py
@@ -37,6 +37,9 @@ def get_env_class(env_meta=None, env_type=None, env=None):
     elif env_type == EB.EnvType.GYM_TYPE:
         from robomimic.envs.env_gym import EnvGym
         return EnvGym
+    elif env_type == EB.EnvType.IG_MOMART_TYPE:
+        from robomimic.envs.env_ig_momart import EnvGibsonMOMART
+        return EnvGibsonMOMART
     raise Exception("code should never reach this point")
 
 
diff --git a/robomimic/utils/file_utils.py b/robomimic/utils/file_utils.py
index 7079b02d..78cc5076 100644
--- a/robomimic/utils/file_utils.py
+++ b/robomimic/utils/file_utils.py
@@ -84,13 +84,13 @@ def get_env_metadata_from_dataset(dataset_path):
     return env_meta
 
 
-def get_shape_metadata_from_dataset(dataset_path, all_modalities=None, verbose=False):
+def get_shape_metadata_from_dataset(dataset_path, all_obs_keys=None, verbose=False):
     """
     Retrieves shape metadata from dataset.
 
     Args:
         dataset_path (str): path to dataset
-        all_modalities (list): list of all modalities used by the model. If not provided, all modalities
+        all_obs_keys (list): list of all modalities used by the model. If not provided, all modalities
             present in the file are used.
         verbose (bool): if True, include print statements
 
@@ -98,8 +98,8 @@ def get_shape_metadata_from_dataset(dataset_path, all_modalities=None, verbose=F
         shape_meta (dict): shape metadata. Contains the following keys:
 
             :`'ac_dim'`: action space dimension
-            :`'all_shapes'`: dictionary that maps observation modality string to modality shape
-            :`'all_modalities'`: list of all observation modalities used
+            :`'all_shapes'`: dictionary that maps observation key string to shape
+            :`'all_obs_keys'`: list of all observation modalities used
             :`'use_images'`: bool, whether or not image modalities are present
     """
 
@@ -117,24 +117,25 @@ def get_shape_metadata_from_dataset(dataset_path, all_modalities=None, verbose=F
     # observation dimensions
     all_shapes = OrderedDict()
 
-    if all_modalities is None:
+    if all_obs_keys is None:
         # use all modalities present in the file
-        all_modalities = [k for k in demo["obs"]]
+        all_obs_keys = [k for k in demo["obs"]]
 
-    for k in sorted(all_modalities):
-        all_shapes[k] = demo["obs/{}".format(k)].shape[1:]
+    for k in sorted(all_obs_keys):
+        initial_shape = demo["obs/{}".format(k)].shape[1:]
         if verbose:
-            print("obs modality {} with shape {}".format(k, all_shapes[k]))
-
-    for k in all_shapes:
-        if ObsUtils.key_is_image(k):
-            all_shapes[k] = ObsUtils.process_image_shape(all_shapes[k])
+            print("obs key {} with shape {}".format(k, initial_shape))
+        # Store processed shape for each obs key
+        all_shapes[k] = ObsUtils.get_processed_shape(
+            obs_modality=ObsUtils.OBS_KEYS_TO_MODALITIES[k],
+            input_shape=initial_shape,
+        )
 
     f.close()
 
     shape_meta['all_shapes'] = all_shapes
-    shape_meta['all_modalities'] = all_modalities
-    shape_meta['use_images'] = ObsUtils.has_image(all_modalities)
+    shape_meta['all_obs_keys'] = all_obs_keys
+    shape_meta['use_images'] = ObsUtils.has_modality("rgb", all_obs_keys)
 
     return shape_meta
 
@@ -263,7 +264,7 @@ def policy_from_checkpoint(device=None, ckpt_path=None, ckpt_dict=None, verbose=
     algo_name, _ = algo_name_from_checkpoint(ckpt_dict=ckpt_dict)
     config, _ = config_from_checkpoint(algo_name=algo_name, ckpt_dict=ckpt_dict, verbose=verbose)
 
-    # read config to set up metadata for observation types (e.g. detecting image observations)
+    # read config to set up metadata for observation modalities (e.g. detecting rgb observations)
     ObsUtils.initialize_obs_utils_with_config(config)
 
     # env meta from model dict to get info needed to create model
@@ -286,7 +287,7 @@ def policy_from_checkpoint(device=None, ckpt_path=None, ckpt_dict=None, verbose=
     model = algo_factory(
         algo_name,
         config,
-        modality_shapes=shape_meta["all_shapes"],
+        obs_key_shapes=shape_meta["all_shapes"],
         ac_dim=shape_meta["ac_dim"],
         device=device,
     )
@@ -368,7 +369,7 @@ def url_is_alive(url):
         return False
 
 
-def download_url(url, download_dir):
+def download_url(url, download_dir, check_overwrite=True):
     """
     First checks that @url is reachable, then downloads the file
     at that url into the directory specified by @download_dir.
@@ -380,6 +381,8 @@ def download_url(url, download_dir):
     Args:
         url (str): url string
         download_dir (str): path to directory where file should be downloaded
+        check_overwrite (bool): if True, will sanity check the download fpath to make sure a file of that name
+            doesn't already exist there
     """
 
     # check if url is reachable. We need the sleep to make sure server doesn't reject subsequent requests
@@ -390,6 +393,12 @@ def download_url(url, download_dir):
     fname = url.split("/")[-1]
     file_to_write = os.path.join(download_dir, fname)
 
+    # If we're checking overwrite and the path already exists,
+    # we ask the user to verify that they want to overwrite the file
+    if check_overwrite and os.path.exists(file_to_write):
+        user_response = input(f"Warning: file {file_to_write} already exists. Overwrite? y/n\n")
+        assert user_response.lower() in {"yes", "y"}, f"Did not receive confirmation. Aborting download."
+
     with DownloadProgressBar(unit='B', unit_scale=True,
                              miniters=1, desc=fname) as t:
         urllib.request.urlretrieve(url, filename=file_to_write, reporthook=t.update_to)
diff --git a/robomimic/utils/macros.py b/robomimic/utils/macros.py
new file mode 100644
index 00000000..c313b827
--- /dev/null
+++ b/robomimic/utils/macros.py
@@ -0,0 +1,6 @@
+"""
+Set of global variables shared across robomimic
+"""
+# Sets debugging mode. Should be set at top-level script so that internal
+# debugging functionalities are made active
+DEBUG = False
diff --git a/robomimic/utils/obs_utils.py b/robomimic/utils/obs_utils.py
index 282d7895..134f976f 100644
--- a/robomimic/utils/obs_utils.py
+++ b/robomimic/utils/obs_utils.py
@@ -11,27 +11,144 @@
 
 import robomimic.utils.tensor_utils as TU
 
+# MACRO FOR VALID IMAGE CHANNEL SIZES
+VALID_IMAGE_CHANNEL_DIMS = {1, 3}       # depth, rgb
 
 # DO NOT MODIFY THIS!
-# This keeps track of observation types - and is populated on call to @initialize_obs_utils_with_obs_specs.
-# This will be a dictionary that maps observation type (e.g. low_dim, image) to a list of observation
-# modalities under that observation type.
-OBS_TYPE_TO_MODALITIES = None
+# This keeps track of observation types (modalities) - and is populated on call to @initialize_obs_utils_with_obs_specs.
+# This will be a dictionary that maps observation modality (e.g. low_dim, rgb) to a list of observation
+# keys under that observation modality.
+OBS_MODALITIES_TO_KEYS = None
+
+# DO NOT MODIFY THIS!
+# This keeps track of observation types (modalities) - and is populated on call to @initialize_obs_utils_with_obs_specs.
+# This will be a dictionary that maps observation keys to their corresponding observation modality
+# (e.g. low_dim, rgb)
+OBS_KEYS_TO_MODALITIES = None
+
+# DO NOT MODIFY THIS
+# This holds the default encoder kwargs that will be used if none are passed at runtime for any given network
+DEFAULT_ENCODER_KWARGS = None
+
+# DO NOT MODIFY THIS
+# This holds the registered observation modality classes
+OBS_MODALITY_CLASSES = {}
+
+# DO NOT MODIFY THIS
+# This global dict stores mapping from observation encoder / randomizer network name to class.
+# We keep track of these registries to enable automated class inference at runtime, allowing
+# users to simply extend our base encoder / randomizer class and refer to that class in string form
+# in their config, without having to manually register their class internally.
+# This also future-proofs us for any additional encoder / randomizer classes we would
+# like to add ourselves.
+OBS_ENCODER_CORES = {"None": None}          # Include default None
+OBS_RANDOMIZERS = {"None": None}            # Include default None
+
+
+def register_obs_key(target_class):
+    assert target_class not in OBS_MODALITY_CLASSES, f"Already registered modality {target_class}!"
+    OBS_MODALITY_CLASSES[target_class.name] = target_class
+
+
+def register_encoder_core(target_class):
+    assert target_class not in OBS_ENCODER_CORES, f"Already registered obs encoder core {target_class}!"
+    OBS_ENCODER_CORES[target_class.__name__] = target_class
+
+
+def register_randomizer(target_class):
+    assert target_class not in OBS_RANDOMIZERS, f"Already registered obs randomizer {target_class}!"
+    OBS_RANDOMIZERS[target_class.__name__] = target_class
+
+
+class ObservationKeyToModalityDict(dict):
+    """
+    Custom dictionary class with the sole additional purpose of automatically registering new "keys" at runtime
+    without breaking. This is mainly for backwards compatibility, where certain keys such as "latent", "actions", etc.
+    are used automatically by certain models (e.g.: VAEs) but were never specified by the user externally in their
+    config. Thus, this dictionary will automatically handle those keys by implicitly associating them with the low_dim
+    modality.
+    """
+    def __getitem__(self, item):
+        # If a key doesn't already exist, warn the user and add default mapping
+        if item not in self.keys():
+            print(f"ObservationKeyToModalityDict: {item} not found,"
+                  f" adding {item} to mapping with assumed low_dim modality!")
+            self.__setitem__(item, "low_dim")
+        return super(ObservationKeyToModalityDict, self).__getitem__(item)
+
+
+def obs_encoder_kwargs_from_config(obs_encoder_config):
+    """
+    Generate a set of args used to create visual backbones for networks
+    from the observation encoder config.
+
+    Args:
+        obs_encoder_config (Config): Config object containing relevant encoder information. Should be equivalent to
+            config.observation.encoder
+
+    Returns:
+        dict: Processed encoder kwargs
+    """
+    # Loop over each obs modality
+    # Unlock encoder config
+    obs_encoder_config.unlock()
+    for obs_modality, encoder_kwargs in obs_encoder_config.items():
+        # First run some sanity checks and store the classes
+        for cls_name, cores in zip(("core", "obs_randomizer"), (OBS_ENCODER_CORES, OBS_RANDOMIZERS)):
+            # Make sure the requested encoder for each obs_modality exists
+            cfg_cls = encoder_kwargs[f"{cls_name}_class"]
+            if cfg_cls is not None:
+                assert cfg_cls in cores, f"No {cls_name} class with name {cfg_cls} found, must register this class before" \
+                    f"creating model!"
+                # encoder_kwargs[f"{cls_name}_class"] = cores[cfg_cls]
+
+        # Process core and randomizer kwargs
+        encoder_kwargs.core_kwargs = dict() if encoder_kwargs.core_kwargs is None else \
+            deepcopy(encoder_kwargs.core_kwargs)
+        encoder_kwargs.obs_randomizer_kwargs = dict() if encoder_kwargs.obs_randomizer_kwargs is None else \
+            deepcopy(encoder_kwargs.obs_randomizer_kwargs)
+
+    # Re-lock keys
+    obs_encoder_config.lock()
+
+    return dict(obs_encoder_config)
+
+
+def initialize_obs_modality_mapping_from_dict(modality_mapping):
+    """
+    This function is an alternative to @initialize_obs_utils_with_obs_specs, that allows manually setting of modalities.
+    NOTE: Only one of these should be called at runtime -- not both! (Note that all training scripts that use a config)
+        automatically handle obs modality mapping, so using this function is usually unnecessary)
+
+    Args:
+        modality_mapping (dict): Maps modality string names (e.g.: rgb, low_dim, etc.) to a list of observation
+            keys that should belong to that modality
+    """
+    global OBS_KEYS_TO_MODALITIES, OBS_MODALITIES_TO_KEYS
+
+    OBS_KEYS_TO_MODALITIES = ObservationKeyToModalityDict()
+    OBS_MODALITIES_TO_KEYS = dict()
+
+    for mod, keys in modality_mapping.items():
+        OBS_MODALITIES_TO_KEYS[mod] = deepcopy(keys)
+        OBS_KEYS_TO_MODALITIES.update({k: mod for k in keys})
 
 
 def initialize_obs_utils_with_obs_specs(obs_modality_specs):
     """
-    This function should be called before using any modality-specific
+    This function should be called before using any observation key-specific
     functions in this file, in order to make sure that all utility
-    functions are aware of the observation types (e.g. which ones
-    are low-dimensional, and which ones are images). It constructs
-    a dictionary that map observation type (e.g. low_dim, image) to
-    a list of observation modalities under that type.
+    functions are aware of the observation modalities (e.g. which ones
+    are low-dimensional, which ones are rgb, etc.).
+
+    It constructs two dictionaries: (1) that map observation modality (e.g. low_dim, rgb) to
+    a list of observation keys under that modality, and (2) that maps the inverse, specific
+    observation keys to their corresponding observation modality.
 
     Input should be a nested dictionary (or list of such dicts) with the following structure:
 
         obs_variant (str):
-            obs_type (str): modalities (list)
+            obs_modality (str): observation keys (list)
             ...
         ...
 
@@ -39,25 +156,27 @@ def initialize_obs_utils_with_obs_specs(obs_modality_specs):
         {
             "obs": {
                 "low_dim": ["robot0_eef_pos", "robot0_eef_quat"],
-                "image": ["agentview_image", "robot0_eye_in_hand"],
+                "rgb": ["agentview_image", "robot0_eye_in_hand"],
             }
             "goal": {
                 "low_dim": ["robot0_eef_pos"],
-                "image": ["agentview_image"]
+                "rgb": ["agentview_image"]
             }
         }
 
-    In the example, raw observations consist of low-dim and image types, with
+    In the example, raw observations consist of low-dim and rgb modalities, with
     the robot end effector pose under low-dim, and the agentview and wrist camera
-    images under image, while goal observations also consist of low-dim and image
-    types, with a subset of the raw observation modalities per type.
+    images under rgb, while goal observations also consist of low-dim and rgb modalities,
+    with a subset of the raw observation keys per modality.
 
     Args:
         obs_modality_specs (dict or list): A nested dictionary (see docstring above for an example)
             or a list of nested dictionaries. Accepting a list as input makes it convenient for
             situations where multiple modules may each have their own modality spec.
     """
-    global OBS_TYPE_TO_MODALITIES
+    global OBS_KEYS_TO_MODALITIES, OBS_MODALITIES_TO_KEYS
+
+    OBS_KEYS_TO_MODALITIES = ObservationKeyToModalityDict()
 
     # accept one or more spec dictionaries - if it's just one, account for this
     if isinstance(obs_modality_specs, dict):
@@ -66,28 +185,50 @@ def initialize_obs_utils_with_obs_specs(obs_modality_specs):
         obs_modality_spec_list = obs_modality_specs
 
     # iterates over observation specs
-    obs_type_mapping = {}
+    obs_modality_mapping = {}
     for obs_modality_spec in obs_modality_spec_list:
         # iterates over observation variants (e.g. observations, goals, subgoals)
-        for obs_variant in obs_modality_spec:
-            for obs_type in obs_modality_spec[obs_variant]:
-                # add all modalities for each obs-type to the corresponding list in obs_type_mapping
-                if obs_type not in obs_type_mapping:
-                    obs_type_mapping[obs_type] = []
-                obs_type_mapping[obs_type] += obs_modality_spec[obs_variant][obs_type]
+        for obs_modalities in obs_modality_spec.values():
+            for obs_modality, obs_keys in obs_modalities.items():
+                # add all keys for each obs modality to the corresponding list in obs_modality_mapping
+                if obs_modality not in obs_modality_mapping:
+                    obs_modality_mapping[obs_modality] = []
+                obs_modality_mapping[obs_modality] += obs_keys
+                # loop over each modality, and add to global dict if it doesn't exist yet
+                for obs_key in obs_keys:
+                    if obs_key not in OBS_KEYS_TO_MODALITIES:
+                        OBS_KEYS_TO_MODALITIES[obs_key] = obs_modality
+                    # otherwise, run sanity check to make sure we don't have conflicting, duplicate entries
+                    else:
+                        assert OBS_KEYS_TO_MODALITIES[obs_key] == obs_modality, \
+                            f"Cannot register obs key {obs_key} with modality {obs_modality}; " \
+                            f"already exists with corresponding modality {OBS_KEYS_TO_MODALITIES[obs_key]}"
 
     # remove duplicate entries and store in global mapping
-    OBS_TYPE_TO_MODALITIES = { obs_type : list(set(obs_type_mapping[obs_type])) for obs_type in obs_type_mapping }
+    OBS_MODALITIES_TO_KEYS = { obs_modality : list(set(obs_modality_mapping[obs_modality])) for obs_modality in obs_modality_mapping }
 
     print("\n============= Initialized Observation Utils with Obs Spec =============\n")
-    for obs_type in OBS_TYPE_TO_MODALITIES:
-        print("using obs type: {} with modalities: {}".format(obs_type, OBS_TYPE_TO_MODALITIES[obs_type]))
+    for obs_modality, obs_keys in OBS_MODALITIES_TO_KEYS.items():
+        print("using obs modality: {} with keys: {}".format(obs_modality, obs_keys))
+
+
+def initialize_default_obs_encoder(obs_encoder_config):
+    """
+    Initializes the default observation encoder kwarg information to be used by all networks if no values are manually
+    specified at runtime.
+
+    Args:
+        obs_encoder_config (Config): Observation encoder config to use.
+            Should be equivalent to config.observation.encoder
+    """
+    global DEFAULT_ENCODER_KWARGS
+    DEFAULT_ENCODER_KWARGS = obs_encoder_kwargs_from_config(obs_encoder_config)
 
 
 def initialize_obs_utils_with_config(config):
     """
-    Utility function to parse config and call @initialize_obs_utils_with_obs_specs with the
-    correct arguments.
+    Utility function to parse config and call @initialize_obs_utils_with_obs_specs and
+    @initialize_default_obs_encoder_kwargs with the correct arguments.
 
     Args:
         config (BaseConfig instance): config object
@@ -97,34 +238,31 @@ def initialize_obs_utils_with_config(config):
             config.observation.planner.modalities, 
             config.observation.actor.modalities,
         ]
+        obs_encoder_config = config.observation.actor.encoder
     elif config.algo_name == "iris":
         obs_modality_specs = [
             config.observation.value_planner.planner.modalities, 
             config.observation.value_planner.value.modalities, 
             config.observation.actor.modalities,
         ]
+        obs_encoder_config = config.observation.actor.encoder
     else:
         obs_modality_specs = [config.observation.modalities]
+        obs_encoder_config = config.observation.encoder
     initialize_obs_utils_with_obs_specs(obs_modality_specs=obs_modality_specs)
+    initialize_default_obs_encoder(obs_encoder_config=obs_encoder_config)
 
 
-def key_is_obs_type(key, obs_type):
+def key_is_obs_modality(key, obs_modality):
     """
-    Check if observation key corresponds to a type @obs_type.
+    Check if observation key corresponds to modality @obs_modality.
 
     Args:
-        key (str): modality name to check
-        obs_type (str): observation type - usually one of "low_dim" or "image"
-    """
-    assert OBS_TYPE_TO_MODALITIES is not None, "error: must call ObsUtils.initialize_obs_utils_with_obs_config first"
-    return (key in OBS_TYPE_TO_MODALITIES[obs_type])
-
-
-def key_is_image(key):
+        key (str): obs key name to check
+        obs_modality (str): observation modality - e.g.: "low_dim", "rgb"
     """
-    Check if observation key corresponds to image observation.
-    """
-    return key_is_obs_type(key, obs_type="image")
+    assert OBS_KEYS_TO_MODALITIES is not None, "error: must call ObsUtils.initialize_obs_utils_with_obs_config first"
+    return OBS_KEYS_TO_MODALITIES[key] == obs_modality
 
 
 def center_crop(im, t_h, t_w):
@@ -187,117 +325,152 @@ def batch_image_chw_to_hwc(im):
         return im.permute(start_dims + [s + 2, s + 3, s + 1])
 
 
-def process_obs(obs_dict):
+def process_obs(obs, obs_modality=None, obs_key=None):
     """
-    Process image observations in observation dictionary to prepare for network input.
+    Process observation @obs corresponding to @obs_modality modality (or implicitly inferred from @obs_key)
+    to prepare for network input.
+
+    Note that either obs_modality OR obs_key must be specified!
+
+    If both are specified, obs_key will override obs_modality
 
     Args:
-        obs_dict (dict): dictionary mappping observation modality to np.array or
+        obs (np.array or torch.Tensor): Observation to process. Leading batch dimension is optional
+        obs_modality (str): Observation modality (e.g.: depth, image, low_dim, etc.)
+        obs_key (str): Name of observation from which to infer @obs_modality
+
+    Returns:
+        processed_obs (np.array or torch.Tensor): processed observation
+    """
+    assert obs_modality is not None or obs_key is not None, "Either obs_modality or obs_key must be specified!"
+    if obs_key is not None:
+        obs_modality = OBS_KEYS_TO_MODALITIES[obs_key]
+    return OBS_MODALITY_CLASSES[obs_modality].process_obs(obs)
+
+
+def process_obs_dict(obs_dict):
+    """
+    Process observations in observation dictionary to prepare for network input.
+
+    Args:
+        obs_dict (dict): dictionary mapping observation keys to np.array or
             torch.Tensor. Leading batch dimensions are optional.
 
     Returns:
-        new_dict (dict): dictionary where image modalities have been processsed by
-            @process_image
+        new_dict (dict): dictionary where observation keys have been processed by their corresponding processors
     """
-    new_dict = { k : obs_dict[k] for k in obs_dict } # shallow copy
-    for k in new_dict:
-        if key_is_image(k):
-            new_dict[k] = process_image(new_dict[k])
-    return new_dict
+    return { k : process_obs(obs=obs, obs_key=k) for k, obs in obs_dict.items() } # shallow copy
 
 
-def process_image(image):
+def process_frame(frame, channel_dim, scale):
     """
-    Given image fetched from dataset, process for network input. Converts array 
-    to float (from uint8), normalizes pixels to [0, 1], and channel swaps
+    Given frame fetched from dataset, process for network input. Converts array
+    to float (from uint8), normalizes pixels from range [0, @scale] to [0, 1], and channel swaps
     from (H, W, C) to (C, H, W).
 
     Args:
-        image (np.array or torch.Tensor): image array
+        frame (np.array or torch.Tensor): frame array
+        channel_dim (int): Number of channels to sanity check for
+        scale (float): Value to normalize inputs by
 
     Returns:
-        processed_image (np.array or torch.Tensor): processed image
+        processed_frame (np.array or torch.Tensor): processed frame
     """
-    assert image.shape[-1] == 3 # check for channel dimensions
+    # Channel size should either be 3 (RGB) or 1 (depth)
+    assert (frame.shape[-1] == channel_dim)
+    frame = TU.to_float(frame)
+    frame /= scale
+    frame = frame.clip(0.0, 1.0)
+    frame = batch_image_hwc_to_chw(frame)
+
+    return frame
 
-    image = TU.to_float(image)
-    image /= 255.
-    image = batch_image_hwc_to_chw(image)
 
-    return image
+def unprocess_obs(obs, obs_modality=None, obs_key=None):
+    """
+    Prepare observation @obs corresponding to @obs_modality modality (or implicitly inferred from @obs_key)
+    to prepare for deployment.
 
+    Note that either obs_modality OR obs_key must be specified!
 
-def unprocess_obs(obs_dict):
+    If both are specified, obs_key will override obs_modality
+
+    Args:
+        obs (np.array or torch.Tensor): Observation to unprocess. Leading batch dimension is optional
+        obs_modality (str): Observation modality (e.g.: depth, image, low_dim, etc.)
+        obs_key (str): Name of observation from which to infer @obs_modality
+
+    Returns:
+        unprocessed_obs (np.array or torch.Tensor): unprocessed observation
     """
-    Prepare processed image observations for saving to dataset. Inverse of
+    assert obs_modality is not None or obs_key is not None, "Either obs_modality or obs_key must be specified!"
+    if obs_key is not None:
+        obs_modality = OBS_KEYS_TO_MODALITIES[obs_key]
+    return OBS_MODALITY_CLASSES[obs_modality].unprocess_obs(obs)
+
+
+def unprocess_obs_dict(obs_dict):
+    """
+    Prepare processed observation dictionary for saving to dataset. Inverse of
     @process_obs.
 
     Args:
-        obs_dict (dict): dictionary mappping observation modality to np.array or
+        obs_dict (dict): dictionary mapping observation keys to np.array or
             torch.Tensor. Leading batch dimensions are optional.
 
     Returns:
-        new_dict (dict): dictionary where image modalities have been processsed by
-            @unprocess_image
+        new_dict (dict): dictionary where observation keys have been unprocessed by
+            their respective unprocessor methods
     """
-    new_dict = { k : obs_dict[k] for k in obs_dict } # shallow copy
-    for k in new_dict:
-        if key_is_image(k):
-            new_dict[k] = unprocess_image(new_dict[k])
-    return new_dict
+    return { k : unprocess_obs(obs=obs, obs_key=k) for k, obs in obs_dict.items() } # shallow copy
 
 
-def unprocess_image(image):
+def unprocess_frame(frame, channel_dim, scale):
     """
-    Given image prepared for network input, prepare for saving to dataset.
-    Inverse of @process_image.
+    Given frame prepared for network input, prepare for saving to dataset.
+    Inverse of @process_frame.
 
     Args:
-        image (np.array or torch.Tensor): image array
+        frame (np.array or torch.Tensor): frame array
+        channel_dim (int): What channel dimension should be (used for sanity check)
+        scale (float): Scaling factor to apply during denormalization
 
     Returns:
-        unprocessed_image (np.array or torch.Tensor): image passed through
-            inverse operation of @process_image
+        unprocessed_frame (np.array or torch.Tensor): frame passed through
+            inverse operation of @process_frame
     """
-    assert image.shape[-3] == 3 # check for channel dimension
-    image = batch_image_chw_to_hwc(image)
-    image *= 255.
-    image = TU.to_uint8(image)
-    return image
+    assert frame.shape[-3] == channel_dim # check for channel dimension
+    frame = batch_image_chw_to_hwc(frame)
+    frame *= scale
+    return frame
 
 
-def process_image_shape(image_shape):
+def get_processed_shape(obs_modality, input_shape):
     """
-    Given image shape in dataset, infer the network input shape. This accounts
-    for the channel swap to prepare images for torch training (see @process_image).
+    Given observation modality @obs_modality and expected inputs of shape @input_shape (excluding batch dimension), return the
+    expected processed observation shape resulting from process_{obs_modality}.
 
     Args:
-        image_shape (tuple or list): tuple or list of size 3 or 4, corresponding
-            to the image shape to process
+        obs_modality (str): Observation modality to use (e.g.: low_dim, rgb, depth, etc...)
+        input_shape (list of int): Expected input dimensions, excluding the batch dimension
 
     Returns:
-        processed_image_shape (tuple): image shape that would result from the 
-            output of @process_image
+        list of int: expected processed input shape
     """
-    if len(image_shape) == 3:
-        return image_shape[2], image_shape[0], image_shape[1]
-    elif len(image_shape) == 4:
-        return image_shape[0], image_shape[3], image_shape[1], image_shape[2]
-    else:
-        raise ValueError("cannot handle image shape {}".format(image_shape))
+    return list(process_obs(obs=np.zeros(input_shape), obs_modality=obs_modality).shape)
 
 
 def normalize_obs(obs_dict, obs_normalization_stats):
     """
     Normalize observations using the provided "mean" and "std" entries 
-    for each observation modality. The observation dictionary will be 
+    for each observation key. The observation dictionary will be
     modified in-place.
 
     Args:
-        obs_dict (dict): dictionary mappping observation modality to np.array or
+        obs_dict (dict): dictionary mapping observation key to np.array or
             torch.Tensor. Leading batch dimensions are optional.
 
-        obs_normalization_stats (dict): this should map observation modality keys to dicts
+        obs_normalization_stats (dict): this should map observation keys to dicts
             with a "mean" and "std" of shape (1, ...) where ... is the default
             shape for the observation.
 
@@ -327,15 +500,16 @@ def normalize_obs(obs_dict, obs_normalization_stats):
     return obs_dict
 
 
-def has_image(obs_keys):
+def has_modality(modality, obs_keys):
     """
-    Returns True if image modalities are present in the list of modalities.
+    Returns True if @modality is present in the list of observation keys @obs_keys.
 
     Args:
-        obs_key (list): list of modalities
+        modality (str): modality to check for, e.g.: rgb, depth, etc.
+        obs_keys (list): list of observation keys
     """
     for k in obs_keys:
-        if key_is_image(k):
+        if key_is_obs_modality(k, obs_modality=modality):
             return True
     return False
 
@@ -352,7 +526,7 @@ def repeat_and_stack_observation(obs_dict, n):
     each modality.
 
     Args:
-        obs_dict (dict): dictionary mappping observation modality to np.array or
+        obs_dict (dict): dictionary mapping observation key to np.array or
             torch.Tensor. Leading batch dimensions are optional.
 
         n (int): number to repeat by
@@ -521,3 +695,265 @@ def sample_random_image_crops(images, crop_height, crop_width, num_crops, pos_en
     )
 
     return crops, crop_inds
+
+
+class Modality:
+    """
+    Observation Modality class to encapsulate necessary functions needed to
+    process observations of this modality
+    """
+    # observation keys to associate with this modality
+    keys = set()
+
+    # Custom processing function that should prepare raw observations of this modality for training
+    _custom_obs_processor = None
+
+    # Custom unprocessing function that should prepare observations of this modality used during training for deployment
+    _custom_obs_unprocessor = None
+
+    # Name of this modality -- must be set by subclass!
+    name = None
+
+    def __init_subclass__(cls, **kwargs):
+        """
+        Hook method to automatically register all valid subclasses so we can keep track of valid modalities
+        """
+        assert cls.name is not None, f"Name of modality {cls.__name__} must be specified!"
+        register_obs_key(cls)
+
+    @classmethod
+    def set_keys(cls, keys):
+        """
+        Sets the observation keys associated with this modality.
+
+        Args:
+            keys (list or set): observation keys to associate with this modality
+        """
+        cls.keys = {k for k in keys}
+
+    @classmethod
+    def add_keys(cls, keys):
+        """
+        Adds the observation @keys associated with this modality to the current set of keys.
+
+        Args:
+            keys (list or set): observation keys to add to associate with this modality
+        """
+        for key in keys:
+            cls.keys.add(key)
+
+    @classmethod
+    def set_obs_processor(cls, processor=None):
+        """
+        Sets the processor for this observation modality. If @processor is set to None, then
+        the obs processor will use the default one (self.process_obs(...)). Otherwise, @processor
+        should be a function to process this corresponding observation modality.
+
+        Args:
+            processor (function or None): If not None, should be function that takes in either a
+                np.array or torch.Tensor and output the processed array / tensor. If None, will reset
+                to the default processor (self.process_obs(...))
+        """
+        cls._custom_obs_processor = processor
+
+    @classmethod
+    def set_obs_unprocessor(cls, unprocessor=None):
+        """
+        Sets the unprocessor for this observation modality. If @unprocessor is set to None, then
+        the obs unprocessor will use the default one (self.unprocess_obs(...)). Otherwise, @unprocessor
+        should be a function to process this corresponding observation modality.
+
+        Args:
+            unprocessor (function or None): If not None, should be function that takes in either a
+                np.array or torch.Tensor and output the unprocessed array / tensor. If None, will reset
+                to the default unprocessor (self.unprocess_obs(...))
+        """
+        cls._custom_obs_unprocessor = unprocessor
+
+    @classmethod
+    def _default_obs_processor(cls, obs):
+        """
+        Default processing function for this obs modality.
+
+        Note that this function is overridden by self.custom_obs_processor (a function with identical inputs / outputs)
+        if it is not None.
+
+        Args:
+            obs (np.array or torch.Tensor): raw observation, which may include a leading batch dimension
+
+        Returns:
+            np.array or torch.Tensor: processed observation
+        """
+        raise NotImplementedError
+
+    @classmethod
+    def _default_obs_unprocessor(cls, obs):
+        """
+        Default unprocessing function for this obs modality.
+
+        Note that this function is overridden by self.custom_obs_unprocessor
+        (a function with identical inputs / outputs) if it is not None.
+
+        Args:
+            obs (np.array or torch.Tensor): processed observation, which may include a leading batch dimension
+
+        Returns:
+            np.array or torch.Tensor: unprocessed observation
+        """
+        raise NotImplementedError
+
+    @classmethod
+    def process_obs(cls, obs):
+        """
+        Prepares an observation @obs of this modality for network input.
+
+        Args:
+            obs (np.array or torch.Tensor): raw observation, which may include a leading batch dimension
+
+        Returns:
+            np.array or torch.Tensor: processed observation
+        """
+        processor = cls._custom_obs_processor if \
+            cls._custom_obs_processor is not None else cls._default_obs_processor
+        return processor(obs)
+
+    @classmethod
+    def unprocess_obs(cls, obs):
+        """
+        Prepares an observation @obs of this modality for deployment.
+
+        Args:
+            obs (np.array or torch.Tensor): processed observation, which may include a leading batch dimension
+
+        Returns:
+            np.array or torch.Tensor: unprocessed observation
+        """
+        unprocessor = cls._custom_obs_unprocessor if \
+            cls._custom_obs_unprocessor is not None else cls._default_obs_unprocessor
+        return unprocessor(obs)
+
+    @classmethod
+    def process_obs_from_dict(cls, obs_dict, inplace=True):
+        """
+        Receives a dictionary of keyword mapped observations @obs_dict, and processes the observations with keys
+        corresponding to this modality. A copy will be made of the received dictionary unless @inplace is True
+
+        Args:
+            obs_dict (dict): Dictionary mapping observation keys to observations
+            inplace (bool): If True, will modify @obs_dict in place, otherwise, will create a copy
+
+        Returns:
+            dict: observation dictionary with processed observations corresponding to this modality
+        """
+        if inplace:
+            obs_dict = deepcopy(obs_dict)
+        # Loop over all keys and process the ones corresponding to this modality
+        for key, obs in obs_dict.values():
+            if key in cls.keys:
+                obs_dict[key] = cls.process_obs(obs)
+
+        return obs_dict
+
+
+class ImageModality(Modality):
+    """
+    Modality for RGB image observations
+    """
+    name = "rgb"
+
+    @classmethod
+    def _default_obs_processor(cls, obs):
+        """
+        Given image fetched from dataset, process for network input. Converts array
+        to float (from uint8), normalizes pixels from range [0, 255] to [0, 1], and channel swaps
+        from (H, W, C) to (C, H, W).
+
+        Args:
+            obs (np.array or torch.Tensor): image array
+
+        Returns:
+            processed_obs (np.array or torch.Tensor): processed image
+        """
+        return process_frame(frame=obs, channel_dim=3, scale=255.)
+
+    @classmethod
+    def _default_obs_unprocessor(cls, obs):
+        """
+        Given image prepared for network input, prepare for saving to dataset.
+        Inverse of @process_frame.
+
+        Args:
+            obs (np.array or torch.Tensor): image array
+
+        Returns:
+            unprocessed_obs (np.array or torch.Tensor): image passed through
+                inverse operation of @process_frame
+        """
+        return TU.to_uint8(unprocess_frame(frame=obs, channel_dim=3, scale=255.))
+
+
+class DepthModality(Modality):
+    """
+    Modality for depth observations
+    """
+    name = "depth"
+
+    @classmethod
+    def _default_obs_processor(cls, obs):
+        """
+        Given depth fetched from dataset, process for network input. Converts array
+        to float (from uint8), normalizes pixels from range [0, 1] to [0, 1], and channel swaps
+        from (H, W, C) to (C, H, W).
+
+        Args:
+            obs (np.array or torch.Tensor): depth array
+
+        Returns:
+            processed_obs (np.array or torch.Tensor): processed depth
+        """
+        return process_frame(frame=obs, channel_dim=1, scale=1.)
+
+    @classmethod
+    def _default_obs_unprocessor(cls, obs):
+        """
+        Given depth prepared for network input, prepare for saving to dataset.
+        Inverse of @process_depth.
+
+        Args:
+            obs (np.array or torch.Tensor): depth array
+
+        Returns:
+            unprocessed_obs (np.array or torch.Tensor): depth passed through
+                inverse operation of @process_depth
+        """
+        return TU.to_uint8(unprocess_frame(frame=obs, channel_dim=1, scale=1.))
+
+
+class ScanModality(Modality):
+    """
+    Modality for scan observations
+    """
+    name = "scan"
+
+    @classmethod
+    def _default_obs_processor(cls, obs):
+        return obs
+
+    @classmethod
+    def _default_obs_unprocessor(cls, obs):
+        return obs
+
+
+class LowDimModality(Modality):
+    """
+    Modality for low dimensional observations
+    """
+    name = "low_dim"
+
+    @classmethod
+    def _default_obs_processor(cls, obs):
+        return obs
+
+    @classmethod
+    def _default_obs_unprocessor(cls, obs):
+        return obs
diff --git a/robomimic/utils/python_utils.py b/robomimic/utils/python_utils.py
new file mode 100644
index 00000000..fa47b221
--- /dev/null
+++ b/robomimic/utils/python_utils.py
@@ -0,0 +1,73 @@
+"""
+Set of general purpose utility functions for easier interfacing with Python API
+"""
+import inspect
+from copy import deepcopy
+import robomimic.utils.macros as Macros
+
+
+def get_class_init_kwargs(cls):
+    """
+    Helper function to return a list of all valid keyword arguments (excluding "self") for the given @cls class.
+
+    Args:
+        cls (object): Class from which to grab __init__ kwargs
+
+    Returns:
+        list: All keyword arguments (excluding "self") specified by @cls __init__ constructor method
+    """
+    return list(inspect.signature(cls.__init__).parameters.keys())[1:]
+
+
+def extract_subset_dict(dic, keys, copy=False):
+    """
+    Helper function to extract a subset of dictionary key-values from a current dictionary. Optionally (deep)copies
+    the values extracted from the original @dic if @copy is True.
+
+    Args:
+        dic (dict): Dictionary containing multiple key-values
+        keys (Iterable): Specific keys to extract from @dic. If the key doesn't exist in @dic, then the key is skipped
+        copy (bool): If True, will deepcopy all values corresponding to the specified @keys
+
+    Returns:
+        dict: Extracted subset dictionary containing only the specified @keys and their corresponding values
+    """
+    subset = {k: dic[k] for k in keys if k in dic}
+    return deepcopy(subset) if copy else subset
+
+
+def extract_class_init_kwargs_from_dict(cls, dic, copy=False, verbose=False):
+    """
+    Helper function to return a dictionary of key-values that specifically correspond to @cls class's __init__
+    constructor method, from @dic which may or may not contain additional, irrelevant kwargs.
+
+    Note that @dic may possibly be missing certain kwargs as specified by cls.__init__. No error will be raised.
+
+    Args:
+        cls (object): Class from which to grab __init__ kwargs that will be be used as filtering keys for @dic
+        dic (dict): Dictionary containing multiple key-values
+        copy (bool): If True, will deepcopy all values corresponding to the specified @keys
+        verbose (bool): If True (or if macro DEBUG is True), then will print out mismatched keys
+
+    Returns:
+        dict: Extracted subset dictionary possibly containing only the specified keys from cls.__init__ and their
+            corresponding values
+    """
+    # extract only relevant kwargs for this specific backbone
+    cls_keys = get_class_init_kwargs(cls)
+    subdic = extract_subset_dict(
+        dic=dic,
+        keys=cls_keys,
+        copy=copy,
+    )
+
+    # Run sanity check if verbose or debugging
+    if verbose or Macros.DEBUG:
+        keys_not_in_cls = [k for k in dic if k not in cls_keys]
+        keys_not_in_dic = [k for k in cls_keys if k not in list(dic.keys())]
+        if len(keys_not_in_cls) > 0:
+            print(f"Warning: For class {cls.__name__}, got unknown keys: {keys_not_in_cls} ")
+        if len(keys_not_in_dic) > 0:
+            print(f"Warning: For class {cls.__name__}, got missing keys: {keys_not_in_dic} ")
+
+    return subdic
\ No newline at end of file
diff --git a/robomimic/utils/test_utils.py b/robomimic/utils/test_utils.py
index c8b05a02..148fe331 100644
--- a/robomimic/utils/test_utils.py
+++ b/robomimic/utils/test_utils.py
@@ -57,6 +57,29 @@ def example_dataset_path():
     return dataset_path
 
 
+def example_momart_dataset_path():
+    """
+    Path to momart dataset to use for testing and example purposes. It should
+    exist under the tests/assets directory, and will be downloaded
+    from a server if it does not exist.
+    """
+    dataset_folder = os.path.join(robomimic.__path__[0], "../tests/assets/")
+    dataset_path = os.path.join(dataset_folder, "test_momart.hdf5")
+    if not os.path.exists(dataset_path):
+        user_response = input("\nWARNING: momart test hdf5 does not exist! We will download sample dataset. "
+                              "This will take 0.6GB space. Proceed? y/n\n")
+        assert user_response.lower() in {"yes", "y"}, f"Did not receive confirmation. Aborting download."
+
+        print("\nDownloading from server...")
+
+        os.makedirs(dataset_folder, exist_ok=True)
+        FileUtils.download_url(
+            url="http://downloads.cs.stanford.edu/downloads/rt_mm/sample/test_momart.hdf5",
+            download_dir=dataset_folder,
+        )
+    return dataset_path
+
+
 def temp_model_dir_path():
     """
     Path to a temporary model directory to write to for testing and example purposes.
diff --git a/robomimic/utils/train_utils.py b/robomimic/utils/train_utils.py
index 695dfb5d..b25e969a 100644
--- a/robomimic/utils/train_utils.py
+++ b/robomimic/utils/train_utils.py
@@ -463,7 +463,7 @@ def save_model(model, config, env_meta, shape_meta, ckpt_path, obs_normalization
         ckpt_path (str): writes model checkpoint to this path
 
         obs_normalization_stats (dict): optionally pass a dictionary for observation
-            normalization. This should map observation modality keys to dicts
+            normalization. This should map observation keys to dicts
             with a "mean" and "std" of shape (1, ...) where ... is the default
             shape for the observation.
     """
diff --git a/setup.py b/setup.py
index c24292ec..40c23ab4 100644
--- a/setup.py
+++ b/setup.py
@@ -36,7 +36,7 @@
     author="Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang",
     url="https://github.com/ARISE-Initiative/robomimic",
     author_email="amandlek@cs.stanford.edu",
-    version="0.1.0",
+    version="0.2.0",
     long_description=long_description,
     long_description_content_type='text/markdown'
 )
diff --git a/tests/test_bc.py b/tests/test_bc.py
index 28451aa6..b8c83720 100644
--- a/tests/test_bc.py
+++ b/tests/test_bc.py
@@ -25,7 +25,7 @@ def get_algo_base_config():
     # low-level obs (note that we define it here because @observation structure might vary per algorithm, 
     # for example HBC)
     config.observation.modalities.obs.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos", "object"]
-    config.observation.modalities.obs.image = []
+    config.observation.modalities.obs.rgb = []
 
     # by default, vanilla BC
     config.algo.gaussian.enabled = False
@@ -46,20 +46,24 @@ def convert_config_for_images(config):
     config.train.num_data_workers = 0
     config.train.batch_size = 16
 
-    # replace object with image modality
+    # replace object with rgb modality
     config.observation.modalities.obs.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos"]
-    config.observation.modalities.obs.image = ["agentview_image"]
+    config.observation.modalities.obs.rgb = ["agentview_image"]
 
     # set up visual encoders
-    config.observation.encoder.visual_core = 'ResNet18Conv'
-    config.observation.encoder.visual_core_kwargs = Config()
-    config.observation.encoder.obs_randomizer_class = None
-    config.observation.encoder.visual_feature_dimension = 64
-    config.observation.encoder.use_spatial_softmax = True
-    config.observation.encoder.spatial_softmax_kwargs.num_kp = 32
-    config.observation.encoder.spatial_softmax_kwargs.learnable_temperature = False
-    config.observation.encoder.spatial_softmax_kwargs.temperature = 1.0
-    config.observation.encoder.spatial_softmax_kwargs.noise_std = 0.
+    config.observation.encoder.rgb.core_class = "VisualCore"
+    config.observation.encoder.rgb.core_kwargs.feature_dimension = 64
+    config.observation.encoder.rgb.core_kwargs.backbone_class = 'ResNet18Conv'                         # ResNet backbone for image observations (unused if no image observations)
+    config.observation.encoder.rgb.core_kwargs.backbone_kwargs.pretrained = False                # kwargs for visual core
+    config.observation.encoder.rgb.core_kwargs.backbone_kwargs.input_coord_conv = False
+    config.observation.encoder.rgb.core_kwargs.pool_class = "SpatialSoftmax"                # Alternate options are "SpatialMeanPool" or None (no pooling)
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.num_kp = 32                      # Default arguments for "SpatialSoftmax"
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.learnable_temperature = False    # Default arguments for "SpatialSoftmax"
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.temperature = 1.0                # Default arguments for "SpatialSoftmax"
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.noise_std = 0.0
+
+    # observation randomizer class - set to None to use no randomization, or 'CropRandomizer' to use crop randomization
+    config.observation.encoder.rgb.obs_randomizer_class = None
 
     return config
 
@@ -232,7 +236,7 @@ def bc_rnn_gmm_modifier(config):
 image_modifiers = OrderedDict()
 for test_name in MODIFIERS:
     lst = test_name.split("-")
-    name = "-".join(lst[:1] + ["image"] + lst[1:])
+    name = "-".join(lst[:1] + ["rgb"] + lst[1:])
     image_modifiers[name] = make_image_modifier(MODIFIERS[test_name])
 MODIFIERS.update(image_modifiers)
 
@@ -241,11 +245,15 @@ def bc_rnn_gmm_modifier(config):
 @register_mod("bc-image-crop")
 def bc_image_crop_modifier(config):
     config = convert_config_for_images(config)
-    config.observation.encoder.obs_randomizer_class = 'CropRandomizer'  # observation randomizer class
-    config.observation.encoder.obs_randomizer_kwargs.crop_height = 76
-    config.observation.encoder.obs_randomizer_kwargs.crop_width = 76
-    config.observation.encoder.obs_randomizer_kwargs.num_crops = 1
-    config.observation.encoder.obs_randomizer_kwargs.pos_enc = False
+
+    # observation randomizer class - using Crop randomizer
+    config.observation.encoder.rgb.obs_randomizer_class = "CropRandomizer"
+
+    # kwargs for observation randomizers (for the CropRandomizer, this is size and number of crops)
+    config.observation.encoder.rgb.obs_randomizer_kwargs.crop_height = 76
+    config.observation.encoder.rgb.obs_randomizer_kwargs.crop_width = 76
+    config.observation.encoder.rgb.obs_randomizer_kwargs.num_crops = 1
+    config.observation.encoder.rgb.obs_randomizer_kwargs.pos_enc = False
     return config
 
 
diff --git a/tests/test_bcq.py b/tests/test_bcq.py
index bde1e133..b8bd0835 100644
--- a/tests/test_bcq.py
+++ b/tests/test_bcq.py
@@ -25,7 +25,7 @@ def get_algo_base_config():
     # low-level obs (note that we define it here because @observation structure might vary per algorithm, 
     # for example HBC)
     config.observation.modalities.obs.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos", "object"]
-    config.observation.modalities.obs.image = []
+    config.observation.modalities.obs.rgb = []
 
     # by default, vanilla BCQ
     config.algo.actor.enabled = True # perturbation actor
@@ -46,20 +46,24 @@ def convert_config_for_images(config):
     config.train.num_data_workers = 0
     config.train.batch_size = 16
 
-    # replace object with image modality
+    # replace object with rgb modality
     config.observation.modalities.obs.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos"]
-    config.observation.modalities.obs.image = ["agentview_image"]
+    config.observation.modalities.obs.rgb = ["agentview_image"]
 
     # set up visual encoders
-    config.observation.encoder.visual_core = 'ResNet18Conv'
-    config.observation.encoder.visual_core_kwargs = Config()
-    config.observation.encoder.obs_randomizer_class = None
-    config.observation.encoder.visual_feature_dimension = 64
-    config.observation.encoder.use_spatial_softmax = True
-    config.observation.encoder.spatial_softmax_kwargs.num_kp = 32
-    config.observation.encoder.spatial_softmax_kwargs.learnable_temperature = False
-    config.observation.encoder.spatial_softmax_kwargs.temperature = 1.0
-    config.observation.encoder.spatial_softmax_kwargs.noise_std = 0.
+    config.observation.encoder.rgb.core_class = "VisualCore"
+    config.observation.encoder.rgb.core_kwargs.feature_dimension = 64
+    config.observation.encoder.rgb.core_kwargs.backbone_class = 'ResNet18Conv'                         # ResNet backbone for image observations (unused if no image observations)
+    config.observation.encoder.rgb.core_kwargs.backbone_kwargs.pretrained = False                # kwargs for visual core
+    config.observation.encoder.rgb.core_kwargs.backbone_kwargs.input_coord_conv = False
+    config.observation.encoder.rgb.core_kwargs.pool_class = "SpatialSoftmax"                # Alternate options are "SpatialMeanPool" or None (no pooling)
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.num_kp = 32                      # Default arguments for "SpatialSoftmax"
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.learnable_temperature = False    # Default arguments for "SpatialSoftmax"
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.temperature = 1.0                # Default arguments for "SpatialSoftmax"
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.noise_std = 0.0
+
+    # observation randomizer class - set to None to use no randomization, or 'CropRandomizer' to use crop randomization
+    config.observation.encoder.rgb.obs_randomizer_class = None
 
     return config
 
@@ -217,7 +221,7 @@ def bcq_vae_modifier_10(config):
 image_modifiers = OrderedDict()
 for test_name in MODIFIERS:
     lst = test_name.split("-")
-    name = "-".join(lst[:1] + ["image"] + lst[1:])
+    name = "-".join(lst[:1] + ["rgb"] + lst[1:])
     image_modifiers[name] = make_image_modifier(MODIFIERS[test_name])
 MODIFIERS.update(image_modifiers)
 
@@ -226,11 +230,15 @@ def bcq_vae_modifier_10(config):
 @register_mod("bcq-image-crop")
 def bcq_image_crop_modifier(config):
     config = convert_config_for_images(config)
-    config.observation.encoder.obs_randomizer_class = 'CropRandomizer'  # observation randomizer class
-    config.observation.encoder.obs_randomizer_kwargs.crop_height = 76
-    config.observation.encoder.obs_randomizer_kwargs.crop_width = 76
-    config.observation.encoder.obs_randomizer_kwargs.num_crops = 1
-    config.observation.encoder.obs_randomizer_kwargs.pos_enc = False
+
+    # observation randomizer class - using Crop randomizer
+    config.observation.encoder.rgb.obs_randomizer_class = "CropRandomizer"
+
+    # kwargs for observation randomizers (for the CropRandomizer, this is size and number of crops)
+    config.observation.encoder.rgb.obs_randomizer_kwargs.crop_height = 76
+    config.observation.encoder.rgb.obs_randomizer_kwargs.crop_width = 76
+    config.observation.encoder.rgb.obs_randomizer_kwargs.num_crops = 1
+    config.observation.encoder.rgb.obs_randomizer_kwargs.pos_enc = False
     return config
 
 
diff --git a/tests/test_cql.py b/tests/test_cql.py
index 84839b39..a78c4bf2 100644
--- a/tests/test_cql.py
+++ b/tests/test_cql.py
@@ -25,7 +25,7 @@ def get_algo_base_config():
     # low-level obs (note that we define it here because @observation structure might vary per algorithm, 
     # for example HBC)
     config.observation.modalities.obs.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos", "object"]
-    config.observation.modalities.obs.image = []
+    config.observation.modalities.obs.rgb = []
 
     # by default, vanilla CQL
     config.algo.actor.bc_start_steps = 40           # BC training initially
@@ -48,20 +48,24 @@ def convert_config_for_images(config):
     config.train.num_data_workers = 0
     config.train.batch_size = 16
 
-    # replace object with image modality
+    # replace object with rgb modality
     config.observation.modalities.obs.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos"]
-    config.observation.modalities.obs.image = ["agentview_image"]
+    config.observation.modalities.obs.rgb = ["agentview_image"]
 
     # set up visual encoders
-    config.observation.encoder.visual_core = 'ResNet18Conv'
-    config.observation.encoder.visual_core_kwargs = Config()
-    config.observation.encoder.obs_randomizer_class = None
-    config.observation.encoder.visual_feature_dimension = 64
-    config.observation.encoder.use_spatial_softmax = True
-    config.observation.encoder.spatial_softmax_kwargs.num_kp = 32
-    config.observation.encoder.spatial_softmax_kwargs.learnable_temperature = False
-    config.observation.encoder.spatial_softmax_kwargs.temperature = 1.0
-    config.observation.encoder.spatial_softmax_kwargs.noise_std = 0.
+    config.observation.encoder.rgb.core_class = "VisualCore"
+    config.observation.encoder.rgb.core_kwargs.feature_dimension = 64
+    config.observation.encoder.rgb.core_kwargs.backbone_class = 'ResNet18Conv'                         # ResNet backbone for image observations (unused if no image observations)
+    config.observation.encoder.rgb.core_kwargs.backbone_kwargs.pretrained = False                # kwargs for visual core
+    config.observation.encoder.rgb.core_kwargs.backbone_kwargs.input_coord_conv = False
+    config.observation.encoder.rgb.core_kwargs.pool_class = "SpatialSoftmax"                # Alternate options are "SpatialMeanPool" or None (no pooling)
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.num_kp = 32                      # Default arguments for "SpatialSoftmax"
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.learnable_temperature = False    # Default arguments for "SpatialSoftmax"
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.temperature = 1.0                # Default arguments for "SpatialSoftmax"
+    config.observation.encoder.rgb.core_kwargs.pool_kwargs.noise_std = 0.0
+
+    # observation randomizer class - set to None to use no randomization, or 'CropRandomizer' to use crop randomization
+    config.observation.encoder.rgb.obs_randomizer_class = None
 
     return config
 
@@ -106,7 +110,7 @@ def cql_gaussian_modifier(config):
 image_modifiers = OrderedDict()
 for test_name in MODIFIERS:
     lst = test_name.split("-")
-    name = "-".join(lst[:1] + ["image"] + lst[1:])
+    name = "-".join(lst[:1] + ["rgb"] + lst[1:])
     image_modifiers[name] = make_image_modifier(MODIFIERS[test_name])
 MODIFIERS.update(image_modifiers)
 
@@ -115,11 +119,15 @@ def cql_gaussian_modifier(config):
 @register_mod("cql-image-crop")
 def cql_image_crop_modifier(config):
     config = convert_config_for_images(config)
-    config.observation.encoder.obs_randomizer_class = 'CropRandomizer'  # observation randomizer class
-    config.observation.encoder.obs_randomizer_kwargs.crop_height = 76
-    config.observation.encoder.obs_randomizer_kwargs.crop_width = 76
-    config.observation.encoder.obs_randomizer_kwargs.num_crops = 1
-    config.observation.encoder.obs_randomizer_kwargs.pos_enc = False
+
+    # observation randomizer class - using Crop randomizer
+    config.observation.encoder.rgb.obs_randomizer_class = "CropRandomizer"
+
+    # kwargs for observation randomizers (for the CropRandomizer, this is size and number of crops)
+    config.observation.encoder.rgb.obs_randomizer_kwargs.crop_height = 76
+    config.observation.encoder.rgb.obs_randomizer_kwargs.crop_width = 76
+    config.observation.encoder.rgb.obs_randomizer_kwargs.num_crops = 1
+    config.observation.encoder.rgb.obs_randomizer_kwargs.pos_enc = False
     return config
 
 
diff --git a/tests/test_hbc.py b/tests/test_hbc.py
index 83b0af4f..e5560696 100644
--- a/tests/test_hbc.py
+++ b/tests/test_hbc.py
@@ -24,13 +24,13 @@ def get_algo_base_config():
     # low-level obs (note that we define it here because @observation structure might vary per algorithm, 
     # for example HBC)
     config.observation.planner.modalities.obs.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos", "object"]
-    config.observation.planner.modalities.obs.image = []
+    config.observation.planner.modalities.obs.rgb = []
 
     config.observation.planner.modalities.subgoal.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos", "object"]
-    config.observation.planner.modalities.subgoal.image = []
+    config.observation.planner.modalities.subgoal.rgb = []
 
     config.observation.actor.modalities.obs.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos", "object"]
-    config.observation.actor.modalities.obs.image = []
+    config.observation.actor.modalities.obs.rgb = []
 
     # by default, planner is deterministic prediction
     config.algo.planner.vae.enabled = False
diff --git a/tests/test_iris.py b/tests/test_iris.py
index 5b0c3fa5..126c5c28 100644
--- a/tests/test_iris.py
+++ b/tests/test_iris.py
@@ -24,16 +24,16 @@ def get_algo_base_config():
     # low-level obs (note that we define it here because @observation structure might vary per algorithm, 
     # for example iris)
     config.observation.value_planner.planner.modalities.obs.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos", "object"]
-    config.observation.value_planner.planner.modalities.obs.image = []
+    config.observation.value_planner.planner.modalities.obs.rgb = []
 
     config.observation.value_planner.planner.modalities.subgoal.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos", "object"]
-    config.observation.value_planner.planner.modalities.subgoal.image = []
+    config.observation.value_planner.planner.modalities.subgoal.rgb = []
 
     config.observation.value_planner.value.modalities.obs.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos", "object"]
-    config.observation.value_planner.value.modalities.obs.image = []
+    config.observation.value_planner.value.modalities.obs.rgb = []
 
     config.observation.actor.modalities.obs.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos", "object"]
-    config.observation.actor.modalities.obs.image = []
+    config.observation.actor.modalities.obs.rgb = []
 
     # by default, basic N(0, 1) prior for both planner VAE and BCQ cVAE
     config.algo.value_planner.planner.vae.enabled = True
diff --git a/tests/test_scripts.py b/tests/test_scripts.py
index 21889efc..30ed7f61 100644
--- a/tests/test_scripts.py
+++ b/tests/test_scripts.py
@@ -24,7 +24,7 @@
 def get_checkpoint_to_test():
     """
     Run a quick training run to get a checkpoint. This function runs a basic bc-image
-    training run. Image modality is used for a harder test case for the run agent
+    training run. RGB modality is used for a harder test case for the run agent
     script, which will need to also try writing image observations to the rollout
     dataset.
     """
@@ -38,20 +38,25 @@ def image_modifier(conf):
         conf.train.num_data_workers = 0
         conf.train.batch_size = 16
 
-        # replace object with image modality
+        # replace object with rgb modality
         conf.observation.modalities.obs.low_dim = ["robot0_eef_pos", "robot0_eef_quat", "robot0_gripper_qpos"]
-        conf.observation.modalities.obs.image = ["agentview_image"]
+        conf.observation.modalities.obs.rgb = ["agentview_image"]
 
         # set up visual encoders
-        conf.observation.encoder.visual_core = 'ResNet18Conv'
-        conf.observation.encoder.visual_core_kwargs = Config()
-        conf.observation.encoder.obs_randomizer_class = None
-        conf.observation.encoder.visual_feature_dimension = 64
-        conf.observation.encoder.use_spatial_softmax = True
-        conf.observation.encoder.spatial_softmax_kwargs.num_kp = 32
-        conf.observation.encoder.spatial_softmax_kwargs.learnable_temperature = False
-        conf.observation.encoder.spatial_softmax_kwargs.temperature = 1.0
-        conf.observation.encoder.spatial_softmax_kwargs.noise_std = 0.0
+        conf.observation.encoder.rgb.core_class = "VisualCore"
+        conf.observation.encoder.rgb.core_kwargs.feature_dimension = 64
+        conf.observation.encoder.rgb.core_kwargs.backbone_class = 'ResNet18Conv'                         # ResNet backbone for image observations (unused if no image observations)
+        conf.observation.encoder.rgb.core_kwargs.backbone_kwargs.pretrained = False                # kwargs for visual core
+        conf.observation.encoder.rgb.core_kwargs.backbone_kwargs.input_coord_conv = False
+        conf.observation.encoder.rgb.core_kwargs.pool_class = "SpatialSoftmax"                # Alternate options are "SpatialMeanPool" or None (no pooling)
+        conf.observation.encoder.rgb.core_kwargs.pool_kwargs.num_kp = 32                      # Default arguments for "SpatialSoftmax"
+        conf.observation.encoder.rgb.core_kwargs.pool_kwargs.learnable_temperature = False    # Default arguments for "SpatialSoftmax"
+        conf.observation.encoder.rgb.core_kwargs.pool_kwargs.temperature = 1.0                # Default arguments for "SpatialSoftmax"
+        conf.observation.encoder.rgb.core_kwargs.pool_kwargs.noise_std = 0.0
+
+        # observation randomizer class - set to None to use no randomization, or 'CropRandomizer' to use crop randomization
+        conf.observation.encoder.rgb.obs_randomizer_class = None
+
         return conf
 
     config = TestUtils.config_from_modifier(base_config=config, config_modifier=image_modifier)