From f74704df6f46b16300fef68a72479025a4962f0b Mon Sep 17 00:00:00 2001
From: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
Date: Mon, 30 Sep 2024 10:57:00 -0700
Subject: [PATCH 01/11] docs: added note on using overfit_batches

Signed-off-by: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
---
 docs/source/best-practices.rst | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/docs/source/best-practices.rst b/docs/source/best-practices.rst
index 194fab6d..3dd1c858 100644
--- a/docs/source/best-practices.rst
+++ b/docs/source/best-practices.rst
@@ -223,6 +223,20 @@ inspired by observations made in LLM training research, where the breakdown of
 assumptions in the convergent properties of ``Adam``-like optimizers causes large
 spikes in the training loss. This callback can help identify these occurrences.
 
+The ``devset``/``fast_dev_run`` approach detailed above is also useful for testing
+engineering/infrastructure (e.g. accelerator offload and logging), but not necessarily
+for probing training dynamics. Instead, we recommend using the ``overfit_batches``
+argument in ``pl.Trainer``
+
+.. code-block:: python
+   import pytorch_lightning as pl
+
+   trainer = pl.Trainer(overfit_batches=100)
+
+
+This will disable shuffling in the training and validation splits (per the PyTorch Lightning
+documentation), and ensure that the same batches are being reused every epoch.
+
 .. _e3nn documentation: https://docs.e3nn.org/en/latest/
 
 .. _IPEX installation: https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu

From e8a21aef212fa1786c0b6b3c72474c6d931ac919 Mon Sep 17 00:00:00 2001
From: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
Date: Mon, 30 Sep 2024 13:00:40 -0700
Subject: [PATCH 02/11] docs: added note on predict versus forward

Signed-off-by: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
---
 docs/source/inference.rst | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/docs/source/inference.rst b/docs/source/inference.rst
index 08032fcc..4b100d0e 100644
--- a/docs/source/inference.rst
+++ b/docs/source/inference.rst
@@ -4,6 +4,24 @@ Inference
 "Inference" can be a bit of an overloaded term, and this page is broken down into different possible
 downstream use cases for trained models.
 
+Task ``predict`` and ``forward`` methods
+----------------------------------------
+
+``matsciml`` tasks implement separate ``forward`` and ``predict`` methods. Both take a
+``BatchDict`` as input, and the latter wraps the former. The difference, however, is that
+``predict`` is intended for inference use primarily because it will also take care of
+reversing the normalization procedure, if they were provided during training, *and* perhaps
+more importantly, will ensure that the exponential moving average weights are used instead
+of the training ones.
+
+In the special case of force prediction (as a derivative of the energy) tasks, you should
+only need to specify normalization ``kwargs`` for energy: the scale value is taking automatically
+from the energy value, and applied to forces.
+
+In short, if you are writing functionality that requires unnormalized outputs (e.g. ``ase`` calculators),
+please ensure you are using ``predict`` instead of ``forward`` directly.
+
+
 Parity plots and model evaluations
 ----------------------------------
 

From 896af346180b3f888d6bcdf915fef0f42fc03b69 Mon Sep 17 00:00:00 2001
From: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
Date: Mon, 30 Sep 2024 13:45:02 -0700
Subject: [PATCH 03/11] docs: adding training documentation

Signed-off-by: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
---
 docs/source/training.rst | 153 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 147 insertions(+), 6 deletions(-)

diff --git a/docs/source/training.rst b/docs/source/training.rst
index da4c6a7a..35375a3f 100644
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -1,14 +1,155 @@
 Training pipeline
 =================
 
-Training with the Open MatSci ML Toolkit utilizes—for the most part&mdash;the
-PyTorch Lightning abstractions.
+Task abstraction
+================
+
+The Open MatSciML Toolkit uses PyTorch Lightning abstractions for managing the flow
+of training: how data from a datamodule gets mapped, to what loss terms are calculated,
+to what gets logged is defined in a base task class. From start to finish, this module
+will take in the definition of an encoding architecture (through ``encoder_class`` and
+``encoder_kwargs`` keyword arguments), construct it, and in concrete task implementations,
+initialize the respective output heads a set of provided or task-specific target keys.
+The ``encoder_kwargs`` specification makes things a bit more verbose, but this ensures
+that the hyperparameters are saved appropriately per the ``save_hyperparameters`` method
+in PyTorch Lightning.
+
+
+``BaseTaskModule`` API reference
+--------------------------------
+
+.. autoclass:: matsciml.models.base.BaseTaskModule
+   :members:
+
+
+Multi task reference
+--------------------------------
+
+One core functionality for ``matsciml`` is the ability to compose multiple tasks
+together, in an (almost) seamless fashion from the single task case.
+
+.. important::
+   The ``MultiTaskLitModule`` is not written in a particularly friendly way at
+   the moment, and may be subject to a significant refactor later!
+
+
+.. autoclass:: matsciml.models.base.MultiTaskLitModule
+   :members:
+
+
+``OutputHead`` API reference
+----------------------------
+
+While there is a singular ``OutputHead`` definition, the blocks that constitute
+an ``OutputHead`` can be specified depending on the type of model architecture
+being used. The default stack is based on simple ``nn.Linear`` layers, however,
+for architectures like MACE which may depend on preserving irreducible representations,
+the ``IrrepOutputBlock`` allows users to specify transformations per-representation.
+
+.. autoclass:: matsciml.models.common.OutputHead
+   :members:
+
+
+.. autoclass:: matsciml.models.common.OutputBlock
+   :members:
+
+
+.. autoclass:: matsciml.models.common.IrrepOutputBlock
+   :members:
+
 
 Task API reference
 ##################
 
-.. autosummary::
-   :toctree: generated
-   :recursive:
+Scalar regression
+-----------------
+
+This task is primarily designed for tasks adjacent to property prediction: you can
+predict an arbitrary number of properties (per output head), based on a shared
+embedding (i.e. one structure maps to a single embedding, which is used by each head).
+
+A special case for using this class would be in tandem (as a multitask setup) with
+the :ref:`_gradfree_force`, which treats energy/force prediction as two
+separate output heads, albeit with the same shared embedding.
+
+Please use continuous valued (e.g. ``nn.MSELoss``) loss metrics for this task.
+
+
+.. autoclass:: matsciml.models.base.ScalarRegressionTask
+   :members:
+
+
+Binary classification
+-----------------------
+
+This task, as the name suggests, uses the embedding to perform one or more binary
+classifications with a shared embedding. This can be something like a ``stability``
+label like in the Materials Project. Keep in mind, however, that a special class
+exists for crystal symmetry classification.
+
+.. _crystal_symmetry:
+
+Crystal symmetry classification
+-------------------------------
+
+This task is a specialized class for what is essentially multiclass classification,
+where given an embedding, we predict which crystal space group the structure belongs
+to using ``nn.CrossEntropyLoss``. This can be a good potential pretraining task.
+
+
+.. note::
+   This task expects that your data includes ``spacegroup`` target key.
+
+.. autoclass:: matsciml.models.base.CrystalSymmetryClassificationTask
+   :members:
+
+
+Force regression task
+---------------------
+
+This task implements energy/force regression, where an ``OutputHead`` is used to first
+predict the energy, followed by taking its derivative with respect to the input coordinates.
+From a developer perspective, this task is quite mechanically different due to the need
+for manual ``autograd``, which is not normally supported by PyTorch Lightning workflows.
+
+
+.. note::
+   This task expects that your data includes ``force`` target key.
+
+.. autoclass:: matsciml.models.base.ForceRegressionTask
+   :members:
+
+
+.. _gradfree_force:
+
+Gradient-free force regression task
+-----------------------------------
+
+This task implements a force prediction task, albeit as a direct output head property
+prediction as opposed to the derivative of an energy value using ``autograd``.
+
+.. note::
+   This task expects that your data includes ``force`` target key.
+
+.. autoclass:: matsciml.models.base.GradFreeForceRegression
+   :members:
+
+
+Node denoising task
+-------------------
+
+This task implements a powerful, and recently becoming more popular, pre-training strategy
+for graph neural networks. The premise is quite simple: an encoder learns as a denoising
+autoencoder by taking in a perturbed structure, and attempting to predict the amount of
+noise in the 3D coordinates.
+
+As a requirement, this task requires the following data transform; you are able to specify
+the scale of the noise added to the positions and intuitively the large the scale, the higher
+potential difficulty in the task.
+
+.. autoclass:: matsciml.datasets.transforms.pretraining.NoisyPositions
+   :members:
+
 
-    matsciml.models.base
+.. autoclass:: matsciml.models.base.NodeDenoisingTask
+   :members:

From a6b9b2c1c7d258faeb39d6cad309783a95ffa166 Mon Sep 17 00:00:00 2001
From: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
Date: Mon, 30 Sep 2024 15:12:04 -0700
Subject: [PATCH 04/11] docs: added note on target normalization

Signed-off-by: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
---
 docs/source/best-practices.rst | 74 ++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/docs/source/best-practices.rst b/docs/source/best-practices.rst
index 3dd1c858..293d8ec6 100644
--- a/docs/source/best-practices.rst
+++ b/docs/source/best-practices.rst
@@ -181,6 +181,80 @@ the accelerator.
 Training
 --------
 
+Target normalization
+^^^^^^^^^^^^^^^^^^^^
+
+Tasks can be provided with ``normalize_kwargs``, which are key/value mappings
+that specify the mean and standard deviation of a target; an example is given below.
+
+.. code-block: python
+
+   Task(
+       ...,
+       normalize_kwargs={
+         "energy_mean": 0.0,
+         "energy_std": 1.0,
+   }
+   )
+
+The example above will normalize ``energy`` labelsm and can be substituted with
+any of target key of interest (e.g. ``force``, ``bandgap``, etc.)
+
+Target loss scaling
+^^^^^^^^^^^^^^^^^^^
+
+A generally common practice is to scale some targets relative to others (e.g. force over
+energy, etc). To specify this, you can pass a ``task_loss_scaling``  dictionary to
+any task module, which maps target keys to a floating point value that will be used
+to multiply the corresponding target loss value before summation and backpropagation.
+
+.. code-block: python
+   Task(
+       ...,
+       task_loss_scaling={
+           "energy": 1.0,
+           "force": 10.0
+   }
+   )
+
+
+A related, but alternative way to specify target scaling is to apply a *schedule* to
+the training loss contributions: essentially, this provides a way to smoothly ramp
+up (or down) different targets, i.e. to allow for more complex training curricula.
+To achieve this, you will need to use the ``LossScalingScheduler`` callback,
+
+.. autoclass:: matsciml.lightning.callbacks.LossScalingScheduler
+   :members:
+
+
+To specify this callback, you must pass subclasses of ``BaseScalingSchedule`` as arguments.
+Each schedule type implements the functional form of a schedule, and currently
+there are two concrete schedules;
+
+.. autoclass:: matsciml.lightning.loss_scaling.BaseScalingSchedule
+   :members:
+   :inherited-members:
+
+
+Composed together, an example would look like this
+
+.. code-block: python
+
+   import pytorch_lightning as pl
+   from matsciml.lightning.callbacks import LossScalingScheduler
+   from matsciml.lightning.loss_scaling import LinearScalingSchedule
+
+   scheduler = LossScalingScheduler(
+      LinearScalingSchedule("energy", initial_value=1.0, end_value=5.0, step_frequency="epoch")
+   )
+   trainer = pl.Trainer(callbacks=[scheduler])
+
+
+The stepping schedule is determined during ``setup`` (as training begins), where the callback will
+inspect ``Trainer`` arguments to determine how many steps will be taken. The ``step_frequency``
+just specifies how often the learning rate is updated.
+
+
 Quick debugging
 ^^^^^^^^^^^^^^^
 

From ea4d23f0985775122346143c8a1a78cb7eb8db35 Mon Sep 17 00:00:00 2001
From: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
Date: Mon, 30 Sep 2024 15:20:04 -0700
Subject: [PATCH 05/11] docs: correcting class reference to
 GradFreeForceRegression

Signed-off-by: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
---
 docs/source/training.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/training.rst b/docs/source/training.rst
index 35375a3f..5b95057b 100644
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -131,7 +131,7 @@ prediction as opposed to the derivative of an energy value using ``autograd``.
 .. note::
    This task expects that your data includes ``force`` target key.
 
-.. autoclass:: matsciml.models.base.GradFreeForceRegression
+.. autoclass:: matsciml.models.base.GradFreeForceRegressionTask
    :members:
 
 

From aa49a96c7f2e1a0a5797a617a8e2923a1e6f8242 Mon Sep 17 00:00:00 2001
From: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
Date: Mon, 30 Sep 2024 15:58:29 -0700
Subject: [PATCH 06/11] docs: updating loss scaling docs

Signed-off-by: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
---
 docs/source/best-practices.rst | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/docs/source/best-practices.rst b/docs/source/best-practices.rst
index 293d8ec6..f80ab497 100644
--- a/docs/source/best-practices.rst
+++ b/docs/source/best-practices.rst
@@ -229,14 +229,7 @@ To achieve this, you will need to use the ``LossScalingScheduler`` callback,
 
 To specify this callback, you must pass subclasses of ``BaseScalingSchedule`` as arguments.
 Each schedule type implements the functional form of a schedule, and currently
-there are two concrete schedules;
-
-.. autoclass:: matsciml.lightning.loss_scaling.BaseScalingSchedule
-   :members:
-   :inherited-members:
-
-
-Composed together, an example would look like this
+there are two concrete schedules. Composed together, an example would look like this
 
 .. code-block: python
 
@@ -255,6 +248,14 @@ inspect ``Trainer`` arguments to determine how many steps will be taken. The ``s
 just specifies how often the learning rate is updated.
 
 
+.. autoclass:: matsciml.lightning.loss_scaling.LinearScalingSchedule
+   :members:
+
+
+.. autoclass:: matsciml.lightning.loss_scaling.SigmoidScalingSchedule
+   :members:
+
+
 Quick debugging
 ^^^^^^^^^^^^^^^
 

From 0ceac7a442e15d197d54bfa0421ddb8ac7e9ea0f Mon Sep 17 00:00:00 2001
From: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
Date: Mon, 30 Sep 2024 16:01:43 -0700
Subject: [PATCH 07/11] docs: add missing binary classification API

Signed-off-by: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
---
 docs/source/training.rst | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/docs/source/training.rst b/docs/source/training.rst
index 5b95057b..ab469273 100644
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -87,6 +87,9 @@ classifications with a shared embedding. This can be something like a ``stabilit
 label like in the Materials Project. Keep in mind, however, that a special class
 exists for crystal symmetry classification.
 
+.. autoclass:: matsciml.models.base.BinaryClassificationTask
+   :members:
+
 .. _crystal_symmetry:
 
 Crystal symmetry classification

From ecb2946f5dc5b7a3819ce7dedd16c691e5167139 Mon Sep 17 00:00:00 2001
From: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
Date: Mon, 30 Sep 2024 16:02:34 -0700
Subject: [PATCH 08/11] docs: removing erroneous header

Signed-off-by: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
---
 docs/source/training.rst | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/docs/source/training.rst b/docs/source/training.rst
index ab469273..edb962ae 100644
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -1,6 +1,3 @@
-Training pipeline
-=================
-
 Task abstraction
 ================
 

From 590948c590604e71f9501647e6457932a7f418a7 Mon Sep 17 00:00:00 2001
From: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
Date: Mon, 30 Sep 2024 16:05:03 -0700
Subject: [PATCH 09/11] docs: removing erroneous task API reference

Signed-off-by: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
---
 docs/source/training.rst | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/docs/source/training.rst b/docs/source/training.rst
index edb962ae..339d8be6 100644
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -55,9 +55,6 @@ the ``IrrepOutputBlock`` allows users to specify transformations per-representat
    :members:
 
 
-Task API reference
-##################
-
 Scalar regression
 -----------------
 

From 923164122cee903ace01d17db2d28f2b2b11d8c3 Mon Sep 17 00:00:00 2001
From: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
Date: Mon, 30 Sep 2024 16:32:34 -0700
Subject: [PATCH 10/11] docs: fixing labels typo

Signed-off-by: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
---
 docs/source/best-practices.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/best-practices.rst b/docs/source/best-practices.rst
index f80ab497..a619dedb 100644
--- a/docs/source/best-practices.rst
+++ b/docs/source/best-practices.rst
@@ -197,7 +197,7 @@ that specify the mean and standard deviation of a target; an example is given be
    }
    )
 
-The example above will normalize ``energy`` labelsm and can be substituted with
+The example above will normalize ``energy`` labels and can be substituted with
 any of target key of interest (e.g. ``force``, ``bandgap``, etc.)
 
 Target loss scaling

From 17d75825146bd8ebb9966184a80d0604a58ba2de Mon Sep 17 00:00:00 2001
From: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
Date: Mon, 30 Sep 2024 16:33:11 -0700
Subject: [PATCH 11/11] docs: removing statement about logging in fast_dev_run

Signed-off-by: Kin Long Kelvin Lee <kin.long.kelvin.lee@intel.com>
---
 docs/source/best-practices.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/best-practices.rst b/docs/source/best-practices.rst
index a619dedb..4f0b5525 100644
--- a/docs/source/best-practices.rst
+++ b/docs/source/best-practices.rst
@@ -299,7 +299,7 @@ assumptions in the convergent properties of ``Adam``-like optimizers causes larg
 spikes in the training loss. This callback can help identify these occurrences.
 
 The ``devset``/``fast_dev_run`` approach detailed above is also useful for testing
-engineering/infrastructure (e.g. accelerator offload and logging), but not necessarily
+engineering/infrastructure (e.g. accelerator offload), but not necessarily
 for probing training dynamics. Instead, we recommend using the ``overfit_batches``
 argument in ``pl.Trainer``