Merge branch 'NVIDIA:main' into main

nvidianz · Aug 7, 2024 · 3fdb498 · 3fdb498
2 parents 21617a6 + 47137cd
commit 3fdb498
Show file tree

Hide file tree

Showing 99 changed files with 4,398 additions and 978 deletions.
diff --git a/docs/getting_started.rst b/docs/getting_started.rst
@@ -304,7 +304,7 @@ we can install these in the Python virtual environment by running:
 .. code-block:: shell
 
   source nvflare-env/bin/activate
-  python3 -m pip install -r simulator-example/requirements.txt
+  python3 -m pip install -r simulator-example/hello-pt/requirements.txt
 
 If using the Dockerfile above to run in a container, these dependencies have already been installed.
 

diff --git a/docs/programming_guide/experiment_tracking.rst b/docs/programming_guide/experiment_tracking.rst
@@ -4,150 +4,17 @@
 Experiment Tracking
 ###################
 
-***********************
-Overview and Approaches
-***********************
+FLARE seamlessly integrates with leading experiment tracking systems—MLflow, Weights & Biases, and TensorBoard—to facilitate comprehensive monitoring of metrics.
 
-In a federated computing setting, the data is distributed across multiple devices or systems, and training is run
-on each device independently while preserving each client's data privacy.
+You can choose between decentralized and centralized tracking configurations:
 
-Assuming a federated system consisting of one server and many clients and the server coordinating the ML training of clients,
-we can interact with ML experiment tracking tools in two different ways: 
+- **Decentralized tracking**: Each client manages its own metrics and experiment tracking server locally, maintaining training metric privacy. However, this setup limits the ability to compare data across different sites.
+- **Centralized tracking**: All metrics are streamed to a central FL server, which then pushes the data to a selected tracking system. This setup supports effective cross-site metric comparisons
 
-    - Client-side experiment tracking: Each client will directly send the log metrics/parameters to the ML experiment
-      tracking server (like MLflow or Weights and Biases) or local file system (like tensorboard)
-    - Aggregated experiment tracking: Clients will send the log metrics/parameters to FL server, and the FL server will
-      send the metrics to ML experiment tracking server or local file system
+We provide solutions for different client execution types. For the Client API, use the corresponding experiment tracking APIs. For Executors or Learners, use the experiment tracking LogWriters.
 
-Each approach will have its use cases and unique challenges. In NVFLARE, we developed a server-side approach (in the
-provided examples, the Receiver is on the FL server, but it could also be on the FL client):
+.. toctree::
+   :maxdepth: 1
 
-    - Clients don't need to have access to the tracking server, avoiding the additional
-      authentication for every client. In many cases, the clients may be from different organizations
-      and different from the host organization of the experiment tracking server. 
-    - Since we reduced connections to the tracking server from N clients to just one server, the traffic to the tracking server 
-      can be highly reduced. In some cases, such as in MLFLow, the events can be buffered in the server and sent to the tracking
-      server in batches, further reducing the traffic to the tracking server. The buffer may add additional latency, so you can
-      disable the buffering if you can set the buffer flush time to 0 assuming the tracking server can take the traffic.
-    - Another key benefit of using server-side experiment tracking is that we separate the metrics data collection 
-      from the metrics data delivery to the tracking server. Clients are only responsible for collecting metrics, and only the server needs to 
-      know about the tracking server. This allows us to have different tools for data collection and data delivery.
-      For example, if the client has training code with logging in Tensorboard syntax, without changing the code, the server can
-      receive the logged data and deliver the metrics to MLflow.
-    - Server-side experiment tracking also can organize different clients' results into different experiment runs so they can be easily
-      compared side-by-side. 
-
-.. note::
-
-    This page covers experiment tracking using :class:`LogWriters <nvflare.app_common.tracking.log_writer.LogWriter>`,
-    which are configured and used with :ref:`executor` or :ref:`model_learner` on the FLARE-side code.
-    However if using the Client API, please refer to :ref:`client_api` and :ref:`nvflare.client.tracking` for adding experiment tracking to your custom training code.
-
-
-**************************************
-Tools, Sender, LogWriter and Receivers
-**************************************
-
-With the "experiment_tracking" examples in the advanced examples directory, you can see how to track and visualize
-experiments in real time and compare results by leveraging several experiment tracking solutions:
-
-    - `Tensorboard <https://www.tensorflow.org/tensorboard>`_
-    - `MLflow <https://mlflow.org/>`_
-    - `Weights and Biases <https://wandb.ai/site>`_
-
-.. note::
-
-    The user needs to sign up at Weights and Biases to access the service, NVFlare can not provide access.
-
-In the Federated Learning phase, users can choose an API syntax that they are used to from one
-of above tools. NVFlare has developed components that mimic these APIs called
-:class:`LogWriters <nvflare.app_common.tracking.log_writer.LogWriter>`. All clients experiment logs
-are streamed over to the FL server (with :class:`ConvertToFedEvent<nvflare.app_common.widgets.convert_to_fed_event.ConvertToFedEvent>`),
-where the actual experiment logs are recorded. The components that receive
-these logs are called Receivers based on :class:`AnalyticsReceiver <nvflare.app_common.widgets.streaming.AnalyticsReceiver>`.
-The receiver component leverages the experiment tracking tool and records the logs during the experiment run.
-
-In a normal setting, we would have pairs of sender and receivers, with some provided implementations in :mod:`nvflare.app_opt.tracking`:
-
-    - TBWriter  <-> TBAnalyticsReceiver
-    - MLflowWriter <-> MLflowReceiver
-    - WandBWriter <-> WandBReceiver
-
-You can also mix and match any combination of LogWriter and Receiver so you can write the ML code using one API
-but use any experiment tracking tool or tools (you can use multiple receivers for the same log data sent from one sender).
-
-.. image:: ../resources/experiment_tracking.jpg
-
-*************************
-Experiment logs streaming
-*************************
-
-On the client side, when a :class:`LogWriters <nvflare.app_common.tracking.log_writer.LogWriter>` writes the
-metrics, instead of writing to files, it actually generates an NVFLARE event (of type `analytix_log_stats` by default).
-The `ConvertToFedEvent` widget will turn the local event `analytix_log_stats` into a 
-fed event `fed.analytix_log_stats`, which will be delivered to the server side.
-
-On the server side, the :class:`AnalyticsReceiver <nvflare.app_common.widgets.streaming.AnalyticsReceiver>` is configured
-to process `fed.analytix_log_stats` events, which writes received log data to the appropriate tracking solution.
-
-****************************************
-Support custom experiment tracking tools
-****************************************
-
-There are many different experiment tracking tools, and you might want to write a custom writer and/or receiver for your needs.
-
-There are three things to consider for developing a custom experiment tracking tool.
-
-Data Type
-=========
-
-Currently, the supported data types are listed in :class:`AnalyticsDataType <nvflare.apis.analytix.AnalyticsDataType>`, and other data types can be added as needed.
-
-Writer
-======
-Implement :class:`LogWriter <nvflare.app_common.tracking.log_writer.LogWriter>` interface with the API syntax. For each tool, we mimic the API syntax of the underlying tool,
-so users can use what they are familiar with without learning a new API.
-For example, for Tensorboard, TBWriter uses add_scalar() and add_scalars(); for MLflow, the syntax is
-log_metric(), log_metrics(), log_parameter(), and log_parameters(); for W&B, the writer just has log().
-The data collected with these calls will all send to the AnalyticsSender to deliver to the FL server.
-
-Receiver
-========
-
-Implement :class:`AnalyticsReceiver <nvflare.app_common.widgets.streaming.AnalyticsReceiver>` interface and determine how to represent different sites' logs.  In all three implementations
-(Tensorboard, MLflow, WandB), each site's log is represented as one run. Depending on the individual tool, the implementation 
-can be different. For example, for both Tensorboard and MLflow, we create different runs for each client and map to the 
-site name. In the WandB implementation, we have to leverage multiprocess and let each run in a different process.  
-
-*****************
-Examples Overview
-*****************
-
-The :github_nvflare_link:`experiment tracking examples <examples/advanced/experiment-tracking>`
-illustrate how to leverage different writers and receivers. All examples are based upon the hello-pt example.
-
-TensorBoard
-===========
-The example in the "tensorboard" directory shows how to use the Tensorboard Tracking Tool (for both the
-sender and receiver). See :ref:`tensorboard_streaming` for details.
-
-MLflow
-======
-Under the "mlflow" directory, the "hello-pt-mlflow" job shows how to use MLflow for tracking with both the MLflow sender
-and receiver. The "hello-pt-tb-mlflow" job shows how to use the Tensorboard Sender, while the receiver is MLflow.
-See :ref:`experiment_tracking_mlflow` for details.
-
-Weights & Biases
-================
-Under the :github_nvflare_link:`wandb <examples/advanced/experiment-tracking/wandb>` directory, the
-"hello-pt-wandb" job shows how to use Weights and Biases for experiment tracking with
-the WandBWriter and WandBReceiver to log metrics.
-
-MONAI Integration
-=================
-
-:github_nvflare_link:`Integration with MONAI <integration/monai>` uses the `NVFlareStatsHandler`
-:class:`LogWriterForMetricsExchanger <nvflare.app_common.tracking.LogWriterForMetricsExchanger>` to connect to
-:class:`MetricsRetriever <nvflare.app_common.metrics_exchange.MetricsRetriever>`. See the job
-:github_nvflare_link:`spleen_ct_segmentation_local <integration/monai/examples/spleen_ct_segmentation_local/jobs/spleen_ct_segmentation_local>`
-for more details on this configuration.
+   experiment_tracking/experiment_tracking_apis
+   experiment_tracking/experiment_tracking_log_writer
diff --git a/docs/programming_guide/experiment_tracking/experiment_tracking_apis.rst b/docs/programming_guide/experiment_tracking/experiment_tracking_apis.rst
@@ -0,0 +1,211 @@
+.. _experiment_tracking_apis:
+
+########################
+Experiment Tracking APIs
+########################
+
+.. figure:: ../../resources/experiment_tracking_diagram.png
+    :height: 500px
+
+To track training metrics such as accuracy or loss or AUC, we need to log these metrics with one of the experiment tracking systems.
+Here we will discuss the following topics:
+
+- Logging metrics with MLflow, TensorBoard, or Weights & Biases
+- Streaming metrics to the FL server
+- Streaming to FL clients
+
+Logging metrics with MLflow, TensorBoard, or Weights & Biases
+=============================================================
+
+Integrate MLflow logging to efficiently stream metrics to the MLflow server with just three lines of code:
+
+.. code-block:: python
+
+  from nvflare.client.tracking import MLflowWriter
+
+  mlflow = MLflowWriter()
+
+  mlflow.log_metric("loss", running_loss / 2000, global_step)
+
+In this setup, we use ``MLflowWriter`` instead of using the MLflow API directly.
+This abstraction is important, as it enables users to flexibly redirect your logging metrics to any destination, which we discuss in more detail later.
+
+The use of MLflow, TensorBoard, or Weights & Biases syntax will all work to stream the collected metrics to any supported experiment tracking system.
+Choosing to use TBWriter, MLflowWriter, or WandBWriter is user preference based on your existing code and requirements.
+
+- ``MLflowWriter`` uses the Mlflow API operation syntax ``log_metric()``
+- ``TBWriter`` uses the TensorBoard SummaryWriter operation ``add_scalar()``
+- ``WandBWriter`` uses the Weights & Biases API operation ``log()``
+
+Here are the APIs:
+
+.. code-block:: python
+
+  class TBWriter(LogWriter):
+    def add_scalar(self, tag: str, scalar: float, global_step: Optional[int] = None, **kwargs):
+    def add_scalars(self, tag: str, scalars: dict, global_step: Optional[int] = None, **kwargs):
+
+
+  class WandBWriter(LogWriter):
+    def log(self, metrics: Dict[str, float], step: Optional[int] = None):
+
+
+  class MLflowWriter(LogWriter):
+      def log_param(self, key: str, value: any) -> None:
+      def log_params(self, values: dict) -> None:
+      def log_metric(self, key: str, value: float, step: Optional[int] = None) -> None:
+      def log_metrics(self, metrics: Dict[str, float], step: Optional[int] = None) -> None:
+      def log_text(self, text: str, artifact_file_path: str) -> None:
+      def set_tag(self, key: str, tag: any) -> None:
+      def set_tags(self, tags: dict) -> None:
+
+
+After you've modified the training code, you can use the NVFlare's job configuration to configure the system to stream the logs appropriately.
+
+Streaming metrics to FL server
+==============================
+
+All metric key values are captured as events, with the flexibility to stream them to the most suitable destinations.
+Let's add the ``ConvertToFedEvent`` to convert these metrics events to federated events so they will be sent to the server.
+
+Add this component to config_fed_client.json:
+
+.. code-block:: yaml
+
+  {
+    "id": "event_to_fed",
+    "name": "ConvertToFedEvent",
+    "args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
+  }
+
+If using the subprocess Client API with the ClientAPILauncherExecutor (rather than the in-process Client API with the InProcessClientAPIExecutor),
+we need to add the ``MetricRelay`` to fire fed events, a ``CellPipe`` for metrics, and an ``ExternalConfiguator`` for client api initialization.
+
+.. code-block:: yaml
+    {
+      id = "metric_relay"
+      path = "nvflare.app_common.widgets.metric_relay.MetricRelay"
+      args {
+        pipe_id = "metrics_pipe"
+        event_type = "fed.analytix_log_stats"
+        read_interval = 0.1
+      }
+    },
+    {
+      id = "metrics_pipe"
+      path = "nvflare.fuel.utils.pipe.cell_pipe.CellPipe"
+      args {
+        mode = "PASSIVE"
+        site_name = "{SITE_NAME}"
+        token = "{JOB_ID}"
+        root_url = "{ROOT_URL}"
+        secure_mode = "{SECURE_MODE}"
+        workspace_dir = "{WORKSPACE}"
+      }
+    },
+    {
+      id = "config_preparer"
+      path = "nvflare.app_common.widgets.external_configurator.ExternalConfigurator"
+      args {
+        component_ids = ["metric_relay"]
+      }
+    }
+
+
+On the server, configure the experiment tracking system in ``config_fed_server.conf`` using one of the following receivers.
+Note that any of the receivers can be used regardless of the which writer is used.
+
+- ``MLflowReceiver`` for MLflow
+- ``TBAnalyticsReceiver`` for TensorBoard
+- ``WandBReceiver`` for Weights & Biases
+
+For example, here we add the ``MLflowReceiver`` component to the components configuration array:
+
+.. code-block:: yaml
+
+  {
+    "id": "mlflow_receiver_with_tracking_uri",
+    "path": "nvflare.app_opt.tracking.mlflow.mlflow_receiver.MLflowReceiver",
+    "args": {
+      tracking_uri = "file:///{WORKSPACE}/{JOB_ID}/mlruns"
+      "kwargs": {
+        "experiment_name": "hello-pt-experiment",
+        "run_name": "hello-pt-with-mlflow",
+        "experiment_tags": {
+          "mlflow.note.content": "markdown for the experiment"
+        },
+        "run_tags": {
+          "mlflow.note.content": "markdown describes details of experiment"
+        }
+      },
+      "artifact_location": "artifacts"
+    }
+  }
+
+Notice the args{} are user defined, such as tracking_uri, experiment_name, tags etc., and will be specific to which receiver is configured.
+
+The MLflow tracking URL argument ``tracking_uri`` is None by default, which uses the MLflow default URL, ``http://localhost:5000``.
+To make this accessible from another machine, make sure to change it to the correct URL, or point to to the ``mlruns`` directory in the workspace.
+
+::
+
+  tracking_uri = <the Mlflow Server endpoint URL>
+
+::
+
+  tracking_uri = "file:///{WORKSPACE}/{JOB_ID}/mlruns"
+
+You can change other arguments such as experiments, run_name, tags (using Markdown syntax), and artifact location.
+
+Start the MLflow server with one of the following commands:
+
+::
+
+  mlflow server --host 127.0.0.1 --port 5000
+
+::
+
+  mlflow ui -port 5000
+
+For more information with an example walkthrough, see the :github_nvflare_link:`FedAvg with SAG with MLflow tutorial <examples/hello-world/step-by-step/cifar10/sag_mlflow/sag_mlflow.ipynb>`.
+
+
+Streaming metrics to FL clients
+===============================
+
+If streaming metrics to the FL server isn't preferred due to privacy or other concerns, users can alternatively stream metrics to the FL client.
+In such cases, there's no need to add the ``ConvertToFedEvent`` component on the client side.
+Additionally, since we're not streaming to the server side, there's no requirement to configure receivers in the server configuration.
+
+Instead to receive records on the client side, configure the metrics receiver in the client configuration instead of the server configuration.
+
+For example, in the TensorBoard configuration, add this component to ``config_fed_client.conf``:
+
+.. code-block:: yaml
+
+  {
+    "id": "tb_analytics_receiver",
+    "name": "TBAnalyticsReceiver",
+    "args": {"events": ["analytix_log_stats"]}
+  }
+
+Note that the ``events`` argument is ``analytix_log_stats``, not ``fed.analytix_log_stats``, indicating that this is a local event.
+
+If using the ``MetricRelay`` component, we can similarly component event_type value from ``fed.analytix_log_stats`` to ``analytix_log_stats`` for convention.
+We then must set the ``MetricRelay`` argument ``fed_event`` to ``false`` to fire local events rather than the default fed events.
+
+.. code-block:: yaml
+
+  {
+    id = "metric_relay"
+    path = "nvflare.app_common.widgets.metric_relay.MetricRelay"
+    args {
+      pipe_id = "metrics_pipe"
+      event_type = "analytix_log_stats"
+      # how fast should it read from the peer
+      read_interval = 0.1
+      fed_event = false
+    }
+  },
+
+Then, the metrics will stream to the client.