Torchmetrics integration for evaluation metrics (#504)

- Add torchmetrics to model.evaluate, a test for mAP and trainer.evaluate, and update docs to give a fuller sense of DeepForest evaluation philosophy - move pycocotools to pip section of .yml - tests multi log dict for torchmetrics
weecology · Oct 9, 2023 · c8c7cb0 · c8c7cb0
1 parent ad45785
commit c8c7cb0
Show file tree

Hide file tree

Showing 9 changed files with 113 additions and 16 deletions.
diff --git a/deepforest/main.py b/deepforest/main.py
@@ -8,6 +8,7 @@
 import pytorch_lightning as pl
 from torch import optim
 import numpy as np
+from torchmetrics.detection import IntersectionOverUnion, MeanAveragePrecision
 
 from deepforest import dataset, visualize, get_data, utilities, model, predict
 from deepforest import evaluate as evaluate_iou
@@ -57,6 +58,11 @@ def __init__(self,
         self.num_classes = num_classes
         self.create_model()
 
+        # Metrics
+        self.iou_metric = IntersectionOverUnion(
+            class_metrics=True, iou_threshold=self.config["validation"]["iou_threshold"])
+        self.mAP_metric = MeanAveragePrecision()
+
         #Create a default trainer.
         self.create_trainer()
 
@@ -512,8 +518,7 @@ def training_step(self, batch, batch_idx):
         return losses
 
     def validation_step(self, batch, batch_idx):
-        """Train on a loaded dataset
-
+        """Evaluate a batch
         """
         try:
             path, images, targets = batch
@@ -529,12 +534,31 @@ def validation_step(self, batch, batch_idx):
         # sum of regression and classification loss
         losses = sum([loss for loss in loss_dict.values()])
 
+        self.model.eval()
+        preds = self.model.forward(images)
+
+        # Calculate intersection-over-union
+        self.iou_metric.update(preds, targets)
+        self.mAP_metric.update(preds, targets)
+
         # Log loss
         for key, value in loss_dict.items():
             self.log("val_{}".format(key), value, on_epoch=True)
 
         return losses
 
+    def on_validation_epoch_end(self):
+        output = self.iou_metric.compute()
+        self.log_dict(output)
+        self.iou_metric.reset()
+
+        output = self.mAP_metric.compute()
+
+        # Remove classes from output dict
+        output = {key: value for key, value in output.items() if not key == "classes"}
+        self.log_dict(output)
+        self.mAP_metric.reset()
+
     def predict_step(self, batch, batch_idx):
         batch_results = self.model(batch)
 

diff --git a/deepforest_config.yml b/deepforest_config.yml
@@ -32,4 +32,4 @@ validation:
     root_dir:
     #Intersection over union evaluation
     iou_threshold: 0.4
-    val_accuracy_interval: 5
+    val_accuracy_interval: 20
diff --git a/dev_requirements.txt b/dev_requirements.txt
@@ -29,4 +29,5 @@ twine
 torchvision>=0.13
 xmltodict
 comet_ml
+pycocotools
 imagecodecs
diff --git a/docs/Evaluation.md b/docs/Evaluation.md
@@ -1,18 +1,84 @@
 # Evaluation 
 
-Independent analysis of whether a model can generalize from training data to new areas is critical for creating a robust workflow. 
-We stress that evaluation data must be different from training data, as neural networks have millions of parameters and can easily memorize thousands of samples. Therefore, while it would be rather easy to tune the model to get extremely high scores on the training data, it would fail when exposed to new images.
+ We stress that evaluation data must be different from training data, as neural networks have millions of parameters and can easily memorize thousands of samples. Avoid random train-test splits, try to create test datasets that mimic downstream tasks. If you are predicting among temporal surveys or across imaging platforms, your train-test data should reflect these partitions. Random sampling is almost never the right choice, biological data often has high spatial, temporal or taxonomic correlation that makes it easier for your model to generalize, but will fail when pushed into new situations.
 
-To get an evaluation score, specify an annotations file in the same format as the training example above. The model will 
+DeepForest provides several evaluation metrics. There is no one-size-fits all evaluation approach, and the user needs to consider which evaluation metric best fits the task. There is significant information online about the evaluation of object detection networks. Our philosophy is to provide a user with a range of statistics and visualizations. Always visualize results and trust your judgment. Never be guided by a single metric.
+
+### Further Reading
+
+[MeanAveragePrecision in torchmetrics](https://medium.com/data-science-at-microsoft/how-to-smoothly-integrate-meanaverageprecision-into-your-training-loop-using-torchmetrics-7d6f2ce0a2b3)
+
+[A general explanation of the mAP metric](https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173)
+
+[Comparing Object Detection Models](https://www.comet.com/site/blog/compare-object-detection-models-from-torchvision/)
+
+## Average Intersection over Union
+DeepForest modules use torchmetric's [IntersectionOverUnion](https://torchmetrics.readthedocs.io/en/stable/detection/intersection_over_union.html) metric. This calculates the average overlap between predictions and ground truth boxes. This can be considered a general indicator of model performance but is not sufficient on its own for model evaluation. There are lots of reasons predictions might overlap with ground truth; for example, consider a model that covers an entire image with boxes. This would have a high IoU but a low value for model utility.
+
+## Mean-Average-Precision (mAP)
+mAP is the standard COCO evaluation metric and the most common for comparing computer vision models. It is useful as a summary statistic. However, it has several limitations for an ecological use case. 
+
+1. Not intuitive and difficult to translate to ecological applications. Read the sections above and visualize the mAP metric, which is essentially the area under the precision-recall curve at a range of IoU values. 
+2. The vast majority of biological applications use a fixed cutoff to determine an object of interest in an image. Perhaps in the future we will weight tree boxes by their confidence score, but currently we do things like, "All predictions > 0.4 score are considered positive detections". This does not connect well with the mAP metric.
+
+## Precision and Recall at a set IoU threshold.
+This was the original DeepForest metric, set to an IoU of 0.4. This means that all predictions that overlap a ground truth box at IoU > 0.4 are true positives. As opposed to the torchmetrics above, it is intuitive and matches downstream ecological tasks. The drawback is that it is slow, coarse, and does not fully reward the model for having high confidence scores on true positives.
+
+There is an additional difference between ecological object detection methods like tree crowns and traditional computer vision methods. Instead of a single or set of easily differentiated ground truths, we could have 60 or 70 objects that overlap in an image. How do you best assign each prediction to each ground truth?
+
+DeepForest uses the [hungarian matching algorithm](https://thinkautonomous.medium.com/computer-vision-for-tracking-8220759eee85) to assign predictions to ground truth based on maximum IoU overlap. This is slow compared to the methods above, and so isn't a good choice for running hundreds of times during model training see config["validation"]["val_accuracy_interval"] for setting the frequency of the evaluate callback for this metric.
+
+# Calculating Evaluation Metrics 
+
+## Torchmetrics and loss scores
+
+These metrics are largely used during training to keep track of model performance. They are relatively fast and will be automatically run during training. 
+
+```    
+    m = main.deepforest()
+    csv_file = get_data("OSBS_029.csv")
+    root_dir = os.path.dirname(csv_file)
+    m.config["validation"]["csv_file"] = csv_file
+    m.config["validation"]["root_dir] = root_dir
+    results = m.trainer.validate(m)
+```
+This creates a dictionary of the average IoU ('iou') as well as 'iou' for each class. Here there is just one class, 'Tree'. Then the COCO mAP scores. See Further Reading above for an explanation of mAP level scores. The val_bbox_regression is the loss function of the object detection box head, and the loss_classification is the loss function of the object classification head.
 ```
+        classes                    0.0
+           iou              0.6305446564807566
+        iou/cl_0            0.6305446564807566
+           map              0.04219449311494827
+         map_50             0.11141198128461838
+         map_75            0.025357535108923912
+        map_large                  -1.0
+       map_medium           0.05097917467355728
+      map_per_class                -1.0
+        map_small                   0.0
+          mar_1            0.008860759437084198
+         mar_10             0.03417721390724182
+         mar_100            0.08481013029813766
+    mar_100_per_class              -1.0
+        mar_large                  -1.0
+       mar_medium           0.09436620026826859
+        mar_small                   0.0
+   val_bbox_regression      0.5196723341941833
+   val_classification       0.4998389184474945
+```
+
+> **_Advanced tip:_**  Users can set the frequency of pytorch lightning evaluation using kwargs passed to main.deepforest.create_trainer(). For example [check_val_every_n_epochs](https://lightning.ai/docs/pytorch/stable/common/trainer.html#check-val-every-n-epoch).
+
+## Recall and Precision at a fixed IoU Score
+To get a recall and precision at a set IoU evaluation score, specify an annotations' file using the m.evaluate method.
+
+```
+m = main.deepforest()
 csv_file = get_data("OSBS_029.csv")
 root_dir = os.path.dirname(csv_file)
-results = model.evaluate(csv_file, root_dir, iou_threshold = 0.4)
+results = m.evaluate(csv_file, root_dir, iou_threshold = 0.4)
 ```
 
 The returned object is a dictionary containing the three keys: results, recall, and precision. The result key in the csv-file represents the intersection-over-union score for each ground truth object.
 
-
 ```
 results["results"].head()
     prediction_id  truth_id      IoU    image_path  match
@@ -27,7 +93,6 @@ This dataframe contains a numeric id for each predicted crown in each image and
 
 The recall is the proportion of ground truth that has a true positive match with a prediction based on the intersection-over-union threshold. The default threshold is 0.4 and can be changed in the model.evaluate(iou_threshold=<>)
 
-
 ```
 results["box_recall"]
 0.705
@@ -40,6 +105,7 @@ results["box_precision"]
 0.781
 ```
 
+### Worked example of calculating IoU and recall/precision values
 To convert overlap among predicted and ground truth bounding boxes into measures of accuracy and precision, the most common approach is to compare the overlap using the intersection-over-union metric (IoU).
 IoU is the ratio between the area of the overlap between the predicted polygon box and the ground truth polygon box divided by the area of the combined bounding box region.
 
@@ -140,8 +206,8 @@ result["class_recall"]
 0  Tree     1.0    0.67033    61
 ```
 
-One important decision was how to average precision and recall across multiple images. Two reasonable options might be to take all predictions and all ground truth and compute the statistic on the entire dataset. 
-This strategy makes more sense for evaluation data that are relatively homogenous across images. We prefer to take the average of per-image precision and recall. This helps balanace the dataset if some images have many trees, and other have few trees, such as when you are comparing multiple habitat types. 
+### How to average evaluation metrics across images?
+One important decision was how to average precision and recall across multiple images. Two reasonable options might be to take all predictions and all ground truth and compute the statistic on the entire dataset. This strategy makes more sense for evaluation data that is relatively homogenous across images. We prefer to take the average of per-image precision and recall. This helps balance the dataset if some images have many objects and others have few objects, such as when you are comparing multiple habitat types.
 Users are welcome to calculate their own statistics directly from the results dataframe.
 
 ```
@@ -154,9 +220,9 @@ result["results"].head()
 34             34         4  0.595862  ...       Tree  OSBS_029.tif   True
 ```
 
-## Evaluating tiles
+### Evaluating tiles too large for memory
 
-The evaluation method uses deepforest.predict_image for each of the paths supplied in the image_path column. This means that the entire image is passed for prediction. This will not work for large images. The deepforest.predict_tile method does a couple things under hood that need to be repeated for evaluation.
+The evaluation method uses deepforest.predict_image for each of the paths supplied in the image_path column. This means that the entire image is passed for prediction. This will not work for large images. The deepforest.predict_tile method does a couple things under the hood that need to be repeated for evaluation.
 
 psuedo_code:
 

diff --git a/environment.yml b/environment.yml
@@ -35,3 +35,4 @@ dependencies:
     - sphinx-markdown-tables
     - imagecodecs
     - albumentations>=1.0.0
+    - pycocotools
diff --git a/setup.py b/setup.py
@@ -59,6 +59,7 @@
           "Pillow>6.2.0",
           "pandas",
           "progressbar2",
+          "pycocotools",
           "six",
           "scipy>1.5",
           "slidingwindow",

diff --git a/tests/test_IoU.py b/tests/test_IoU.py
@@ -27,4 +27,4 @@ def test_compute_IoU(m, download_release, tmpdir):
 
     result = IoU.compute_IoU(ground_truth, predictions)
     assert result.shape[0] == ground_truth.shape[0]
-    assert sum(result.IoU) > 10 
+    assert sum(result.IoU) > 10 
diff --git a/tests/test_evaluate.py b/tests/test_evaluate.py
@@ -93,4 +93,4 @@ def test_compute_class_recall(sample_results):
         'size': [2, 1]
     }).reset_index(drop=True)
 
-    assert evaluate.compute_class_recall(sample_results).equals(expected_recall)
+    assert evaluate.compute_class_recall(sample_results).equals(expected_recall)
diff --git a/tests/test_main.py b/tests/test_main.py
@@ -361,4 +361,8 @@ def test_over_score_thresh(m):
     assert m.model.score_thresh == 0.8
     assert not m.model.score_thresh == original_score_thresh
 
-
+def test_iou_metric(m):
+    results = m.trainer.validate(m)
+    keys = ['val_classification', 'val_bbox_regression', 'iou', 'iou/cl_0']
+    for x in keys:
+        assert x in list(results[0].keys())