refactor: `SummarizationAccuracyMetrics` transform to handle multiple target outputs more efficiently #317

kirupang-code · 2024-07-31T04:29:45Z

Edited the logic in SummarizationAccuracyMetrics so that the target_output field wouldn't have to be duplicated in order to be evaluated for each model_output. If we have multiple target_outputs we take the max score from self.compute_metric.
Added an additional target_output_keys_provider parameter to support a way to compute scores given a key to a list of possible target outputs.
Updated the ordering of subclass transforms in order to keep everything consistent with their parent transforms, otherwise parameters would get mixed up. I had to add a default value for BertScore because it follows a parameter with a default value. input_keys + model_output_keys, TypeError: can only concatenate str (not "list") to str

cr: https://code.amazon.com/reviews/CR-135854933

…weldge

…obustness

…etric

src/fmeval/transforms/summarization_accuracy_metrics.py

danielezhu · 2024-08-01T21:09:42Z

src/fmeval/transforms/summarization_accuracy_metrics.py

        allow_duplicate_input_keys: bool,
+        target_output_keys: Optional[List[str]] = None,


Let's keep this as a positional argument so that 1) the argument order doesn't change, which would break things for existing users, and 2) the default configuration isn't invalid; currently, if you use the default options, you'll get a validation error.

For point 1), I really mean that we don't want the signatures of the concrete subclasses to change. Obviously, no one is directly calling this ABC's init method.

src/fmeval/transforms/summarization_accuracy_metrics.py

danielezhu · 2024-08-01T21:21:37Z

src/fmeval/transforms/summarization_accuracy_metrics.py

-        bertscore_model: Union[BertscoreHelperModel, ActorHandle],
+        target_output_keys: Optional[List[str]] = None,
+        target_output_keys_provider: str = "",
+        bertscore_model: Union[BertscoreHelperModel, ActorHandle] = BertscoreHelperModel(BERTSCORE_DEFAULT_MODEL),


We need to put target_output_keys_provider as the last argument, or else existing calls to this method will break.

Discussed with @kirupang-code offline. We need to use this ordering (which is backwards-incompatible) in order for deserialization of these object instances to work properly (since we rely on a specific ordering of the arguments to __init__). The existing calls to BertScore.__init__ in our own code have been updated as necessary, but users who interact with BertScore objects directly will be impacted.

We should keep track of this before the next release.

danielezhu · 2024-08-02T02:07:12Z

src/fmeval/transforms/summarization_accuracy_metrics.py

            output_keys,
            allow_duplicates=allow_duplicate_input_keys,
        )
        self.target_output_keys = target_output_keys
        self.model_output_keys = model_output_keys
+        self.target_output_keys_provider = str(target_output_keys_provider)


Why is this needed?

src/fmeval/transforms/summarization_accuracy_metrics.py

danielezhu · 2024-08-02T02:20:55Z

src/fmeval/transforms/summarization_accuracy_metrics.py

-        bertscore_model: Union[BertscoreHelperModel, ActorHandle],
+        target_output_keys: Optional[List[str]] = None,
+        target_output_keys_provider: str = "",
+        bertscore_model: Union[BertscoreHelperModel, ActorHandle] = BertscoreHelperModel(BERTSCORE_DEFAULT_MODEL),


Discussed with @kirupang-code offline. We need to use this ordering (which is backwards-incompatible) in order for deserialization of these object instances to work properly (since we rely on a specific ordering of the arguments to __init__). The existing calls to BertScore.__init__ in our own code have been updated as necessary, but users who interact with BertScore objects directly will be impacted.

We should keep track of this before the next release.

danielezhu · 2024-08-02T02:23:26Z

test/unit/transforms/test_summarization_accuracy_metrics.py

@@ -128,6 +128,31 @@ def test_bert_score_call_with_bertscore_model_object():
    mock_bertscore_model.get_helper_scores.assert_called_once_with("Hello there!", "Hi")


+def test_bert_score_call_with_target_output_keys_provider():


Please add a unit test where there are multiple target outputs, and verify that the max score is returned.

kirupang-code · 2024-08-02T19:25:58Z

src/fmeval/transforms/summarization_accuracy_metrics.py

Had to add # type: ignore to target_output_keys_provider because of its Optional type since we validate that its value is not None. Discussed with Daniel. Errors without the # type: ignore:

3. Mypy ======= src/fmeval/transforms/summarization_accuracy_metrics.py:89:69: error: List item 0 has incompatible type "str | None"; expected "str" [list-item] src/fmeval/transforms/summarization_accuracy_metrics.py:111:25: error: Invalid index type "str | None" for "dict[str, Any]"; expected type "str" [index]

xiaoyi-cheng

lgtm

test/unit/transforms/test_summarization_accuracy_metrics.py

kirupang-code added 30 commits July 3, 2024 13:24

Added metric to factual knowledge + unit/integration tests

6d66b62

cr: https://code.amazon.com/reviews/CR-135854933

fixed changes from PR comments

cc866ed

Deleted metrics.py and restored code in util.py

843d9f6

added factual knowledge metrics to constants.py

c2f9efb

Merge branch 'main' of github.com:aws/fmeval

d7e5fa5

added factual knowledge metrics to be included in binary score

8d9bf4f

updated score descriptions for factual knowledge

1aee116

feat: add configurable param logical_operator (OR/AND) to factual kno…

d8c29da

…weldge

Merge branch 'main' of github.com:aws/fmeval

8715749

Merge branch 'main' of github.com:aws/fmeval

0ca7c47

fixed changes from PR comments

ba43b92

added warning and fixed typo

5e7bb50

Merge branch 'main' into main

411e0fa

modified warnings and fixed invalid config tests for factual_knowledge

f1f9792

Merge branch 'main' of github.com:aws/fmeval

43a9a48

Merge branch 'main' of github.com:kirupang-code/fmeval

5f2d1d4

Merge branch 'main' of github.com:kirupang-code/fmeval

b684515

Merge branch 'main' of github.com:kirupang-code/fmeval

e13bd5c

Merge branch 'main' of github.com:aws/fmeval

d51f3ff

feat: Adding BERTScore to QAAccuracy + QAAccuracySemanticRobustness

2a165cf

fix: documentation and tests for qa accuracy + qa accuracy semantic r…

634e85f

…obustness

fix: lint checks

c019be5

fix: created dataset for qa_accuracy, reverted to js_model_runner

0988ac7

fix: integration tests by adding approx for BertScore

28b449e

fix: moved BertScoreWithDelimiter to qa_accuracy and updated tests

5b99691

fix: restored qa_accuracy_semantic_robustness

e6c6f33

fix: smaller dataset for integ tests to reduce runtime

e3d02d6

fix: smaller dataset for integ tests to reduce runtime

7052ef2

Merge branch 'main' of github.com:kirupang-code/fmeval

4859531

Add BertScoreMax transform for qa_accuracy

fdb99b1

kirupang-code added 4 commits July 25, 2024 15:29

fix: lint checks

cf6448a

fix: cleaning up code and checking reporting folder for changes

c9f565d

fix: refactored SummarizationAccuracyMetric

a84aa5b

fix: deleted dataset file from previous PR

7589756

xiaoyi-cheng changed the title ~~fix: refactor SummarizationAccuracyMetrics transform to handle multiple target outputs more efficiently~~ refactor: SummarizationAccuracyMetrics transform to handle multiple target outputs more efficiently Jul 31, 2024

kirupang-code changed the title ~~refactor: SummarizationAccuracyMetrics transform to handle multiple target outputs more efficiently~~ WIP - refactor: SummarizationAccuracyMetrics transform to handle multiple target outputs more efficiently Jul 31, 2024

kirupang-code added 2 commits July 31, 2024 13:55

refactor: added target_output_keys_provider to SummarizationAccuracyM…

f58046a

…etric

edited description of test function

bfb7f20

kirupang-code changed the title ~~WIP - refactor: SummarizationAccuracyMetrics transform to handle multiple target outputs more efficiently~~ refactor: SummarizationAccuracyMetrics transform to handle multiple target outputs more efficiently Jul 31, 2024

updated assert description for target_output_keys

8431857

xiaoyi-cheng requested a review from danielezhu August 1, 2024 17:43

xiaoyi-cheng reviewed Aug 1, 2024

View reviewed changes

src/fmeval/transforms/summarization_accuracy_metrics.py Show resolved Hide resolved

danielezhu requested changes Aug 1, 2024

View reviewed changes

kirupang-code added 2 commits August 1, 2024 17:57

fixed comments from PR

410b0fc

fixed errors in test from previous run

b59547e

danielezhu requested changes Aug 2, 2024

View reviewed changes

add unit test and fixed type issues

f9e31da

kirupang-code commented Aug 2, 2024

View reviewed changes

kirupang-code requested review from danielezhu and xiaoyi-cheng August 2, 2024 19:29

xiaoyi-cheng previously approved these changes Aug 6, 2024

View reviewed changes

danielezhu requested changes Aug 7, 2024

View reviewed changes

test/unit/transforms/test_summarization_accuracy_metrics.py Show resolved Hide resolved

fix: Mocked BertScore in unit test

48c06dd

kirupang-code dismissed xiaoyi-cheng’s stale review via 48c06dd August 7, 2024 23:35

danielezhu approved these changes Aug 7, 2024

View reviewed changes

xiaoyi-cheng approved these changes Aug 8, 2024

View reviewed changes

xiaoyi-cheng merged commit bc5a15f into aws:main Aug 8, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: `SummarizationAccuracyMetrics` transform to handle multiple target outputs more efficiently #317

refactor: `SummarizationAccuracyMetrics` transform to handle multiple target outputs more efficiently #317

kirupang-code commented Jul 31, 2024 •

edited

Loading

danielezhu Aug 1, 2024

danielezhu Aug 1, 2024

danielezhu Aug 2, 2024

danielezhu Aug 2, 2024

danielezhu Aug 2, 2024

danielezhu Aug 2, 2024

kirupang-code Aug 2, 2024 •

edited

Loading

xiaoyi-cheng left a comment

		allow_duplicate_input_keys: bool,
		target_output_keys: Optional[List[str]] = None,

		@@ -128,6 +128,31 @@ def test_bert_score_call_with_bertscore_model_object():
		mock_bertscore_model.get_helper_scores.assert_called_once_with("Hello there!", "Hi")


		def test_bert_score_call_with_target_output_keys_provider():

refactor: SummarizationAccuracyMetrics transform to handle multiple target outputs more efficiently #317

refactor: SummarizationAccuracyMetrics transform to handle multiple target outputs more efficiently #317

Conversation

kirupang-code commented Jul 31, 2024 • edited Loading

danielezhu Aug 1, 2024

Choose a reason for hiding this comment

danielezhu Aug 1, 2024

Choose a reason for hiding this comment

danielezhu Aug 2, 2024

Choose a reason for hiding this comment

danielezhu Aug 2, 2024

Choose a reason for hiding this comment

danielezhu Aug 2, 2024

Choose a reason for hiding this comment

danielezhu Aug 2, 2024

Choose a reason for hiding this comment

kirupang-code Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

xiaoyi-cheng left a comment

Choose a reason for hiding this comment

refactor: `SummarizationAccuracyMetrics` transform to handle multiple target outputs more efficiently #317

refactor: `SummarizationAccuracyMetrics` transform to handle multiple target outputs more efficiently #317

kirupang-code commented Jul 31, 2024 •

edited

Loading

kirupang-code Aug 2, 2024 •

edited

Loading