Rework metrics #376

LoannPeurey · 2022-06-13T12:41:21Z

The metrics pipeline needs more flexibility, the goal is to have the possibility to include whatever metric we want in the resulting csv file (from a list of available metrics or from a custom made function taking annotations and duration and outputting a value to include in the result) Old pipelines should be kept, but will work by simply giving the list of what was done in the past to the new system.
The period pipeline is going to disappear as --period will become an option available for every extraction and not restricted to that specific subcommand.
In order to give all the metrics wanted as parameters, a file will be used (yml ? / csv ?, maybe both). At the end of the extraction, all the parameters used will be saved in a file.
This way of handling things will come with a performance hit (as every metric is computed separately, the previous optimizations taking advantage of dependencies or computing in batch the values will not be possible anymore) and this needs to be evaluated

LoannPeurey · 2022-06-23T14:38:57Z

Additional corrections to make :

Manage conflicts between cli arguments and parameters in the yml file (probably raise error if conflict). => outdated, yml file pipeline does not allow for additional cli arguments
Manage conflicts of name (ie multiple metrics would have the same name (raise error or warn ? )
drop support for python 3.6
check for correct metrics values returned between 0 and NaN

…to rework_metrics support 3.6 dropped

LoannPeurey · 2022-07-11T15:19:11Z

check that recordings option works

ChildProject/pipelines/metrics.py

ChildProject/projects.py

ChildProject/utils.py

ChildProject/pipelines/metrics.py

ChildProject/projects.py

docs/source/metrics.rst

kalenkovich

New code and tests make sense to me. @LoannPeurey, would you like me to check anything specific in more detail?

ChildProject/pipelines/metrics.py

LoannPeurey · 2022-07-25T07:31:19Z

New code and tests make sense to me. @LoannPeurey, would you like me to check anything specific in more detail?

Next step being the addition of the lena CTC CVC to our supported metrics list, is that clear for you how we do it?

kalenkovich · 2022-08-11T01:09:52Z

New code and tests make sense to me. @LoannPeurey, would you like me to check anything specific in more detail?

Next step being the addition of the lena CTC CVC to our supported metrics list, is that clear for you how we do it?

Something like this?

@metricFunction(args={}, cols={"lena_ctc"})
def lena_ctc(annotations: pd.DataFrame, max_distance_ms: int = 1, **kwargs):
   <pretty much a copy of the current ctc code>

Then "lena_ctc" should be supplied in the input dataframe.

LoannPeurey · 2022-08-11T08:48:08Z

The main difference is that we don't have a "lena_ctc" column extracted from its files.
We store the columns:

lena_speaker
lena_block_number
lena_block_type
lena_conv_status
lena_response_count
lena_conv_turn_type
lena_conv_floor_type
utterances_count
utterances_length

So CTC should be computed from some of this columns I think.
You can see what converted its files are from here for example. And the description of the column content here

I see that you have added the max_distance_ms argument, foes lena allow to change that? If we can include it, it should be as a kwarg, the duration arg however should alway be here, even if not used because it is passed to all metrics.

So closer to:

@metricFunction(args={}, cols={"lena_conv_status","lena_conv_turn_type" }) #put relevant columns
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
   max_distance_ms = kwargs["max_distance_ms"]
   <pretty much a copy of the current ctc code>

kalenkovich · 2022-08-11T14:46:03Z

Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the columns argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, the max_distance_ms: int = 1 argument. 🤦🤦🤦

As for the actual question, I would extract one more column from its by adding the following to converters.py

class ItsConverter(AnnotationConverter):
    ...
                lena_cumulative_ctc = conversation_info[2]

That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like

        df = pd.DataFrame(segments)
        
        # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
        df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)

        return df

After that, for any interval we would take max(lena_cumulative_ctc) - min(lena_cumulative_ctc):

@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
    """Conversational turn count (ctc) as calculated by LENA."""
    return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()

ChildProject/pipelines/metricsFunctions.py

LoannPeurey · 2022-08-12T10:00:58Z

Since we already merged this branch and went on to new additions, I suggest you check out the #394 PR, @kalenkovich I'll add you as a reviewer there, so check the changes if you want (and your review will probably be useful). Just know that this will be merged and hopefully released at the beginning of next week regardless.

LoannPeurey · 2022-08-12T15:08:32Z

Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the columns argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, the max_distance_ms: int = 1 argument. 🤦🤦🤦

As for the actual question, I would extract one more column from its by adding the following to converters.py
class ItsConverter(AnnotationConverter):
    ...
                lena_cumulative_ctc = conversation_info[2]
That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like
        df = pd.DataFrame(segments)
        
        # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
        df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)

        return df
After that, for any interval we would take max(lena_cumulative_ctc) - min(lena_cumulative_ctc):
@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
    """Conversational turn count (ctc) as calculated by LENA."""
    return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()

Understood, if we have to import a new column from the its, it means that this will break backward compatibility and we will need to reimport most of our datasets

kalenkovich · 2022-08-12T15:17:42Z

Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the columns argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, the max_distance_ms: int = 1 argument. 🤦🤦🤦
As for the actual question, I would extract one more column from its by adding the following to converters.py
class ItsConverter(AnnotationConverter):
    ...
                lena_cumulative_ctc = conversation_info[2]
That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like
        df = pd.DataFrame(segments)
        
        # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
        df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)

        return df
After that, for any interval we would take max(lena_cumulative_ctc) - min(lena_cumulative_ctc):
@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
    """Conversational turn count (ctc) as calculated by LENA."""
    return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()
Understood, if we have to import a new column from the its, it means that this will break backward compatibility and we will need to reimport most of our datasets

Isn't that a common situation where you need some extra information from the raw files? Just curious, because in this case, it is probably unnecessary since your calculation should yield the same numbers as mine.

LoannPeurey · 2022-08-12T15:23:16Z

Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the columns argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, the max_distance_ms: int = 1 argument. 🤦🤦🤦
As for the actual question, I would extract one more column from its by adding the following to converters.py
class ItsConverter(AnnotationConverter):
    ...
                lena_cumulative_ctc = conversation_info[2]
That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like
        df = pd.DataFrame(segments)
        
        # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
        df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)

        return df
After that, for any interval we would take max(lena_cumulative_ctc) - min(lena_cumulative_ctc):
@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
    """Conversational turn count (ctc) as calculated by LENA."""
    return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()
Understood, if we have to import a new column from the its, it means that this will break backward compatibility and we will need to reimport most of our datasets
Isn't that a common situation where you need some extra information from the raw files? Just curious, because in this case, it is probably unnecessary since your calculation should yield the same numbers as mine.

I am not sure if that has occured in the past. I am far from being an expert on lena's file so I am not sure how exhaustive our importation is. But in most other file types, I think the importation aims at getting all the info converted.

LoannPeurey added 4 commits June 1, 2022 17:38

first version of metrics with parameters, only superClass Metrics

9c33de1

redefine lenaMetrics class

e65dec8

prepare usual pipelines : custom lena aclew

18e0460

metrics.py was left out

4163f27

LoannPeurey self-assigned this Jun 13, 2022

LoannPeurey added enhancement New feature or request metrics labels Jun 13, 2022

LoannPeurey added 3 commits June 17, 2022 14:31

csv file input, verbose errors for missing columns

1dae06d

new dataset : organizing data

8fdef10

change yaml output to have the full list of metrics

f908d6b

LoannPeurey and others added 10 commits June 27, 2022 18:43

intersect time ranges, new yml pipeline

c242627

automated tests

ff6868a

fix tests

cc22c56

drop tests python 3.6 (support will be dropped)

8687900

metricsFunctions use decorators

7ec1bf0

Merge branch 'rework_metrics' of github.com:LAAC-LSCP/ChildProject in…

0b6943a

…to rework_metrics support 3.6 dropped

avoid division by 0 un lp metric

1d9da92

multiple minor improvements, requirement python to >=3.7

07013f3

update documentation

e41679d

fix doc generation

89de391

throw ValueError when given incorrect recording

0684f79

LoannPeurey mentioned this pull request Jul 15, 2022

Metrics pipeline performance #380

Open

LoannPeurey marked this pull request as ready for review July 15, 2022 13:59

LoannPeurey requested review from William-N-Havard and kalenkovich July 15, 2022 13:59