Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework metrics #376

Merged
merged 22 commits into from
Jul 25, 2022
Merged

Rework metrics #376

merged 22 commits into from
Jul 25, 2022

Conversation

LoannPeurey
Copy link
Contributor

@LoannPeurey LoannPeurey commented Jun 13, 2022

The metrics pipeline needs more flexibility, the goal is to have the possibility to include whatever metric we want in the resulting csv file (from a list of available metrics or from a custom made function taking annotations and duration and outputting a value to include in the result) Old pipelines should be kept, but will work by simply giving the list of what was done in the past to the new system.
The period pipeline is going to disappear as --period will become an option available for every extraction and not restricted to that specific subcommand.
In order to give all the metrics wanted as parameters, a file will be used (yml ? / csv ?, maybe both). At the end of the extraction, all the parameters used will be saved in a file.
This way of handling things will come with a performance hit (as every metric is computed separately, the previous optimizations taking advantage of dependencies or computing in batch the values will not be possible anymore) and this needs to be evaluated

  • core of the Metrics class
  • list of available metrics with their function (metricsFunctions.py)
  • LenaMetrics and AclewMetrics backward compatibility
  • Usage of custom functions passed to the Metrics class to allow for custom metrics
  • Define the format and content of parameters file
  • outputs a file containing the parameters used
  • performance comparison moved as an issue for later
  • Documentation
  • Tests
  • Turns count implementation later implementation

@LoannPeurey LoannPeurey self-assigned this Jun 13, 2022
@LoannPeurey LoannPeurey added enhancement New feature or request metrics labels Jun 13, 2022
@LoannPeurey
Copy link
Contributor Author

LoannPeurey commented Jun 23, 2022

Additional corrections to make :

  • Manage conflicts between cli arguments and parameters in the yml file (probably raise error if conflict). => outdated, yml file pipeline does not allow for additional cli arguments
  • Manage conflicts of name (ie multiple metrics would have the same name (raise error or warn ? )
  • drop support for python 3.6
  • check for correct metrics values returned between 0 and NaN

@LoannPeurey
Copy link
Contributor Author

LoannPeurey commented Jul 11, 2022

  • check that recordings option works

@LoannPeurey LoannPeurey marked this pull request as ready for review July 15, 2022 13:59
Copy link
Collaborator

@kalenkovich kalenkovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New code and tests make sense to me. @LoannPeurey, would you like me to check anything specific in more detail?

@LoannPeurey
Copy link
Contributor Author

New code and tests make sense to me. @LoannPeurey, would you like me to check anything specific in more detail?

Next step being the addition of the lena CTC CVC to our supported metrics list, is that clear for you how we do it?

@LoannPeurey LoannPeurey merged commit 4cc4f43 into master Jul 25, 2022
@kalenkovich
Copy link
Collaborator

New code and tests make sense to me. @LoannPeurey, would you like me to check anything specific in more detail?

Next step being the addition of the lena CTC CVC to our supported metrics list, is that clear for you how we do it?

Something like this?

@metricFunction(args={}, cols={"lena_ctc"})
def lena_ctc(annotations: pd.DataFrame, max_distance_ms: int = 1, **kwargs):
   <pretty much a copy of the current ctc code>

Then "lena_ctc" should be supplied in the input dataframe.

@LoannPeurey
Copy link
Contributor Author

The main difference is that we don't have a "lena_ctc" column extracted from its files.
We store the columns:

  • lena_speaker
  • lena_block_number
  • lena_block_type
  • lena_conv_status
  • lena_response_count
  • lena_conv_turn_type
  • lena_conv_floor_type
  • utterances_count
  • utterances_length

So CTC should be computed from some of this columns I think.
You can see what converted its files are from here for example. And the description of the column content here

I see that you have added the max_distance_ms argument, foes lena allow to change that? If we can include it, it should be as a kwarg, the duration arg however should alway be here, even if not used because it is passed to all metrics.

So closer to:

@metricFunction(args={}, cols={"lena_conv_status","lena_conv_turn_type" }) #put relevant columns
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
   max_distance_ms = kwargs["max_distance_ms"]
   <pretty much a copy of the current ctc code>

@kalenkovich
Copy link
Collaborator

Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the columns argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, the max_distance_ms: int = 1 argument. 🤦🤦🤦

As for the actual question, I would extract one more column from its by adding the following to converters.py

class ItsConverter(AnnotationConverter):
    ...
                lena_cumulative_ctc = conversation_info[2]

That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like

        df = pd.DataFrame(segments)
        
        # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
        df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)

        return df

After that, for any interval we would take max(lena_cumulative_ctc) - min(lena_cumulative_ctc):

@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
    """Conversational turn count (ctc) as calculated by LENA."""
    return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()

@LoannPeurey
Copy link
Contributor Author

Since we already merged this branch and went on to new additions, I suggest you check out the #394 PR, @kalenkovich I'll add you as a reviewer there, so check the changes if you want (and your review will probably be useful). Just know that this will be merged and hopefully released at the beginning of next week regardless.

@kalenkovich kalenkovich mentioned this pull request Aug 12, 2022
3 tasks
@LoannPeurey
Copy link
Contributor Author

Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the columns argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, the max_distance_ms: int = 1 argument. 🤦🤦🤦

As for the actual question, I would extract one more column from its by adding the following to converters.py

class ItsConverter(AnnotationConverter):
    ...
                lena_cumulative_ctc = conversation_info[2]

That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like

        df = pd.DataFrame(segments)
        
        # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
        df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)

        return df

After that, for any interval we would take max(lena_cumulative_ctc) - min(lena_cumulative_ctc):

@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
    """Conversational turn count (ctc) as calculated by LENA."""
    return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()

Understood, if we have to import a new column from the its, it means that this will break backward compatibility and we will need to reimport most of our datasets

@kalenkovich
Copy link
Collaborator

Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the columns argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, the max_distance_ms: int = 1 argument. 🤦🤦🤦
As for the actual question, I would extract one more column from its by adding the following to converters.py

class ItsConverter(AnnotationConverter):
    ...
                lena_cumulative_ctc = conversation_info[2]

That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like

        df = pd.DataFrame(segments)
        
        # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
        df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)

        return df

After that, for any interval we would take max(lena_cumulative_ctc) - min(lena_cumulative_ctc):

@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
    """Conversational turn count (ctc) as calculated by LENA."""
    return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()

Understood, if we have to import a new column from the its, it means that this will break backward compatibility and we will need to reimport most of our datasets

Isn't that a common situation where you need some extra information from the raw files? Just curious, because in this case, it is probably unnecessary since your calculation should yield the same numbers as mine.

@LoannPeurey
Copy link
Contributor Author

Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the columns argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, the max_distance_ms: int = 1 argument. 🤦🤦🤦
As for the actual question, I would extract one more column from its by adding the following to converters.py

class ItsConverter(AnnotationConverter):
    ...
                lena_cumulative_ctc = conversation_info[2]

That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like

        df = pd.DataFrame(segments)
        
        # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
        df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)

        return df

After that, for any interval we would take max(lena_cumulative_ctc) - min(lena_cumulative_ctc):

@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
    """Conversational turn count (ctc) as calculated by LENA."""
    return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()

Understood, if we have to import a new column from the its, it means that this will break backward compatibility and we will need to reimport most of our datasets

Isn't that a common situation where you need some extra information from the raw files? Just curious, because in this case, it is probably unnecessary since your calculation should yield the same numbers as mine.

I am not sure if that has occured in the past. I am far from being an expert on lena's file so I am not sure how exhaustive our importation is. But in most other file types, I think the importation aims at getting all the info converted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request metrics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants