-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework metrics #376
Rework metrics #376
Conversation
Additional corrections to make :
|
…to rework_metrics support 3.6 dropped
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New code and tests make sense to me. @LoannPeurey, would you like me to check anything specific in more detail?
Next step being the addition of the lena CTC CVC to our supported metrics list, is that clear for you how we do it? |
Something like this?
Then "lena_ctc" should be supplied in the input dataframe. |
The main difference is that we don't have a "lena_ctc" column extracted from its files.
So CTC should be computed from some of this columns I think. I see that you have added the So closer to: @metricFunction(args={}, cols={"lena_conv_status","lena_conv_turn_type" }) #put relevant columns
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
max_distance_ms = kwargs["max_distance_ms"]
<pretty much a copy of the current ctc code> |
Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the As for the actual question, I would extract one more column from its by adding the following to class ItsConverter(AnnotationConverter):
...
lena_cumulative_ctc = conversation_info[2] That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like df = pd.DataFrame(segments)
# lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)
return df After that, for any interval we would take @metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
"""Conversational turn count (ctc) as calculated by LENA."""
return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min() |
Since we already merged this branch and went on to new additions, I suggest you check out the #394 PR, @kalenkovich I'll add you as a reviewer there, so check the changes if you want (and your review will probably be useful). Just know that this will be merged and hopefully released at the beginning of next week regardless. |
Understood, if we have to import a new column from the its, it means that this will break backward compatibility and we will need to reimport most of our datasets |
Isn't that a common situation where you need some extra information from the raw files? Just curious, because in this case, it is probably unnecessary since your calculation should yield the same numbers as mine. |
I am not sure if that has occured in the past. I am far from being an expert on lena's file so I am not sure how exhaustive our importation is. But in most other file types, I think the importation aims at getting all the info converted. |
The metrics pipeline needs more flexibility, the goal is to have the possibility to include whatever metric we want in the resulting csv file (from a list of available metrics or from a custom made function taking annotations and duration and outputting a value to include in the result) Old pipelines should be kept, but will work by simply giving the list of what was done in the past to the new system.
The period pipeline is going to disappear as
--period
will become an option available for every extraction and not restricted to that specific subcommand.In order to give all the metrics wanted as parameters, a file will be used (yml ? / csv ?, maybe both). At the end of the extraction, all the parameters used will be saved in a file.
This way of handling things will come with a performance hit (as every metric is computed separately, the previous optimizations taking advantage of dependencies or computing in batch the values will not be possible anymore) and this needs to be evaluated
performance comparisonmoved as an issue for laterTurns count implementationlater implementation