Conversations/summary #469

LoannPeurey · 2024-04-09T10:49:10Z

extract summary about conversations, using the conversations type of annotation sets

Steps:

update conversations summary

… other feature

lucasgautheron · 2024-06-06T11:21:15Z

Hi all,

I was curious about the implementation of the extraction of conversations, so I took a look at the PR.
It reminded me something I wanted to say that applies to this PR but not only: I'd like to suggest that we stop using pandas' iterrows() altogether.

For some reason, iterrows is much, MUCH slower than any other alternative (by one or two orders of magnitudes!):

Personally, I like to use to_dict(orient="records") instead. This requires almost no further adaptation of the code, which is why I am allowing myself to make the suggestion (regardless of the magnitude of the performance issue, it is always worth abandoning iterrows, given that the cost of switching is so small).

This is important especially when looping over all segments of all annotations (potentially ~1M--10M rows for a corpus).

If I may, an additional suggestion would be to reserve the variable names "annotation" and "annotations" only for "parts of recordings covered by some annotator" and to use "segments" only when referring to labelled/classified segments of speech (in other words, "segments" are what is contained within "annotations").

…e usage of segments vs annotations

alix-bourree · 2024-07-19T14:14:59Z

Hi,

I ran the StandardConversations and CustomConversations classes on Joe's data without encountering any issues:

Israel-Haifa
babylogger-vs-lena-data
bergelson
cougar
lyon
nepal-havron
phonSES
png2019
quechua
timor-leste2022
tseltal2015
tsimane2017
tsimanec2018
warlaumont
winnipeg

I’ve also added some edge cases to test_conversations.py:

def test_empty_conversations(project, am):
    empty_segments = pd.DataFrame(columns=["segment_onset", "segment_offset", "speaker_type", "time_since_last_conv", "conv_count"])

    am.import_annotations(
        pd.DataFrame(
            [{"set": "empty_conv",
              "raw_filename": "file.its",
              "time_seek": 0,
              "recording_filename": "sound.wav",
              "range_onset": 0,
              "range_offset": 30000000,
              "format": "csv",
              }]
        ),
        import_function=partial(fake_vocs, empty_segments),
    )

    std = StandardConversations(project, setname='empty_conv')
    results = std.extract()

    assert results.empty, "The result should be empty for an empty dataset"

def test_nan_values(project, am):
    nan_segments = pd.DataFrame({
        "segment_onset": [np.nan, 10, 20],
        "segment_offset": [5, np.nan, 25],
        "speaker_type": ["CHI", np.nan, "FEM"],
        "time_since_last_conv": [np.nan, 15, 5],
        "conv_count": [1, 1, 2]
    })

    am.import_annotations(
        pd.DataFrame(
            [{"set": "nan_conv",
              "raw_filename": "file.its",
              "time_seek": 0,
              "recording_filename": "sound.wav",
              "range_onset": 0,
              "range_offset": 30000000,
              "format": "csv",
              }]
        ),
        import_function=partial(fake_vocs, nan_segments),
    )

    std = StandardConversations(project, setname='nan_conv')
    results = std.extract()

    assert not results.empty, "The result should not be empty for a dataset with NaN values"

def test_single_entry_conversation(project, am):
    single_segment = pd.DataFrame({
        "segment_onset": [0],
        "segment_offset": [5],
        "speaker_type": ["CHI"],
        "time_since_last_conv": [np.nan],
        "conv_count": [1]
    })

    am.import_annotations(
        pd.DataFrame(
            [{"set": "single_conv",
              "raw_filename": "file.its",
              "time_seek": 0,
              "recording_filename": "sound.wav",
              "range_onset": 0,
              "range_offset": 30000000,
              "format": "csv",
              }]
        ),
        import_function=partial(fake_vocs, single_segment),
    )

    std = StandardConversations(project, setname='single_conv')
    results = std.extract()

    assert len(results) == 1, "The result should contain one conversation for a single entry dataset"

def test_incorrect_data_types(project, am):
    incorrect_types = pd.DataFrame({
        "segment_onset": ["0", "10", "20"],
        "segment_offset": ["5", "15", "25"],
        "speaker_type": ["CHI", "FEM", "MAN"],
        "time_since_last_conv": ["nan", "15", "5"],
        "conv_count": [1, 1, 2]
    })

    am.import_annotations(
        pd.DataFrame(
            [{"set": "incorrect_types_conv",
              "raw_filename": "file.its",
              "time_seek": 0,
              "recording_filename": "sound.wav",
              "range_onset": 0,
              "range_offset": 30000000,
              "format": "csv",
              }]
        ),
        import_function=partial(fake_vocs, incorrect_types),
    )

    std = StandardConversations(project, setname='incorrect_types_conv')
    with pytest.raises(Exception):
        std.extract(), "The code should raise an exception for incorrect data types"

def test_unsorted_annotations(project, am):
    unsorted_segments = pd.DataFrame({
        "segment_onset": [20, 0, 10],
        "segment_offset": [25, 5, 15],
        "speaker_type": ["FEM", "CHI", "MAN"],
        "time_since_last_conv": [5, np.nan, 15],
        "conv_count": [2, 1, 1]
    })

    am.import_annotations(
        pd.DataFrame(
            [{"set": "unsorted_conv",
              "raw_filename": "file.its",
              "time_seek": 0,
              "recording_filename": "sound.wav",
              "range_onset": 0,
              "range_offset": 30000000,
              "format": "csv",
              }]
        ),
        import_function=partial(fake_vocs, unsorted_segments),
    )

    std = StandardConversations(project, setname='unsorted_conv')
    results = std.extract()

    assert not results.empty, "The result should not be empty for unsorted annotations"

However, I encountered some failures:

======== short test summary info =========================================================================================
FAILED tests/test_conversations.py::test_empty_conversations - NameError: name 'grouper' is not defined
FAILED tests/test_conversations.py::test_nan_values - pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
FAILED tests/test_conversations.py::test_incorrect_data_types - TypeError: can only concatenate str (not "int") to str

It seems there might be some issues with handling exceptions for these cases. Let me know if you think it's necessary to address these exceptions.

Otherwise, the checklist has been followed successfully!

LoannPeurey · 2024-07-22T11:00:14Z

Thank you @alix-bourree , I think indeed that we are always missing some coverage in testing and that will help, Can you push those changes to the branch? I think I will give those errors a look to fix them

akoziol98 and others added 27 commits February 27, 2024 16:20

Created summarise_conversations()

1ae4956

Create conversations.py

5f03bcb

Created _summarise_conversations

72415df

Create conversation metrics class

24822a8

retrieve unwanted changes

ee5b3c8

Update .DS_Store

61c4e83

Update Conversations class

125fa01

register conversations pipelines

55f39d0

Define metrics

b274aef

Updated metric parameters

b881013

Updated Conversation class

745f5de

Updated metrics

d47191a

Updated structure by Loann

4a846ca

Clean metrics

aafbb6b

Updated list of metrics

ac50346

Update conversationFunctions.py

efe1dfb

Merge pull request #465 from LAAC-LSCP/master

796785d

update conversations summary

ready conversations summary just for standard extraction, without any…

d3c06cd

… other feature

Added who_participates column

f47e5aa

corrected typo

9107acb

Changed display of participants column

f71d768

remove .DS_Store

1e67ca6

minor formatting changes in metrics

9e36a88

making custom and specification pipelines work

bd0f1a0

Update conversations.py

f7c209d

Update conversations.py

a45803e

ensure required columns are present

7c22e1c

LoannPeurey mentioned this pull request Jun 6, 2024

derivation conversation error empty annotation #472

Closed

prevent duplicate columns when returning empty dataframe

317166b

LoannPeurey and others added 16 commits June 7, 2024 11:06

merge master into conversations/summary

f15e99d

silence warning for inplace changes of a copy

056cf60

exploratory on conversations extract optimizations

61cc5a1

remove some dowstream analysis that can be computed later, standardiz…

097dd1b

…e usage of segments vs annotations

further optimization

f284f69

cleaning up and adding some doctrings

76c23cb

cleaning comment

84f62d4

add doc for conversation extraction

a6775e9

contraint pillow as 10.4.0 breaks matplotlib

4e2514f

fix some parameters and docstrings

1181f84

start to add tests and testing material

3949dfb

make parameters file pipeline ok with arguments

4242ed3

update parameter pipeline test

99ff0de

Merge branch 'master' into conversations/summary

3c4873b

tests for individual conversation extraction functions

e56c5d9

merge to fix conflicts with master

394f9d7

LoannPeurey marked this pull request as ready for review July 8, 2024 09:39

LoannPeurey requested review from alix-bourree and lucasgautheron July 8, 2024 09:39

LoannPeurey self-assigned this Jul 8, 2024

CHANGELOG

ed78522

alix-bourree and others added 4 commits July 23, 2024 11:02

add edge cases

a2d151f

no exception on empty dataframe

d522cba

move test data types to annotations, na from conversation

e69ae02

Merge branch 'master' into conversations/summary

7e02ede

LoannPeurey merged commit 8593f56 into master Aug 1, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversations/summary #469

Conversations/summary #469

LoannPeurey commented Apr 9, 2024 •

edited

Loading

lucasgautheron commented Jun 6, 2024 •

edited

Loading

alix-bourree commented Jul 19, 2024

LoannPeurey commented Jul 22, 2024

Conversations/summary #469

Conversations/summary #469

Conversation

LoannPeurey commented Apr 9, 2024 • edited Loading

lucasgautheron commented Jun 6, 2024 • edited Loading

alix-bourree commented Jul 19, 2024

LoannPeurey commented Jul 22, 2024

LoannPeurey commented Apr 9, 2024 •

edited

Loading

lucasgautheron commented Jun 6, 2024 •

edited

Loading