Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversations/summary #469

Merged
merged 49 commits into from
Aug 1, 2024
Merged

Conversations/summary #469

merged 49 commits into from
Aug 1, 2024

Conversation

LoannPeurey
Copy link
Contributor

@LoannPeurey LoannPeurey commented Apr 9, 2024

extract summary about conversations, using the conversations type of annotation sets

Steps:

  • Making it work:
    • Create the conversationsextraction class
    • Create the conversationsFunctions necessary
    • Implement the standard extraction class
    • Implement other classes (custom, other standards)
    • implement the specification pipeline (using parameter file)
  • Improvements (format and functionalities:
    • unify usage of 'annotations' term for stretches of audio annotated, and 'segments' for speaker segments (in _process_conversation and in conversationFunctions.py)
    • remove iterrows() in _process_conversation
    • Rework the way extract works, loading all the segments first was done to threading retrieving the segments and then conversations. I think it is better to run retrieving the segments in a loop that submits conversations to the pool asynchronously with map_async , when everything is submitted, wait for the results and contenate the results.
  • Add tests for everything
  • Benchmark and Optimize the extraction (perhaps implement some test on efficiency goals for speed of extraction)
  • Add to the documentation some page on how to use

@lucasgautheron
Copy link
Collaborator

lucasgautheron commented Jun 6, 2024

Hi all,

I was curious about the implementation of the extraction of conversations, so I took a look at the PR.
It reminded me something I wanted to say that applies to this PR but not only: I'd like to suggest that we stop using pandas' iterrows() altogether.

For some reason, iterrows is much, MUCH slower than any other alternative (by one or two orders of magnitudes!):

Personally, I like to use to_dict(orient="records") instead. This requires almost no further adaptation of the code, which is why I am allowing myself to make the suggestion (regardless of the magnitude of the performance issue, it is always worth abandoning iterrows, given that the cost of switching is so small).

This is important especially when looping over all segments of all annotations (potentially ~1M--10M rows for a corpus).

If I may, an additional suggestion would be to reserve the variable names "annotation" and "annotations" only for "parts of recordings covered by some annotator" and to use "segments" only when referring to labelled/classified segments of speech (in other words, "segments" are what is contained within "annotations").

@LoannPeurey LoannPeurey marked this pull request as ready for review July 8, 2024 09:39
@LoannPeurey LoannPeurey self-assigned this Jul 8, 2024
@alix-bourree
Copy link
Contributor

Hi,

I ran the StandardConversations and CustomConversations classes on Joe's data without encountering any issues:

  • Israel-Haifa
  • babylogger-vs-lena-data
  • bergelson
  • cougar
  • lyon
  • nepal-havron
  • phonSES
  • png2019
  • quechua
  • timor-leste2022
  • tseltal2015
  • tsimane2017
  • tsimanec2018
  • warlaumont
  • winnipeg

I’ve also added some edge cases to test_conversations.py:

def test_empty_conversations(project, am):
    empty_segments = pd.DataFrame(columns=["segment_onset", "segment_offset", "speaker_type", "time_since_last_conv", "conv_count"])

    am.import_annotations(
        pd.DataFrame(
            [{"set": "empty_conv",
              "raw_filename": "file.its",
              "time_seek": 0,
              "recording_filename": "sound.wav",
              "range_onset": 0,
              "range_offset": 30000000,
              "format": "csv",
              }]
        ),
        import_function=partial(fake_vocs, empty_segments),
    )

    std = StandardConversations(project, setname='empty_conv')
    results = std.extract()

    assert results.empty, "The result should be empty for an empty dataset"

def test_nan_values(project, am):
    nan_segments = pd.DataFrame({
        "segment_onset": [np.nan, 10, 20],
        "segment_offset": [5, np.nan, 25],
        "speaker_type": ["CHI", np.nan, "FEM"],
        "time_since_last_conv": [np.nan, 15, 5],
        "conv_count": [1, 1, 2]
    })

    am.import_annotations(
        pd.DataFrame(
            [{"set": "nan_conv",
              "raw_filename": "file.its",
              "time_seek": 0,
              "recording_filename": "sound.wav",
              "range_onset": 0,
              "range_offset": 30000000,
              "format": "csv",
              }]
        ),
        import_function=partial(fake_vocs, nan_segments),
    )

    std = StandardConversations(project, setname='nan_conv')
    results = std.extract()

    assert not results.empty, "The result should not be empty for a dataset with NaN values"

def test_single_entry_conversation(project, am):
    single_segment = pd.DataFrame({
        "segment_onset": [0],
        "segment_offset": [5],
        "speaker_type": ["CHI"],
        "time_since_last_conv": [np.nan],
        "conv_count": [1]
    })

    am.import_annotations(
        pd.DataFrame(
            [{"set": "single_conv",
              "raw_filename": "file.its",
              "time_seek": 0,
              "recording_filename": "sound.wav",
              "range_onset": 0,
              "range_offset": 30000000,
              "format": "csv",
              }]
        ),
        import_function=partial(fake_vocs, single_segment),
    )

    std = StandardConversations(project, setname='single_conv')
    results = std.extract()

    assert len(results) == 1, "The result should contain one conversation for a single entry dataset"

def test_incorrect_data_types(project, am):
    incorrect_types = pd.DataFrame({
        "segment_onset": ["0", "10", "20"],
        "segment_offset": ["5", "15", "25"],
        "speaker_type": ["CHI", "FEM", "MAN"],
        "time_since_last_conv": ["nan", "15", "5"],
        "conv_count": [1, 1, 2]
    })

    am.import_annotations(
        pd.DataFrame(
            [{"set": "incorrect_types_conv",
              "raw_filename": "file.its",
              "time_seek": 0,
              "recording_filename": "sound.wav",
              "range_onset": 0,
              "range_offset": 30000000,
              "format": "csv",
              }]
        ),
        import_function=partial(fake_vocs, incorrect_types),
    )

    std = StandardConversations(project, setname='incorrect_types_conv')
    with pytest.raises(Exception):
        std.extract(), "The code should raise an exception for incorrect data types"

def test_unsorted_annotations(project, am):
    unsorted_segments = pd.DataFrame({
        "segment_onset": [20, 0, 10],
        "segment_offset": [25, 5, 15],
        "speaker_type": ["FEM", "CHI", "MAN"],
        "time_since_last_conv": [5, np.nan, 15],
        "conv_count": [2, 1, 1]
    })

    am.import_annotations(
        pd.DataFrame(
            [{"set": "unsorted_conv",
              "raw_filename": "file.its",
              "time_seek": 0,
              "recording_filename": "sound.wav",
              "range_onset": 0,
              "range_offset": 30000000,
              "format": "csv",
              }]
        ),
        import_function=partial(fake_vocs, unsorted_segments),
    )

    std = StandardConversations(project, setname='unsorted_conv')
    results = std.extract()

    assert not results.empty, "The result should not be empty for unsorted annotations"

However, I encountered some failures:

======== short test summary info =========================================================================================
FAILED tests/test_conversations.py::test_empty_conversations - NameError: name 'grouper' is not defined
FAILED tests/test_conversations.py::test_nan_values - pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
FAILED tests/test_conversations.py::test_incorrect_data_types - TypeError: can only concatenate str (not "int") to str

It seems there might be some issues with handling exceptions for these cases. Let me know if you think it's necessary to address these exceptions.

Otherwise, the checklist has been followed successfully!

@LoannPeurey
Copy link
Contributor Author

Thank you @alix-bourree , I think indeed that we are always missing some coverage in testing and that will help, Can you push those changes to the branch? I think I will give those errors a look to fix them

@LoannPeurey LoannPeurey merged commit 8593f56 into master Aug 1, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants