-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversations/summary #469
Conversation
update conversations summary
Hi all, I was curious about the implementation of the extraction of conversations, so I took a look at the PR. For some reason,
Personally, I like to use This is important especially when looping over all segments of all annotations (potentially ~1M--10M rows for a corpus). If I may, an additional suggestion would be to reserve the variable names "annotation" and "annotations" only for "parts of recordings covered by some annotator" and to use "segments" only when referring to labelled/classified segments of speech (in other words, "segments" are what is contained within "annotations"). |
…e usage of segments vs annotations
Hi, I ran the
I’ve also added some edge cases to def test_empty_conversations(project, am):
empty_segments = pd.DataFrame(columns=["segment_onset", "segment_offset", "speaker_type", "time_since_last_conv", "conv_count"])
am.import_annotations(
pd.DataFrame(
[{"set": "empty_conv",
"raw_filename": "file.its",
"time_seek": 0,
"recording_filename": "sound.wav",
"range_onset": 0,
"range_offset": 30000000,
"format": "csv",
}]
),
import_function=partial(fake_vocs, empty_segments),
)
std = StandardConversations(project, setname='empty_conv')
results = std.extract()
assert results.empty, "The result should be empty for an empty dataset"
def test_nan_values(project, am):
nan_segments = pd.DataFrame({
"segment_onset": [np.nan, 10, 20],
"segment_offset": [5, np.nan, 25],
"speaker_type": ["CHI", np.nan, "FEM"],
"time_since_last_conv": [np.nan, 15, 5],
"conv_count": [1, 1, 2]
})
am.import_annotations(
pd.DataFrame(
[{"set": "nan_conv",
"raw_filename": "file.its",
"time_seek": 0,
"recording_filename": "sound.wav",
"range_onset": 0,
"range_offset": 30000000,
"format": "csv",
}]
),
import_function=partial(fake_vocs, nan_segments),
)
std = StandardConversations(project, setname='nan_conv')
results = std.extract()
assert not results.empty, "The result should not be empty for a dataset with NaN values"
def test_single_entry_conversation(project, am):
single_segment = pd.DataFrame({
"segment_onset": [0],
"segment_offset": [5],
"speaker_type": ["CHI"],
"time_since_last_conv": [np.nan],
"conv_count": [1]
})
am.import_annotations(
pd.DataFrame(
[{"set": "single_conv",
"raw_filename": "file.its",
"time_seek": 0,
"recording_filename": "sound.wav",
"range_onset": 0,
"range_offset": 30000000,
"format": "csv",
}]
),
import_function=partial(fake_vocs, single_segment),
)
std = StandardConversations(project, setname='single_conv')
results = std.extract()
assert len(results) == 1, "The result should contain one conversation for a single entry dataset"
def test_incorrect_data_types(project, am):
incorrect_types = pd.DataFrame({
"segment_onset": ["0", "10", "20"],
"segment_offset": ["5", "15", "25"],
"speaker_type": ["CHI", "FEM", "MAN"],
"time_since_last_conv": ["nan", "15", "5"],
"conv_count": [1, 1, 2]
})
am.import_annotations(
pd.DataFrame(
[{"set": "incorrect_types_conv",
"raw_filename": "file.its",
"time_seek": 0,
"recording_filename": "sound.wav",
"range_onset": 0,
"range_offset": 30000000,
"format": "csv",
}]
),
import_function=partial(fake_vocs, incorrect_types),
)
std = StandardConversations(project, setname='incorrect_types_conv')
with pytest.raises(Exception):
std.extract(), "The code should raise an exception for incorrect data types"
def test_unsorted_annotations(project, am):
unsorted_segments = pd.DataFrame({
"segment_onset": [20, 0, 10],
"segment_offset": [25, 5, 15],
"speaker_type": ["FEM", "CHI", "MAN"],
"time_since_last_conv": [5, np.nan, 15],
"conv_count": [2, 1, 1]
})
am.import_annotations(
pd.DataFrame(
[{"set": "unsorted_conv",
"raw_filename": "file.its",
"time_seek": 0,
"recording_filename": "sound.wav",
"range_onset": 0,
"range_offset": 30000000,
"format": "csv",
}]
),
import_function=partial(fake_vocs, unsorted_segments),
)
std = StandardConversations(project, setname='unsorted_conv')
results = std.extract()
assert not results.empty, "The result should not be empty for unsorted annotations" However, I encountered some failures: ======== short test summary info =========================================================================================
FAILED tests/test_conversations.py::test_empty_conversations - NameError: name 'grouper' is not defined
FAILED tests/test_conversations.py::test_nan_values - pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
FAILED tests/test_conversations.py::test_incorrect_data_types - TypeError: can only concatenate str (not "int") to str It seems there might be some issues with handling exceptions for these cases. Let me know if you think it's necessary to address these exceptions. Otherwise, the checklist has been followed successfully! |
Thank you @alix-bourree , I think indeed that we are always missing some coverage in testing and that will help, Can you push those changes to the branch? I think I will give those errors a look to fix them |
extract summary about conversations, using the conversations type of annotation sets
Steps:
iterrows()
in _process_conversationRework the wayextract
works, loading all the segments first was done to threading retrieving the segments and then conversations. I think it is better to run retrieving the segments in a loop that submits conversations to the pool asynchronously withmap_async
, when everything is submitted, wait for the results and contenate the results.