Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge Alistair's fixes with Isaac's #80

Open
wants to merge 3 commits into
base: corpus-3-release-fixes
Choose a base branch
from

Conversation

ibevers
Copy link
Contributor

@ibevers ibevers commented Oct 2, 2024

No description provided.

@satra
Copy link
Contributor

satra commented Oct 2, 2024

@alistairewj and @ibevers - why are we going away from the parallel execution? also pydra has a serial executor. if that didn't work something in the usage or assumptions is incorrect. let's fix things if they are broken now instead of the rework i'm seeing here.

@ibevers
Copy link
Contributor Author

ibevers commented Oct 2, 2024

@satra we are adding serial execution as a plan B for redundancy since parallel execution is not working with SenseLab's speech_to_text function currently.

We are working on debugging parallel execution. I believe the core challenge with parallel execution may be with the reliability of Whisper or SenseLab's speech_to_text function. Fabio told me that it does not work on shorter audio files. @alistair found that it did not work initially on every file, although I believe it worked when he retried it since he was able to get a complete set of transcripts. @alistairewj told me that the parallel execution is currently not working after adding error handling and retry logic for transcription.

@satra
Copy link
Contributor

satra commented Oct 2, 2024

this does not seem to be an issue of serial vs parallel (both of which pydra can handle without change of code). it's more a question of what happens when some file is not processed appropriately (resilience, something pydra handles internally as well by creating an exception/runtime error that is tracked at the task level). and perhaps also constructing the task and workflow properly. for example, having retries in a task.

@fabiocat93
Copy link
Contributor

@ibevers also, can you please open an issue on senselab describing the issue you are facing and reporting all steps to reproduce it? thanks!

@ibevers
Copy link
Contributor Author

ibevers commented Oct 2, 2024

Yes, I suppose talking about it as a serial/parallel issue does not really focus on the key problem. @fabiocat93 Yes, I will do that. Thank you!

@alistairewj
Copy link
Collaborator

alistairewj commented Oct 3, 2024

Yeah so I ran everything in serial because debugging the parallel execution was challenging. I fixed the issues with pipeline and its use with senselab. They were:

  1. If the language isn't specified, the transcription fails. I fixed this by specifying the language as English.
  2. Randomly HFModel fails to validate the data. I couldn't find any consistency as to why this would happen - just on a random record. I don't actually think this is a senselab issue. a try/except around the validation error handles this fine since it doesn't repeatedly happen. I wish I had a good explanation for it, but HF does some pretty elaborate things when it tries to load a model; for all I know it's simply an API call which fails and isn't propagated properly through the traceback.

For the parallel vs. serial processing, this was my choice really not Isaac's. I've been finding pydra very hard to use. The documentation is sparse, and so it wasn't easy for me to (quickly) get the debugging to work. Specifically I can't find where the Task object is documented with its methods - the docs only describe ShellCommandTask. There are not really any examples to learn from either. :(

The code to do the parallel processing is here:

# Get paths to every audio file.
audio_paths = get_audio_paths(bids_dir_path=bids_dir_path)
# Initialize the Pydra workflow.
ef_wf = pydra.Workflow(
name="ef_wf", input_spec=["audio_paths"], audio_paths=audio_paths, cache_dir=None
)
# Run wav_to_features for each audio file.
ef_wf.add(
wav_to_features(
name="features",
wav_path=ef_wf.lzin.audio_paths,
transcription_model_size=transcription_model_size,
with_sensitive=with_sensitive,
).split("wav_path", wav_path=ef_wf.lzin.audio_paths)
)
ef_wf.set_output(
{"audio_paths": ef_wf.lzin.audio_paths}
)
with pydra.Submitter(plugin="cf") as run:
run(ef_wf)
return ef_wf

This fails with this error:

Exception has occurred: AttributeError
'Workflow' object has no attribute 'ef_wf'
  File "/Users/alistairewj/git/b2aiprep/src/b2aiprep/prepare/prepare.py", line 243, in extract_features_workflow
    ef_wf.set_output(
  File "/Users/alistairewj/git/b2aiprep/src/b2aiprep/prepare/prepare.py", line 315, in prepare_bids_like_data
    extract_features_workflow(
  File "/Users/alistairewj/git/b2aiprep/src/b2aiprep/cli.py", line 111, in prepbidslikedata
    prepare_bids_like_data(
  File "/Users/alistairewj/git/b2aiprep/src/b2aiprep/cli.py", line 413, in <module>
    main()  # pylint: disable=no-value-for-parameter
    ^^^^^^
AttributeError: 'Workflow' object has no attribute 'ef_wf'

It seems like the name of the workflow (ef_wf) is being used as an attribute.. but I don't know why. I suspect this is an easy error to fix, but I'm not sure how!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants