Skip to content

What is the optimal size for the data to be processed (sbert_resolver_pipeline) in Spark NLP for Healthcare? #316

Discussion options

You must be logged in to vote

Can you try writing the raw clinical_note_df to disk as parquet and the read as parquet and then transform with the pipeline to get the resolutions ? The most expensive part in your pipeline is the sbert_embedder stage where you collect the embeddings from sbert. we just released much lighter versions of sbert embedders tonight but no compatible resolver released yet.. so please try writing the raw clinical_note_df to disk as parquet and then monitor your cpu usage while resolving (transform) (edited)

If it doesn't help, can you try ending your pipeline right after the ner converter and then save to parquet and share the timing again ? so that we would know if NER part is also a blocker..…

Replies: 1 comment

Comment options

JustHeroo
Aug 25, 2021
Collaborator Author

You must be logged in to vote
0 replies
Answer selected by JustHeroo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant