What is the optimal size for the data to be processed (sbert_resolver_pipeline) in Spark NLP for Healthcare? #316
-
have got my pipeline running with |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Can you try writing the raw clinical_note_df to disk as parquet and the read as parquet and then transform with the pipeline to get the resolutions ? The most expensive part in your pipeline is the sbert_embedder stage where you collect the embeddings from sbert. we just released much lighter versions of sbert embedders tonight but no compatible resolver released yet.. so please try writing the raw clinical_note_df to disk as parquet and then monitor your cpu usage while resolving (transform) (edited) If it doesn't help, can you try ending your pipeline right after the ner converter and then save to parquet and share the timing again ? so that we would know if NER part is also a blocker.. and I may suggest another approach for sbert embedder step later on in DM |
Beta Was this translation helpful? Give feedback.
Can you try writing the raw clinical_note_df to disk as parquet and the read as parquet and then transform with the pipeline to get the resolutions ? The most expensive part in your pipeline is the sbert_embedder stage where you collect the embeddings from sbert. we just released much lighter versions of sbert embedders tonight but no compatible resolver released yet.. so please try writing the raw clinical_note_df to disk as parquet and then monitor your cpu usage while resolving (transform) (edited)
If it doesn't help, can you try ending your pipeline right after the ner converter and then save to parquet and share the timing again ? so that we would know if NER part is also a blocker..…