What is the optimal size for the data to be processed (sbert_resolver_pipeline) in Spark NLP for Healthcare? #316

JustHeroo · 2021-08-25T11:36:59Z

JustHeroo
Aug 25, 2021
Collaborator

have got my pipeline running with
sbert_resolver_pipeline = Pipeline(
stages = [
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
c2doc,
sbert_embedder,
icd10_resolver])
The question I have is .. Is there any optimal size for the data to be processed??? i.e. 1 Note at a time .. or a DF of 50 notes etc?? I had a DF with 2000 notes with notes column named to "text"
clinical_note_df=tdf.select(F.explode('notes_data').alias("rec")).select("rec.event_id","rec.blob_length","rec.updt_dt_tm","rec.valid_from_dt_tm","rec.Note").withColumnRenamed("Note", "text")
I transformed it and extracted relevant information
icd10_sdf = sbert_models.transform(clinical_note_df)
icd10_sdf2=icd10_sdf.select(icd10_sdf.event_id, icd10_sdf.blob_length,icd10_sdf.updt_dt_tm,icd10_sdf.valid_from_dt_tm,icd10_sdf.text,F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","icd10cm_code.result")).alias("icd10cm_code")).select( "event_id", "blob_length","updt_dt_tm","valid_from_dt_tm","text",F.expr("icd10cm_code['1'].entity").alias("entity"),F.expr("icd10cm_code['0']").alias("chunk"),F.expr("icd10cm_code['2']").alias("icd10_code"))
and wrote it to a parquet file
icd10_sdf2.write.parquet("/tmp/test/" + fname)
The whole process for 2000 records took over 3 hours. Our daily count of notes is much much higher than that.. Is there a shorter and faster pipeline recommended to get ICD10 CM codes .. or more efficient way of processing .. or may optimal size for processing at one time??
We have till end of month to do a working POC, so that we can go ahead and approach John Snow Labs to sign a licensing contract. Any help or guidance would be appreciated.
P.S. - I am running local spark (3.02), Python(3.6.8), Spark NLP and JSL (3.02) on one server with 96 cpu cores and 256GB of memory.

Answered by JustHeroo

Aug 25, 2021

Can you try writing the raw clinical_note_df to disk as parquet and the read as parquet and then transform with the pipeline to get the resolutions ? The most expensive part in your pipeline is the sbert_embedder stage where you collect the embeddings from sbert. we just released much lighter versions of sbert embedders tonight but no compatible resolver released yet.. so please try writing the raw clinical_note_df to disk as parquet and then monitor your cpu usage while resolving (transform) (edited)

If it doesn't help, can you try ending your pipeline right after the ner converter and then save to parquet and share the timing again ? so that we would know if NER part is also a blocker..…

View full answer

JustHeroo · 2021-08-25T11:37:52Z

JustHeroo
Aug 25, 2021
Collaborator Author

Can you try writing the raw clinical_note_df to disk as parquet and the read as parquet and then transform with the pipeline to get the resolutions ? The most expensive part in your pipeline is the sbert_embedder stage where you collect the embeddings from sbert. we just released much lighter versions of sbert embedders tonight but no compatible resolver released yet.. so please try writing the raw clinical_note_df to disk as parquet and then monitor your cpu usage while resolving (transform) (edited)

If it doesn't help, can you try ending your pipeline right after the ner converter and then save to parquet and share the timing again ? so that we would know if NER part is also a blocker.. and I may suggest another approach for sbert embedder step later on in DM

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the optimal size for the data to be processed (sbert_resolver_pipeline) in Spark NLP for Healthcare? #316

{{title}}

Replies: 1 comment

{{title}}

Select a reply

What is the optimal size for the data to be processed (sbert_resolver_pipeline) in Spark NLP for Healthcare? #316

JustHeroo Aug 25, 2021 Collaborator

Replies: 1 comment

JustHeroo Aug 25, 2021 Collaborator Author

JustHeroo
Aug 25, 2021
Collaborator

JustHeroo
Aug 25, 2021
Collaborator Author