DailyMed XML processing for NDC -> image #309

jrlegrand · 2024-07-16T16:31:25Z

Problem Statement

Need to extract a linkage from NDC -> image file name from DailyMed XML.

Criteria for Success

Data mart for NDC -> image

Additional Information

I looked through DailyMed's SPL stylesheet

I think there's some neat tricks we can learn about XML from these, and my main takeaway is that if we can really understand how DailyMed crafts their XML template for their website, that's the closest source of truth
Specifically for the ObservationMedia stuff, i think we are doing basically what DailyMed is doing mostly - though there's some specialized stuff they are doing that may or may not be important

Probably the bigger question is how we tackle the final piece of consuming the focused XML sections (gleaned / transformed / compiled using XSLT templates) from each pathway.

If we need to OCR images, does that mean we need to unzip all the zip files to get the images out? not sure how much storage space that would take up, but assuming it would be pretty large. Would it make more sense to try to OCR a hosted image instead of the local image? We could get the DailyMed image URL from the XML and maybe point the OCR tool at that URL instead of a local file? There's also a lot of images that have nothing to do with labels (i.e. chemical structure or administration instruction diagrams) that we don't need to bother with unzipping and/or OCR-ing.

If we leave everything zipped (as we do currently), we could spit out a smaller, more focused XML document that Python/pandas can pick up and parse through pretty easily with XPath to create the columns in a dataframe. I am doing the equivalent of this currently in my branch (https://github.com/coderxio/sagerx/tree/jrlegrand/dailymed), but using SQL. Meaning - the smaller XML document is stored in an xml column in Postgres, and then dbt models use SQL to do essentially what pandas would do to convert the smaller XML document to columns in one or more tables.

Using pandas would mean these tables are materialized.
Using dbt means we can decide whether we want them to be materialized in the sagerx_lake schema (this might be a weird use of dbt - maybe they would end up as materialized staging tables in sagerx_dev), or whether we want them to be normal staging views in sagerx_dev.

I don't know what the performance or memory usage limitations would be for both of these options, but assume it might be better to go the pandas route for memory reasons.... not sure. I did run into an error (#238) when originally trying to load ALL SPLs, but things have changed since then which may make that error moot.

jrlegrand added this to SageRx Sprint Board Jul 16, 2024

jrlegrand moved this to In Progress in SageRx Sprint Board Jul 16, 2024

jrlegrand assigned jrlegrand and lprzychodzien Jul 16, 2024

jrlegrand mentioned this issue Sep 6, 2024

DailyMed NDC->Image File - Initial Work #318

Open

jrlegrand mentioned this issue Oct 24, 2024

DailyMed NDC to Label Image Mart #326

Merged

jrlegrand closed this as completed in #326 Oct 30, 2024

github-project-automation bot moved this from In Progress to Done in SageRx Sprint Board Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DailyMed XML processing for NDC -> image #309

DailyMed XML processing for NDC -> image #309

jrlegrand commented Jul 16, 2024 •

edited

Loading

DailyMed XML processing for NDC -> image #309

DailyMed XML processing for NDC -> image #309

Comments

jrlegrand commented Jul 16, 2024 • edited Loading

Problem Statement

Criteria for Success

Additional Information

jrlegrand commented Jul 16, 2024 •

edited

Loading