Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDC Health Topics #994

Open
Popolechien opened this issue May 10, 2024 · 13 comments
Open

CDC Health Topics #994

Popolechien opened this issue May 10, 2024 · 13 comments
Assignees
Labels
Bug Something isn't working Medical Medical related Content Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) Zimit

Comments

@Popolechien
Copy link
Collaborator

  • Website URL: https://www.cdc.gov/health-topics.html
  • License: Public domain
  • Desired ZIM Title: Center for Disease Control Health Topics
  • Desired ZIM Description: Find diseases and conditions
  • Desired ZIM Icon –png (URL or attach one): n/a
  • Language (ISO 639-3): eng
  • Is this a MediaWiki?: no
@Popolechien Popolechien added the Medical Medical related Content label May 10, 2024
@MrnateGeek
Copy link

I do have a .png icon:
https://www.cdc.gov/TemplatePackage/3.0/images/Win8_tile_70x70.png

Sounds like someone don't know how to read the html code

@benoit74
Copy link
Contributor

Looks like it is going to be tough to select only health topics, and I don't get why the rest of information is not valuable.

I've created https://farm.openzim.org/recipes/www.cdc.gov_en_all for now, limited to 100 pages.

Site is also available in Spanish so I've also configured https://farm.openzim.org/recipes/www.cdc.gov_es_all as well, also limited to 100 pages for now.

@benoit74
Copy link
Contributor

Discussed atm with Popolechien, we will ZIM the whole website, it is too complex to isolate only health topics and #995 is on another domain so no worries to have.

@RavanJAltaie
Copy link
Contributor

Recipe created
https://farm.openzim.org/recipes/cdc.gov_en_health-topics
I'll post the library link once ready

@benoit74
Copy link
Contributor

@RavanJAltaie please read comments before duplicating effort.

Issue is already assigned to me so I'm working on it. And I've already created the recipe as mentioned few comments above. You are loosing everyone time here.

@benoit74
Copy link
Contributor

Custom CSS configured and full run requested on https://farm.openzim.org/recipes/www.cdc.gov_en_all and https://farm.openzim.org/recipes/www.cdc.gov_es_all

@benoit74
Copy link
Contributor

ES version is mostly ready to review, I just had to tweak a bit more the CSS to remove even more search boxes (I don't recall exactly but there is just so many !). Will update issue once this is ready to review.

EN version got interrupted by the huge page at https://www.cdc.gov/about/advisory-committee-director/meetings-archive.html. I considered this is not vital information, painful to crawl and probably a huge contributor to ZIM size if crawled so I decided to exclude this page from the ZIM for now. Please speak-up if you don't understand / disagree.

@RavanJAltaie
Copy link
Contributor

@benoit74 I agree with the exclusion

@benoit74
Copy link
Contributor

Regarding EN version, the result is quite huge: 36.09 GB

Exploring this ZIM a bit, I can say that :

  • 13.05 GB are used by Youtube videos
  • 3.84 GB are used by other videos hosted at www.cdc.gov/video
    • there are very few videos, and they seems pretty big (see detailed listing below)
  • 6.9 GB are reports/data files under www.cdc.gov/healthyyouth
    • lots of big ZIP / DAT files, they could be excluded

All that being said, it means to me that it is going to be painful to curate the EN ZIM of cdc.gov to reach an acceptable size. Something like the 1.7G of Medline Plus.

ES version seems quite ok on the other hand at first look at https://dev.library.kiwix.org/#lang=&q=cdc but in fact many pages are broken / missing.

Videos hosted at www.cdc.gov/video:

3.84 GB /www.cdc.gov/video
1.94 GB /www.cdc.gov/video/tuskegee
1.47 GB /www.cdc.gov/video/tuskegee/334468_Tuskegee_Study_Personal_Reflections_Edit_17_lowres.mp4
463.77 MB /www.cdc.gov/video/tuskegee/Webcast-Tuskegee-Remembrance-low-res.mp4
897.54 MB /www.cdc.gov/video/shepardawards
897.54 MB /www.cdc.gov/video/shepardawards/2022_Shepard_Awards_Edited_LR.mp4
502.71 MB /www.cdc.gov/video/climate-health
502.71 MB /www.cdc.gov/video/climate-health/eval
127.58 MB /www.cdc.gov/video/climate-health/eval/DEHSP-Video-4-SD.mp4
119.48 MB /www.cdc.gov/video/climate-health/eval/DEHSP-Video-5-SD.mp4
101.79 MB /www.cdc.gov/video/climate-health/eval/DEHSP-Video-3-SD.mp4
83.31 MB /www.cdc.gov/video/climate-health/eval/DEHSP-Video-1-SD.mp4
70.54 MB /www.cdc.gov/video/climate-health/eval/DEHSP-Video-2-SD.mp4
407.37 MB /www.cdc.gov/video/phgr
407.37 MB /www.cdc.gov/video/phgr/btd
359.83 MB /www.cdc.gov/video/phgr/btd/2019
359.83 MB /www.cdc.gov/video/phgr/btd/2019/309688_BtdAdolescentHealth.wmv
47.53 MB /www.cdc.gov/video/phgr/btd/2015
47.53 MB /www.cdc.gov/video/phgr/btd/2015/Btd_DASH.wmv
95.82 MB /www.cdc.gov/video/cdctv
95.82 MB /www.cdc.gov/video/cdctv/influenza
95.82 MB /www.cdc.gov/video/cdctv/influenza/NCIRD_P1Flu_Video01_HowDoesFluMakeYouSick_16x9_HIRES.mp4

I honestly don't really know how to progress on this, maybe just give up for now considering website structure make it too complex to reach our goal (extract only important health topics, and not all cdc.gov videos / data / ...) for the moment, and come back in few years to see if this has changed. Any other idea?

@Popolechien
Copy link
Collaborator Author

Yeah, I was going to suggest we go ahead and generate the zim file anyway but there truly is too much junk / internal comms. Let's park it for the time being and maybe make a note of reaching out to suggest that their next revamp better segregate content.

@benoit74 benoit74 added Bug Something isn't working Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) labels Oct 28, 2024
@john8952
Copy link

john8952 commented Jan 7, 2025

I finished generating a zim of cdc.gov and wanted to share what I learned here.

As discovered above and in another issue, there is a large youtube video at https://www.cdc.gov/about/advisory-committee-director/meetings-archive.html that breaks the crawl.

I ran a new crawl with autoPlay disabled which resolved the issue with the youtube video, but the crawler would seem break on a .mp4 file (also noted in the previously mentioned issue). The crawl produced a zim of ~160GB and I would guess the full one would be around 200GB.

I modified the crawl to exclude .mp4 links. However, there was .mp4 embedded at https://www.cdc.gov/coca/hcp/trainings/inflammatory-syndrome-children-mis-c.html which still broke the crawl. I was able to run the crawl by passing in the crawl yaml file while keeping other parameters the same, then generate the zim with the warcs from both crawls.

My final zim is 105.9GB using commands with the below parameters:

docker run --rm -v /path/to/zims:/output ghcr.io/openzim/zimit zimit --custom-css=https://drive.farm.openzim.org/zimit_custom_css/www.cdc.gov.css --description="Information of US Centers for Disease Control and Prevention" --exclude="(^https:\/\/(www\.cdc\.gov\/spanish\/|www\.cdc\.gov\/.*\/es\/|espanol\.cdc\.gov\/|www\.cdc\.gov\/about\/advisory-committee-director\/meetings-archive.html|.*\.mp4$))|(^http:\/\/(www\.cdc\.gov\/spanish\/|www\.cdc\.gov\/.*\/es\/|espanol\.cdc\.gov\/|www\.cdc\.gov\/about\/advisory-committee-director\/meetings-archive.html|.*\.mp4$))" --name="www.cdc.gov_en_all_novid" --title="US Center for Disease Control" --url=https://www.cdc.gov/ --zim-lang=eng --scopeType host --keep --behaviors autofetch,siteSpecific

@benoit74
Copy link
Contributor

benoit74 commented Jan 7, 2025

Thanks a lot @john8952

You should probably be able to exclude "embedded" mp4 (and Youtube videos) with the --blockRules parameter, but this is not yet exposed by Zimit, you will have to wait for openzim/zimit#433 to be solved and released

@john8952
Copy link

john8952 commented Jan 7, 2025

Awesome thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Medical Medical related Content Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) Zimit
Projects
None yet
Development

No branches or pull requests

5 participants