-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDC Health Topics #994
Comments
I do have a .png icon: Sounds like someone don't know how to read the html code |
Looks like it is going to be tough to select only health topics, and I don't get why the rest of information is not valuable. I've created https://farm.openzim.org/recipes/www.cdc.gov_en_all for now, limited to 100 pages. Site is also available in Spanish so I've also configured https://farm.openzim.org/recipes/www.cdc.gov_es_all as well, also limited to 100 pages for now. |
Discussed atm with Popolechien, we will ZIM the whole website, it is too complex to isolate only health topics and #995 is on another domain so no worries to have. |
Recipe created |
@RavanJAltaie please read comments before duplicating effort. Issue is already assigned to me so I'm working on it. And I've already created the recipe as mentioned few comments above. You are loosing everyone time here. |
Custom CSS configured and full run requested on https://farm.openzim.org/recipes/www.cdc.gov_en_all and https://farm.openzim.org/recipes/www.cdc.gov_es_all |
ES version is mostly ready to review, I just had to tweak a bit more the CSS to remove even more search boxes (I don't recall exactly but there is just so many !). Will update issue once this is ready to review. EN version got interrupted by the huge page at https://www.cdc.gov/about/advisory-committee-director/meetings-archive.html. I considered this is not vital information, painful to crawl and probably a huge contributor to ZIM size if crawled so I decided to exclude this page from the ZIM for now. Please speak-up if you don't understand / disagree. |
@benoit74 I agree with the exclusion |
Regarding EN version, the result is quite huge: 36.09 GB Exploring this ZIM a bit, I can say that :
All that being said, it means to me that it is going to be painful to curate the EN ZIM of cdc.gov to reach an acceptable size. Something like the 1.7G of Medline Plus. ES version seems quite ok on the other hand at first look at https://dev.library.kiwix.org/#lang=&q=cdc but in fact many pages are broken / missing. Videos hosted at www.cdc.gov/video:
I honestly don't really know how to progress on this, maybe just give up for now considering website structure make it too complex to reach our goal (extract only important health topics, and not all cdc.gov videos / data / ...) for the moment, and come back in few years to see if this has changed. Any other idea? |
Yeah, I was going to suggest we go ahead and generate the zim file anyway but there truly is too much junk / internal comms. Let's park it for the time being and maybe make a note of reaching out to suggest that their next revamp better segregate content. |
I finished generating a zim of cdc.gov and wanted to share what I learned here. As discovered above and in another issue, there is a large youtube video at https://www.cdc.gov/about/advisory-committee-director/meetings-archive.html that breaks the crawl. I ran a new crawl with autoPlay disabled which resolved the issue with the youtube video, but the crawler would seem break on a .mp4 file (also noted in the previously mentioned issue). The crawl produced a zim of ~160GB and I would guess the full one would be around 200GB. I modified the crawl to exclude .mp4 links. However, there was .mp4 embedded at https://www.cdc.gov/coca/hcp/trainings/inflammatory-syndrome-children-mis-c.html which still broke the crawl. I was able to run the crawl by passing in the crawl yaml file while keeping other parameters the same, then generate the zim with the warcs from both crawls. My final zim is 105.9GB using commands with the below parameters:
|
Thanks a lot @john8952 You should probably be able to exclude "embedded" mp4 (and Youtube videos) with the |
Awesome thanks! |
The text was updated successfully, but these errors were encountered: