-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to parse attachments/files and download them! #3
Comments
thx. well, "aye" but the pkg doesn't do it yet. I've had the "scraping api" on my "todo" list for a while but haven't had the time to work on it. Ref: https://archive.org/help/aboutsearch.htm & https://archive.org/advancedsearch.php & https://archive.readme.io/docs/ Lemme see how much effort it'll take to add in support (paginated APIs on resource-constrained sites are so not-fun to work with). |
@hrbrmstr amazing, that would be great. I really believe this is what most people do with the archive "How can I get that annoying old zip file that was available 3 years ago??" |
When you get some time, it'd be 👍 If you could poke at the (just added) nascent "Scrape API" calls (https://github.com/hrbrmstr/wayback/blob/master/R/ia-scrape.R) and then let me know what extra helpers I should add to support the use case. |
sure of course. let me try that asap! |
Give https://github.com/hrbrmstr/wayback/blob/master/R/ia-retrieve.R |
@hrbrmstr that seems pretty neat but I wonder if I have explained correctly what I had in mind. Imagine that you are interested in the free csvs from maxmind.com Now going to https://web.archive.org/web/*/http://maxmind.com/* (<-- add the star at the end) shows you ALL the links on the maxmind domain that were saved in the archive. You can see that there is a field where you can filter by type, say csv, or pdf. This is hugely valuable because you can pull all the attachments at once from a website, but is it as real PITA because it has to be manual. I wonder if your package can retrieve that information or perhaps I have misunderstood what you did. Thanks! |
AH! Gotcha. Let me see how that works memento/timemap-API-wise. Pretty sure I can rig up something. |
Looks like there's a "new-ish" CDX parameter used in that particular online query interface that I did not have support for in the package. I've added it to the Def let me know if I need to tweak this more and — if you have some time and wouldn't mind — please add yourself to the
|
Hrm. I just made that a bit better by also adding in support for filtering (like the web ux has). By default it only returns items with a
Now to work on the "download from that point in time" functionality. |
Looks like it's not much more than calling
|
Hi @hrbrmstr sorry yesterday I was super busy with the kiddos. I will try this tonight and let you know. And please, how can I pretend having contributed to the package when you did all the work??? It is simply a pleasure to be able to share ideas and see them implemented so fast! Thanks! |
@hrbrmstr the function While using the
Could we have an option to specify that we want everything? Indeed, once downloaded locally it will very easy to parse the correct links. What do you think? |
yep, just set the |
haha nice thanks i overlooked that default |
Hi Harbour Master,
yet another brillant package from you! I wonder if there is an easy way to pull all the files from the archived website in the wayback archive. For instance, something like "get all the .pdfs from all archives (in a given time range) from this website".
I do these kind of queries manually on the wayback archive, and it is very time consuming and annoying. Being able to do that programmatically with your package would be really nice.
What do you think?
Thanks!
The text was updated successfully, but these errors were encountered: