Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to parse attachments/files and download them! #3

Open
randomgambit opened this issue Sep 17, 2018 · 14 comments
Open

how to parse attachments/files and download them! #3

randomgambit opened this issue Sep 17, 2018 · 14 comments
Assignees

Comments

@randomgambit
Copy link

Hi Harbour Master,

yet another brillant package from you! I wonder if there is an easy way to pull all the files from the archived website in the wayback archive. For instance, something like "get all the .pdfs from all archives (in a given time range) from this website".

I do these kind of queries manually on the wayback archive, and it is very time consuming and annoying. Being able to do that programmatically with your package would be really nice.

What do you think?
Thanks!

@hrbrmstr hrbrmstr self-assigned this Sep 17, 2018
@hrbrmstr
Copy link
Owner

hrbrmstr commented Sep 17, 2018

thx.

well, "aye" but the pkg doesn't do it yet. I've had the "scraping api" on my "todo" list for a while but haven't had the time to work on it. Ref: https://archive.org/help/aboutsearch.htm & https://archive.org/advancedsearch.php & https://archive.readme.io/docs/

Lemme see how much effort it'll take to add in support (paginated APIs on resource-constrained sites are so not-fun to work with).

@randomgambit
Copy link
Author

@hrbrmstr amazing, that would be great. I really believe this is what most people do with the archive "How can I get that annoying old zip file that was available 3 years ago??"

@hrbrmstr
Copy link
Owner

When you get some time, it'd be 👍 If you could poke at the (just added) nascent "Scrape API" calls (https://github.com/hrbrmstr/wayback/blob/master/R/ia-scrape.R) and then let me know what extra helpers I should add to support the use case.

@randomgambit
Copy link
Author

sure of course. let me try that asap!
thanks!

@hrbrmstr
Copy link
Owner

Give ia_retreive() a go. I think that might be what you were looking for (just added)

https://github.com/hrbrmstr/wayback/blob/master/R/ia-retrieve.R

@randomgambit
Copy link
Author

@hrbrmstr that seems pretty neat but I wonder if I have explained correctly what I had in mind. Imagine that you are interested in the free csvs from maxmind.com

Now going to https://web.archive.org/web/*/http://maxmind.com/* (<-- add the star at the end) shows you ALL the links on the maxmind domain that were saved in the archive. You can see that there is a field where you can filter by type, say csv, or pdf.

This is hugely valuable because you can pull all the attachments at once from a website, but is it as real PITA because it has to be manual. I wonder if your package can retrieve that information or perhaps I have misunderstood what you did.

Thanks!

@hrbrmstr
Copy link
Owner

AH!

Gotcha. Let me see how that works memento/timemap-API-wise. Pretty sure I can rig up something.

@hrbrmstr
Copy link
Owner

Looks like there's a "new-ish" CDX parameter used in that particular online query interface that I did not have support for in the package. I've added it to the cdx_basic_query() function and (as noted below) I think it provides the assistance you were inquiring about.

Def let me know if I need to tweak this more and — if you have some time and wouldn't mind — please add yourself to the DESCRIPTION (a new person() item) as a contributor (ctb) as this was an immensely helpful suggestion and discussion.

library(wayback)
library(tidyverse)

cdx <- cdx_basic_query("http://maxmind.com/", "prefix")

filter(cdx, grepl("csv", original))
## # A tibble: 43 x 7
##    urlkey       timestamp           original         mimetype  statuscode digest     length
##    <chr>        <dttm>              <chr>            <chr>     <chr>      <chr>       <dbl>
##  1 com,maxmind… 2002-10-14 00:00:00 http://maxmind.… text/html 200        IFDVCDHMB…  2733.
##  2 com,maxmind… 2006-05-17 00:00:00 http://www.maxm… text/html 200        7HYYDOKDG…  1717.
##  3 com,maxmind… 2009-02-11 00:00:00 http://maxmind.… text/html 301        BCL36PMUW…   405.
##  4 com,maxmind… 2008-12-10 00:00:00 http://www.maxm… text/html 200        JZCCABPE7…  1962.
##  5 com,maxmind… 2009-10-18 00:00:00 http://www.maxm… text/pla… 200        WGT2VMJ6S…   957.
##  6 com,maxmind… 2003-08-15 00:00:00 http://www.maxm… text/html 404        U5A45F3Y2…   392.
##  7 com,maxmind… 2009-07-05 00:00:00 http://www.maxm… text/html 404        Y3VUK7LZQ…   413.
##  8 com,maxmind… 2009-07-05 00:00:00 http://www.maxm… text/html 404        Z4BTKJJPQ…   413.
##  9 com,maxmind… 2009-07-05 00:00:00 http://www.maxm… text/html 404        WXKDYKM67…   411.
## 10 com,maxmind… 2006-11-27 00:00:00 http://www.maxm… text/html 404        XVCYDXUBM…   421.
## # ... with 33 more rows

@hrbrmstr
Copy link
Owner

hrbrmstr commented Sep 18, 2018

Hrm. I just made that a bit better by also adding in support for filtering (like the web ux has). By default it only returns items with a 200 status code.

library(wayback)
library(tidyverse)

cdx <- cdx_basic_query("http://maxmind.com/", "prefix")

(csv <- filter(cdx, grepl("\\.csv", original)))
## # A tibble: 9 x 7
##   urlkey                          timestamp           original                             mimetype  statuscode digest       length
##   <chr>                           <dttm>              <chr>                                <chr>     <chr>      <chr>         <dbl>
## 1 com,maxmind)/cityisporgsample.… 2009-10-18 00:00:00 http://www.maxmind.com:80/cityispor… text/pla… 200        WGT2VMJ6SRI… 9.57e2
## 2 com,maxmind)/download/geoip/cs… 2003-02-23 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        2QUN23TUA24… 5.60e2
## 3 com,maxmind)/download/geoip/cs… 2003-02-23 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        NTF247I5W5P… 7.86e2
## 4 com,maxmind)/download/geoip/da… 2006-01-11 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        OFDELJCECME… 1.21e6
## 5 com,maxmind)/download/geoip/da… 2006-06-20 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        3INKOCVKMG6… 1.16e6
## 6 com,maxmind)/download/geoip/da… 2007-11-11 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        E2AT3XS3YLQ… 2.95e6
## 7 com,maxmind)/download/geoip/da… 2008-07-09 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        4YRNZBZ4VFH… 3.76e6
## 8 com,maxmind)/download/geoip/da… 2008-08-13 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        HG7GQQQZUV6… 3.85e6
## 9 com,maxmind)/download/geoip/mi… 2014-03-02 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        MW7F7GGPJLG… 3.26e4

Now to work on the "download from that point in time" functionality.

@hrbrmstr
Copy link
Owner

Looks like it's not much more than calling read_memento():

dat <- read_memento(csv$original[9], as.POSIXct(csv$timestamp[9]), "raw")

readr::read_csv(dat, col_names = c("iso2c", "regcod", "name"))
## Parsed with column specification:
## cols(
##   iso2c = col_character(),
##   regcod = col_character(),
##   name = col_character()
## )
## # A tibble: 4,066 x 3
##    iso2c regcod name               
##    <chr> <chr>  <chr>              
##  1 AD    02     Canillo            
##  2 AD    03     Encamp             
##  3 AD    04     La Massana         
##  4 AD    05     Ordino             
##  5 AD    06     Sant Julia de Loria
##  6 AD    07     Andorra la Vella   
##  7 AD    08     Escaldes-Engordany 
##  8 AE    01     Abu Dhabi          
##  9 AE    02     Ajman              
## 10 AE    03     Dubai              
## # ... with 4,056 more rows

@randomgambit
Copy link
Author

Hi @hrbrmstr sorry yesterday I was super busy with the kiddos. I will try this tonight and let you know. And please, how can I pretend having contributed to the package when you did all the work??? It is simply a pleasure to be able to share ideas and see them implemented so fast!

Thanks!

@randomgambit
Copy link
Author

@hrbrmstr the function cdx_basic_query looks pretty smooth. However I wonder why looking on the archive website directly returns 100k results

image

While using the api only returns 10k.


cdx <- cdx_basic_query("https://imdb.com/", "prefix")
> 
> cdx
# A tibble: 10,000 x 7
   urlkey                          timestamp           original                            mimetype statuscode digest      length
   <chr>                           <dttm>              <chr>                               <chr>    <chr>      <chr>        <dbl>
 1 com,imdb)/                      1996-11-19 00:00:00 http://imdb.com:80/                 text/ht~ 200        XLXNEHRIAG~   1725
 2 com,imdb)/%23                   2006-05-30 00:00:00 http://www.imdb.com:80/%23          text/ht~ 200        CUH3KMB2GO~    837
 3 com,imdb)/%23imdb2.consumer.ho~ 2009-08-19 00:00:00 http://www.imdb.com:80/%23imdb2.co~ text/ht~ 200        AYZ5SY67IR~    688
 4 com,imdb)/%23imdb2.consumer.ho~ 2009-03-11 00:00:00 http://www.imdb.com:80/%23imdb2.co~ text/ht~ 200        AYZ5SY67IR~    673

Could we have an option to specify that we want everything? Indeed, once downloaded locally it will very easy to parse the correct links.

What do you think?

@hrbrmstr
Copy link
Owner

hrbrmstr commented Sep 20, 2018

yep, just set the limit parameter to something higher than the 10K it's defaulted to ;-)

@randomgambit
Copy link
Author

haha nice thanks i overlooked that default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants