Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose configuration file (archives.json) to be web accessible #127

Open
machawk1 opened this issue Mar 6, 2020 · 3 comments
Open

Expose configuration file (archives.json) to be web accessible #127

machawk1 opened this issue Mar 6, 2020 · 3 comments

Comments

@machawk1
Copy link
Member

machawk1 commented Mar 6, 2020

Memento 1.0-RC8 exposes the list of archives aggregated on the /about and primary web endpoint.

## Upstream Archives

1. [Archive.today](https://archive.today/)
2. [Portuguese Web Archive](https://arquivo.pt/)
3. [Perma Archive](https://perma.cc/)
4. [Stanford Web Archive](https://swap.stanford.edu/)
5. [BAnQ](https://waext.banq.qc.ca/)
6. [Archive-It](https://wayback.archive-it.org/)
7. [Icelandic Web Archive](https://wayback.vefsafn.is/)
8. [Bibliotheca Alexandrina Web Archive](https://web.archive.bibalex.org/)
9. [Internet Archive](https://web.archive.org/)
10. [Australian Web Archive](https://web.archive.org.au/)
11. [Library and Archives Canada](https://webarchive.bac-lac.gc.ca/)
12. [Library of Congress](https://webarchive.loc.gov/)
13. [UK National Archives Web Archive](https://webarchive.nationalarchives.gov/)
14. [National Records of Scotland](https://webarchive.nrscotland.gov.uk/)
15. [UK Web Archive](https://webarchive.org.uk/)
16. [UK Parliament Web Archive](https://webarchive.parliament.uk/)

It might be useful to expose the archives' respective endpoints. If an archive is disabled or "sleeping", it might also be useful to expose this information. From what I recall, the "disabled" status is present in the JSON file but the "sleeping" attribute that occurs after some number of failures is runtime generated, so that might be trickier.

Regardless, it would be useful to expose the archives.json file that is being used in the current instance.

@ibnesayeed
Copy link
Member

From what I recall, the "disabled" status is present in the JSON file but the "sleeping" attribute that occurs after some number of failures is runtime generated, so that might be trickier.

On the contrary, I think it will be trickier to report ignored (disabled explicitly in the input archive list file) archives because we filter them off immediately after parsing the file and do not keep any records of those ignored archives in the memory as they will not be contributing in the process for the entire uptime of the service. The runtime structure is easier to report, it simply requires marshaling the array into JSON.

I can think of adding more runtime attributes to the structure of each archive, such as:

  • Number of requests sent
  • Number of successful responses
  • Number of times an archive gone dormant
  • The last time when an archive (only if currently dormant?) was put to sleep

Obviously, these counters will only keep track of the state for the uptime of the instance. If we also report the uptime of the instance and total number of received requests under the /about endpoint, this will enable a nice time series visualization about the health of instance and upstream archives.

@machawk1
Copy link
Member Author

machawk1 commented Mar 6, 2020

All of the extra information you suggested would be really useful, too.

What about also keeping a record of timestamps/ranges for which an archive was dormant? This seems like it might require a lot of bookkeeping.

we filter them off immediately after parsing the file and do not keep any records

Couldn't we keep a record of when we filter them off to be reported later?

@ibnesayeed
Copy link
Member

What about also keeping a record of timestamps/ranges for which an archive was dormant? This seems like it might require a lot of bookkeeping.

That would be a serious memory leak as the amount of memory needed to run the service will continue to rise indefinitely for the lifetime of the instance. Keeping counters is cheap as the value is replaced in place. By knowing the number of dormant sessions of each archives and already knowing the configurable dormant period, one can simple multiply the two to get an overall duration for which an archive was not being aggregated from.

Couldn't we keep a record of when we filter them off to be reported later?

We could, but I do not see a compelling reason to do so. The purpose of the option to ignore selected archives before an instance is started is to allow the service maintainers to keep record of things that they used in the past, anticipate using in the future, or have some private entries for testing. There is not a lot to offer by exposing such private information. The ignore attribute of the archives list is a way to say, "hey aggregator, don't bother about these and assume as if they don't exist".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants