Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not respecting showDupeCount=true; retry without --uniques-only #82

Open
reagle opened this issue Oct 3, 2024 · 3 comments
Open

not respecting showDupeCount=true; retry without --uniques-only #82

reagle opened this issue Oct 3, 2024 · 3 comments

Comments

@reagle
Copy link

reagle commented Oct 3, 2024

Hi, I'm new to the tool, and don't want to download empty files or files which haven't changed. I tried and got the following. I'm not sure what this means and why it doesn't work...?

❯ waybackpack http://reddit.com/r/self -d ~/Downloads/wayback-reddit --from-date 2008 --to-date 2009  --no-clobber --progress --uniques-only
Traceback (most recent call last):
  File "/Users/reagle/.pyenv/versions/3.12.5/bin/waybackpack", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/reagle/.pyenv/versions/3.12.5/lib/python3.12/site-packages/waybackpack/cli.py", line 142, in main
    snapshots = search(
                ^^^^^^^
  File "/Users/reagle/.pyenv/versions/3.12.5/lib/python3.12/site-packages/waybackpack/cdx.py", line 47, in search
    raise WaybackpackException(
waybackpack.cdx.WaybackpackException: Wayback Machine CDX API not respecting showDupeCount=true; retry without --uniques-only.
@jsvine
Copy link
Owner

jsvine commented Oct 18, 2024

Thanks for your interest in waybackpack, @reagle. Here's what's happening:

  • The Wayback Machine's CDX API theoretically provides a way to check for (and thus skip over) duplicate content. If you pass --uniques-only, then waybackpack attempts to skip those dupes.
  • ... but the API hasn't always respected the relevant parameter, making it impossible for waybackpack to respect --uniques-only.
  • Because we don't want people who are expecting --uniques-only to get unexpected results when the feature doesn't work, we throw that error.
  • You can remove --uniques-only from your invocation, although that of course won't resolve the underlying issue (which is that you will end up downloading files that haven't changed).

@reagle
Copy link
Author

reagle commented Oct 18, 2024

Okay, thank you. I'm not sure how often --uniques-only fails, but a nice feature for pack would be to check if the files are redundant itself. That is, if the API returns a digest that matches and earlier page, don't write it to disk. If you didn't want to do that and that info is available, perhaps you could include it in the metadata of the HTML, so a wrapper could do it. I found myself single file results (and wanting to tweak default argument values) and so used this wrapper.

#!/usr/bin/env python3

"""Wrap waybackpack to copy files to a single directory."""

import argparse
import os
import shutil
import subprocess


def run_waybackpack(args):
    """Run waybackpack with the given arguments."""
    command = ["waybackpack", "--dir", args.dir, "--delay-retry", str(args.delay_retry)]
    if args.no_clobber:
        command.append("--no-clobber")
    if args.progress:
        command.append("--progress")
    command.extend(args.unknown)

    try:
        subprocess.run(command, check=True)
        print("Waybackpack command executed successfully.")
    except subprocess.CalledProcessError as e:
        print(f"Error executing waybackpack: {e}")
        return False
    return True


def process_files(base_dir):
    """Create files rather than paths from waybackpack."""
    for root, _, files in os.walk(base_dir):
        for file in files:
            if file.endswith(".html"):
                original = os.path.join(root, file)
                relative_path = os.path.relpath(original, base_dir)
                new_filename = relative_path.replace(os.sep, "_")
                new_file_path = os.path.join(base_dir, new_filename)
                shutil.copy(original, new_file_path)
                print(f"Copied {original} to {new_file_path}")


def main():
    """Process arguments and call waybackpack and file processing."""
    parser = argparse.ArgumentParser(description="Waybackpack Wrapper")
    parser.add_argument(
        "--dir", type=str, default="wb", help="Directory for storing results"
    )
    parser.add_argument(
        "--delay-retry", type=int, default=15, help="Delay between retries"
    )
    parser.add_argument(
        "--no-clobber",
        action="store_true",
        default=True,
        help="Do not overwrite existing files",
    )
    parser.add_argument(
        "--progress", action="store_true", default=True, help="Show progress"
    )
    args, unknown = parser.parse_known_args()
    args.unknown = unknown

    if run_waybackpack(args):
        process_files(args.dir)


if __name__ == "__main__":
    main()

@Quuxplusone
Copy link

Quuxplusone commented Jan 4, 2025

I got the same error. I vaguely understand the explanation that this is a problem with the Wayback Machine's own API, and waybackpack is doing a good thing by throwing an error instead of falling back to something the user (me) might not want to do after all. But I don't understand at all how this explanation maps onto the actual output of the waybackpack executable! What I see on my screen for this failure mode is:

$ waybackpack --raw --to-date 202401 --uniques-only --dir archive/ http://fq.math.ca/Scanned/28-3/andre-jeannin.pdf
Traceback (most recent call last):
  File "/Users/aodwyer/env/bin/waybackpack", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/aodwyer/env/lib/python3.12/site-packages/waybackpack/cli.py", line 142, in main
    snapshots = search(
                ^^^^^^^
  File "/Users/aodwyer/env/lib/python3.12/site-packages/waybackpack/cdx.py", line 47, in search
    raise WaybackpackException(
waybackpack.cdx.WaybackpackException: Wayback Machine CDX API not respecting showDupeCount=true; retry without --uniques-only.

...Ah, I get it, I had been parsing that message as "hocuspocus [is] not respecting showDupeCount=true; retry without --uniques-only", which confused me. I had in fact passed --uniques-only. So it was confusing to see an error message that claimed to apply only without --uniques-only.
But you had meant me to parse it as "hocuspocus [does] not [respect] showDupeCount=true; [please] retry without --uniques-only"! That is, the last part was a command to the user (me), not a description of the failure mode.

I suggest improving the error message in three ways:

  • Actually catch the Python exception and output a proper message to the command-line user; don't just dump a stacktrace.
  • Rephrase the first part in active voice: "--uniques-only requires the Wayback Machine CDX API to respect showDupeCount=true, but in this case it doesn't."
  • Rephrase the second part as a new sentence, imperative, on a second line of text: "Please try again without --uniques-only."

So the final fixed behavior would look like this mockup:

$ waybackpack --raw --to-date 202401 --uniques-only --dir archive/ http://fq.math.ca/Scanned/28-3/andre-jeannin.pdf
Error: --uniques-only requires the Wayback Machine CDX API to respect `showDupeCount=true`, but in this case it doesn't.
Please try again without --uniques-only.

(The phrase "in this case" is super vague, of course, but I don't have the knowledge to improve its specificity.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants