Skip to content

Commit

Permalink
Merge pull request #20 from bellingcat/fix-cdx-overload
Browse files Browse the repository at this point in the history
Add better rate limit protection (solves #19)
  • Loading branch information
jclark1913 authored Nov 7, 2023
2 parents 4345abb + 4f6b097 commit 3f82361
Show file tree
Hide file tree
Showing 8 changed files with 176 additions and 109 deletions.
41 changes: 24 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -202,7 +202,7 @@ You can also clone and download the repo from github and use the tool locally.
3. Get a high-level overview:
```terminal
python main.py -h
python -m wayback_google_analytics.main.py -h
```
<p align="right">(<a href="#readme-top">back to top</a>)</p>
Expand All @@ -227,27 +227,28 @@ Options list (run `wayback-google-analytics -h` to see in terminal):
options:
-h, --help show this help message and exit
-i INPUT_FILE, --input_file INPUT_FILE
Enter a file path to a list of urls in a readable file
type (e.g. .txt, .csv, .md)
Enter a file path to a list of urls in a readable file type
(e.g. .txt, .csv, .md)
-u URLS [URLS ...], --urls URLS [URLS ...]
Enter a list of urls separated by spaces to get their
UA/GA codes (e.g. --urls https://www.google.com
Enter a list of urls separated by spaces to get their UA/GA
codes (e.g. --urls https://www.google.com
https://www.facebook.com)
-o {csv,txt,json,xlsx}, --output {csv,txt,json,xlsx}
Enter an output type to write results to file.
Defaults to json.
Enter an output type to write results to file. Defaults to
json.
-s START_DATE, --start_date START_DATE
Start date for time range (dd/mm/YYYY:HH:MM) Defaults
to 01/10/2012:00:00, when UA codes were adopted.
Start date for time range (dd/mm/YYYY:HH:MM) Defaults to
01/10/2012:00:00, when UA codes were adopted.
-e END_DATE, --end_date END_DATE
End date for time range (dd/mm/YYYY:HH:MM). Defaults
to None.
End date for time range (dd/mm/YYYY:HH:MM). Defaults to None.
-f {yearly,monthly,daily,hourly}, --frequency {yearly,monthly,daily,hourly}
Can limit snapshots to remove duplicates (1 per hr,
day, month, etc). Defaults to None.
Can limit snapshots to remove duplicates (1 per hr, day, month,
etc). Defaults to None.
-l LIMIT, --limit LIMIT
Limits number of snapshots returned. Defaults to -100
(most recent 100 snapshots).
Limits number of snapshots returned. Defaults to -100 (most
recent 100 snapshots).
-sc, --skip_current Add this flag to skip current UA/GA codes when getting archived
codes.
```

Expand Down Expand Up @@ -289,7 +290,15 @@ Ordered by code:

<p align="right">(<a href="#readme-top">back to top</a>)</p>

<!-- Limitations -->
## Limitations & Rate Limits

We recommend that you limit your list of urls to ~10 and your max snapshot limit to <500 during queries. While Wayback Google Analytics doesn't have any hardcoded limitations in regards to how many urls or snapshots you can request, large queries can cause 443 errors (rate limiting). Being rate limited can result in a temporary 5-10 minute ban from web.archive.org and the CDX api.

The app currently uses `asyncio.Semaphore()` along with delays between requests, but large queries or operations that take a long time can still result in a 443. Use your judgment and break large queries into smaller, more manageable pieces if you find yourself getting rate limited.


<p align="right">(<a href="#readme-top">back to top</a>)</p>

<!-- CONTRIBUTING -->
## Contributing
Expand Down Expand Up @@ -325,8 +334,6 @@ Distributed under the MIT License. See `LICENSE.txt` for more information.

<p align="right">(<a href="#readme-top">back to top</a>)</p>



<!-- CONTACT -->
## Contact

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "wayback-google-analytics"
version = "0.1.6"
version = "0.2.0"
description = "A tool for gathering current and historic google analytics ids from multiple websites"
authors = ["Justin Clark <[email protected]>"]
license = "MIT"
Expand Down
7 changes: 1 addition & 6 deletions tests/test_async_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,16 +62,11 @@ async def mock_text_method():
@patch("wayback_google_analytics.async_utils.get_UA_code")
@patch("wayback_google_analytics.async_utils.get_GA_code")
@patch("wayback_google_analytics.async_utils.get_GTM_code")
@patch("wayback_google_analytics.async_utils.sem", new_callable=MagicMock())
async def test_get_codes_from_single_timestamp(
self, mock_sem, mock_GTM, mock_GA, mock_UA, mock_get
self, mock_GTM, mock_GA, mock_UA, mock_get
):
"""Does get_codes_from_single_timestamp return correct codes from a single archive.org snapshot?"""

# Mock semaphore
mock_sem.__aenter__.return_value = MagicMock()
mock_sem.__aexit__.return_value = MagicMock()

# Mock the response from the server
mock_response = MagicMock()

Expand Down
4 changes: 4 additions & 0 deletions tests/test_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ def test_setup_args_valid_args(self):
"daily",
"--limit",
"10",
"--skip_current",
]
args = setup_args()

Expand All @@ -102,6 +103,7 @@ def test_setup_args_valid_args(self):
self.assertEqual(args.end_date, "01/01/2013:12:00")
self.assertEqual(args.frequency, "daily")
self.assertEqual(args.limit, "10")
self.assertEqual(args.skip_current, True)

def test_setup_args_valid_args_shorthand(self):
"""Does setup_args return args if valid args provided using shorthand commands?"""
Expand All @@ -120,6 +122,7 @@ def test_setup_args_valid_args_shorthand(self):
"daily",
"-l",
"10",
"-sc",
]
args = setup_args()

Expand All @@ -133,3 +136,4 @@ def test_setup_args_valid_args_shorthand(self):
self.assertEqual(args.end_date, "01/01/2013:12:00")
self.assertEqual(args.frequency, "daily")
self.assertEqual(args.limit, "10")
self.assertEqual(args.skip_current, True)
20 changes: 11 additions & 9 deletions wayback_google_analytics/async_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,6 @@
from wayback_google_analytics.codes import get_UA_code, get_GA_code, get_GTM_code
from wayback_google_analytics.utils import get_date_from_timestamp, DEFAULT_HEADERS

# Semaphore to limit number of concurrent requests (10-15 appears to work fine. 20+ causes 443 error from web.archive.org)
sem = asyncio.Semaphore(10)


async def get_snapshot_timestamps(
session,
Expand All @@ -14,6 +11,7 @@ async def get_snapshot_timestamps(
end_date,
frequency,
limit,
semaphore=asyncio.Semaphore(10),
):
"""Takes a url and returns an array of snapshot timestamps for a given time range.
Expand All @@ -24,6 +22,7 @@ async def get_snapshot_timestamps(
end_date (str, optional): End date for time range.
frequency (str, optional): Can limit snapshots to remove duplicates (1 per hr, day, week, etc).
limit (int, optional): Limit number of snapshots returned.
semaphore: asyncio.Semaphore()
Returns:
Array of timestamps:
Expand Down Expand Up @@ -52,22 +51,24 @@ async def get_snapshot_timestamps(
pattern = re.compile(r"\d{14}")

# Use session to get timestamps
async with session.get(cdx_url, headers=DEFAULT_HEADERS) as response:
timestamps = pattern.findall(await response.text())
async with semaphore:
async with session.get(cdx_url, headers=DEFAULT_HEADERS) as response:
timestamps = pattern.findall(await response.text())

print("Timestamps from CDX api: ", timestamps)

# Return sorted timestamps
return sorted(timestamps)


async def get_codes_from_snapshots(session, url, timestamps):
async def get_codes_from_snapshots(session, url, timestamps, semaphore=asyncio.Semaphore(10)):
"""Returns an array of UA/GA codes for a given url using the Archive.org Wayback Machine.
Args:
session (aiohttp.ClientSession)
url (str)
timestamps (list): List of timestamps to get codes from.
semaphore: asyncio.Semaphore()
Returns:
{
Expand Down Expand Up @@ -103,7 +104,7 @@ async def get_codes_from_snapshots(session, url, timestamps):

# Get codes from each timestamp with asyncio.gather().
tasks = [
get_codes_from_single_timestamp(session, base_url, timestamp, results)
get_codes_from_single_timestamp(session, base_url, timestamp, results, semaphore)
for timestamp in timestamps
]
await asyncio.gather(*tasks)
Expand All @@ -120,21 +121,22 @@ async def get_codes_from_snapshots(session, url, timestamps):
return results


async def get_codes_from_single_timestamp(session, base_url, timestamp, results):
async def get_codes_from_single_timestamp(session, base_url, timestamp, results, semaphore=asyncio.Semaphore(10)):
"""Returns UA/GA codes from a single archive.org snapshot and adds it to the results dictionary.
Args:
session (aiohttp.ClientSession)
base_url (str): Base url for archive.org snapshot.
timestamp (str): 14-digit timestamp.
results (dict): Dictionary to add codes to (inherited from get_codes_from_snapshots()).
semaphore: asyncio.Semaphore()
Returns:
None
"""

# Use semaphore to limit number of concurrent requests
async with sem:
async with semaphore:
async with session.get(
base_url.format(timestamp=timestamp), headers=DEFAULT_HEADERS
) as response:
Expand Down
57 changes: 44 additions & 13 deletions wayback_google_analytics/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,20 +66,42 @@ async def main(args):
)
args.frequency = COLLAPSE_OPTIONS[args.frequency]

async with aiohttp.ClientSession() as session:
results = await get_analytics_codes(
session=session,
urls=args.urls,
start_date=args.start_date,
end_date=args.end_date,
frequency=args.frequency,
limit=args.limit,
)
print(results)
semaphore = asyncio.Semaphore(10)

# handle printing the output
if args.output:
write_output(output_file, args.output, results)
# Warn user if large request
if abs(int(args.limit)) > 500 or len(args.urls) > 9:
response = input(
f"""Large requests can lead to being rate limited by archive.org.\n\n Current limit: {args.limit} (Recommended < 500) \n\n Current # of urls: {len(args.urls)} (Recommended < 10, unless limit < 50)
Do you wish to proceed? (Yes/no)
"""
)
if response.lower() not in ("yes", "y"):
print("Request cancelled.")
exit()

try:
async with semaphore:
async with aiohttp.ClientSession() as session:
results = await get_analytics_codes(
session=session,
urls=args.urls,
start_date=args.start_date,
end_date=args.end_date,
frequency=args.frequency,
limit=args.limit,
semaphore=semaphore,
skip_current=args.skip_current,
)
print(results)

# handle printing the output
if args.output:
write_output(output_file, args.output, results)
except aiohttp.ClientError as e:
print(
"Your request was rate limited. Wait 5 minutes and try again and consider reducing the limit and # of numbers."
)


def setup_args():
Expand All @@ -91,6 +113,7 @@ def setup_args():
--end_date: End date for time range. Defaults to None.
--frequency: Can limit snapshots to remove duplicates (1 per hr, day, month, etc). Defaults to None.
--limit: Limit number of snapshots returned. Defaults to None.
--skip_current: Add this flag to skip current UA/GA codes when getting archived codes.
Returns:
Command line arguments (argparse)
Expand Down Expand Up @@ -144,12 +167,20 @@ def setup_args():
default=-100,
help="Limits number of snapshots returned. Defaults to -100 (most recent 100 snapshots).",
)
parser.add_argument(
"-sc",
"--skip_current",
action="store_true",
help="Add this flag to skip current UA/GA codes when getting archived codes.",
)

return parser.parse_args()


def main_entrypoint():
args = setup_args()
asyncio.run(main(args))


if __name__ == "__main__":
main_entrypoint()
Loading

0 comments on commit 3f82361

Please sign in to comment.