Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data missing due to exceeding quota #36

Closed
dimbleby opened this issue Aug 17, 2024 · 9 comments · Fixed by #41
Closed

Data missing due to exceeding quota #36

dimbleby opened this issue Aug 17, 2024 · 9 comments · Fixed by #41

Comments

@dimbleby
Copy link

looking at the history, it seems as though the intention is to publish updated data on the first of each month.

Last update was 1st July.

@edgarrmondragon
Copy link

@hugovk
Copy link
Owner

hugovk commented Aug 18, 2024

Strange, still hitting the limit two weeks later:

google.api_core.exceptions.Forbidden: 403 Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas

That was for 7,000 projects.

@hugovk
Copy link
Owner

hugovk commented Aug 30, 2024

Google BigQuery gives 1 TB free quota per month.

Here's some data on past runs. The gaps indicate when switching to fetch data for more projects.

image

The 5k growth seemed fairly linear, so when switching to 8k I made a rough calculation that the 0.5 TB available would last about 40 months. Instead, since the turn of 2024 the growth has increased rapidly and is now over 1 TB for 8k.

@hugovk
Copy link
Owner

hugovk commented Aug 30, 2024

I've started the 90-day, $300 (€277.46) free trial to get things back on track and fetched the July data:

pypinfo -v --json --indent 0 --limit 8000 -sd 2024-07-02 -ed 2024-07-31 "" project > top-pypi-packages-30-days.json
jq -c . < top-pypi-packages-30-days.json > top-pypi-packages-30-days.min.json
jq -r '.rows[] | [.download_count, .project] | @csv' top-pypi-packages-30-days.json >> top-pypi-packages-30-days.csv

./deploy.sh

https://github.com/hugovk/top-pypi-packages/releases/tag/2024.08

This indeed is over 1 TB, it's 1.023 TB:

image

At some point whilst on the free trial, I'll do some experiments to pick a lower number of projects that's still under 1 TB, but it's concerning it grew to use ~0.5 TB in ~0.5 year.

Let's keep this open until that is done, and to check the September update goes out as planned on Sunday.

@hugovk
Copy link
Owner

hugovk commented Oct 13, 2024

Here's a couple of pypinfo commands, limiting to just the top 10 and 1 day, otherwise identical except for the --all flag.

--all means "Show downloads by all installers, not only pip".

By default, it only fetches downloads from pip, and that's what has been used in this repo (since day 1? Or perhaps since that flag was introduced? Would have to check. At least, the default has been been used here for a long time).

Only pip

❯ pypinfo --limit 10 --days 1 "" project
Served from cache: False
Data processed: 58.21 GiB
Data billed: 58.21 GiB
Estimated cost: $0.29
project download_count
boto3 37,251,744
aiobotocore 16,252,824
urllib3 16,243,278
botocore 15,687,125
requests 13,271,314
s3fs 12,865,055
s3transfer 12,014,278
fsspec 11,982,305
charset-normalizer 11,684,740
certifi 11,639,584
Total 158,892,247

All installers

pypinfo --all --limit 10 --days 1 "" project
Served from cache: False
Data processed: 46.63 GiB
Data billed: 46.63 GiB
Estimated cost: $0.23
project download_count
boto3 39,495,624
botocore 17,281,187
urllib3 17,225,121
aiobotocore 16,430,826
requests 14,287,965
s3fs 12,958,516
charset-normalizer 12,781,405
certifi 12,647,098
setuptools 12,608,120
idna 12,510,335
Total 168,226,197

So we can see the default pip-only costs an extra 25% data processed and data billed, and costs an extra 25% in dollars.

Unsurprisingly, the actual download counts are higher for all installers. The ranking has changed a bit, but I expect we're still getting more-or-less the same packages in the top thousands of results.

Queries

Here's the same commands, but run with --test: "--test simply prints the query".

pypinfo --test --limit 10 --days 1 "" project
SELECT
  file.project as project,
  COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
  AND details.installer.name = "pip"
GROUP BY
  project
ORDER BY
  download_count DESC
LIMIT 10
pypinfo --test --all --limit 10 --days 1 "" project
SELECT
  file.project as project,
  COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
  project
ORDER BY
  download_count DESC
LIMIT 10

These queries are the same, except the default has an extra AND details.installer.name = "pip" condition.

It sounds reasonable it would cost more to do extra filtering work.

Installers

Let's look at the installers:

pypinfo --all --limit 100 --days 1 "" installer
Served from cache: False
Data processed: 29.49 GiB
Data billed: 29.49 GiB
Estimated cost: $0.15
installer_name download_count
pip 1,121,198,711
uv 117,194,833
requests 29,828,272
poetry 23,009,454
None 8,916,745
bandersnatch 6,171,555
setuptools 1,362,797
Bazel 1,280,271
Browser 1,096,328
Nexus 593,230
Homebrew 510,247
Artifactory 69,063
pdm 62,904
OS 13,108
devpi 9,530
conda 2,272
pex 194
Total 1,311,319,514

pip still by far the most popular, and unsurprising uv is up there too, with about 10% of pip's downloads.

The others are 10% of uv or less. A lot of them are mirroring services, that we wanted to exclude before.

I think given uv's importance, and my expectation that it will continue to take a bigger share of the pie, plus especially the extra cost for filtering by just pip, means that we should switch to fetching data for all downloaders. Plus the others don't account for that much of the pie.

Next time, I'll also do some comparison between different totals: should we keep 8,000, go back to 5,000 or some other number?

@hugovk hugovk changed the title no August release Data missing due to exceeding quota Nov 10, 2024
@hugovk
Copy link
Owner

hugovk commented Nov 10, 2024

Next, let's see how number of projects affects the cost. (This data fetched yesterday, 2024-11-09.)

Days = 1, limit = 1000...8000

Let's begin checking just one day for different limits: 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000. We'll use --all, so not filtering for just pip.

Here's the basic query:

❯ pypinfo --test --all -v --json --indent 0 --days 1 --limit 8000 "" project
Credentials location set to "/Users/hugo/Dropbox/bin/data/pypinfo-ea8f23d28f2b.json".
SELECT
  file.project as project,
  COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
  project
ORDER BY
  download_count DESC
LIMIT 8000

Let's use a helper script like this:

import os

days = 1

for limit in (1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000):
    outfile = f"days-{days}-limit-{limit}-all.json"
    if os.path.exists(outfile):
        print(f"{outfile} exists, skipping")
        continue
    cmd = (
        f"pypinfo --all --json --indent 0 --days {days} --limit {limit} '' "
        f"project > {outfile}"
    )
    print(cmd)
    os.system(cmd)
image

Result: interestingly, the cost is the same for all limits (1000-8000): 0.31.

Days = 1, limit = 1000...8000, pip

Let's repeat but removing --all, so it's filters for only pip:

image

Result: cost again the same for all limits, and cost increased: 0.39.

Days = 30, limit = 8000...1000

Next, back to --all, repeat for 30 days, and start with 8000 first in case we were only paying for incremental changes:

days = 30

for limit in (8000, 7000, 6000, 5000, 4000, 3000, 2000, 1000):
    ...
image

Result: Again, the cost is the same regardless of limit: 4.89

Days = 1, 30, limit = 1

Let's also check for limit = 1

Result: Same cost as above, 0.31 for 1 day, and 4.89 for 30 days.

Days = 1...30, limit = 1

Now we know the cost is the same regardless of the limit (at least up to 8000), let's see how the cost varies over the number of days.

import os

import os

# for limit in (1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000):
for days in range(1, 31):
    for limit in (1,):
        outfile = f"days-{days}-limit-{limit}-all.json"
        if os.path.exists(outfile):
            print(f"{outfile} exists, skipping")
            continue

        cmd = (
            f"pypinfo --all --json --indent 0 --days {days} --limit {limit} '' "
            f"project > {outfile}"
        )
        print(cmd)
        os.system(cmd)

Fetch results from JSONs into CSV:

import json

print("days\testimated_cost\tbytes_billed\tbytes_processed")

for days in range(1, 31):
    for limit in (1,):
        outfile = f"days-{days}-limit-{limit}-all.json"

        with open(outfile) as f:
            data = json.load(f)

        out = [
            days,
            data["query"]["estimated_cost"],
            f"{data["query"]["bytes_billed"]:,}",
            f"{data["query"]["bytes_processed"]:,}",
        ]
        print("\t".join(map(str, out)))

Gives:

days	estimated_cost	bytes_billed	bytes_processed
1	0.31	67,423,436,800	67,423,031,967
2	0.49	106,777,542,656	106,776,936,726
3	0.66	144,589,193,216	144,588,849,934
4	0.84	183,362,387,968	183,362,056,051
5	0.94	205,844,905,984	205,844,089,492
6	1.05	230,423,527,424	230,423,239,221
7	1.20	263,483,031,552	263,482,009,301
8	1.37	301,120,618,496	301,119,942,981
9	1.56	340,849,065,984	340,848,563,030
10	1.73	379,594,997,760	379,594,187,471
11	1.91	419,779,575,808	419,778,888,347
12	2.02	442,795,819,008	442,795,580,469
13	2.11	463,927,771,136	463,927,460,363
14	2.28	500,829,257,728	500,828,317,940
15	2.47	541,425,926,144	541,424,988,775
16	2.64	580,075,388,928	580,075,257,785
17	2.83	621,216,268,288	621,215,221,560
18	3.01	660,446,642,176	660,445,832,190
19	3.13	686,769,045,504	686,768,435,724
20	3.24	710,389,268,480	710,388,303,159
21	3.41	747,798,265,856	747,798,009,119
22	3.59	789,304,049,664	789,303,120,956
23	3.78	830,582,292,480	830,581,546,594
24	3.97	871,580,565,504	871,579,537,799
25	4.14	908,331,057,152	908,330,312,456
26	4.25	933,782,093,824	933,781,258,405
27	4.35	956,512,075,776	956,511,698,396
28	4.52	992,670,121,984	992,669,344,697
29	4.70	1,032,543,272,960	1,032,542,495,766
30	4.89	1,073,809,981,440	1,073,809,418,894
image

Result: it's the number of days that affects cost.

Google: "The first 1 TiB of query data processed per month is free." 1 TiB = 1,099,511,627,776 bytes, and all of the values in the above table are below this limit. Note, this is with --all.

The last run in this repo, without --all, was billed at 1,376,442,646,528 bytes, over the limit:

"last_update": "2024-11-01 12:27:36",
"query": {
"bytes_billed": 1376442646528,
"bytes_processed": 1376442377548,
"cached": false,
"estimated_cost": "6.26"

So we can remove --all and we'll be okay... for now. But as we can see from #36 (comment), as time goes on, the line goes up.

We're at 1,073,809,981,440 and the limit is not far off at 1,099,511,627,776.

In the future, to stay under 1 TiB we'd need to reduce the number of days by some unclear amount.

Well, let's add --all for now: #39.

Here's the data collected from these tests: data.zip

@hugovk
Copy link
Owner

hugovk commented Nov 10, 2024

To repeat with the limit increasing by powers of ten, up to 1,000,000, getting data for all 531,022 on PyPI.

limit projects count estimated_cost bytes_billed bytes_processed
1 1 0.20 43,447,746,560 43,447,720,943
10 10 0.20 43,447,746,560 43,447,720,943
100 100 0.20 43,447,746,560 43,447,720,943
1000 1,000 0.20 43,447,746,560 43,447,720,943
8000 8,000 0.20 43,447,746,560 43,447,720,943
10000 10,000 0.20 43,447,746,560 43,447,720,943
100000 100,000 0.20 43,447,746,560 43,447,720,943
1000000 531,022 0.20 43,447,746,560 43,447,720,943
image

Result: Again, same cost for 1 project or 531,022 projects.

Next thought, perhaps the query can be improved:

SELECT
  file.project as project,
  COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
  project
ORDER BY
  download_count DESC
LIMIT 1000

Data fetched today, 2024-11-10: data2.zip

@hugovk
Copy link
Owner

hugovk commented Nov 10, 2024

Omitting the LIMIT is the same as fetching all, effectively the same as LIMIT 1000000, and the same cost (0.20):

SELECT
  file.project as project,
  COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
  project
ORDER BY
  download_count DESC

@hugovk
Copy link
Owner

hugovk commented Nov 24, 2024

I've switched to fetching data for all installers: #39

And written this up at https://dev.to/hugovk/a-surprising-thing-about-pypis-bigquery-data-2g9o

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants