Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include .conda packages #45

Open
jakirkham opened this issue Aug 7, 2023 · 77 comments
Open

Include .conda packages #45

jakirkham opened this issue Aug 7, 2023 · 77 comments

Comments

@jakirkham
Copy link

It would be helpful to include both .conda & .tar.bz2 packages. Particularly as more of the former and less of the latter are produced. May also help to track these separately to track the transition to the newer format

@jakirkham
Copy link
Author

cc @beckermr @wolfv

@jezdez
Copy link
Member

jezdez commented Sep 20, 2023

Looking into this with @cappadona

@dopplershift
Copy link

@jezdez Did that go anywhere? I was working on collecting some download numbers for my library and right now 2023 shows minimal downloads due to the transition to .conda.

@jakirkham
Copy link
Author

@jezdez did this issue get solved more broadly?

Saw the python packages were fixed recently: #41

Is there a path for fixing the other packages? Or did this already happen?

@cappadona
Copy link

@jakirkham @dopplershift. Apologies for the delay.

We have not yet addressed .conda packages missing from this data set. This work is on our backlog, and we should be able to get this done in November. We will provide updates here, but please don't hesitate to reach out with questions.

@jakirkham
Copy link
Author

Thanks Nick! 🙏

@cappadona
Copy link

Hi @jakirkham @dopplershift. Quick update on the status of this issue.

We're working on finalizing a new pipeline that will source this public data set and include .conda packages moving forward. We expect to have it ready by the end of March 2024 and will post an update here when it is available.

@leofang
Copy link

leofang commented Jan 4, 2024

Hi @cappadona Thanks for the update! Q: Would it be possible to also update the past statistics when the new pipeline is up?

@cappadona
Copy link

@leofang At the moment we're not planning to replace any existing files in the bucket and only implement the fix for future data.

@jakirkham
Copy link
Author

cc @aterrel @chenghlee (as we discussed this earlier)

@leofang
Copy link

leofang commented Mar 1, 2024

Hi @cappadona @jezdez Friendly nudge for updates 🙂 This has impacted several statistics tracking tools and caused confusion. I've heard jabbering about "no one is using conda" as they looked at the download counts from, say, condastats, but it is simply not true.

@cappadona
Copy link

Hi @leofang. Thanks for checking in. We are on track to include .conda packages in the dataset by the end of the month.

@jakirkham
Copy link
Author

Just wanted to check in, @cappadona how are things looking here?

@wolfv
Copy link

wolfv commented Mar 19, 2024

Still looks reaaaally flat: https://prefix.dev/channels/conda-forge/packages/aesara (picked a random package)

@jakirkham
Copy link
Author

To be fair, Nick said end of the month originally. So end of next week

Though would be good to learn if that is still true or if this is likely to slip

@jakirkham
Copy link
Author

@cappadona how are things looking?

@cappadona
Copy link

@jakirkham Sorry I missed your earlier message. Thanks for checking in. We're looking good and the March 2024 data published to the s3 bucket later this week will include .conda packages.

I will post an update to this thread once the March data is available.

@jakirkham
Copy link
Author

Thanks Nick! 🙏

@cappadona
Copy link

Hi all. Quick update. We're just about there. Finalizing QA with the rest of the team, including a colleague who returns next week. Here are a couple examples for March 2024.

Screenshot 2024-04-05 at 5 17 12 PM Screenshot 2024-04-05 at 5 20 53 PM

@jakirkham
Copy link
Author

Thanks Nick! 🙏

With numpy this includes some older versions like 1.9.2, are these coming from defaults? Asking as conda-forge jumped to numpy version 1.9.3 (in the 1.9 series). Or is this an amalgamation of different channel statistics?

aesara is only in conda-forge AFAIK. So am guessing the top sheet is based on conda-forge data. Is that right?

@cappadona
Copy link

Hi @jakirkham. The screenshot is an aggregation of multiple channels, which are usually identified in the final dataset via the data_source column. I did confirm that conda-forge is the only data sources for aesara.

@jakirkham
Copy link
Author

How are things looking @cappadona ?

@jakirkham
Copy link
Author

@cappadona are there any updates here?

Also as a side note, users are also asking about March data in this issue: #51

@cappadona
Copy link

Hi @jakirkham. Monthly and hourly data for March and April 2024, which includes .conda packages, are now available in the bucket.

Thank you all for your patience.

@jezdez
Copy link
Member

jezdez commented May 6, 2024

@cappadona Do you think we could update the old files as well, since .conda files had been hosted for a while? Should we keep this ticket open until we fix that?

@wolfv
Copy link

wolfv commented May 6, 2024

So just to get it right, the format of the parquet files changed?

@jakirkham
Copy link
Author

Thanks Jannis! 🙏

Please let us know if you need more info from us or need us to test anything 🙂

@wolfv
Copy link

wolfv commented Sep 6, 2024

The 1970 issues were actually issues in our code. Sorry about that!

@wolfv
Copy link

wolfv commented Sep 6, 2024

We just fixed things on our end, but it appaears that the pipeline to produce this data is not really working anymore?

The latest data is 2024-06...

@jezdez
Copy link
Member

jezdez commented Sep 6, 2024

Huh, I'd check with @cappadona about it, he was working on an analysis

@cappadona
Copy link

Hi all. We've been running some analysis on the dataset in response to everyone's feedback and will share our findings when this is complete.

In the interim, responding to some of the recent questions in this thread...


@wolfv

We just fixed things on our end, but it appaears that the pipeline to produce this data is not really working anymore?

The latest data is 2024-06...

The latest data available in the s3 bucket is for 2024-05, which was made available in June. We have temporarily paused publishing new data until we complete the QA.

The 1970 issues were actually issues in our code. Sorry about that!

Thank you. This is one issue that we haven't been able to reproduce.


@jakirkham @phwuil @nicrie

Notably:

False alarm -- addressed by Wolf

This is the main focus of our QA effort and we're tentatively planning to replace data beginning in 2022-06 to address the undercounting.

Temporarily paused publishing new data (see my response above)

We still need to dig into the download counter displayed on anaconda.org. I will also comment on each of those issues.

@wolfv
Copy link

wolfv commented Sep 20, 2024

We've dropped the faulty data from our end. Any chance you are going to backfill data from the past? it looks pretty weird now, because some packages that had releases only have 1 measuring point.

@wolfv
Copy link

wolfv commented Sep 20, 2024

Screenshot 2024-09-20 at 14 18 53

@wolfv
Copy link

wolfv commented Sep 20, 2024

Lastly, while it appears you fixed the .conda, is it possible that .tar.bz2 are not accounted for anymore?

https://prefix.dev/channels/conda-forge/packages/_libgcc_mutex

Screenshot 2024-09-20 at 14 20 38

@cappadona
Copy link

We've dropped the faulty data from our end. Any chance you are going to backfill data from the past? it looks pretty weird now, because some packages that had releases only have 1 measuring point.

Hi Wolf, yes we are planning to backfill past data and we will be sharing details at this week's conda community sync.

@cappadona
Copy link

Hi @wolfv I'm unable to reproduce this dropoff for _libgcc_mutex when using condastats

Screenshot 2024-09-23 at 11 20 20 AM

@wolfv
Copy link

wolfv commented Sep 23, 2024

OK, then we might have an issue on our end again :) Thanks!

@jakirkham
Copy link
Author

@cappadona is this working correctly for other channels?

Think it would be good to double check these are all handled correctly (others may have suggestions):

  • bioconda
  • defaults
  • nvidia
  • pytorch
  • rapidsai
  • rapidsai-nightly

@jakirkham
Copy link
Author

Also worth noting RAPIDS is switching to publishing .conda packages. So we will want to make sure they are picked up in the statistics here

@h-vetinari
Copy link

We've dropped the faulty data from our end. Any chance you are going to backfill data from the past? it looks pretty weird now, because some packages that had releases only have 1 measuring point.

Hi Wolf, yes we are planning to backfill past data and we will be sharing details at this week's conda community sync.

Any updates on the backfill?

I tried to run the by-the-numbers binder again, and

dd.read_parquet("s3://anaconda-package-data/conda/hourly/2024/06/2024-06-*.parquet",storage_options={'anon': True})

returns an empty data frame, and so do all months after June (whereas the months up until May 2024 are fine).

I've loosened the match to

dd.read_parquet("s3://anaconda-package-data/conda/hourly/2024/06/*.parquet",storage_options={'anon': True})

and still nothing.

@jakirkham
Copy link
Author

Asked about this at the Conda community meeting earlier this week and it sounds like they are working through some issues

@wolfv
Copy link

wolfv commented Dec 4, 2024

Any updates?

@cappadona
Copy link

@jakirkham @h-vetinari @wolfv We are finalizing the work to generate new hourly and monthly data beginning with 2022-06 that includes downloads for .conda packages and plan to deliver this by the end of next week 2024-12-20.

@jakirkham
Copy link
Author

@cappadona could you please let us know what the status is on the updated download statistics?

@cappadona
Copy link

Hi @jakirkham. 2024 data is now available in the public bucket through November and includes .conda packages in the counts.

We are backfilling the remaining prior months (2022-06 to 2023-12) today with .conda package counts, and will post another update when that is complete.

@wolfv
Copy link

wolfv commented Dec 23, 2024

oh my god, finally!

The graph looks a little better again, e.g. for bzip2 (https://prefix.dev/channels/conda-forge/packages/bzip2)

Screenshot 2024-12-23 at 13 51 59

@cappadona
Copy link

The backfill of .conda package counts through 2022-06 is complete. Thank you all for your patience and feedback.

@h-vetinari
Copy link

Thank you very much! 🙏

Not to rain on the parade, but just to double-check: Regarding the graph that wolf showed, the trough between oct '23 and june '24 still looks quite suspicious...? 🤔

@h-vetinari
Copy link

zlib also looks weird (handover from blue to orange is plausible, everything after that looks iffy)

image

@wolfv
Copy link

wolfv commented Dec 23, 2024

Yeah we need to drop the faulty data from our database. Will take care of it shortly

@h-vetinari
Copy link

If one goes and executes by-the-numbers notebook linked from the conda-forge landing page (with some minor adaptations to update the loop over which years we're interested in), we get the following for 2021-2023:

Untitled

Update for 2020-2024 (minus Dec. '24 data, which gets downloaded but fails processing):

Success 🥳

image

@jakirkham
Copy link
Author

Happy New Year everyone! 🥳

Thank you Nick! 🙏

Think the next step will be for all of us to go through this data and make sure things are looking reasonable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants