Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependency Graph #2

Open
Fazel94 opened this issue Dec 6, 2015 · 6 comments
Open

Dependency Graph #2

Fazel94 opened this issue Dec 6, 2015 · 6 comments

Comments

@Fazel94
Copy link

Fazel94 commented Dec 6, 2015

If you upload all meta data or just their dependency in some easy to use format like xml , json or even an mySQL full db dump, I can implement a dependency graph and thus answer your blog post questions.
I can implement a adoption of page rank or similar algorithm to find the impact factor of packages.

@MartinThoma
Copy link
Owner

@Fazel94 Thank you for offering your help. Could you please tell me which blog article you refer to and which data I should upload?

@Fazel94
Copy link
Author

Fazel94 commented Dec 6, 2015

Sorry,
Here is the post I'm talking of
http://martin-thoma.com/analyzing-pypi-metadata/

I would be glad to mine PyPI data. But it is you pleasing for me to get around scraping PyPI myself.
I mean as formatted ( as well as it is not a burden for you) data base or serialized version of meta data, specially the dependency list for each package, so I can make a dependency graph on It and may be do a little frequent item set counting to extract which packages people use together.

Thank you for your attention.

@MartinThoma
Copy link
Owner

specially the dependency list for each package

There is no such thing as a dependency list of each package in PyPI metadata. You could only download all the packages (completely), look for a requriements.txt and read that.

I can upload the data. However, it is quite a bit. I'm currently running the script again. The scripts beginning with "c" are currently running and even a 7z-compressed csv version of the packages table is about 3 MB.

Would that still be of use for you? If you really want to build the dependency graph, you have to download a quite massive amount of data. Estimating with the query

SELECT sum(size)/1000000000 FROM `urls`

it is currently about 3.3GB. I can give you a better approximation tomorrow.

Where should I upload it?

@MartinThoma
Copy link
Owner

Currently it is at pyromancer and 16.35GB.

I've added a scripts to check for imports in a package.

TODOs are:

  • apply that script to the latest versions of all packages in PyPI
  • analyze the setup.py

Done:

  • download the Python package
  • extract it
  • get the python files
  • insert the gathered data into the database
  • (add a new table to the database for dependencies)

@MartinThoma MartinThoma changed the title I can help with Dependency graph. Dependency Graph Dec 6, 2015
@MartinThoma
Copy link
Owner

Ok, I've just put some more work in it:

If you really want to make the dependency graph, you still have to:

  • implement the get_setup_packages in package_analysis.py
  • run ./package_analysis for all latest releases

This will fill your database with all possible dependencies. Even if you don't implement get_setup_packages it will add probably all dependencies. However, even with a VERY good internet connection I expect that this will probably take several days to run. One could parallelize the download of the packages, but that would still need many hours.

@MartinThoma
Copy link
Owner

@Fazel94 I've just made the script to run it over the complete PyPI database. That will take quite a while. And it corrently ignores setuptools, which is a major issue (but was too complicated to make a secure / fast implementation within just a couple of hours - you could add that, if you want).

How would you like to visualize the graph? It has 67582 nodes and a lot more than 4600 edges (I'm just downloading / building the graph... takes a while). You cannot use graphviz for that.

(By the way, do we know each other? Are you a student from KIT, too?)

By now, the most imported module is os, followed (not even close) by sys, logging, re ... and org. I guess that is an error? I have no idea where that comes from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants