Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large datasets #284

Open
s2tephen opened this issue Nov 2, 2016 · 1 comment
Open

Large datasets #284

s2tephen opened this issue Nov 2, 2016 · 1 comment

Comments

@s2tephen
Copy link
Contributor

s2tephen commented Nov 2, 2016

See the big-data branch for upper bounding on data sets and betweenness centrality estimation. These bounds were determined by running a series of tests on a variety of random graphs to roughly find the sweet spot in the accuracy/runtime tradeoff. Ultimately I was able to get the runtime of the betweenness_centrality function under 90 sec with error less than two decimal places (< .01).

This estimation works in the backend logic/testing suite, but unfortunately still times out in production. I was able to get it working by further lowering the upper bounds: MAX_NODES = 2000 and MAX_V_X_E (nodes * edges) = 10000000. At that point, however, the accuracy of the betweenness centrality estimation is much lower. I can't really invest any more time into fine tuning this but it may be worth revisiting in the future. One possible solution would be to run the betweenness centrality algorithm in parallel, as in this example.

@rahulbot
Copy link
Collaborator

rahulbot commented Nov 2, 2016

Maybe this is the final straw in the argument to host prod on the civic servers. @kanarinka what do you think?

@kanarinka kanarinka modified the milestone: v2.0 - Someday Jul 25, 2017
@rahulbot rahulbot removed this from the Someday milestone May 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants