-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sorting and searching based on updated
and pushed
#12
Comments
Expanding the number of clojure libraries dewey can index sounds great. There's a few different changes I'm thinking about for the data pipeline, so I'm not exactly sure what timeline for integrating this kind of change would look like.
Do you think that's close to the total number of 2 star repositories? If not, what do you think is the limiting factor? |
Yes, that's close to it, re-running it now gives 4477. There is one theoretical bug in the "algorithm": if 1000+ repositories have the exact same pushed_at, the iteration could be stopped prematurely. I think that theoretical chance is safe to rule out. Otherwise I don't know any shortcomings of this method. I ran the whole thing now (removed stardev, which is updated now and then - I'm not sure how often - gives |
So cool! How long did that take? Maybe that's an even better method. Also, in theory, once we have an update to date list, we would only need to query for libraries pushed since the last time the process was run. 89,245 is a lot of repositories to analyze every week, but it's probably not so bad if we only re-analyze libraries that have changed. |
I'm getting about ~30 repos/second: (defn find-clojure-repos []
(iteration
(with-retries
(fn [{:keys [start-time cnt url pushed_at last-response] :as k}]
(prn cnt (select-keys k [:url :pushed_at]))
(let [start-time (or start-time (System/currentTimeMillis))
req
(cond
;; initial request
(= cnt 0) (search-repos-request "language:clojure")
;; received next-url
url (assoc base-request :url url)
;; received star number
pushed_at (search-repos-request (str "language:clojure pushed:<=" pushed_at))
:else (throw (Exception. (str "Unexpected key type: " (pr-str k)))))]
(rate-limit-sleep! last-response)
(let [response (http/request (with-auth req))
prev-items (into #{} (get-in last-response [:body :items] []))
page-items (get-in response [:body :items] [])
new-items (vec (remove (partial contains? prev-items) page-items))
new-cnt (+ cnt (count new-items))
spent-time-seconds (/ (max 1 (- (System/currentTimeMillis) start-time))
1000)
repos-per-second (/ new-cnt spent-time-seconds)]
(println "Repos/second:" (format "%.1f" (double repos-per-second)))
(-> response
(assoc :cnt new-cnt)
(assoc :start-time start-time)
(assoc ::key k
::request req)
(assoc-in [:body :items] new-items))))))
:kf
(fn [{:keys [cnt] :as response}]
(let [url (-> response :links :next :href)]
(when-let [m (if url
{:url url}
(when-let [pushed_at (some-> response :body :items last :pushed_at)]
{:pushed_at pushed_at}))]
(merge m
(select-keys response [:cnt :start-time])
{:last-response response}))))
:initk {:cnt 0})) and 90K repos / 30s = 3000 seconds => ~50 minutes to fetch it all. (edit: Forgive me if my math is way off, it's late and I didn't double check anything. But 50 minutes sounds reasonable..) It sounds like a good idea to avoid re-analyzing everything. |
Do you have an enterprise account? I thought the normal rate limiting was around 5k/hr. |
I suspect I wasn't being clear enough? I don't have an enterprise account. For me it was evaluating: (def all-repos (vec (find-clojure-repos))) that took 50 minutes. Does that make sense? |
Does this method include github's rate limiting? Otherwise, I'm trying to figure out how it gets the data so quickly while staying under github's rate limit. |
Yes, I believe it does. Are you sure that Here is the exact code that's running: I believe I created the token with as little permissions as possible. Maybe that makes a difference? Does that help? And a sample output from the console:
|
What is 5k per hour? Is that 5k requests? If there are 90 000 repos and each request fetches ~100, that only needs 900 requests (and finishes in 50 minutes here). |
Oh, got it. I read 30 repos per second as 30 requests per second for some reason. It all makes sense now 👍 |
Incremental pull of updated repos is now available at: https://github.com/ivarref/dewey/blob/main/src/com/phronemophobic/dewey.clj You can evaluate It's something like an hour since I ran the update, and for me the console now outputs:
Edit: I'm not 100% sure this method is bulletproof. Can you see any problems with it? It's sorting by Sorry about formatting diff in the commit. Regards. |
This is really great stuff. I'm pretty excited about getting it integrated. This will expand the dataset quite a bit which might require some additional changes. Just brainstorming a bit:
Most of dewey uses |
Hi again, and thanks! I discovered a bug when continuing from a I changed the data format to be new line delimited edn. git-lfs is one option for storing large files, and it's recommended by GitHub. Any thoughts on that? Each line/entry in BTW: My original goal when looking at this code, was to create something similar to https://github.com/phronmophobic/add-deps but for the CLI and tools.deps/gitlibs. |
I think that only applies to objects in a git repository. Currently, data dumps are only being uploaded as part of releases where we probably won't have to worry about the limit. The static analysis edn is already approaching 1gb. I think their docs say there isn't a specified limit. I also have an s3 bucket that I've been using for temporary storage, but would be open to using it for public data if it makes sense.
The reasoning is based on practical experience building ETL pipelines. Since storage is so cheap, I find that the easiest way to stay sane is have a dumb step that only fetches data and separately have steps that process the data. The benefit is that if you decide you want to update your transform based on more info, you can rerun the transform step without re-fetching data. Having a "dumb" fetch process also reduces the risk of bugs that cause data loss. There are reasons to combine the steps, but I don't think we're at the scale of data where trying to optimize for efficiency is that helpful.
I do have a gui version based on the latest clojure alpha, https://github.com/phronmophobic/add-deps/blob/main/src/com/phronemophobic/add_deps2.clj. I still think the most annoying part if figuring out the best way to make the data available. My latest brain storm is to try hosting a readonly db on s3 using xtdb or datomic. |
Hi again @phronmophobic Thanks for your input. I totally agree with your comments on storage: it is indeed cheap. I think a person who wants to develop/modify dewey can be expected to download a release and continue at
If PS: Catching up with the latest data is fast. I executed the following locally:
That's two weeks of Clojure git(hub) changes in 19 seconds. PS 2: a more human and incremental view of
(I'm personally not very familiar with babashka, and it's my first time using Edit: I'll be busy for some time now, so I may not have the time to respond in a very timely manner. Hope that you will make progress and comment here as you like, and of course you may use Regards. |
Hi @phronmophobic
and thanks for many great libraries and code!
I was able to get 4431 repositories with 2 stars using this code:
and
you'll notice that here
pushed_at
is used for searching and sort-byupdated
is used.The search is done using
pushed:<= ...
so that identical updated timestamps is supported (this also introduces the need to remove items from the previous request, in order to not loop forever).Is this something that you would like to integrate into
dewey
?Thanks and kind regards!
The text was updated successfully, but these errors were encountered: