Sorting and searching based on `updated` and `pushed` #12

ivarref · 2023-06-19T10:28:01Z

and thanks for many great libraries and code!

I was able to get 4431 repositories with 2 stars using this code:

(def base-request
  {:url          search-repos-url
   :method       :get
   :as           :json
   :query-params {:per_page 100
                  :sort     "updated"
                  :order    "desc"}})

and

(defn find-clojure-repos []
  (iteration
    (with-retries
      (fn [{:keys [url pushed_at last-response] :as k}]
        (prn (select-keys k [:url :pushed_at]))
        (let [req
              (cond
                ;; initial request
                (nil? k) (search-repos-request "language:clojure stars:2")
                ;; received next-url
                url (assoc base-request :url url)
                ;; received star number
                pushed_at (search-repos-request (str "language:clojure stars:2 " "pushed:<=" pushed_at))
                :else (throw (Exception. (str "Unexpected key type: " (pr-str k)))))]
          (rate-limit-sleep! last-response)
          (let [response (http/request (with-auth req))
                prev-items (into #{} (get-in last-response [:body :items] []))]
            (-> response
                (assoc ::key k
                       ::request req)
                (update-in [:body :items]
                           (fn [items]
                             (vec (remove (partial contains? prev-items) items)))))))))
    :kf
    (fn [response]
      (let [url (-> response :links :next :href)]
        (if url
          {:last-response response
           :url           url}
          (when-let [pushed_at (some-> response
                                       :body
                                       :items
                                       last
                                       ;; want to continue from where we left off
                                       :pushed_at)]
            {:pushed_at     pushed_at
             :last-response response}))))))

you'll notice that here pushed_at is used for searching and sort-by updated is used.

The search is done using pushed:<= ... so that identical updated timestamps is supported (this also introduces the need to remove items from the previous request, in order to not loop forever).

Is this something that you would like to integrate into dewey?

Thanks and kind regards!

The text was updated successfully, but these errors were encountered:

phronmophobic · 2023-06-20T04:14:20Z

Expanding the number of clojure libraries dewey can index sounds great. There's a few different changes I'm thinking about for the data pipeline, so I'm not exactly sure what timeline for integrating this kind of change would look like.

I was able to get 4431 repositories with 2 stars using this code:

Do you think that's close to the total number of 2 star repositories? If not, what do you think is the limiting factor?

ivarref · 2023-06-20T19:42:50Z

Do you think that's close to the total number of 2 star repositories?

Yes, that's close to it, re-running it now gives 4477.

There is one theoretical bug in the "algorithm": if 1000+ repositories have the exact same pushed_at, the iteration could be stopped prematurely. I think that theoretical chance is safe to rule out. Otherwise I don't know any shortcomings of this method.

I ran the whole thing now (removed stars:2), and ended up at a total of 89245 Clojure repositories.

stardev, which is updated now and then - I'm not sure how often - gives 85761 as the number of Clojure repositories. The numbers are thus roughly the same.

phronmophobic · 2023-06-20T20:09:47Z

I ran the whole thing now (removed stars:2), and ended up at a total of 89245 Clojure repositories.

So cool! How long did that take? Maybe that's an even better method. Also, in theory, once we have an update to date list, we would only need to query for libraries pushed since the last time the process was run.

89,245 is a lot of repositories to analyze every week, but it's probably not so bad if we only re-analyze libraries that have changed.

ivarref · 2023-06-20T20:58:47Z

I'm getting about ~30 repos/second:

(defn find-clojure-repos []
  (iteration
    (with-retries
      (fn [{:keys [start-time cnt url pushed_at last-response] :as k}]
        (prn cnt (select-keys k [:url :pushed_at]))
        (let [start-time (or start-time (System/currentTimeMillis))
              req
              (cond
                ;; initial request
                (= cnt 0) (search-repos-request "language:clojure")
                ;; received next-url
                url (assoc base-request :url url)
                ;; received star number
                pushed_at (search-repos-request (str "language:clojure pushed:<=" pushed_at))
                :else (throw (Exception. (str "Unexpected key type: " (pr-str k)))))]
          (rate-limit-sleep! last-response)
          (let [response (http/request (with-auth req))
                prev-items (into #{} (get-in last-response [:body :items] []))
                page-items (get-in response [:body :items] [])
                new-items (vec (remove (partial contains? prev-items) page-items))
                new-cnt (+ cnt (count new-items))
                spent-time-seconds (/ (max 1 (- (System/currentTimeMillis) start-time))
                                      1000)
                repos-per-second (/ new-cnt spent-time-seconds)]
            (println "Repos/second:" (format "%.1f" (double repos-per-second)))
            (-> response
                (assoc :cnt new-cnt)
                (assoc :start-time start-time)
                (assoc ::key k
                       ::request req)
                (assoc-in [:body :items] new-items))))))
    :kf
    (fn [{:keys [cnt] :as response}]
      (let [url (-> response :links :next :href)]
        (when-let [m (if url
                       {:url url}
                       (when-let [pushed_at (some-> response :body :items last :pushed_at)]
                         {:pushed_at pushed_at}))]
          (merge m
                 (select-keys response [:cnt :start-time])
                 {:last-response response}))))
    :initk {:cnt 0}))

and 90K repos / 30s = 3000 seconds => ~50 minutes to fetch it all.

(edit: Forgive me if my math is way off, it's late and I didn't double check anything. But 50 minutes sounds reasonable..)

It sounds like a good idea to avoid re-analyzing everything.

phronmophobic · 2023-06-21T23:03:35Z

Do you have an enterprise account? I thought the normal rate limiting was around 5k/hr.

ivarref · 2023-06-23T10:08:27Z

I suspect I wasn't being clear enough? I don't have an enterprise account.

For me it was evaluating:

(def all-repos (vec (find-clojure-repos)))

that took 50 minutes.

Does that make sense?

phronmophobic · 2023-06-23T18:47:58Z

Does this method include github's rate limiting?

Otherwise, I'm trying to figure out how it gets the data so quickly while staying under github's rate limit.

ivarref · 2023-06-24T14:34:40Z

Yes, I believe it does.

Are you sure that com.phronemophobic.dewey.util/auth is a proper value?
I noticed that nil is a legal value (and no warning will be printed).

Here is the exact code that's running:
https://github.com/ivarref/dewey

I believe I created the token with as little permissions as possible. Maybe that makes a difference?

Does that help?

And a sample output from the console:

(def all-repos (vec (find-clojure-repos)))

0 {:pushed_at nil}
Time for http/request ... SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
3379 ms
Repos/second: 29.6
100 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure&page=2"}
limit remaining  29
Time for http/request ... 2633 ms
Repos/second: 33.2
200 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure&page=3"}
limit remaining  28
Time for http/request ... 4666 ms
Repos/second: 28.0
300 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure&page=4"}
limit remaining  27
Time for http/request ... 2885 ms
Repos/second: 29.4
400 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure&page=5"}
limit remaining  26
Time for http/request ... 2874 ms
Repos/second: 30.4
500 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure&page=6"}
limit remaining  25
Time for http/request ... 2800 ms
Repos/second: 31.1
600 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure&page=7"}
limit remaining  24
Time for http/request ... 2969 ms
Repos/second: 31.5
700 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure&page=8"}
limit remaining  23
Time for http/request ... 2635 ms
Repos/second: 32.1
800 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure&page=9"}
limit remaining  22
Time for http/request ... 2528 ms
Repos/second: 32.8
900 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure&page=10"}
limit remaining  21
Time for http/request ... 3603 ms
Repos/second: 32.2
1000 {:pushed_at "2010-12-22T17:06:58Z"}
limit remaining  20
Time for http/request ... 2627 ms
Repos/second: 32.6
1099 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2010-12-22T17%3A06%3A58Z&page=2"}
limit remaining  19
Time for http/request ... 2885 ms
Repos/second: 32.8
1199 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2010-12-22T17%3A06%3A58Z&page=3"}
limit remaining  18
Time for http/request ... 2792 ms
Repos/second: 33.0
1299 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2010-12-22T17%3A06%3A58Z&page=4"}
limit remaining  17
Time for http/request ... 2933 ms
Repos/second: 33.1
1399 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2010-12-22T17%3A06%3A58Z&page=5"}
limit remaining  16
Time for http/request ... 2861 ms
Repos/second: 33.2
1499 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2010-12-22T17%3A06%3A58Z&page=6"}
limit remaining  15
Time for http/request ... 2622 ms
Repos/second: 33.5
1599 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2010-12-22T17%3A06%3A58Z&page=7"}
limit remaining  14
Time for http/request ... 2728 ms
Repos/second: 33.6
1699 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2010-12-22T17%3A06%3A58Z&page=8"}
limit remaining  13
Time for http/request ... 2826 ms
Repos/second: 33.7
1799 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2010-12-22T17%3A06%3A58Z&page=9"}
limit remaining  12
Time for http/request ... 2835 ms
Repos/second: 33.8
1899 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2010-12-22T17%3A06%3A58Z&page=10"}
limit remaining  11
Time for http/request ... 2927 ms
Repos/second: 33.8
1999 {:pushed_at "2011-09-17T18:08:12Z"}
limit remaining  10
Time for http/request ... 2957 ms
Repos/second: 33.8
2098 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2011-09-17T18%3A08%3A12Z&page=2"}
limit remaining  9
Time for http/request ... 2719 ms
Repos/second: 33.9
2198 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2011-09-17T18%3A08%3A12Z&page=3"}
limit remaining  29
Time for http/request ... 3089 ms
Repos/second: 33.8
2298 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2011-09-17T18%3A08%3A12Z&page=4"}
limit remaining  28
Time for http/request ... 3057 ms
Repos/second: 33.8
2398 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2011-09-17T18%3A08%3A12Z&page=5"}
limit remaining  27
Time for http/request ... 2845 ms
Repos/second: 33.8
2498 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2011-09-17T18%3A08%3A12Z&page=6"}
limit remaining  26
Time for http/request ... 2879 ms
Repos/second: 33.9
2598 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2011-09-17T18%3A08%3A12Z&page=7"}
limit remaining  25
Time for http/request ... 2632 ms
Repos/second: 34.0
2698 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2011-09-17T18%3A08%3A12Z&page=8"}
limit remaining  24
Time for http/request ... 3073 ms
Repos/second: 34.0
2798 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2011-09-17T18%3A08%3A12Z&page=9"}
limit remaining  23
Time for http/request ... 3263 ms
Repos/second: 33.8
2898 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2011-09-17T18%3A08%3A12Z&page=10"}
limit remaining  22
Time for http/request ... 2892 ms
Repos/second: 33.8
2998 {:pushed_at "2012-03-02T23:45:08Z"}
limit remaining  21
Time for http/request ... 2852 ms
Repos/second: 33.9
3097 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-03-02T23%3A45%3A08Z&page=2"}
limit remaining  20
Time for http/request ... 3202 ms
Repos/second: 33.8
3197 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-03-02T23%3A45%3A08Z&page=3"}
limit remaining  19
Time for http/request ... 2941 ms
Repos/second: 33.8
3297 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-03-02T23%3A45%3A08Z&page=4"}
limit remaining  18
Time for http/request ... 2839 ms
Repos/second: 33.8
3397 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-03-02T23%3A45%3A08Z&page=5"}
limit remaining  17
Time for http/request ... 2960 ms
Repos/second: 33.8
3497 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-03-02T23%3A45%3A08Z&page=6"}
limit remaining  16
Time for http/request ... 2781 ms
Repos/second: 33.9
3597 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-03-02T23%3A45%3A08Z&page=7"}
limit remaining  15
Time for http/request ... 3289 ms
Repos/second: 33.8
3697 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-03-02T23%3A45%3A08Z&page=8"}
limit remaining  14
Time for http/request ... 2865 ms
Repos/second: 33.8
3797 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-03-02T23%3A45%3A08Z&page=9"}
limit remaining  13
Time for http/request ... 2818 ms
Repos/second: 33.8
3897 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-03-02T23%3A45%3A08Z&page=10"}
limit remaining  12
Time for http/request ... 2881 ms
Repos/second: 33.9
3997 {:pushed_at "2012-07-07T13:28:53Z"}
limit remaining  11
Time for http/request ... 3006 ms
Repos/second: 33.8
4096 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-07-07T13%3A28%3A53Z&page=2"}
limit remaining  10
Time for http/request ... 2736 ms
Repos/second: 33.9
4196 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-07-07T13%3A28%3A53Z&page=3"}
limit remaining  9
Time for http/request ... 3017 ms
Repos/second: 33.9
4296 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-07-07T13%3A28%3A53Z&page=4"}
limit remaining  29
Time for http/request ... 2940 ms
Repos/second: 33.9
4396 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-07-07T13%3A28%3A53Z&page=5"}
limit remaining  28
Time for http/request ... 3171 ms
Repos/second: 33.8
4496 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-07-07T13%3A28%3A53Z&page=6"}
limit remaining  27
Time for http/request ... 2874 ms
Repos/second: 33.8
4596 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-07-07T13%3A28%3A53Z&page=7"}
limit remaining  26
Time for http/request ... 2910 ms
Repos/second: 33.8
4696 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-07-07T13%3A28%3A53Z&page=8"}
limit remaining  25
Time for http/request ... 2935 ms
Repos/second: 33.8
4796 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-07-07T13%3A28%3A53Z&page=9"}
limit remaining  24
Time for http/request ... 3348 ms
Repos/second: 33.8
4896 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-07-07T13%3A28%3A53Z&page=10"}
limit remaining  23
Time for http/request ... 2933 ms
Repos/second: 33.8
4996 {:pushed_at "2012-10-29T22:18:09Z"}
limit remaining  22
Time for http/request ... 2840 ms
Repos/second: 33.8
5095 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-10-29T22%3A18%3A09Z&page=2"}
limit remaining  21
Time for http/request ... 3062 ms
Repos/second: 33.8
5195 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-10-29T22%3A18%3A09Z&page=3"}
limit remaining  20
Time for http/request ... 2862 ms
Repos/second: 33.8
5295 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-10-29T22%3A18%3A09Z&page=4"}
limit remaining  19
Time for http/request ... 2998 ms
Repos/second: 33.8
5395 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-10-29T22%3A18%3A09Z&page=5"}
limit remaining  18
Time for http/request ... 2951 ms
Repos/second: 33.8
5495 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2012-10-29T22%3A18%3A09Z&page=6"}
limit remaining  17
...

ivarref · 2023-06-24T14:39:12Z

What is 5k per hour? Is that 5k requests?

If there are 90 000 repos and each request fetches ~100, that only needs 900 requests (and finishes in 50 minutes here).

phronmophobic · 2023-06-24T20:43:24Z

Oh, got it. I read 30 repos per second as 30 requests per second for some reason. It all makes sense now 👍

ivarref · 2023-06-25T21:33:04Z

Incremental pull of updated repos is now available at:

https://github.com/ivarref/dewey/blob/main/src/com/phronemophobic/dewey.clj

You can evaluate (def all-repos-vec (vec (find-clojure-repos)))) and it should only fetch the most recent repos (given that CWD contains all-repos.tsv.)
Can you try it out and see if it works as expected for you?

It's something like an hour since I ran the update, and for me the console now outputs:

(def all-repos-vec (vec (find-clojure-repos)))
continuing at {:url https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2023-06-21T15%3A30%3A48Z&page=4, :pushed_at nil}
0 {:url "https://api.github.com/search/repositories?per_page=100&sort=updated&order=asc&q=language%3Aclojure+pushed%3A%3E%3D2023-06-21T15%3A30%3A48Z&page=4", :pushed_at nil}
new items: 14 , total items: 86186 , spent 2655 ms
1 {:pushed_at "2023-06-25T21:31:18Z"}
limit remaining  29
new items: 0 , total items: 86186 , spent 664 ms

all-repos.tsv is an append-only file (for now). I'm sure it and the code could be improved.

Edit: I'm not 100% sure this method is bulletproof. Can you see any problems with it? It's sorting by asc, doing basic de-duplication and continuing at the place where it last got results.

Sorry about formatting diff in the commit.

Regards.

phronmophobic · 2023-06-26T18:05:05Z

This is really great stuff. I'm pretty excited about getting it integrated.

This will expand the dataset quite a bit which might require some additional changes. Just brainstorming a bit:

Currently, the analyzer analyzes every project every week. I've been meaning to change the pipeline so it only analyzes projects that have been updated. A quick workaround might be to just analyze projects with 3 or more stars.
All of the data is released as .edn files. That's fine for small amounts of data, but we should probably find a better data format to allow for more data! Potentially, that might just be new line delimited edn.
The web interface requests all the data searches it locally in the browser. I'm not sure that will work with 5-6x more data. Currently, the web interface is completely static for ease and cost. There's probably a better alternative, but I haven't had the time to settle on a different one. Anyway, only including repos with 3 or more stars might be a workaround here as well.

Most of dewey uses .edn files to store data between runs and steps. Any particular reason to use .tsv rather than .edn for all-repos.tsv?

ivarref · 2023-06-29T17:42:05Z

Hi again, and thanks!

I discovered a bug when continuing from a &page=10 url: the indexing would then stop.
It's fixed in main at https://github.com/ivarref/dewey

I changed the data format to be new line delimited edn. all-repos.edn is now 456M! GitHub only allows for 100M per file.
all-repos.edn now also stores all information returned from the GitHub query. I'm not sure how much of that information is actually useful/needed.

git-lfs is one option for storing large files, and it's recommended by GitHub. Any thoughts on that?

Each line/entry in all-repos.edn contains the key :session-index. For new items the value will be one more than the previous run of find-clojure-repos, i.e. you could use this value to select only what has changed since last run. Does that make sense? I'm not sure I explained it very well.

BTW: My original goal when looking at this code, was to create something similar to https://github.com/phronmophobic/add-deps but for the CLI and tools.deps/gitlibs.

phronmophobic · 2023-06-29T19:26:56Z

GitHub only allows for 100M per file.

I think that only applies to objects in a git repository. Currently, data dumps are only being uploaded as part of releases where we probably won't have to worry about the limit. The static analysis edn is already approaching 1gb. I think their docs say there isn't a specified limit.

I also have an s3 bucket that I've been using for temporary storage, but would be open to using it for public data if it makes sense.

all-repos.edn now also stores all information returned from the GitHub query. I'm not sure how much of that information is actually useful/needed.

The reasoning is based on practical experience building ETL pipelines. Since storage is so cheap, I find that the easiest way to stay sane is have a dumb step that only fetches data and separately have steps that process the data. The benefit is that if you decide you want to update your transform based on more info, you can rerun the transform step without re-fetching data. Having a "dumb" fetch process also reduces the risk of bugs that cause data loss. There are reasons to combine the steps, but I don't think we're at the scale of data where trying to optimize for efficiency is that helpful.

BTW: My original goal when looking at this code, was to create something similar to https://github.com/phronmophobic/add-deps but for the CLI and tools.deps/gitlibs.

I do have a gui version based on the latest clojure alpha, https://github.com/phronmophobic/add-deps/blob/main/src/com/phronemophobic/add_deps2.clj. I still think the most annoying part if figuring out the best way to make the data available. My latest brain storm is to try hosting a readonly db on s3 using xtdb or datomic.

ivarref · 2023-07-04T09:24:08Z

Hi again @phronmophobic

Thanks for your input. I totally agree with your comments on storage: it is indeed cheap.

I think a person who wants to develop/modify dewey can be expected to download a release and continue at all-repos.edn in the release if he/she wants to fetch the latest changes.

The web interface requests all the data searches it locally in the browser. I'm not sure that will work with 5-6x more data.

If dewey is about git libraries, how about removing projects with zero tags and (perhaps) zero stars?
My guess is that there is many such projects.

PS: Catching up with the latest data is fast. I executed the following locally:

(def all-repos-vec (time (vec (find-clojure-repos {}))))
0 nil {"q" "language:clojure pushed:>=2023-06-20T19:03:33Z", "page" "7"}
Saved 100 items to disk
New items: 100 , total items: 86276 , spent 4406 ms
limit remaining  29
1 nil {"q" "language:clojure pushed:>=2023-06-20T19:03:33Z", "page" "8"}
limit remaining  29
Saved 85 items to disk
New items: 85 , total items: 86361 , spent 2976 ms
limit remaining  28
2 "2023-07-04T09:05:50Z" nil
limit remaining  28
Saved 0 items to disk
New items: 0 , total items: 86361 , spent 366 ms
"Elapsed time: 19060.095366 msecs"
=> #'com.phronemophobic.dewey/all-repos-vec

That's two weeks of Clojure git(hub) changes in 19 seconds.

PS 2: a more human and incremental view of all-repos.edn is possible like this:

$ tail -f all-repos.edn | bb -I --stream -e '(println (select-keys *input* [:full_name :pushed_at]))'

(I'm personally not very familiar with babashka, and it's my first time using -I --stream and -e, so I figured I'd share it.)

Edit: I'll be busy for some time now, so I may not have the time to respond in a very timely manner. Hope that you will make progress and comment here as you like, and of course you may use ivarref/dewey as you find useful. The function name find-clojure-repos is misnamed, it should be called something like find-new-or-updated-clojure-repos. You also know this of course. All the best.

Regards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sorting and searching based on `updated` and `pushed` #12

Sorting and searching based on `updated` and `pushed` #12

ivarref commented Jun 19, 2023

phronmophobic commented Jun 20, 2023

ivarref commented Jun 20, 2023 •

edited

Loading

phronmophobic commented Jun 20, 2023

ivarref commented Jun 20, 2023 •

edited

Loading

phronmophobic commented Jun 21, 2023

ivarref commented Jun 23, 2023

phronmophobic commented Jun 23, 2023

ivarref commented Jun 24, 2023

ivarref commented Jun 24, 2023

phronmophobic commented Jun 24, 2023

ivarref commented Jun 25, 2023 •

edited

Loading

phronmophobic commented Jun 26, 2023

ivarref commented Jun 29, 2023

phronmophobic commented Jun 29, 2023

ivarref commented Jul 4, 2023 •

edited

Loading

Sorting and searching based on updated and pushed #12

Sorting and searching based on updated and pushed #12

Comments

ivarref commented Jun 19, 2023

phronmophobic commented Jun 20, 2023

ivarref commented Jun 20, 2023 • edited Loading

phronmophobic commented Jun 20, 2023

ivarref commented Jun 20, 2023 • edited Loading

phronmophobic commented Jun 21, 2023

ivarref commented Jun 23, 2023

phronmophobic commented Jun 23, 2023

ivarref commented Jun 24, 2023

ivarref commented Jun 24, 2023

phronmophobic commented Jun 24, 2023

ivarref commented Jun 25, 2023 • edited Loading

phronmophobic commented Jun 26, 2023

ivarref commented Jun 29, 2023

phronmophobic commented Jun 29, 2023

ivarref commented Jul 4, 2023 • edited Loading

Sorting and searching based on `updated` and `pushed` #12

Sorting and searching based on `updated` and `pushed` #12

ivarref commented Jun 20, 2023 •

edited

Loading

ivarref commented Jun 20, 2023 •

edited

Loading

ivarref commented Jun 25, 2023 •

edited

Loading

ivarref commented Jul 4, 2023 •

edited

Loading