-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use crates.io API for search #1224
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a short check, some first thoughts in comments.
Sorry in advance for unnecessary questions or remarks, since new to the project and the language :)
If you want to can ignore the performance related questions, it was a long time since I worked in compiled languages, the comments only come from typical performance quickwins I always see in reviews.
While this improves #708 , we will get the issues open in the crates.io search (rust-lang/crates.io#654 , rust-lang/crates.io#310 , rust-lang/crates.io#2710 ) , but it's definitely better fixing them only in one place :)
src/web/releases.rs
Outdated
@@ -531,7 +532,7 @@ pub fn search_handler(req: &mut Request) -> IronResult<Response> { | |||
INNER JOIN releases | |||
ON crates.latest_version_id = releases.id | |||
WHERE github_stars >= 100 AND rustdoc_status = true | |||
OFFSET FLOOR(RANDOM() * 280) LIMIT 1", | |||
OFFSET FLOOR(RANDOM() * SELECT COUNT(*) FROM crates) LIMIT 1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can lead to an empty result:
- in the
WHERE
you filter for crates with > 100 stars and having docs , - the offset is based on
RANDOM
(between 0 and 1) and the count of all crates, so can be far higher (I assume) than the amount of crates with >100 stars.
to be 100% correct you would have to use the same joins and the same filters , but I guess some constant we divide by should be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, could I do something like ORDER BY random() LIMIT 1
instead? Then I don't have to worry about knowing the number of crates in advance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ORDER BY random() is surprisingly expensive.
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=70591.52..70591.53 rows=1 width=36) (actual time=183.781..188.827 rows=1 loops=1)
-> Sort (cost=70591.52..70599.05 rows=3009 width=36) (actual time=183.780..188.825 rows=1 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 25kB
-> Gather (cost=5108.21..70576.48 rows=3009 width=36) (actual time=74.115..186.504 rows=5641 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Hash Join (cost=4108.21..69268.06 rows=752 width=28) (actual time=53.510..173.996 rows=1128 loops=5)
Hash Cond: (releases.id = crates.latest_version_id)
-> Hash Join (cost=1022.38..66135.98 rows=4426 width=21) (actual time=7.829..121.941 rows=15972 loops=5)
Hash Cond: ((releases.github_repo)::text = (github_repos.id)::text)
-> Parallel Seq Scan on releases (cost=0.00..64951.29 rows=61822 width=53) (actual time=0.028..85.406 rows=49939 loops=5)
Filter: rustdoc_status
Rows Removed by Filter: 12299
-> Hash (cost=998.16..998.16 rows=1937 width=32) (actual time=7.653..7.653 rows=1945 loops=5)
Buckets: 2048 Batches: 1 Memory Usage: 140kB
-> Seq Scan on github_repos (cost=0.00..998.16 rows=1937 width=32) (actual time=0.023..6.983 rows=1945 loops=5)
Filter: (stars >= 100)
Rows Removed by Filter: 25116
-> Hash (cost=2430.37..2430.37 rows=52437 width=15) (actual time=42.731..42.732 rows=52641 loops=5)
Buckets: 65536 Batches: 1 Memory Usage: 3034kB
-> Seq Scan on crates (cost=0.00..2430.37 rows=52437 width=15) (actual time=0.024..22.419 rows=52641 loops=5)
Planning time: 0.656 ms
Execution time: 189.117 ms
(24 rows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for some reason a CTE is even more expensive!
cratesfyi=> explain analyze WITH random_crates AS (
SELECT crates.name,
releases.version,
releases.target_name
FROM crates
INNER JOIN releases ON crates.latest_version_id = releases.id
LEFT JOIN github_repos ON releases.github_repo = github_repos.id
WHERE github_repos.stars >= 100 AND rustdoc_status = true
)
SELECT name, version, target_name FROM random_crates
OFFSET FLOOR(RANDOM() * (SELECT COUNT(*) FROM random_crates))
LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=70642.69..70642.71 rows=1 width=1250) (actual time=387.544..390.924 rows=1 loops=1)
CTE random_crates
-> Gather (cost=5108.21..70568.96 rows=3009 width=28) (actual time=66.232..382.994 rows=5641 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Hash Join (cost=4108.21..69268.06 rows=752 width=28) (actual time=89.080..366.300 rows=1128 loops=5)
Hash Cond: (releases.id = crates.latest_version_id)
-> Hash Join (cost=1022.38..66135.98 rows=4426 width=21) (actual time=16.025..286.586 rows=15972 loops=5)
Hash Cond: ((releases.github_repo)::text = (github_repos.id)::text)
-> Parallel Seq Scan on releases (cost=0.00..64951.29 rows=61822 width=53) (actual time=0.023..219.517 rows=49939 loops=5)
Filter: rustdoc_status
Rows Removed by Filter: 12300
-> Hash (cost=998.16..998.16 rows=1937 width=32) (actual time=15.846..15.846 rows=1945 loops=5)
Buckets: 2048 Batches: 1 Memory Usage: 140kB
-> Seq Scan on github_repos (cost=0.00..998.16 rows=1937 width=32) (actual time=0.027..15.137 rows=1945 loops=5)
Filter: (stars >= 100)
Rows Removed by Filter: 25116
-> Hash (cost=2430.37..2430.37 rows=52437 width=15) (actual time=72.454..72.455 rows=52641 loops=5)
Buckets: 65536 Batches: 1 Memory Usage: 3034kB
-> Seq Scan on crates (cost=0.00..2430.37 rows=52437 width=15) (actual time=0.024..35.014 rows=52641 loops=5)
InitPlan 2 (returns $2)
-> Aggregate (cost=67.70..67.71 rows=1 width=8) (actual time=386.610..386.611 rows=1 loops=1)
-> CTE Scan on random_crates random_crates_1 (cost=0.00..60.18 rows=3009 width=0) (actual time=66.235..385.329 rows=5641 loops=1)
-> CTE Scan on random_crates (cost=0.00..60.18 rows=3009 width=1250) (actual time=0.001..0.650 rows=3640 loops=1)
Planning time: 1.095 ms
Execution time: 391.292 ms
(26 rows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to fix this in a separate PR I think; currently it gives an internal error because I forgot parentheses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds sensible to split this, yes :)
I also have other ideas how to do that query efficiently,
perhaps generate a random ID based on the current value in crates_id_seq
and just fetching the first crate matching our criteria with an id greater than this. This is not a perfect random distribution but should be good enough.
I forgot, I also did a short manual test which seemed fine. Also results without release with docs are filtered. I still need to check the template and see what should appear in which cases, but since we are returning the same struct, this should be fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this improves #708 , we will get the issues open in the crates.io search (rust-lang/crates.io#654 , rust-lang/crates.io#310 , rust-lang/crates.io#2710 ) , but it's definitely better fixing them only in one place :)
Thanks for looking these up :) rust-lang/crates.io#654 and rust-lang/crates.io#310 look related to the UI - we always look at max_version
, not latest_version
, so I wouldn't expect that to impact us. rust-lang/crates.io#2710 looks like a duplicate of rust-lang/crates.io#654.
This would really benefit from tests.
src/web/releases.rs
Outdated
@@ -531,7 +532,7 @@ pub fn search_handler(req: &mut Request) -> IronResult<Response> { | |||
INNER JOIN releases | |||
ON crates.latest_version_id = releases.id | |||
WHERE github_stars >= 100 AND rustdoc_status = true | |||
OFFSET FLOOR(RANDOM() * 280) LIMIT 1", | |||
OFFSET FLOOR(RANDOM() * SELECT COUNT(*) FROM crates) LIMIT 1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, could I do something like ORDER BY random() LIMIT 1
instead? Then I don't have to worry about knowing the number of crates in advance.
c2f371a
to
f029d0b
Compare
when this is done we could also include download-count, if we want: #222 |
This is a much smaller maintenance burden and fixes a lot of our bugs. TODO: fix tests
- Only make one queries, instead of one per each crate - Remove confusing filter_map and Option<Result> - Filter by crate name to reuse the database index
f029d0b
to
1b2fdd2
Compare
I just thought of a way to test this - we could use the staging crates.io instead and then publish all the crates we want to test with a stupid long prefix like |
For testing all the details like us handling certain responses coming from crates.io I think we should mock the HTTP calls (we have It feels to have tests rely on the staging database of another application. This would rather be integration tests, which we could have one or two of that just check if the API fails, but not all the details. Especially forcing a status-code like 500 or 503 would be tricky on staging :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super happy about the new HTTP request in-between; but that'll still allow us to focus on other things!
@jyn514 do you have any objections to me finishing this up? It looks like the only thing missing are the tests, right? |
@syphar not at all, go for it! :) |
With my crates.io hat, could you please change the pagination to use the For the implementation on the docs.rs side, we can probably use |
Closing in favor of #1521. |
This is a much smaller maintenance burden and fixes a lot of our bugs.
TODO: the tests are broken and I don't know a reasonable way to fix them.
Closes #1101. Closes #789. Should fix #708, although I'm not sure how to test it.