Use crates.io API for search #1224

jyn514 · 2020-12-27T20:18:27Z

This is a much smaller maintenance burden and fixes a lot of our bugs.
TODO: the tests are broken and I don't know a reasonable way to fix them.

Closes #1101. Closes #789. Should fix #708, although I'm not sure how to test it.

syphar

I did a short check, some first thoughts in comments.

Sorry in advance for unnecessary questions or remarks, since new to the project and the language :)

If you want to can ignore the performance related questions, it was a long time since I worked in compiled languages, the comments only come from typical performance quickwins I always see in reviews.

While this improves #708 , we will get the issues open in the crates.io search (rust-lang/crates.io#654 , rust-lang/crates.io#310 , rust-lang/crates.io#2710 ) , but it's definitely better fixing them only in one place :)

syphar · 2020-12-27T21:17:47Z

src/web/releases.rs

@@ -531,7 +532,7 @@ pub fn search_handler(req: &mut Request) -> IronResult<Response> {
                     INNER JOIN releases
                         ON crates.latest_version_id = releases.id
                     WHERE github_stars >= 100 AND rustdoc_status = true
-                     OFFSET FLOOR(RANDOM() * 280) LIMIT 1",
+                     OFFSET FLOOR(RANDOM() * SELECT COUNT(*) FROM crates) LIMIT 1",


This can lead to an empty result:

in the WHERE you filter for crates with > 100 stars and having docs ,

the offset is based on RANDOM (between 0 and 1) and the count of all crates, so can be far higher (I assume) than the amount of crates with >100 stars.

to be 100% correct you would have to use the same joins and the same filters , but I guess some constant we divide by should be good enough.

Hmm, could I do something like ORDER BY random() LIMIT 1 instead? Then I don't have to worry about knowing the number of crates in advance.

ORDER BY random() is surprisingly expensive.

QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=70591.52..70591.53 rows=1 width=36) (actual time=183.781..188.827 rows=1 loops=1) -> Sort (cost=70591.52..70599.05 rows=3009 width=36) (actual time=183.780..188.825 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB -> Gather (cost=5108.21..70576.48 rows=3009 width=36) (actual time=74.115..186.504 rows=5641 loops=1) Workers Planned: 4 Workers Launched: 4 -> Hash Join (cost=4108.21..69268.06 rows=752 width=28) (actual time=53.510..173.996 rows=1128 loops=5) Hash Cond: (releases.id = crates.latest_version_id) -> Hash Join (cost=1022.38..66135.98 rows=4426 width=21) (actual time=7.829..121.941 rows=15972 loops=5) Hash Cond: ((releases.github_repo)::text = (github_repos.id)::text) -> Parallel Seq Scan on releases (cost=0.00..64951.29 rows=61822 width=53) (actual time=0.028..85.406 rows=49939 loops=5) Filter: rustdoc_status Rows Removed by Filter: 12299 -> Hash (cost=998.16..998.16 rows=1937 width=32) (actual time=7.653..7.653 rows=1945 loops=5) Buckets: 2048 Batches: 1 Memory Usage: 140kB -> Seq Scan on github_repos (cost=0.00..998.16 rows=1937 width=32) (actual time=0.023..6.983 rows=1945 loops=5) Filter: (stars >= 100) Rows Removed by Filter: 25116 -> Hash (cost=2430.37..2430.37 rows=52437 width=15) (actual time=42.731..42.732 rows=52641 loops=5) Buckets: 65536 Batches: 1 Memory Usage: 3034kB -> Seq Scan on crates (cost=0.00..2430.37 rows=52437 width=15) (actual time=0.024..22.419 rows=52641 loops=5) Planning time: 0.656 ms Execution time: 189.117 ms (24 rows)

And for some reason a CTE is even more expensive!

cratesfyi=> explain analyze WITH random_crates AS ( SELECT crates.name, releases.version, releases.target_name FROM crates INNER JOIN releases ON crates.latest_version_id = releases.id LEFT JOIN github_repos ON releases.github_repo = github_repos.id WHERE github_repos.stars >= 100 AND rustdoc_status = true ) SELECT name, version, target_name FROM random_crates OFFSET FLOOR(RANDOM() * (SELECT COUNT(*) FROM random_crates)) LIMIT 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=70642.69..70642.71 rows=1 width=1250) (actual time=387.544..390.924 rows=1 loops=1) CTE random_crates -> Gather (cost=5108.21..70568.96 rows=3009 width=28) (actual time=66.232..382.994 rows=5641 loops=1) Workers Planned: 4 Workers Launched: 4 -> Hash Join (cost=4108.21..69268.06 rows=752 width=28) (actual time=89.080..366.300 rows=1128 loops=5) Hash Cond: (releases.id = crates.latest_version_id) -> Hash Join (cost=1022.38..66135.98 rows=4426 width=21) (actual time=16.025..286.586 rows=15972 loops=5) Hash Cond: ((releases.github_repo)::text = (github_repos.id)::text) -> Parallel Seq Scan on releases (cost=0.00..64951.29 rows=61822 width=53) (actual time=0.023..219.517 rows=49939 loops=5) Filter: rustdoc_status Rows Removed by Filter: 12300 -> Hash (cost=998.16..998.16 rows=1937 width=32) (actual time=15.846..15.846 rows=1945 loops=5) Buckets: 2048 Batches: 1 Memory Usage: 140kB -> Seq Scan on github_repos (cost=0.00..998.16 rows=1937 width=32) (actual time=0.027..15.137 rows=1945 loops=5) Filter: (stars >= 100) Rows Removed by Filter: 25116 -> Hash (cost=2430.37..2430.37 rows=52437 width=15) (actual time=72.454..72.455 rows=52641 loops=5) Buckets: 65536 Batches: 1 Memory Usage: 3034kB -> Seq Scan on crates (cost=0.00..2430.37 rows=52437 width=15) (actual time=0.024..35.014 rows=52641 loops=5) InitPlan 2 (returns $2) -> Aggregate (cost=67.70..67.71 rows=1 width=8) (actual time=386.610..386.611 rows=1 loops=1) -> CTE Scan on random_crates random_crates_1 (cost=0.00..60.18 rows=3009 width=0) (actual time=66.235..385.329 rows=5641 loops=1) -> CTE Scan on random_crates (cost=0.00..60.18 rows=3009 width=1250) (actual time=0.001..0.650 rows=3640 loops=1) Planning time: 1.095 ms Execution time: 391.292 ms (26 rows)

I'm going to fix this in a separate PR I think; currently it gives an internal error because I forgot parentheses.

sounds sensible to split this, yes :)

I also have other ideas how to do that query efficiently,

perhaps generate a random ID based on the current value in crates_id_seq and just fetching the first crate matching our criteria with an id greater than this. This is not a perfect random distribution but should be good enough.

src/web/releases.rs

syphar · 2020-12-27T22:18:18Z

I forgot, I also did a short manual test which seemed fine. Also results without release with docs are filtered.

I still need to check the template and see what should appear in which cases, but since we are returning the same struct, this should be fine.

jyn514

While this improves #708 , we will get the issues open in the crates.io search (rust-lang/crates.io#654 , rust-lang/crates.io#310 , rust-lang/crates.io#2710 ) , but it's definitely better fixing them only in one place :)

Thanks for looking these up :) rust-lang/crates.io#654 and rust-lang/crates.io#310 look related to the UI - we always look at max_version, not latest_version, so I wouldn't expect that to impact us. rust-lang/crates.io#2710 looks like a duplicate of rust-lang/crates.io#654.

This would really benefit from tests.

src/web/releases.rs

jyn514 · 2020-12-27T23:20:19Z

src/web/releases.rs

@@ -531,7 +532,7 @@ pub fn search_handler(req: &mut Request) -> IronResult<Response> {
                     INNER JOIN releases
                         ON crates.latest_version_id = releases.id
                     WHERE github_stars >= 100 AND rustdoc_status = true
-                     OFFSET FLOOR(RANDOM() * 280) LIMIT 1",
+                     OFFSET FLOOR(RANDOM() * SELECT COUNT(*) FROM crates) LIMIT 1",


Hmm, could I do something like ORDER BY random() LIMIT 1 instead? Then I don't have to worry about knowing the number of crates in advance.

syphar · 2020-12-29T12:54:58Z

when this is done we could also include download-count, if we want: #222

This is a much smaller maintenance burden and fixes a lot of our bugs. TODO: fix tests

- Only make one queries, instead of one per each crate - Remove confusing filter_map and Option<Result> - Filter by crate name to reuse the database index

jyn514 · 2021-06-24T03:16:03Z

I just thought of a way to test this - we could use the staging crates.io instead and then publish all the crates we want to test with a stupid long prefix like docs_rs_testing_please_do_not_use or something so people don't use it. That still hits the network, but it should at least be deterministic, and it gives us advance warning if crates.io changes their API.

syphar · 2021-06-24T06:17:03Z

I just thought of a way to test this - we could use the staging crates.io instead :slight_smile: and then publish all the crates we want to test with a stupid long prefix like docs_rs_testing_please_do_not_use or something so people don't use it. That still hits the network, but it should at least be deterministic, and it gives us advance warning if crates.io changes their API.

For testing all the details like us handling certain responses coming from crates.io I think we should mock the HTTP calls (we have mockito in our requirements since the gitlab integration was added).

It feels to have tests rely on the staging database of another application. This would rather be integration tests, which we could have one or two of that just check if the API fails, but not all the details.

Especially forcing a status-code like 500 or 503 would be tricky on staging :)

GuillaumeGomez

I'm not super happy about the new HTTP request in-between; but that'll still allow us to focus on other things!

syphar · 2021-09-24T12:37:51Z

@jyn514 do you have any objections to me finishing this up?

It looks like the only thing missing are the tests, right?

jyn514 · 2021-09-24T19:43:03Z

@syphar not at all, go for it! :)

pietroalbini · 2021-09-25T10:13:16Z

With my crates.io hat, could you please change the pagination to use the meta.next_page key instead of manually providing ?page=? crates.io might switch how pagination works in the future to improve performance, and using meta.next_page would ensure the code won't break (calling /api/v1/crates without a search query already defaults to another pagination method).

For the implementation on the docs.rs side, we can probably use "?paginate=" + base64(meta.next_page) as the query string for the search page.

jyn514 · 2021-10-27T10:05:14Z

Closing in favor of #1521.

jyn514 added the A-crate-search Area: A problem or feature request for searching crates label Dec 27, 2020

syphar requested changes Dec 27, 2020

View reviewed changes

jyn514 commented Dec 27, 2020

View reviewed changes

jyn514 force-pushed the burn-search-with-fire branch from c2f371a to f029d0b Compare December 28, 2020 16:19

jyn514 added the S-waiting-on-author Status: This PR is incomplete or needs to address review comments label Dec 29, 2020

jyn514 added 3 commits June 23, 2021 21:53

Shell out to crates.io for search

f7f4fb4

This is a much smaller maintenance burden and fixes a lot of our bugs. TODO: fix tests

Url-encode parameters properly

7b7dd55

Simplify and speed up database queries

1b2fdd2

- Only make one queries, instead of one per each crate - Remove confusing filter_map and Option<Result> - Filter by crate name to reuse the database index

jyn514 force-pushed the burn-search-with-fire branch from f029d0b to 1b2fdd2 Compare June 24, 2021 01:54

GuillaumeGomez approved these changes Sep 12, 2021

View reviewed changes

jyn514 mentioned this pull request Oct 8, 2021

http://docs.rs/releases/search (without a search query) gives a 500 internal error #1518

Closed

syphar mentioned this pull request Oct 16, 2021

use crates.io search to search crates #1521

Merged

jyn514 closed this Oct 27, 2021

jyn514 deleted the burn-search-with-fire branch November 13, 2021 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use crates.io API for search #1224

Use crates.io API for search #1224

jyn514 commented Dec 27, 2020

syphar left a comment

syphar Dec 27, 2020

jyn514 Dec 27, 2020

jyn514 Dec 27, 2020

jyn514 Dec 27, 2020 •

edited

Loading

jyn514 Dec 27, 2020

syphar Dec 28, 2020

syphar commented Dec 27, 2020

jyn514 left a comment

jyn514 Dec 27, 2020

syphar commented Dec 29, 2020

jyn514 commented Jun 24, 2021 •

edited

Loading

syphar commented Jun 24, 2021

GuillaumeGomez left a comment

syphar commented Sep 24, 2021

jyn514 commented Sep 24, 2021

pietroalbini commented Sep 25, 2021

jyn514 commented Oct 27, 2021

Use crates.io API for search #1224

Use crates.io API for search #1224

Conversation

jyn514 commented Dec 27, 2020

syphar left a comment

Choose a reason for hiding this comment

syphar Dec 27, 2020

Choose a reason for hiding this comment

jyn514 Dec 27, 2020

Choose a reason for hiding this comment

jyn514 Dec 27, 2020

Choose a reason for hiding this comment

jyn514 Dec 27, 2020 • edited Loading

Choose a reason for hiding this comment

jyn514 Dec 27, 2020

Choose a reason for hiding this comment

syphar Dec 28, 2020

Choose a reason for hiding this comment

syphar commented Dec 27, 2020

jyn514 left a comment

Choose a reason for hiding this comment

jyn514 Dec 27, 2020

Choose a reason for hiding this comment

syphar commented Dec 29, 2020

jyn514 commented Jun 24, 2021 • edited Loading

syphar commented Jun 24, 2021

GuillaumeGomez left a comment

Choose a reason for hiding this comment

syphar commented Sep 24, 2021

jyn514 commented Sep 24, 2021

pietroalbini commented Sep 25, 2021

jyn514 commented Oct 27, 2021

jyn514 Dec 27, 2020 •

edited

Loading

jyn514 commented Jun 24, 2021 •

edited

Loading