Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use crates.io API for search #1224

Closed
wants to merge 3 commits into from

Conversation

jyn514
Copy link
Member

@jyn514 jyn514 commented Dec 27, 2020

This is a much smaller maintenance burden and fixes a lot of our bugs.
TODO: the tests are broken and I don't know a reasonable way to fix them.

Closes #1101. Closes #789. Should fix #708, although I'm not sure how to test it.

@jyn514 jyn514 added the A-crate-search Area: A problem or feature request for searching crates label Dec 27, 2020
Copy link
Member

@syphar syphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a short check, some first thoughts in comments.

Sorry in advance for unnecessary questions or remarks, since new to the project and the language :)

If you want to can ignore the performance related questions, it was a long time since I worked in compiled languages, the comments only come from typical performance quickwins I always see in reviews.

While this improves #708 , we will get the issues open in the crates.io search (rust-lang/crates.io#654 , rust-lang/crates.io#310 , rust-lang/crates.io#2710 ) , but it's definitely better fixing them only in one place :)

@@ -531,7 +532,7 @@ pub fn search_handler(req: &mut Request) -> IronResult<Response> {
INNER JOIN releases
ON crates.latest_version_id = releases.id
WHERE github_stars >= 100 AND rustdoc_status = true
OFFSET FLOOR(RANDOM() * 280) LIMIT 1",
OFFSET FLOOR(RANDOM() * SELECT COUNT(*) FROM crates) LIMIT 1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can lead to an empty result:

  • in the WHERE you filter for crates with > 100 stars and having docs ,
  • the offset is based on RANDOM (between 0 and 1) and the count of all crates, so can be far higher (I assume) than the amount of crates with >100 stars.

to be 100% correct you would have to use the same joins and the same filters , but I guess some constant we divide by should be good enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, could I do something like ORDER BY random() LIMIT 1 instead? Then I don't have to worry about knowing the number of crates in advance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORDER BY random() is surprisingly expensive.

                                                                      QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=70591.52..70591.53 rows=1 width=36) (actual time=183.781..188.827 rows=1 loops=1)
   ->  Sort  (cost=70591.52..70599.05 rows=3009 width=36) (actual time=183.780..188.825 rows=1 loops=1)
         Sort Key: (random())
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Gather  (cost=5108.21..70576.48 rows=3009 width=36) (actual time=74.115..186.504 rows=5641 loops=1)
               Workers Planned: 4
               Workers Launched: 4
               ->  Hash Join  (cost=4108.21..69268.06 rows=752 width=28) (actual time=53.510..173.996 rows=1128 loops=5)
                     Hash Cond: (releases.id = crates.latest_version_id)
                     ->  Hash Join  (cost=1022.38..66135.98 rows=4426 width=21) (actual time=7.829..121.941 rows=15972 loops=5)
                           Hash Cond: ((releases.github_repo)::text = (github_repos.id)::text)
                           ->  Parallel Seq Scan on releases  (cost=0.00..64951.29 rows=61822 width=53) (actual time=0.028..85.406 rows=49939 loops=5)
                                 Filter: rustdoc_status
                                 Rows Removed by Filter: 12299
                           ->  Hash  (cost=998.16..998.16 rows=1937 width=32) (actual time=7.653..7.653 rows=1945 loops=5)
                                 Buckets: 2048  Batches: 1  Memory Usage: 140kB
                                 ->  Seq Scan on github_repos  (cost=0.00..998.16 rows=1937 width=32) (actual time=0.023..6.983 rows=1945 loops=5)
                                       Filter: (stars >= 100)
                                       Rows Removed by Filter: 25116
                     ->  Hash  (cost=2430.37..2430.37 rows=52437 width=15) (actual time=42.731..42.732 rows=52641 loops=5)
                           Buckets: 65536  Batches: 1  Memory Usage: 3034kB
                           ->  Seq Scan on crates  (cost=0.00..2430.37 rows=52437 width=15) (actual time=0.024..22.419 rows=52641 loops=5)
 Planning time: 0.656 ms
 Execution time: 189.117 ms
(24 rows)

Copy link
Member Author

@jyn514 jyn514 Dec 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for some reason a CTE is even more expensive!

cratesfyi=> explain analyze WITH random_crates AS (
  SELECT crates.name,
    releases.version,
    releases.target_name
  FROM crates
    INNER JOIN releases ON crates.latest_version_id = releases.id
    LEFT JOIN github_repos ON releases.github_repo = github_repos.id
  WHERE github_repos.stars >= 100 AND rustdoc_status = true
)
SELECT name, version, target_name FROM random_crates
OFFSET FLOOR(RANDOM() * (SELECT COUNT(*) FROM random_crates))
LIMIT 1;
                                                                     QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=70642.69..70642.71 rows=1 width=1250) (actual time=387.544..390.924 rows=1 loops=1)
   CTE random_crates
     ->  Gather  (cost=5108.21..70568.96 rows=3009 width=28) (actual time=66.232..382.994 rows=5641 loops=1)
           Workers Planned: 4
           Workers Launched: 4
           ->  Hash Join  (cost=4108.21..69268.06 rows=752 width=28) (actual time=89.080..366.300 rows=1128 loops=5)
                 Hash Cond: (releases.id = crates.latest_version_id)
                 ->  Hash Join  (cost=1022.38..66135.98 rows=4426 width=21) (actual time=16.025..286.586 rows=15972 loops=5)
                       Hash Cond: ((releases.github_repo)::text = (github_repos.id)::text)
                       ->  Parallel Seq Scan on releases  (cost=0.00..64951.29 rows=61822 width=53) (actual time=0.023..219.517 rows=49939 loops=5)
                             Filter: rustdoc_status
                             Rows Removed by Filter: 12300
                       ->  Hash  (cost=998.16..998.16 rows=1937 width=32) (actual time=15.846..15.846 rows=1945 loops=5)
                             Buckets: 2048  Batches: 1  Memory Usage: 140kB
                             ->  Seq Scan on github_repos  (cost=0.00..998.16 rows=1937 width=32) (actual time=0.027..15.137 rows=1945 loops=5)
                                   Filter: (stars >= 100)
                                   Rows Removed by Filter: 25116
                 ->  Hash  (cost=2430.37..2430.37 rows=52437 width=15) (actual time=72.454..72.455 rows=52641 loops=5)
                       Buckets: 65536  Batches: 1  Memory Usage: 3034kB
                       ->  Seq Scan on crates  (cost=0.00..2430.37 rows=52437 width=15) (actual time=0.024..35.014 rows=52641 loops=5)
   InitPlan 2 (returns $2)
     ->  Aggregate  (cost=67.70..67.71 rows=1 width=8) (actual time=386.610..386.611 rows=1 loops=1)
           ->  CTE Scan on random_crates random_crates_1  (cost=0.00..60.18 rows=3009 width=0) (actual time=66.235..385.329 rows=5641 loops=1)
   ->  CTE Scan on random_crates  (cost=0.00..60.18 rows=3009 width=1250) (actual time=0.001..0.650 rows=3640 loops=1)
 Planning time: 1.095 ms
 Execution time: 391.292 ms
(26 rows)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to fix this in a separate PR I think; currently it gives an internal error because I forgot parentheses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds sensible to split this, yes :)

I also have other ideas how to do that query efficiently,

perhaps generate a random ID based on the current value in crates_id_seq and just fetching the first crate matching our criteria with an id greater than this. This is not a perfect random distribution but should be good enough.

@syphar
Copy link
Member

syphar commented Dec 27, 2020

I forgot, I also did a short manual test which seemed fine. Also results without release with docs are filtered.

I still need to check the template and see what should appear in which cases, but since we are returning the same struct, this should be fine.

Copy link
Member Author

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this improves #708 , we will get the issues open in the crates.io search (rust-lang/crates.io#654 , rust-lang/crates.io#310 , rust-lang/crates.io#2710 ) , but it's definitely better fixing them only in one place :)

Thanks for looking these up :) rust-lang/crates.io#654 and rust-lang/crates.io#310 look related to the UI - we always look at max_version, not latest_version, so I wouldn't expect that to impact us. rust-lang/crates.io#2710 looks like a duplicate of rust-lang/crates.io#654.

This would really benefit from tests.

@@ -531,7 +532,7 @@ pub fn search_handler(req: &mut Request) -> IronResult<Response> {
INNER JOIN releases
ON crates.latest_version_id = releases.id
WHERE github_stars >= 100 AND rustdoc_status = true
OFFSET FLOOR(RANDOM() * 280) LIMIT 1",
OFFSET FLOOR(RANDOM() * SELECT COUNT(*) FROM crates) LIMIT 1",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, could I do something like ORDER BY random() LIMIT 1 instead? Then I don't have to worry about knowing the number of crates in advance.

@jyn514 jyn514 force-pushed the burn-search-with-fire branch from c2f371a to f029d0b Compare December 28, 2020 16:19
@syphar
Copy link
Member

syphar commented Dec 29, 2020

when this is done we could also include download-count, if we want: #222

@jyn514 jyn514 added the S-waiting-on-author Status: This PR is incomplete or needs to address review comments label Dec 29, 2020
jyn514 added 3 commits June 23, 2021 21:53
This is a much smaller maintenance burden and fixes a lot of our bugs.
TODO: fix tests
- Only make one queries, instead of one per each crate
- Remove confusing filter_map and Option<Result>
- Filter by crate name to reuse the database index
@jyn514 jyn514 force-pushed the burn-search-with-fire branch from f029d0b to 1b2fdd2 Compare June 24, 2021 01:54
@jyn514
Copy link
Member Author

jyn514 commented Jun 24, 2021

I just thought of a way to test this - we could use the staging crates.io instead and then publish all the crates we want to test with a stupid long prefix like docs_rs_testing_please_do_not_use or something so people don't use it. That still hits the network, but it should at least be deterministic, and it gives us advance warning if crates.io changes their API.

@syphar
Copy link
Member

syphar commented Jun 24, 2021

I just thought of a way to test this - we could use the staging crates.io instead :slight_smile: and then publish all the crates we want to test with a stupid long prefix like docs_rs_testing_please_do_not_use or something so people don't use it. That still hits the network, but it should at least be deterministic, and it gives us advance warning if crates.io changes their API.

For testing all the details like us handling certain responses coming from crates.io I think we should mock the HTTP calls (we have mockito in our requirements since the gitlab integration was added).

It feels to have tests rely on the staging database of another application. This would rather be integration tests, which we could have one or two of that just check if the API fails, but not all the details.

Especially forcing a status-code like 500 or 503 would be tricky on staging :)

Copy link
Member

@GuillaumeGomez GuillaumeGomez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy about the new HTTP request in-between; but that'll still allow us to focus on other things!

@syphar
Copy link
Member

syphar commented Sep 24, 2021

@jyn514 do you have any objections to me finishing this up?

It looks like the only thing missing are the tests, right?

@jyn514
Copy link
Member Author

jyn514 commented Sep 24, 2021

@syphar not at all, go for it! :)

@pietroalbini
Copy link
Member

With my crates.io hat, could you please change the pagination to use the meta.next_page key instead of manually providing ?page=? crates.io might switch how pagination works in the future to improve performance, and using meta.next_page would ensure the code won't break (calling /api/v1/crates without a search query already defaults to another pagination method).

For the implementation on the docs.rs side, we can probably use "?paginate=" + base64(meta.next_page) as the query string for the search page.

@jyn514
Copy link
Member Author

jyn514 commented Oct 27, 2021

Closing in favor of #1521.

@jyn514 jyn514 closed this Oct 27, 2021
@jyn514 jyn514 deleted the burn-search-with-fire branch November 13, 2021 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-crate-search Area: A problem or feature request for searching crates S-waiting-on-author Status: This PR is incomplete or needs to address review comments
Projects
None yet
5 participants