-
Notifications
You must be signed in to change notification settings - Fork 0
tp 02
Matt Jadud edited this page Dec 23, 2024
·
1 revision
The goal is to return relevant results.
However, the challenge is returning relevant results within the corpus given its size.
If NASA has 20,000 blog posts, and someone searches for the word "moon," their single query is going to hit thousands of posts. Let's say 3000.
We will return 10 results. We want to return them quickly
In this example, the searcher entered "moon" on the NASA website.
- Recency. It might be that a race of transforming robots was just discovered on the dark side of the moon. If this is the case, then a recent blog post is more relevant than a historical PDF about the Apollo moon landings.
- Current click-throughs. When other people searched for "moon," what did they ultimately click on? This may help guide relevance.
- WWW vs. PDF. Are PDF more or less relevant than web pages? In theory, we're supposed to be supporting a digital-first experience with the WWW... so, how should we handle PDFs? Should they be... separate? Their own search? They tend to be high word density, and therefore are harder to rank in the same pool as WWW pages.
If we have 20K posts, and 3K of them mention the moon, that means that it almost doesn't matter (absent more information) which posts we return if someone searches for the word "moon."
-
Pre-compute the most common words. For a given corpus, we should generate the list of 100 most common terms. We can do several things with that.
- Re-order the terms. We could leave it out of the search. It will return too many rows. Therefore, we can re-order the search, so uncommon terms are used to narrow the space, and then we select from that.
-
Subsample. Any search for a common word can 1) start with that subset of the corpus, and 2) apply probabilistic (e.g.
SAMPLETABLE
) means to reduce the space. If we're looking for 10 articles out of 3000, then downsampling quickly to 100-500 articles, and ranking within that sample is as good as anything else (absent other techniques).
- ...
- Small world paradox. We are searching a small world. If we have an index for all of NASA, that doesn't necessarily interplay with the rest of the world. Therefore, we can't use strategies like Google's historic PageRank... becuase we don't know what points at these pages. It could be there are 10 really excellent, highly-ranked pages at NASA... but we would never know.
- Scale. We're going to have a small footprint. We don't have the authority or budget to stand up huge clusters of machines. So, we have to optimize for the infrastructure and problem space. This is harder, in some ways, than being able to "just throw hardware at the problem." But, it should lead to better solutions.
- Lack of history. We don't have the analytics and history that we need to walk in the door and make things relevant for users. We can develop that analytical history, but it will take some time.
- Enlarge the world. We could pick 10-100 sites in the world, and use them as external indicators of relevance. That is, we could pick XYZ News Corp, Blogger Network of Space Awesome, and a handful of others, and crawl them as well. We could, in a word, create an external pool of sites that we watch, and use the links from those sites to help determine relevance. Is this... potentially biased? Of course. Could this be gamed? Of course. Could we use a rotating pool(s)? Yes. So, point being... there are ways to enlarge the world. It just takes time.
- Scale. We can't really get around this. We need to use the tools we have, and use them efficiently. Now, I do wonder if there's a way to use crowdsourced methods... that is, could we have a completely decentralized crawler? Where every agent fetches one page every day, but we have millions of agents? I suspect that would never clear ATO...
- Develop history. We'll want to focus our analytical efforts early on making sure we're collecting things that help us make the search better. This should overlap naturally with things people want to know anyway.