Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adjust LinkContentFetcher run method, use ByteStream #5972

Merged
merged 13 commits into from
Oct 10, 2023

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Oct 5, 2023

Why:

During web retriever pipeline setup and testing we realized a limitation in our LinkContentFetcher - it was only set up to fetch content from a single URL while our pipelines work with lists.

What:

We've revamped the LinkContentFetcher to now handle a list of URLs.

How can it be used:

The usage remains pretty much the same, but with the added flexibility. Now, instead of passing a single URL, you can pass a list of URLs to LinkContentFetcher

Testing:

We've beefed up our unit tests to cover this new functionality. Additionally, some hands-on manual testing was done to ensure that the fetcher behaves correctly with multiple URLs.

Notes For Reviewer:

Be gentle :-)

@vblagoje vblagoje requested a review from a team as a code owner October 5, 2023 06:58
@vblagoje vblagoje requested review from julian-risch and removed request for a team October 5, 2023 06:58
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Oct 5, 2023
@vblagoje vblagoje added the ignore-for-release-notes PRs with this flag won't be included in the release notes. label Oct 5, 2023
@github-actions github-actions bot added the topic:DX Developer Experience label Oct 5, 2023
@vblagoje vblagoje force-pushed the link_content_update branch from 949c428 to de23db6 Compare October 5, 2023 15:48
@github-actions github-actions bot removed the topic:DX Developer Experience label Oct 5, 2023
@vblagoje
Copy link
Member Author

vblagoje commented Oct 5, 2023

@julian-risch now after we sorted out #5933 this one should be good to review

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see this adapted to the new ByteStream data class. 👍 I have a couple of smaller change requests. We should also create release notes for this PR. Not only the return type changed but also the input is a list now and the fetching uses multithreading.

@vblagoje vblagoje requested a review from a team as a code owner October 6, 2023 12:38
@vblagoje vblagoje requested review from dfokina and removed request for a team October 6, 2023 12:38
@vblagoje
Copy link
Member Author

vblagoje commented Oct 6, 2023

Thanks for the review @julian-risch , please have another look

@vblagoje vblagoje added the 2.x Related to Haystack v2.0 label Oct 8, 2023
@vblagoje
Copy link
Member Author

vblagoje commented Oct 9, 2023

@julian-risch and @ZanSara a bit more context for you with these last few changes. During experiments with the actual web retriever pipeline, I realized that we needed to change the component to return the list of streams, where each stream has mime/content type in metadata rather than returning a dict of stream lists. Otherwise, this component would be similar to FileTypeRouter from here. That's why now you'll see changes in 65ef679

@julian-risch julian-risch removed the ignore-for-release-notes PRs with this flag won't be included in the release notes. label Oct 9, 2023
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good, just two change requests. The first one is a small change in many of the tests. Let's add checks for the correct content type value in metadata to all the tests.
Second, let's rename v2 in test_link_content_fetcher_multiple_different_content_types_v2 so that it better describes how the test differs from test_link_content_fetcher_multiple_different_content_types.

My understanding after also having a quick look at #5998 is that the reason for setting the content type in the metadata and having only a single output edge is that the splitting into multiple edges can already be handled by FileTypeRouter. Having a short usage example in the docstring where users can see that both components work well together would be great. If possible, an end-to-end test should test them in combination in a full pipeline when the PRs are merged.

assert "Introduction to Haystack" in document.text
assert document.metadata["url"] == HTML_URL
streams = fetcher.run([HTML_URL])["streams"]
assert "Haystack" in streams[0].data.decode("utf-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add assert stream.metadata["content_type"] == "text/html" here and for all other tests above and below with the corresponding content type.

@vblagoje
Copy link
Member Author

vblagoje commented Oct 9, 2023

Updated @julian-risch with 0cc9236 Note that imho it doesn't make sense to check content type for mocks so I didn't add additional content type checks everywhere

@julian-risch
Copy link
Member

Updated @julian-risch with 0cc9236 Note that imho it doesn't make sense to check content type for mocks so I didn't add additional content type checks everywhere

Still makes sense to me to do the checks because we're just mocking requests. The extraction of the content type from the response is could still be checked.

@julian-risch
Copy link
Member

julian-risch commented Oct 9, 2023

@vblagoje One more question that came up: If I have multiple input URLs, how do I know which ByteStream in the output corresponds to which URL? At the moment, I don't because there is no URL in the metadata, right?
What about storing the URL in the metadata too. Just as in the earlier version where you still had documents?

@vblagoje
Copy link
Member Author

vblagoje commented Oct 9, 2023

@vblagoje One more question that came up: If I have multiple input URLs, how do I know which ByteStream in the output corresponds to which URL? At the moment, I don't because there is no URL in the metadata, right? What about storing the URL in the metadata too. Just as in the earlier version where you still had documents?

Great idea, see dd56a64

@vblagoje
Copy link
Member Author

The logic in 9a69cc8 made the most sense to me. Perhaps we should return a list of successful and failed retrievals, wdyt?

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@vblagoje vblagoje merged commit 6a50123 into main Oct 10, 2023
20 checks passed
@vblagoje vblagoje deleted the link_content_update branch October 10, 2023 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants