Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support bytestream in Unstructured API #1082

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

alperkaya
Copy link
Contributor

Related Issues

Proposed Changes:

Unstructured API can also be called with Bytestream now in addition of Path.

How did you test it?

Added extra unit tests

Notes for the reviewer

Checklist

@alperkaya alperkaya requested a review from a team as a code owner September 12, 2024 12:46
@alperkaya alperkaya requested review from silvanocerza and removed request for a team September 12, 2024 12:46
@CLAassistant
Copy link

CLAassistant commented Sep 12, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added integration:unstructured-fileconverter type:documentation Improvements or additions to documentation labels Sep 12, 2024
@alperkaya alperkaya changed the title Support bytestream in Unstructured API feat: Support bytestream in Unstructured API Sep 12, 2024
@alperkaya
Copy link
Contributor Author

@vblagoje , I try to address your request in this PR.

@silvanocerza
Copy link
Contributor

@alperkaya can you please sign the CLA? Otherwise we can't merge this. :)

@alperkaya
Copy link
Contributor Author

@alperkaya can you please sign the CLA? Otherwise we can't merge this. :)

done ;)

for filepath, metadata in tqdm(
zip(all_filepaths, meta_list), desc="Converting files to Haystack Documents", disable=not self.progress_bar
zip(all_filepaths, meta_list[:len(all_filepaths)]), desc="Converting files to Haystack Documents"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong.

If you have a combination of paths and ByteStream in your sources and a list for meta you risk assigning the wrong meta to the wrong Document.

You can verify with something like this:

from pathlib import Path

from haystack.dataclasses import ByteStream
from haystack_integrations.components.converters.unstructured import (
    UnstructuredFileConverter,
)

converter = UnstructuredFileConverter()

sources = [
    "README.md",
    ByteStream(data=b"content", meta={"file_path": "some_file.md"}),
    Path(__file__),
    ByteStream(data=b"content", meta={"file_path": "yet_another_file.md"}),
    ByteStream(data=b"content", meta={"file_path": "my_file.md"}),
]

meta = [
    {"type": "str"},
    {"type": "ByteStream"},
    {"type": "Path"},
    {"type": "ByteStream"},
    {"type": "ByteStream"},
]

res = converter.run(sources=sources, meta=meta)

Also I notice that some meta fields are completely lost.

Copy link
Contributor

@silvanocerza silvanocerza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation has quite some problems, I would suggest taking a look at the existing converters we have in core.
https://github.com/deepset-ai/haystack/tree/main/haystack/components/converters

@alperkaya
Copy link
Contributor Author

Hi @silvanocerza,

I try to address your comment by checking the existing converters.

This version covers these cases without losing meta fields.

Case 1: Files with Meta as None
Case 2: ByteStreams with Meta as None
Case 3: Files with Meta as a Dictionary
Case 4: ByteStreams with Meta as a Dictionary
Case 5: Files with Meta as a List
Case 6: ByteStreams with Meta as a List
Case 7: Directory with Meta as a Dictionary
Case 8: Directory with Meta as a List (Should Fail)
Case 9: Combination of File Paths, ByteStreams, and Directory with Meta as a Dictionary

@silvanocerza
Copy link
Contributor

This is still not working as expected, I strongly suggest you copy the implementation from core.

@alperkaya
Copy link
Contributor Author

This is still not working as expected, I strongly suggest you copy the implementation from core.

Hi, in the core repo, the solution handles either file paths or bytestreams. However, in this PR, I’m managing not just file paths and bytestreams, but also directories, where I need to fetch all files without entering subfolders. Given these additional requirements, could we explore how to modify the core solution to meet this use case, or would an alternative approach be better suited here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:unstructured-fileconverter type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unstructured: support ByteStream input in run method
3 participants