Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial attempt at straightforward document processing script. #731

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alexaryn
Copy link
Contributor

This is a proposal/example, not meant to be checked in.

The basic idea here is not only to bypass Ray, but also avoid the lazy-evaluated pipeline abstraction. Instead, it's coded the way a typical programmer would expect to write the code. This approach is synchronous rather than functional and allows different documents to be treated differently on the fly.

Instead of DocSet, we deal with a list of Document. DocSet confuses people because it's not a set of documents.

One finding is that most existing transforms would be easier to use a simple functions. Then they could the the target of "map", either directly or via DocSet.

This code represents the exercise of simplifying without modifying Sycamore. The function iterInputs would be intended as an addition to the Sycamore library. There are FIXME comments for how Sycamore could become easier to use directly.

The remaining piece would be a way to encapsulate common processing sequences into higher-level single calls. We could do this generally, or just provide some off-the-shelf. This may turn into an exercise in naming.

Copy link
Collaborator

@eric-anderson eric-anderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I prefer the local_mode (ctx = sycamore.init(exec_mode=ExecMode.LOCAL)) approach for three reasons:

  1. If you need to scale up, it's easy to switch it over to ray mode
  2. We can in the future add multiprocessing support to get more speed
  3. It preserves all of the metadata so the reliability work will be able to happen

That said, there is clearly a need for some rayless thing as people are starting to use local mode before it's really ready, and you ended up writing this example.


###############################################################################

def iterInputs(inputs: list[str], aws_sess = None) -> Iterator[BinaryIO]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can replace all of this with
docs = BinaryScan(paths=inputs).local_source()
once https://github.com/aryn-ai/sycamore/pull/712/files is in.

@jonfritz
Copy link
Contributor

jonfritz commented Aug 28, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants