-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dseq for a range of integers / custom input splits from clojure code #9
Comments
It should be possible to take any existing Simplifying writing new I have been meaning to add an I'd like to stew for a bit on some ideas here. If you'd like to participate, we can use this ticket for design discussion. |
Thanks, that helps in terms of understanding what dseqs are and aren't trying to do. Sorry for the possibly-slightly-silly question-- I guess it was bad luck that the first thing I tried to do struck on a complicated use case. I guess most people will be using hadoop data as input so will be using an existing InputFormat so I'm the odd one out here -- I want to use programatically-generated values as input to my mappers, which can be generated lazily. Generating those values eagerly and serializing them as EDN would work, although perhaps not ideal if you're serializing a massive list of numbers which could be generated programatically. I wonder if it'd be possible to have a way to create a dseq from a var which maps to a lazy seq of lazy seqs for the splits:
Although I guess you'd have to be a little bit careful about things like chunking in the lazy sequences, and perhaps careful to avoid sequential dependences if you want to allow each split to be generated independently of the other ones. E.g. the following might be better than the example above:
|
Perhaps requiring a function which returns a vector of lazy sequences would make the requirement for laziness more explicit:
|
What I'm missing here is the details of when / where / how these InputFormat and InputSplit instances are serialized and passed around when a job runs on multiple nodes. It sounds like the InputSplits need to be serializable somehow (despite that not being part of the InputSplit interface). I guess that's the difficulty here and why you were thinking about:
Perhaps best if I leave this to the better informed :) |
I think you're on the right track. Here's my breakdown (partially to clarify my own thoughts):
In the actual implementation in Hadoop, there are a few complications:
I believe we can simplify this significantly, via the following:
The one tricky thing left is the record readers. They’re just inherently mutable, in way which doesn’t work well with Clojure idioms, and driven in all their mutability by code deep in Hadoop MapReduce. I’m going to mull on them a bit longer, and see if I can’t think of a way to abstract them a bit more nicely from the Clojure perspective. |
I've thought about it for a bit, and I don't see a way of completely hiding |
Design changed slightly when faced with actual implementation, but check out PR #10 for my proposed implementation of the basic interface and some examples. |
Ooh nice, thanks! |
Parkour 0.6.2 is now released and includes the basic interface mentioned. This basic interface does not include providing access to the |
I was looking for a way to create a dseq for a range of integers, which can be split amongst the mappers.
So far the closest I found was
parkour.io.mem/dseq
, but this only works in local mode.Is there a way to implement custom dseqs with custom logic for input splits from clojure code? Perhaps I need to implement (and AOT-compile) a custom InputFormat class for this kind of thing, and implement a cstep which sets this class's name as
mapreduce.job.inputformat.class
, sets config for it, then somehow wrap that as a dseq? Or perhaps all the dseq stuff assumes file-based input?Thought I'd mention this anyway as one of the "gaps" alluded to on that documentation ticket -- it feels like something which would be really simple in idiomatic clojure code, and these dseqs are advertised as one of the core abstractions of the library, but it's not at all clear how I'd go about creating one myself, given some logic I'd like to implement for how to generate keys as input for mappers. Looking at the protocol for dseq isn't very enlightening in this respect as they mainly just appear to be wrappers for csteps.
(I did find at least one custom InputFormat for integer range here: https://github.com/twitter/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/IntegerListInputFormat.java , although I'm not sure why this particular class uses static fields for its config and if that would work with parkour.)
The text was updated successfully, but these errors were encountered: