-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug w/ reduce + projection + dups #267
Comments
Reduce is the only operator I can think of that can produce duplicates. Are there other operators that have this behavior? Reduce can be forced to always produce unique tuples if we always supply Hash.new as the initial value, so out comes a hash. This can be forced syntactically on the programmer by not having reduce accept an initial value at all. That is, the only allowable syntax is coll.reduce {|hash, t| … } is the only syntax allowed. |
Well, projection can produce duplicates; reduce is sensitive to duplicates in its input. Reduce doesn't necessarily produce duplicates itself. I don't think that limiting reduce to start with an empty hash will change this behavior: I can still pass a code block to the reduce that produces different values depending on whether there are duplicates. Offhand, reduce is the only duplicate-sensitive operator I can think of (well, count is as well, but we already do dup elim before computing count -- i.e., we compute COUNT(DISTINCT ...) in SQL terms). So we could fix this by just doing dup elim on the input to reduce (unless it is already a known-distinct value like BudCollection). |
On Mar 29, 2012, at 8:50 AM, Neil Conway wrote:
True. Ugh, this is an issue with all code blocks, isn't it? Seems to me that all code blocks that are not fed to a collection (non-element) will have to be uniq'd.
Since the result of reduce is the hash object itself, which is a uniq'd set of k-v pairs, the code block can't force dups in the output. Without code blocks, the operations that always produce uniq'd output are group, scan and reduce. Meanwhile, join, sort, each_with_index are sensitive to dups (dup input => dup output). So if there's a way to analyze code blocks for uniqueness, we may be able to identify edges in the dataflow where dup elimination may have to be introduced. --s |
To fix reduce, shouldn't it be sufficient to just do duplicate elimination on its input? (Unless we can prove the input arrives from a known-to-be-dup-free source, like a scanner). In the case of join, I'd argue it isn't really duplicate sensitive: dup input => dup output, but you should get the same result regardless of where duplication elimination is done (i.e., you can eliminate dups either from join input or output). |
Or rather, any duplicate-sensitive operator that is passed the output of a user-defined code block must do duplication elimination. Otherwise redirecting the output of the code block through a scratch collection would change the semantics of the program. e.g.,
t2
andt3
have different contents in this example, but they shouldn't.The text was updated successfully, but these errors were encountered: