Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

findall(::Fix2{typeof(in)}, array) and _findin make assumptions on finiteness #29124

Open
andyferris opened this issue Sep 11, 2018 · 9 comments
Labels
search & find The find* family of functions

Comments

@andyferris
Copy link
Member

When creating the predicate pred = in(something), it may be that something is not a finite, iterable collection, but that in(x, something) is still a fast operation. (In my case it's 3D points in a Sphere - fast to determine, but Sphere is an uncountable set).

In this case, the implementation of _findin(a, b) is not valid, because we assume b can be converted to a Set here.

This function is called by findall here, without consideration whether something (i.e. pred.x) is finite.

I would recommend changing this signature to the case where pred.x isa Union{AbstractArray, Tuple} for consistency, but I'm not sure if that will cause regressions for anyone?

@fredrikekre fredrikekre added the search & find The find* family of functions label Sep 11, 2018
@nalimilan
Copy link
Member

Before #24673, this method was called by findin, and it has the same problem. I guess it's somewhat more severe now that it's called by find.

If we restrict the signature, calls will fall back to the most generic findall method. Maybe that's OK, but it's a bit annoying that custom types with a finite length which may be convertible to Set would have to define a better method manually. Maybe we should check Base.IteratorSize?

@JeffBezanson
Copy link
Member

Since a Sphere is (probably?) not iterable, it shouldn't have to define IteratorSize. There is a slight impedance mismatch here, in that in is generically defined to check whether something is equal to an element of a collection, which is not quite the same meaning as being contained in a region.

@StefanKarpinski
Copy link
Member

I don't think that in need necessarily be restricted to being an element of a collection. Since a collection is discrete, in is naturally defined/implemented for it in terms of checking equality with each item the collection iterates. However, for non-discrete things, in could be defined/implemented in a more general way. I'm not sure how this relates to findin; presumably you don't want to iterate uncountably infinite objects like spheres.

@nalimilan
Copy link
Member

I'm not sure how this relates to findin; presumably you don't want to iterate uncountably infinite objects like spheres.

To be clear, this issue is about the x argument find(in(x), y). You don't need to iterate over x to know whether an entry from y is present in x.

@andyferris
Copy link
Member Author

Hmm... Is there not a fallback definition for IteratorSize to unknown?

@nalimilan
Copy link
Member

It falls back to HasLength unfortunately.

@andyferris
Copy link
Member Author

That seems... presumptuous.

@JeffBezanson
Copy link
Member

We've been over this. But actually, SizeUnknown means that something is at least iterable, and in practice consumers will assume it's finite, so Set(itr) would still work. There is no value for a type that's not iterable at all, and I guess technically there also isn't a value that means the type's finite-ness is unknown.

I think all we can do here is limit that method to some types like AbstractArray, and other uses might have to explicitly write in(Set(x)) which frankly is for the best. In fact, this could give better performance if the container is already an AbstractSet but not a Set, since it avoids an unnecessary copy.

@andyferris
Copy link
Member Author

andyferris commented Sep 11, 2018

Even for tuples and small AbstractArrays it might be slower to use a Set than not (I suppose I should do some benchmarks to back up this claim).

If all the user needs to write is in(Set(x)) to get a speed up, at least that is concise and it’s very visible what kind of algorithm will occur under the hood (hash lookup) and gives ultimate control to the user. Maybe even explicit is better than implicit in this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
search & find The find* family of functions
Projects
None yet
Development

No branches or pull requests

5 participants