Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hypothesis not finding failing examples #77

Open
asmeurer opened this issue Aug 21, 2020 · 8 comments
Open

Hypothesis not finding failing examples #77

asmeurer opened this issue Aug 21, 2020 · 8 comments
Assignees

Comments

@asmeurer
Copy link
Member

asmeurer commented Aug 21, 2020

At #76 (particularly the most recent commit, 2006911), I implemented the logic for as_subindex wrong. But the hypothesis test for as_subindex just isn't finding a failing example. I even ran it with --max-examples=1000000, which took 2 hours to run, and nothing.

I tried reducing the broad ndindices strategies to strategies that I knew would encompass the bug. I got nothing until I narrowed it down to just

shapes = tuples(integers(0, 10)).filter(
             # numpy gives errors with empty arrays with large shapes.
             # See https://github.com/numpy/numpy/issues/15753
             lambda shape: prod([i for i in shape if i]) < 1000)

_integer_arrays = arrays(intp, shapes, elements=integers(0, 10))

Tuples_of_slices = tuples(one_of(
    slices(),
), min_size=2).filter(_doesnt_raise)

Tuples_of_integerarrays = tuples(one_of(
    _integer_arrays,
), min_size=2, max_size=32).filter(_doesnt_raise)

@given(Tuples_of_slices, Tuples_of_integerarrays, shapes)
def test_as_subindex_hypothesis(idx1, idx2, shape):
    ...

(compare that to the default ndindices and shapes strategies https://github.com/asmeurer/ndindex/blob/2006911d291f5af0681b64bd9ba6b96eed5d5eec/ndindex/tests/helpers.py). And even then I still had to run with --max-examples=1000.

A shrunk example that fails is

    idx1=(slice(None, 1, None), slice(None, 1, None))
    idx2=(array([0]), array([0]))
    shape=(1, 1)

@Zac-HD do you have any suggestions on how I can improve the situation here? I understand that the search space for my strategies is pretty big, but I would expect to at least get something useful out of it after running for a few hours. By the way, if you want to run the above test, you may have to install NumPy from git (sorry).

Is there maybe some way to tell hypothesis that certain things are more important than others? When I look at the verbose output, it seems to be trying a lot of things that aren't really going to make a difference for the most part, like very large integers. It also really likes producing empty tuples, and those are good to test, but they generally represent trivial cases and in a lot of instances I if an empty tuple is a problem I would expect a larger example to shrink to it. I'm OK with it testing these things, but watching the verbose output, it feels like it care about them a little bit too much. Or should I not focus too much on what hypothesis is doing internally?

I know I can add @examples, but that seems a lot like I would just be fooling myself into thinking hypothesis is useful, because the @examples don't actually influence the example generation from the strategies (they also don't shrink, which is kind of annoying). So an @example might cover a case I know to be wrong, but hypothesis wouldn't actually ever find cases that I don't know about.

@Zac-HD
Copy link
Contributor

Zac-HD commented Aug 22, 2020

Hmm. Unfortunately I don't have any silver-bullet style advice; I think this is just a case where failing examples are a very small fraction of your input space.

If you can work out how to summarise "goodness" of an example as a real number (or several), hypothesis.target() can help. Usually I'd also be excited about some coverage-driven stuff I'm working on, but I don't think it would actually help much in this case.

Otherwise... you can probably make examples somewhat more diverse by complicating your strategies. This usually isn't worth it, but does kinda work - for example using s | s.filter(not an empty tuple) | <s but with min_size=2> to boost the chance that s generates non-empty tuples. In general this is both harder and less useful than it seems though, Hypothesis' heuristics are already pretty good.

@asmeurer
Copy link
Member Author

asmeurer commented Aug 24, 2020

Thanks for responding.

Actually, now that I think about it, the problem might not be the empty tuples. I think the issue is that I have a lot of different cases for this particular function that raise NotImplementedError. I currently just ignore NotImplementedError in the test, so that it can "automatically" work when it becomes implemented. I actually used to assume(False) for NotImplementedError, but I removed it when hypothesis told me there were too many filtered tests. But perhaps I should have heeded that warning. I'm pretty sure once I implement more cases, then the current error that I wasn't finding would be found much easier. Maybe I should make a separate test with a more restrictive set of strategies, which I know will be implemented. Is there a way to decorate the same test twice to generate two different tests?

Another question, which I think will actually be a bigger problem going forward. Is there some way to tell hypothesis that two strategies should be "mostly" related to one another in some way? To be more concrete, if you have a boolean array index idx and index an array a with a given shape shape, a[idx] will be an index error unless the shape of idx matches shape. In general, idx might be part of a tuple index, so it should match shape at the right place. I want to test the IndexError case, but in the typical case, the shape generated for idx should be related to shape. idx and shape would be generated from different strategies, like

# The actual strategies used in ndindex are a little bit more complicated than this

shapes = tuples(integers(0, 10))
boolean_arrays = arrays(bool_, shapes)

@given(boolean_arrays, shapes)
def test_booleanarray(idx, shape):
    ...

Maybe I can use target to come up with a rough measure of how close idx.shape is to shape. Are there any other clever ways to do this?

asmeurer added a commit to asmeurer/ndindex that referenced this issue Aug 24, 2020
See issue Quansight-Labs#77.

This implementation of checking if the array shapes are subsequences of the
test array shape doesn't seem to be fine grained enough for it to find a
simple bug, so I think it will need to be improved.
@asmeurer
Copy link
Member Author

asmeurer commented Aug 24, 2020

I played with using target here #78. You can run the tests with -s to see the target values. I've added an intentional bug to see if it can find it. I first tried checking the ratio of boolean array shapes that matched a subset of the index array shapes, but that seemed to be too fine grained. So instead I made the target negative if the boolean array mismatches (by how much it mismatches), and positive if it does, by its shape. I tried giving more score to arrays with more dimensions and more arrays, to get those cases to appear more often.

I've had mixed results. I eventually did get it to find the intentional bug, but only after many examples. It isn't really clear to me if the targeting made it find the bug any faster (and I'm not sure how to test that, especially now that the failure is in the database). Based on the output of my print statements, it doesn't really seem to be smart enough to increase the target values.

@Zac-HD
Copy link
Contributor

Zac-HD commented Aug 25, 2020

For related shapes, you probably want some combination of st.shared(), npst.broadcastable_shapes(), and npst.mutually_broadcastable_shapes().

Yeah... targeting does tend to increase the score pretty well, but it only kicks in halfway through and arrays are a hard structure for it. Coming up with a metric that correlates well with bugs is often a lot harder than it sounds too... As to empirically checking the impact though, you can run a number of tests with @settings(database=None) (or use the phases setting to skip Phase.reuse) and --hypothesis-seed= 0, 1, 2, 3... to get reproducible runs, plus --hypothesis-show-statistics to see how long it took to find the bug. It's a manual process, but it does work eventually!

@asmeurer
Copy link
Member Author

Nice. shared is what I was looking for. As a note, it might be helpful if that document were categorized by type, so that the meta-strategies like shared are in a separate list from the basic strategies. So I can generate an array shape and then pick a subsequence of it for the boolean array index shape. I'll play with this and see if does a better job at generating useful examples.

@asmeurer
Copy link
Member Author

I've been using this to test the effectiveness of the strategies (after introducing a bug that would only be found if a good example is generated).

from hypothesis import seed, settings, Phase
from ndindex.tests.test_newshape import test_newshape_hypothesis
import warnings
warnings.filterwarnings('error')
func = settings(phases=[Phase.generate, Phase.shrink], max_examples=10000)(test_newshape_hypothesis)

N = 20
failures = 0
for i in range(N):
    print(i)
    try:
        seed(i)(func)()
    except KeyboardInterrupt:
        break
    except BaseException as e:
        print(e)
        failures += 1
print(failures, N, failures/N)

The base run with no shared strategies gets 0 failures (I didn't attempt to see how many runs are needed until it finds one). With a shared shape I can get it up to 95%, which falls down to 75% if I also allow nonshared shapes using one_of. It's still low enough that it probably won't find a bug in a normal test run, but high enough that I can be reasonably confident after running a few --max-examples=10000.

asmeurer added a commit to asmeurer/ndindex that referenced this issue Aug 26, 2020
…shape

This is achieved with the shared() strategy. Without this, boolean arrays
almost always don't match the array shape where they are indexed, except in
the scalar boolean case, meaning almost no boolean array behavior was properly
tested outside of IndexError behavior. We do still include random shapes to
test this behavior, as it doesn't reduce the number of good examples by much.

Fixes part of issue Quansight-Labs#77. See the discussion there for more information.
@scopatz scopatz added this to the August 2020 milestone Aug 26, 2020
@ericdatakelly ericdatakelly modified the milestones: August 2020, September 2020 Sep 11, 2020
@asmeurer
Copy link
Member Author

I'm a little disappointed that hypothesis never found the failing example here #80, even after running millions of tests. I think the issue with the as_subindex tests is that there are so many cases that give NotImplementedError, that if a tuple index contains any index that isn't implemented, the whole test is more or less skipped. I used to filter these, but I was getting filtering warnings, so I changed it to just ignore NotImplementedError. But I guess I shouldn't have ignored them. Trying to filter these cases isn't going to be easy, especially in a way that doesn't make hypothesis give warnings about it. The best solution is to just implement more of the missing cases, so that this isn't a problem any more.

@Zac-HD
Copy link
Contributor

Zac-HD commented Sep 18, 2020

The best solution is to just implement more of the missing cases, so that this isn't a problem any more.

Yep. The usual workaround in the meantime would be to restrict your slices strategy so that it only generates cases you know you can handle 😕

@ericdatakelly ericdatakelly modified the milestones: September 2020, October 2020 Oct 13, 2020
@ericdatakelly ericdatakelly modified the milestones: October 2020, November 2020 Nov 16, 2020
@ericdatakelly ericdatakelly modified the milestones: November 2020, December 2020 Dec 17, 2020
@ericdatakelly ericdatakelly modified the milestones: December 2020, January 2021 Jan 6, 2021
@ericdatakelly ericdatakelly modified the milestones: January 2021, February 2021 Feb 4, 2021
@ericdatakelly ericdatakelly modified the milestones: February 2021, March 2021 Mar 8, 2021
@ericdatakelly ericdatakelly removed this from the March 2021 milestone Apr 5, 2021
@ericdatakelly ericdatakelly added this to the April 2021 milestone Apr 5, 2021
@ericdatakelly ericdatakelly modified the milestones: April 2021, May 2021 May 10, 2021
@ericdatakelly ericdatakelly modified the milestones: May 2021, June 2021 Jun 7, 2021
@ericdatakelly ericdatakelly removed this from the June 2021 milestone Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants