Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] fix quadruplets decision_function #217

Conversation

wdevazelhes
Copy link
Member

@wdevazelhes wdevazelhes commented Jun 13, 2019

I just realized there was a pb with decision_function for quadruplets, since it didn't work on list of lists of lists (instead of 3D arrays). It made me realize that we don't have integration test for these array-like inputs (althought we had extensive unit tests of check_input functions). Here is a fix. It adds a significant test time (15s, because it tests all possibilities with reasonable size datasets), but it's like other similar tests, it's hard to find smaller datasets since sometimes the algorithms do wrong on them. I've opened a separate issue (#218) for reducing the time of tests, so if everything is OK regarding everything but the time consideration, I guess we can merge

@wdevazelhes wdevazelhes requested review from bellet and perimosocordiae and removed request for bellet June 13, 2019 13:47
@perimosocordiae
Copy link
Contributor

As I understand it, this is just a "smoke test": it verifies that we don't throw any exceptions, but doesn't check the numeric results. If so, we can probably shrink down the input size significantly, as we don't really care about convergence behavior.

@wdevazelhes
Copy link
Member Author

As I understand it, this is just a "smoke test": it verifies that we don't throw any exceptions, but doesn't check the numeric results. If so, we can probably shrink down the input size significantly, as we don't really care about convergence behavior.

I agree, done: I just kept 30 samples (below that like 20 samples, there are pbs either with LMNN or RCA). But it makes the test only last 6s (instead of 15), so I guess it's OK for now ? (And maybe later we can address this by making more adapted datasets that have the minimum amount of samples to work for every algo)

@bellet
Copy link
Member

bellet commented Jun 14, 2019

Looks good but why do we need as many as 30 samples?

@bellet
Copy link
Member

bellet commented Jun 24, 2019

Looks good but why do we need as many as 30 samples?

Still puzzled by this ;-)

@wdevazelhes
Copy link
Member Author

Looks good but why do we need as many as 30 samples?

With 20 samples, I have the following error for LMNN:

ValueError: not enough class labels for specified k (smallest class has 2)

I've tried with train_test_split to balance the subsampling, with 20 samples only, let's see if travis turns green

(With 10 samples only RCA throws:

ValueError: Unable to make 10 chunks of 2 examples each

)

@wdevazelhes
Copy link
Member Author

Apparently we need 30 samples for the test to pass in all versions on CI, otherwise we have the RCA problem

@wdevazelhes
Copy link
Member Author

Note: TODO: I'll prevent that and try to use less samples by setting num_chunks= to a lower number than 100

@wdevazelhes
Copy link
Member Author

Done: if tests are green, I think we are good to merge

@wdevazelhes
Copy link
Member Author

Tests are green. If everybody agree we can merge


# we subsample the data for the test to be more efficient
input_data, _, labels, _ = train_test_split(input_data, labels,
train_size=20)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe setting stratify=labels would make LMNN pass for train_size=10?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I tried and it worked for 11 (though it didn't work for 10, I think because the choice of chunks removes any point once it's already included in a chunk or something like that, and the dataset has 3 classes, but I would need to investigate more to be sure)
However, it bugs for MLKR, since labels is a continuous values array that cannot be used as an argument of "stratify"
So I suggest that to keep it simple we keep it like that (with 20 samples, and no stratify), what do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, let's merge

@bellet bellet merged commit 580d38d into scikit-learn-contrib:master Jun 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants