Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in case of dataframe and distance matrix input #46

Closed
jnpsk opened this issue Jan 16, 2023 · 4 comments · Fixed by #72
Closed

Inconsistency in case of dataframe and distance matrix input #46

jnpsk opened this issue Jan 16, 2023 · 4 comments · Fixed by #72
Assignees
Labels
bug Something isn't working high priority This issue is a high priority relative to other open issues in progress This issue is being actively worked on

Comments

@jnpsk
Copy link

jnpsk commented Jan 16, 2023

This is not a project issue, but a suggestion to put some kind of warning in the distance matrix example in the README.

There is an example of using distance matrix as input for LoOP in the README. It shows how sklearn.neighbors.NearestNeighbors could be used to obtain distance matrix together with index matrix. It seems that this way the matricies also contain distance measures to a point itself, resulting to zero distance for the first nearest neighbor of every point.
On the other hand internal method _compute_distance_and_neighbor_matrix, used when data argument is specified, excludes the distances to a point itself and so giving different scores on same data.
I took a look into the test case, which allowes difference of 0.15 in scores vector, and thus the difference between 0.45 and 0.6 is considered negligible.
I think the output metrices of sklearn.neighbors.NearestNeighbors should be transformed first to be consistent with the internal algorithm.

@vc1492a
Copy link
Owner

vc1492a commented Jan 17, 2023

@jnpsk thank you for identifying and noting this behavior in this issue!

While it's not a project issue as you stated, I think it would be best to align the behavior of using your own distance matrix with that of the original Local Outlier Probabilities method, for consistency. Perhaps this can be resolved by adding one additional neighbor when using sklearn.neighbors.NearestNeighbors and truncating the distance vectors for each point to its closest neighbors (excluding itself).

This fix should be included in the next version pushed out.

@vc1492a vc1492a added bug Something isn't working high priority This issue is a high priority relative to other open issues labels Jan 17, 2023
@vc1492a
Copy link
Owner

vc1492a commented Feb 16, 2023

@jnpsk can you please comment on the version of SciPy you have installed? After running the test case as is with 120 observations, I receive a difference of 0.0 between the approach using the internal method _compute_distance_and_neighbor_matrix and the method using sklearn.neighbors.NearestNeighbors. Could you please past your reproducible example below?

@vc1492a vc1492a added the in progress This issue is being actively worked on label Feb 16, 2023
@jnpsk
Copy link
Author

jnpsk commented Feb 16, 2023

Having defined X_n120() function as in test cases, I can run the example as follows without raising the assert error running on SciPy 1.9.1.

data = X_n120()

from sklearn.neighbors import NearestNeighbors
from PyNomaly import loop
from sklearn.utils._testing import assert_almost_equal

# generate distance and neighbor indices
neigh = NearestNeighbors(metric='euclidean')
neigh.fit(data)
d, idx = neigh.kneighbors(data, n_neighbors=10, return_distance=True)

# fit loop using data and distance matrix
clf1 = loop.LocalOutlierProbability(data, n_neighbors=10)
clf2 = loop.LocalOutlierProbability(distance_matrix=d, neighbor_matrix=idx, n_neighbors=10)

scores1 = clf1.fit().local_outlier_probabilities
scores2 = clf2.fit().local_outlier_probabilities

# compare the agreement between the results
assert_almost_equal(scores1, scores2, decimal=1)

What I mean is that the d matrix has zeros in the first column, whereas the internal method does not include distances to point itself, resulting in slight inconsitence. It is not a big deal on that dataset combined with chosen number of n_neighbors = 10. But if the assert checks more decimal places, or lower number of neigbors is used (3), the two scores differs significantly.

@vc1492a vc1492a removed the in progress This issue is being actively worked on label Mar 13, 2023
@vc1492a vc1492a added this to the Address Existing Bug Fixes milestone Aug 19, 2024
@IroNEDR IroNEDR added the in progress This issue is being actively worked on label Oct 27, 2024
@vc1492a vc1492a self-assigned this Nov 1, 2024
@vc1492a vc1492a linked a pull request Nov 3, 2024 that will close this issue
@vc1492a
Copy link
Owner

vc1492a commented Nov 3, 2024

Resolved this ticket by updating the documentation in the readme.md and including a test in tests/test_loop.py that not only illustrates the consistency but also how to achieve identical results using both the internal method and NearestNeighbors up to 10 decimal places. Covered in #72.

@vc1492a vc1492a closed this as completed Nov 3, 2024
@vc1492a vc1492a mentioned this issue Nov 3, 2024
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority This issue is a high priority relative to other open issues in progress This issue is being actively worked on
Projects
None yet
4 participants