Inconsistency in case of dataframe and distance matrix input #46

jnpsk · 2023-01-16T14:51:59Z

This is not a project issue, but a suggestion to put some kind of warning in the distance matrix example in the README.

There is an example of using distance matrix as input for LoOP in the README. It shows how sklearn.neighbors.NearestNeighbors could be used to obtain distance matrix together with index matrix. It seems that this way the matricies also contain distance measures to a point itself, resulting to zero distance for the first nearest neighbor of every point.
On the other hand internal method _compute_distance_and_neighbor_matrix, used when data argument is specified, excludes the distances to a point itself and so giving different scores on same data.
I took a look into the test case, which allowes difference of 0.15 in scores vector, and thus the difference between 0.45 and 0.6 is considered negligible.
I think the output metrices of sklearn.neighbors.NearestNeighbors should be transformed first to be consistent with the internal algorithm.

The text was updated successfully, but these errors were encountered:

vc1492a · 2023-01-17T15:46:58Z

@jnpsk thank you for identifying and noting this behavior in this issue!

While it's not a project issue as you stated, I think it would be best to align the behavior of using your own distance matrix with that of the original Local Outlier Probabilities method, for consistency. Perhaps this can be resolved by adding one additional neighbor when using sklearn.neighbors.NearestNeighbors and truncating the distance vectors for each point to its closest neighbors (excluding itself).

This fix should be included in the next version pushed out.

vc1492a · 2023-02-16T21:49:23Z

@jnpsk can you please comment on the version of SciPy you have installed? After running the test case as is with 120 observations, I receive a difference of 0.0 between the approach using the internal method _compute_distance_and_neighbor_matrix and the method using sklearn.neighbors.NearestNeighbors. Could you please past your reproducible example below?

jnpsk · 2023-02-16T22:15:23Z

Having defined X_n120() function as in test cases, I can run the example as follows without raising the assert error running on SciPy 1.9.1.

data = X_n120()

from sklearn.neighbors import NearestNeighbors
from PyNomaly import loop
from sklearn.utils._testing import assert_almost_equal

# generate distance and neighbor indices
neigh = NearestNeighbors(metric='euclidean')
neigh.fit(data)
d, idx = neigh.kneighbors(data, n_neighbors=10, return_distance=True)

# fit loop using data and distance matrix
clf1 = loop.LocalOutlierProbability(data, n_neighbors=10)
clf2 = loop.LocalOutlierProbability(distance_matrix=d, neighbor_matrix=idx, n_neighbors=10)

scores1 = clf1.fit().local_outlier_probabilities
scores2 = clf2.fit().local_outlier_probabilities

# compare the agreement between the results
assert_almost_equal(scores1, scores2, decimal=1)

What I mean is that the d matrix has zeros in the first column, whereas the internal method does not include distances to point itself, resulting in slight inconsitence. It is not a big deal on that dataset combined with chosen number of n_neighbors = 10. But if the assert checks more decimal places, or lower number of neigbors is used (3), the two scores differs significantly.

vc1492a · 2024-11-03T03:39:42Z

Resolved this ticket by updating the documentation in the readme.md and including a test in tests/test_loop.py that not only illustrates the consistency but also how to achieve identical results using both the internal method and NearestNeighbors up to 10 decimal places. Covered in #72.

vc1492a added bug Something isn't working high priority This issue is a high priority relative to other open issues labels Jan 17, 2023

vc1492a assigned vc1492a and medvidov Feb 16, 2023

vc1492a added the in progress This issue is being actively worked on label Feb 16, 2023

vc1492a removed the in progress This issue is being actively worked on label Mar 13, 2023

vc1492a unassigned medvidov and vc1492a Aug 19, 2024

vc1492a added this to the Address Existing Bug Fixes milestone Aug 19, 2024

vc1492a assigned IroNEDR Aug 19, 2024

IroNEDR added the in progress This issue is being actively worked on label Oct 27, 2024

vc1492a self-assigned this Nov 1, 2024

vc1492a mentioned this issue Nov 3, 2024

Fix/dataframe distance matrix inconsistency #72

Merged

vc1492a linked a pull request Nov 3, 2024 that will close this issue

Fix/dataframe distance matrix inconsistency #72

Merged

vc1492a closed this as completed Nov 3, 2024

vc1492a mentioned this issue Nov 3, 2024

Dev #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency in case of dataframe and distance matrix input #46

Inconsistency in case of dataframe and distance matrix input #46

jnpsk commented Jan 16, 2023

vc1492a commented Jan 17, 2023

vc1492a commented Feb 16, 2023

jnpsk commented Feb 16, 2023

vc1492a commented Nov 3, 2024

Inconsistency in case of dataframe and distance matrix input #46

Inconsistency in case of dataframe and distance matrix input #46

Comments

jnpsk commented Jan 16, 2023

vc1492a commented Jan 17, 2023

vc1492a commented Feb 16, 2023

jnpsk commented Feb 16, 2023

vc1492a commented Nov 3, 2024