Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different DBCV value from Matlab's implementation when calculating it on a dataset having all 0s cluster labels #9

Open
davidechicco opened this issue Jan 10, 2025 · 4 comments

Comments

@davidechicco
Copy link

davidechicco commented Jan 10, 2025

Hi Felipe
Thanks again for your availability to fix the issues in your software package during December.
I am writing to you today because I found another case where your dbcv() function generates an outcome which is different from Matlab's implementation one.

I have applied your dbcv() function to this DB1_with_307_clusters.csv
dataset file, where the last column on the right contains the cluster labels, which are all zeros.

I used this piece of code:

import pandas as pd
import dbcv

data_file_name = 'DB1_with_307_clusters.csv'
df = pd.read_csv(data_file_name)
print("data file name read: ", data_file_name)

df = df.drop(df[(df.cluster >= 2)].index)

these_data = df.iloc[:,:-2]
these_clusters = df.iloc[:,-1]
this_dbcv = dbcv.dbcv(these_data, these_clusters, check_duplicates = False)
print("FelSiq/DBCV = ", this_dbcv)

The result is 0.999 and is clearly wrong. A collaborator of mine applied the function of the original DBCV Matlab implementation and obtained 0 as result.

Can you please investigate this problem?

Thanks

-- Davide

@FelSiq
Copy link
Owner

FelSiq commented Jan 10, 2025

Hi, Davide.

Could you please elaborate on the expected semantics of the label 0? Is it supposed to represent all noise or a legitimate cluster?

Please keep in mind that the matlab implementation considers "0" the label noise, thus the result being 0.0 makes sense, whereas my implementation considers - by default - "0" a legitimate cluster and "-1" representing the noise labels, thus the result being essentially 1.0 also makes sense.

Am I missing something?

Best regards,
Felipe.

@davidechicco
Copy link
Author

Hi Felipe

Thanks for your reply. My file has the following meaning:
-1 noise cluster
0 first cluster
1 second cluster

We are aware of the Matlab implementation's differences, and therefore we added +1 to our data when dealing with Matlab code. We changed the clusters this way:
-1 --> 0 noise cluster
0 --> 1 first cluster
1 --> 2 second cluster

So, our Matlab code does not interpret our 0 labels as noise, but as elements of the first cluster, as it should.
We believe your Python function should produce 0 as result, just like Matlab's implementation
Can you please look into this issue?
Unfortunately I do not have the Matlab code to share with you because I do not have a Matlab license.

Thanks

-- Davide

@FelSiq
Copy link
Owner

FelSiq commented Jan 10, 2025

Understood. I'll investigate this asap.

Thanks for the report.

@davidechicco
Copy link
Author

Thanks Felipe.
While I'm in it, here's another case where we get different results from Matlab.

In this dataset file (DB1_with_774_clusters.csv), that has the same structure of the DB1_with_307_clusters.csv file mentioned earlier, we get:

  • DBCV = 0 with your Python code;
  • DBCV = 0.880161 with the Matlab implementation.

Perhaps you want to study it a bit as well.

Muito obrigado! :-)

@davidechicco davidechicco changed the title Wrong DBCV value when calculating it on a dataset having all 0s cluster labels Different DBCV value from Matlab's implementation when calculating it on a dataset having all 0s cluster labels Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants