Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensemble Clustering on different data transformations #16

Open
Kosisochi opened this issue Jul 26, 2021 · 1 comment
Open

Ensemble Clustering on different data transformations #16

Kosisochi opened this issue Jul 26, 2021 · 1 comment

Comments

@Kosisochi
Copy link

Kosisochi commented Jul 26, 2021

From the documentation, i can only see capabilities for clustering ensemble for different k or different algorithm algorithms and so on.
What I want to do is to cluster different "views" of the same data. Example: Bag of Words , TFIDF and word embedding representation of the same data as an ensemble.

I have tried creating different data objects of the different data representation.

dataObj = oe.data(df2, list_x)
dataObj1 = oe.data(df3, list_xj)

dataObj.D["parent1"] = dataObj1
c = oe.cluster(dataObj)
K = 5
numIterations = 2
c_MV_arr = []
source_names = ['parent', 'parent1']
output_names = ["BOW", "TFIDF"]
for i in range(1,numIterations):
    name = 'kmeans_' + output_names[i]  
    c.cluster(source_names[i], 'kmeans', name, K, init = 'random', n_init = 1) 
    c_MV_arr.append(c.finish_majority_vote(threshold=0.5))

but i get this error

TypeError: float() argument must be a string or a number, not 'data'

I also tried multiple c's

c = oe.cluster(dataObj)
c1 = oe.cluster(dataObj1)

but i cant calculate c.finish_majority_vote(threshold=0.5) as it only takes c into consideration and not c1

Is it possible to use OpenEnsembles to cluster ensembles of different data?

Note: The feature dimension of df2 and df3 are different and so cannot do the transformation into BOW features or TFIDF features inside the for loop because the length of the features (length of df2.colums or df3.colums) is the second argument for oe.data(df,x) and this is initialized outside the for loop.

@Kosisochi
Copy link
Author

To answer my question, this is what i did:
I added the new data object to a list d_arr and the used the .merge() method to add the new data object to the first data object

numRepeats = 1
d_arr=[]
for i in range(0,numRepeats):
    d_temp = oe.data(df3, list_xj)
    d_arr.append(d_temp)

transdict= dataObj.merge(d_arr)

Then i initialized a cluster object using the first data object

c = oe.cluster(dataObj)
for name in dataObj.D.keys():
       -----------------------------------
         c.cluster(name, "kmeans", "kmeans_"+ name + "_" , K , n_init = 1)

Where name refers to a different data object. The first data object is caled "parent" and a subsequent on is called "parent_1" and so on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant