Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel crash for large dataset #1151

Open
nhpackard opened this issue Oct 10, 2024 · 0 comments
Open

kernel crash for large dataset #1151

nhpackard opened this issue Oct 10, 2024 · 0 comments

Comments

@nhpackard
Copy link

After usual imports

import pandas as pd
import numpy as np
import umap

I have
python version: 3.9.18
umap version: 0.5.6

My dataset has 5,400,000 points in a 768 dim space. For 540,000 points, umap seems to work fine:

%%time
rawdat = np.random.rand(540000,768)
dat = pd.DataFrame(rawdat,columns=[str(i) for i in range(rawdat.shape[1])])

ums = umap.UMAP(metric='cosine', #def euclidean
                   min_dist=0.1, # def 0.1
                   n_neighbors=15, # def 15
                   n_components=2, #def 2
                   transform_seed=1,
                   verbose=True
                  ).fit(dat)

yields

UMAP(angular_rp_forest=True, metric='cosine', transform_seed=1, verbose=True)
Thu Oct 10 13:46:34 2024 Construct fuzzy simplicial set
Thu Oct 10 13:46:35 2024 Finding Nearest Neighbors
Thu Oct 10 13:46:35 2024 Building RP forest with 42 trees
Thu Oct 10 13:46:52 2024 NN descent for 19 iterations
	 1  /  19
	 2  /  19
	 3  /  19
	 4  /  19
	 5  /  19
	 6  /  19
	 7  /  19
	 8  /  19
	 9  /  19
	 10  /  19
	 11  /  19
	 12  /  19
	 13  /  19
	Stopping threshold met -- exiting after 13 iterations
Thu Oct 10 13:47:15 2024 Finished Nearest Neighbor Search
Thu Oct 10 13:47:18 2024 Construct embedding
Epochs completed: 100%| 
 200/200 [00:23]
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Thu Oct 10 13:48:56 2024 Finished embedding
CPU times: user 25min 30s, sys: 2min 52s, total: 28min 23s
Wall time: 2min 26s

But for 5,400,000 data points,

%%time

rawdat = np.random.rand(5400000,768)
dat = pd.DataFrame(rawdat,columns=[str(i) for i in range(rawdat.shape[1])])

ums = umap.UMAP(metric='cosine', #def euclidean
                   min_dist=0.1, # def 0.1
                   n_neighbors=15, # def 15
                   n_components=2, #def 2
                   transform_seed=1,
                   verbose=True
                  ).fit(dat)

the python kernel crashes after ~4 minutes, with output:

UMAP(angular_rp_forest=True, metric='cosine', transform_seed=1, verbose=True)
Thu Oct 10 13:51:34 2024 Construct fuzzy simplicial set
Thu Oct 10 13:51:37 2024 Finding Nearest Neighbors
Thu Oct 10 13:51:39 2024 Building RP forest with 64 trees

never reaching "NN descent".

Memory should not be a problem. My machine (macbook pro) has 128GB of RAM. Memory usage as reported by htop reaches ~30GB just before the kernel crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant