Makes sure we don't pull the whole corpus into memory when training #23

dirkgr · 2015-06-22T21:35:24Z

Explanation in the comments.

dirkgr · 2015-06-22T21:36:41Z

src/main/java/com/medallia/word2vec/neuralnetwork/NeuralNetworkTrainer.java

-				for (final List<List<String>> batch : partitioned) {
-					tasks.add(createWorker(i, iter, batch));
+				for (final List<List<String>> batch : batched) {
+					futures.add(ex.submit(createWorker(i, iter, batch)));
 					i++;
 				}


This for loop would pull the entire training data into memory, because every worker contains a batch, and all workers are instantiated before the first one starts working.

dirkgr · 2015-07-07T00:27:06Z

Ping?

wko27 · 2015-07-07T07:30:36Z

src/main/java/com/medallia/word2vec/neuralnetwork/NeuralNetworkTrainer.java

+	public NeuralNetworkModel train(final Iterable<List<String>> sentences) throws InterruptedException {
+		final ListeningExecutorService ex =
+				MoreExecutors.listeningDecorator(
+					new ThreadPoolExecutor(config.numThreads, config.numThreads,


Neat trick, but let's leave a comment to what we're trying to accomplish here. If I understand correctly, the overall idea is to have executor.submit block if there are no available threads to avoid materializing the sentences in memory before they are needed. The ArrayBlockingQueue and CallerRunsPolicy is one way to accomplish this.

Any reason why the blocking queue starts with size 8 instead of config.numThreads?

That's correct. I'll add a comment.

The queue size could be config.numThreads, but it's not really connected to the number of processors. It's just connected to the amount of overhead there is in creating these threads. In principle, a queue size of 1 should do, but I tried that and it was slower. I'm worried that if I set it to the number of processors, I'll run out of memory on a machine with lots of cores.

Actually, that's not correct. The queue size matters when the main thread is running a task due to the CallerRunsPolicy. So it is connected to the number of processors. I changed it.

…o TrainIncrementally3

dirkgr · 2015-07-15T17:37:17Z

This might be a fix for #20.

Hronom · 2015-07-29T19:09:09Z

Any info about then this pull request will be accepted? This changes helps me train 2,4 GB of data...

Iakovenko-Oleksandr · 2015-08-03T12:07:55Z

The fix is really useful! It took us 70+ Gb of RAM to train a model without it. Now it's only about 10Gb. I wonder, why the improvement so essential hasn't yet been added to master?

dirkgr · 2015-08-04T05:58:28Z

@wko27 had some concerns about the quality of the resulting vectors. @Hronom, @Iakovenko-Oleksandr, do you have any problems with your results?

Iakovenko-Oleksandr · 2015-08-04T09:50:11Z

What kind of problems? It really feels like changed, but we still don't have any tools to evaluate adequacy of model... Closest vectors look more or less fine.

dirkgr added 4 commits June 22, 2015 14:25

Makes sure we don't pull the whole corpus into memory when training

fe6d329

Made it build in Java 7.

8e81df1

Made * imports explicit

e5f7d44

Formatting

94db87d

dirkgr reviewed Jun 22, 2015
View reviewed changes

Google's Iterable thing works just fine

61e7fc7

wko27 reviewed Jul 7, 2015
View reviewed changes

dirkgr added 2 commits July 7, 2015 14:50

Changed queue size and added comment

464abcc

Merge branch 'master' of https://github.com/medallia/Word2VecJava int…

8cf84b0

…o TrainIncrementally3

dirkgr mentioned this pull request Jul 15, 2015

Training for a large dataset, Java heap space memory out #20

Open

sbhaktha mentioned this pull request May 27, 2016

Add Word2vec jar to OKC dependencies for better library usage support allenai/ike#162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Makes sure we don't pull the whole corpus into memory when training #23

Makes sure we don't pull the whole corpus into memory when training #23

dirkgr commented Jun 22, 2015

dirkgr Jun 22, 2015

dirkgr commented Jul 7, 2015

wko27 Jul 7, 2015

dirkgr Jul 7, 2015

dirkgr Jul 7, 2015

dirkgr commented Jul 15, 2015

Hronom commented Jul 29, 2015

Iakovenko-Oleksandr commented Aug 3, 2015

dirkgr commented Aug 4, 2015

Iakovenko-Oleksandr commented Aug 4, 2015

Makes sure we don't pull the whole corpus into memory when training #23

Are you sure you want to change the base?

Makes sure we don't pull the whole corpus into memory when training #23

Conversation

dirkgr commented Jun 22, 2015

dirkgr Jun 22, 2015

Choose a reason for hiding this comment

dirkgr commented Jul 7, 2015

wko27 Jul 7, 2015

Choose a reason for hiding this comment

dirkgr Jul 7, 2015

Choose a reason for hiding this comment

dirkgr Jul 7, 2015

Choose a reason for hiding this comment

dirkgr commented Jul 15, 2015

Hronom commented Jul 29, 2015

Iakovenko-Oleksandr commented Aug 3, 2015

dirkgr commented Aug 4, 2015

Iakovenko-Oleksandr commented Aug 4, 2015