Skip to content

Makes sure we don't pull the whole corpus into memory when training #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

dirkgr
Copy link
Contributor

@dirkgr dirkgr commented Jun 22, 2015

Explanation in the comments.

for (final List<List<String>> batch : partitioned) {
tasks.add(createWorker(i, iter, batch));
for (final List<List<String>> batch : batched) {
futures.add(ex.submit(createWorker(i, iter, batch)));
i++;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop would pull the entire training data into memory, because every worker contains a batch, and all workers are instantiated before the first one starts working.

@dirkgr
Copy link
Contributor Author

dirkgr commented Jul 7, 2015

Ping?

public NeuralNetworkModel train(final Iterable<List<String>> sentences) throws InterruptedException {
final ListeningExecutorService ex =
MoreExecutors.listeningDecorator(
new ThreadPoolExecutor(config.numThreads, config.numThreads,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat trick, but let's leave a comment to what we're trying to accomplish here. If I understand correctly, the overall idea is to have executor.submit block if there are no available threads to avoid materializing the sentences in memory before they are needed. The ArrayBlockingQueue and CallerRunsPolicy is one way to accomplish this.

Any reason why the blocking queue starts with size 8 instead of config.numThreads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. I'll add a comment.

The queue size could be config.numThreads, but it's not really connected to the number of processors. It's just connected to the amount of overhead there is in creating these threads. In principle, a queue size of 1 should do, but I tried that and it was slower. I'm worried that if I set it to the number of processors, I'll run out of memory on a machine with lots of cores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that's not correct. The queue size matters when the main thread is running a task due to the CallerRunsPolicy. So it is connected to the number of processors. I changed it.

@dirkgr
Copy link
Contributor Author

dirkgr commented Jul 15, 2015

This might be a fix for #20.

@Hronom
Copy link
Contributor

Hronom commented Jul 29, 2015

Any info about then this pull request will be accepted? This changes helps me train 2,4 GB of data...

@Iakovenko-Oleksandr
Copy link

The fix is really useful! It took us 70+ Gb of RAM to train a model without it. Now it's only about 10Gb. I wonder, why the improvement so essential hasn't yet been added to master?

@dirkgr
Copy link
Contributor Author

dirkgr commented Aug 4, 2015

@wko27 had some concerns about the quality of the resulting vectors. @Hronom, @Iakovenko-Oleksandr, do you have any problems with your results?

@Iakovenko-Oleksandr
Copy link

What kind of problems? It really feels like changed, but we still don't have any tools to evaluate adequacy of model... Closest vectors look more or less fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants