Skip to content

Commit

Permalink
added paragraph about maybe using map-reduce in the future
Browse files Browse the repository at this point in the history
  • Loading branch information
astrieanna committed May 16, 2012
1 parent b53801f commit ea14d31
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion paper.tex
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,9 @@ \section{Conclusions}

During the preprocessing stage we ran into system memory constraints because of Python's high memory usage during vector generation. Therefore we only used a small subset of our input data: roughly 2.5 to 3 million words depending on the language. This wasted much of the corpora we had accumulated: 18.5 million words of German, 11.25 million words of Spanish, 91.9 million words of French, and 1.1 billion words of English.

We spent most of our time optimizing and debugging our python processing loops. This code did pre-preprocessing (strip licenses and invalid chars) and preprocessing (tokenize, split, and stem) on all of our input data. Just changing the way it iterated over data gained as much as a 100x speedup. In the future, we should attempt to also outsource this processing to the GPU, but since everything would have to be done in C, string processing will be a nightmare.
We spent most of our time optimizing and debugging our python processing loops. This code did pre-preprocessing (strip licenses and invalid chars) and preprocessing (tokenize, split, and stem) on all of our input data. Just changing the way it iterated over data gained as much as a 100x speedup. In the future, we should also outsource this processing to the GPU, but since everything would have to be done in C, string processing will be a nightmare.

Building the co-occurrence matrices could also be parallelized. Since the co-occurrences are sentence-based, it would be easy write using map-reduce. Each mapper could be given a sentence; it would produce all the co-occurrence pairs generated by that sentence. Each reducer would get all the pairs centered on one particular word; it would build the vector representing that word.

Overall, we have shown that the use of GPUs can be very advantageous and are surprised on the lack of GPU related Machine Translation literature. We will likely continue developing this in the future and have all of our source available at github.com/madmaze/parallel-mcca.

Expand Down

0 comments on commit ea14d31

Please sign in to comment.