Multiplying-Sparse-Matrices-with-MapReduce

There are two big sparse matrices M and N (each is 100k * 10k), compute the multiplication of them by using MapReduce. Matrices are stored with the following format:
<i><TAB><j><TAB><m_i_j>
<j><TAB><k><TAB><n_j_k>
If you want to install Hadoop cluster by yoursrlf, you can refer to my tutorial (https://zhuanlan.zhihu.com/p/58968191)
I use Hadoop streaming with Python to solve this problem. There are three MapReduce jobs in total.

Stage 1

mappper: add tag to matrices M and N so as to differentiate them. Swap i and j in M. Output is
<j><TAB><m><TAB><i><TAB><m_i_j>
<j><TAB><n><TAB><k><TAB><n_j_k>
reducer: no reducer in this stage

Stage 2

mapper: identity mapper reducer: Cartesian product of record with the same j from M and N matrices, output is
< i,k ><TAB>< m_ij * n_jk>

Stage 3

mapper: identity mapper reducer: sum record with same key (i,k)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
large_dataset		large_dataset
small_dataset		small_dataset
stage1		stage1
stage2		stage2
stage3		stage3
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multiplying-Sparse-Matrices-with-MapReduce

Stage 1

Stage 2

Stage 3

About

Releases

Packages

Languages

florapril/Multiplying-Sparse-Matrices-with-MapReduce

Folders and files

Latest commit

History

Repository files navigation

Multiplying-Sparse-Matrices-with-MapReduce

Stage 1

Stage 2

Stage 3

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages