There are two big sparse matrices M and N (each is 100k * 10k), compute the multiplication of them by using MapReduce. Matrices are stored with the following format:
<i><TAB><j><TAB><mij>
<j><TAB><k><TAB><njk>
If you want to install Hadoop cluster by yoursrlf, you can refer to my tutorial (https://zhuanlan.zhihu.com/p/58968191)
I use Hadoop streaming with Python to solve this problem. There are three MapReduce jobs in total.
mappper: add tag to matrices M and N so as to differentiate them. Swap i and j in M. Output is
<j><TAB><m><TAB><i><TAB><mij>
<j><TAB><n><TAB><k><TAB><njk>
reducer: no reducer in this stage
mapper: identity mapper
reducer: Cartesian product of record with the same j from M and N matrices, output is
< i,k ><TAB>< mij * njk>
mapper: identity mapper reducer: sum record with same key (i,k)