-
Notifications
You must be signed in to change notification settings - Fork 706
Introduction to Matrix Library
Matrix.scala is a Scalding library that introduces the possibility of treating pipes as sparse matrices and to operate on them using standard matrix operations, such as matrix multiplication:
//Computing the innerproduct of matrix A
innerProd = A * A.transpose
The matrix constructor takes in a pipe containing triples that have the assumed semantics of (row index, column index, matrix value). Additionally, the user can specify the approximate dimensions of the matrix (number of rows, columns, non-zero values) and its skewness (if the distribution of values over the row/column keys is skewed or not). This additional information can help speed-up the computation and improve scalability of the resulting job.
The matrix row and column indexes can be of any type that is comparable. The usual cases are Int, Long, String. This means that labeled matrices are allowed. For example, we can create a matrix containing the number of users that like specific movie genres per geo without first reindexing the categorical fields to numerical ids:
//Loading the number of users interested in movie genres per geo from a Tsv source
val interestsMatrix = Tsv( args("input") ).read.toMatrix[String,String,Long]('geo, 'movie_genre, 'freq)
The value type decides what operations can be applied to the matrix.
- Minimally, in order to support addition, the value type T has to have the trait
Monoid[T]
(as defined in algebird/Monoid). - In order to support subtraction, the value type T has to have the trait
Group[T]
(algebird/Group). - For multiplication, the value type T has to have the trait
Ring[T]
([algebird/Ring] (https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Ring.scala)). - For division, the value type T has to have the trait
Field[T]
(algebird/Field).
The reason why this is a more powerful approach than to require for the value type to be Numeric is that all of the four operations: addition, subtraction, multiplication, division can be then extended to non-numeric types. For example:
- String addition can be defined as string concatenation
- List addition can be defined as list concatenation
- Set addition can be defined as set union
These operations can be stacked together: For example, Map addition can be defined as set union and the values in the Maps intersection could be aggregated using their own definition of addition. By allowing matrix values to be structured types we can work with higher-order tensors such as cubes or four-tensors with the same library.
For more information on algebraic structures see: Algebraic_structure(Wikipedia) and the pages on Monoid, Group, Ring, Field.
Graphs have a straightforward representation in Matrix library as adjacency matrices. We will use the library to compute the outdegrees of the nodes in the graph.
package com.twitter.scalding.examples
import com.twitter.scalding._
import com.twitter.scalding.mathematics.Matrix
class GraphOutDegreeJob(args : Args) extends Job(args) {
import Matrix._
val adjacencyMatrix = Tsv( args("input"), ('user1, 'user2, 'rel) )
.read
.toMatrix[Long,Long,Double]('user1, 'user2, 'rel)
// each row i represents all of the outgoing edges from i
// by summing out all of the columns we get the outdegree of i
adjacencyMatrix.sumColVectors.write( Tsv( args("output") ) )
}
We convert a pipe of triples to a sparse matrix where element[i,j] represents and edge between row[i] and column[j]. We then sum the values of the columns together into a column vector that has the outdegree of node[i] at row[i].
- Read the Matrix API Reference: includes code snippets explaining different kinds of matrix functions (e.g., sumRowVectors, matrix product, element-wise product, diagonal, topRowElems) and much more.
- Go over the Matrix tutorials: the tutorials range from one-liners to more complex examples that show real applications of the Matrix functions to graph problems and text processing.
- Scaladocs
- Getting Started
- Type-safe API Reference
- SQL to Scalding
- Building Bigger Platforms With Scalding
- Scalding Sources
- Scalding-Commons
- Rosetta Code
- Fields-based API Reference (deprecated)
- Scalding: Powerful & Concise MapReduce Programming
- Scalding lecture for UC Berkeley's Analyzing Big Data with Twitter class
- Scalding REPL with Eclipse Scala Worksheets
- Scalding with CDH3U2 in a Maven project
- Running your Scalding jobs in Eclipse
- Running your Scalding jobs in IDEA intellij
- Running Scalding jobs on EMR
- Running Scalding with HBase support: Scalding HBase wiki
- Using the distributed cache
- Unit Testing Scalding Jobs
- TDD for Scalding
- Using counters
- Scalding for the impatient
- Movie Recommendations and more in MapReduce and Scalding
- Generating Recommendations with MapReduce and Scalding
- Poker collusion detection with Mahout and Scalding
- Portfolio Management in Scalding
- Find the Fastest Growing County in US, 1969-2011, using Scalding
- Mod-4 matrix arithmetic with Scalding and Algebird
- Dean Wampler's Scalding Workshop
- Typesafe's Activator for Scalding