MLlib is a machine learning library related to Spark. It can be used with Java, Scala or Python and it runs on existing Hadoop clusters and data.

Spark excels at iterative computation, enabling MLlib to run fast: up to a 100 times faster than MapReduce. MLlib contains high-quality algorithms that leverage iteration, and can yield better results than the one-pass approximations sometimes used on MapReduce.

Currently MLlib contains for example linear SVM, logistic regression, classification and regression tree, k-means clustering, recommendation using alternating least squares and a few more. But not all well-known algorithms are available, because some of them aren't apropriate to parallellize.

Related news