Hadoop and its filesystem HDFS is open-source software (part of Apache) for distributed processing of Big Data.
R is the go-to tool of our Data Scientists. It's great for exploratory analysis and allows easy access to statistical, mathematical and machine learning functions.
The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
Spark can run on Hadoop 2's YARN and can read any existing Hadoop data. It is developed to run programs faster by making more use of in-memory data processing. Spark developers claim that it runs 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk.
MLlib is a machine learning library related to Spark. It can be used with Java, Scala or Python and it runs on existing Hadoop clusters and data.
Apache Mahout is a scalable machine learning library. It contains clustering, classification and collaborative filtering algorithms which are implemented on top of scalable distributed filesystems.