R

Statistics and mathematics

R is developed as a statistical software, hence off course it is scoring very high on statistics. We are not pretending that all statistical formulas can be found in one of the many packages, but it will support almost every statistical analysis you can think of.

The same holds for mathematics. We just want to remark that mathematics is a wide field, with applications you will never use in data analysis. Off course there are many very specific parts of math which are not predefined in R, but which you will never use. Most of the math you will use is prepackaged in R.

Computer science and machine learning

R is a software program where you can do a lot when you have some programming skills, but saying that it can solve anything for you on a computer, would underestimate all other programming languages. But for your analysis, it will almost always support what you need. You can load a bunch of types of files (excel, xml, tabular data, plain text, ...), you can process your data in many ways (original way, sql-like, java-integrated, ...) and you can call it using many languages (e.g. Java). Whatever you can think of, there might be a package for it.

Moreover R is great for machine learning. Again we would like to mention that whatever you have in mind, you can probably find a package for it. For example naive Bayes, decision trees (or its extension Random Forest), logistic regression (or its upgrade GAM), support vector machines and many more can easily be computated using some R packages.

Communication and visualization

R has many ways to support your communication. For example, documents can be written in R markdown. In order to make presentations you can use a wide variety: you can make beamer slides with R markdown, create an R presentation or make your presentation with Slidify. All of these methods also allow you to create html files which you can publish on the site of R (or RStudio) or wherever you want.

R also supports graphics and plots, although it does not allow you to do really fancy things. R is great for making quick plots for an exploratory data analysis, but it is not recommended for your final data product. However there is work in progress here: for example by using Shiny, you can make reactive data products which you can export to html and make available to whoever you want.

Even though you can do a bunch of interactive things with it, you will not be able to create the most fancy layouts you can think of.

R and big data

In case of big data, one often refers to the tree V's: volume, velocity and variety. R itself has no problem with variety thanks to its wide variety of plugins. But R is not created to cope with big volumes, hence the velocity will go down rapidly when the volume is going up.

This is where Hadoop, the most well-known big data tool for the moment, pops up. As you might think, there are packgages to link with Hadoop. One of them is RHadoop. But in RHadoop, you have to write your mappers and reducers yourself. For example logistic regression can be coded in one line using R, in RHadoop, you have to write it yourself and we were able to do it in just under 50 lines...

Nowadays, most of the people are not writing MapReduce anymore and even are going away from Hadoop to some more realitime big data tools, like for example Spark. Again there is a plugin from R to Spark (written in Scala). But again, for the moment, not many functions are predefined in SparkR. Naive Bayes in R can be done in one line, we wrote a distributed naive Bayes in SparkR in more or less 200 lines...

But still we belive that concerning R SparkR is definitely more preferable than RHadoop. Writing SparkR code, you can easily run and test it local, while RHadoop has to connect with HDFS, hence it is much slower and not good for trial and error coding.

Conclusion

R is a great data science tool. Most of the analysis you can think of, are supported by and predefined in R. But when it comes to big data, the plugins with the big data tools are still too premature. Where Spark itself has a few predefined machine learning algorithms in MlLib, you still have to distribute these algorithms in SparkR yourself.

A data scientist probably will not be perfect in all competences he should have. A mixed team gathering all the competences might be preferable. In the same way, one program will probably never be perfect for all competences, it is better to consider the best of all possible programs you (or your team) are mastering. But in the case of exploratory data analysis R will be hard to beat!

Related news