This week IBM hosted a workshop on Big Data Analytics at the IBM Innovation Center in Zürich.Thanks to Romeo Kienzler for showing us the latest developments in Hadoop, Pig, Jaql and Hive. Pig looks the most interesting at the moment due to it’s open source nature. But it seems that IBM is considering releasing an open source version of Jaql at some point.
The next set of tools under development aim to run analysis code from R directly within the Hadoop infrastructure. Although this sounds exciting, it reminds me when back in 2006 some people at CERN, including myself, started to develop something very similar for LHC analyses.
What we learned was that this kind of setup was not practical for complex big data analysis tasks. Understanding multi-dimensional variable spaces requires running hundreds of analysis variations, of which each should not take more that about 30 seconds to be practical. This means that the analysis has to be done in two steps:
- A pre-selection, running on the cluster within Hadoop, which discards most of the background data and calculates some more condensed quantities, and
- a final selection, running on the laptop to analyze the pre-processed information using R for example.
What is missing curently is a system, which integrates these two allowing for an efficient transfer (write and read) of information between them. I am curious to see how these developments continue.