Today I had a look at the Swiss public transport API provided via the Swiss Open Knowledge Foundation side. It gives access to all Swiss train schedules and connections. One needs just a little python script to download the data.
I retrieved the list of all Swiss cities with at least 10K inhabitants from Wikipedia and consider for each city all connections to all the other Swiss cities and calculate the average travel time (taking the fastest connection in each case).
Due to its central location and good infrastructure Zürich has the lowest average travel time of 77 minutes. From Bern ones needs already 10 minutes more on average, from St.Gallen 138 minutes, Geneva 156 minutes, Lugano 228 minutes (see complete list).
I used the OpenStreeMap library in R to display the average travel times on a map (klick for a better view):
One can see nicely where the cities are located (I plotted only the bigger cities when numbers were overlapping). There are of course two factors which play a role here: how centrally a city is located, and how fast the train connection is. By comparing nearby values one can get an idea of the relative importance of the two.
This week IBM hosted a workshop on Big Data Analytics at the IBM Innovation Center in Zürich.Thanks to Romeo Kienzler for showing us the latest developments in Hadoop, Pig, Jaql and Hive. Pig looks the most interesting at the moment due to it’s open source nature. But it seems that IBM is considering releasing an open source version of Jaql at some point.
The next set of tools under development aim to run analysis code from R directly within the Hadoop infrastructure. Although this sounds exciting, it reminds me when back in 2006 some people at CERN, including myself, started to develop something very similar for LHC analyses.
What we learned was that this kind of setup was not practical for complex big data analysis tasks. Understanding multi-dimensional variable spaces requires running hundreds of analysis variations, of which each should not take more that about 30 seconds to be practical. This means that the analysis has to be done in two steps:
- A pre-selection, running on the cluster within Hadoop, which discards most of the background data and calculates some more condensed quantities, and
- a final selection, running on the laptop to analyze the pre-processed information using R for example.
What is missing curently is a system, which integrates these two allowing for an efficient transfer (write and read) of information between them. I am curious to see how these developments continue.
I wanted to share with you one of my favorid graphics of all times. It shows the evolution of the statistical evidence of the experimental data (black dots) compared to the theoretical prediction (histograms) of what one would expect to see if there was no Higgs boson (red and purple) and if there is one (blue). (Tlick on it, and wait until you see the blue writing appear.)
For predicting this 50 years ago, Peter Higgs and François Englert recieved the Nobel Prize in 2013. Some people, like myself, like to believe that this is just the first out of a whole group of Higgs bosons which could open the door to a new world of supersymmetric particles.
To produce this plot about 6 petabyte of data had to be analysed, distributed over a global network of 170 computer centers.
Data science, the use of data to understand complex systems and make predictions, has been around for some time. More data enables the application of more sophisticated methods of analysis, which allows understanding these systems with much higher precission and gaining completely new insights.
What is new now is that due to the exponential growth of data produced by modern technology in every aspect of life, the data scientific methods that were formerly used only at CERN and later at places like google and social media companies, become accessible for normal businesses, entailing a boost of efficiency and productivity.
A data scientist has to master a large set of tools from the areas of programming, statistics and machine learning. New University courses(1, 2), which aim to provide this kind of knowledge in a short time, are currently instantiated.
However, the most important characteristic of a successful data scientist is that he has to have the mind of a researcher, to be a curios out-of-the-box thinker and problem-solving generalist. It is not uncommon that different people who are given the same data set will be able to see different things in it. This ability to ‘see’ something where others do not, comes only with talent and many years of experience.
This is exemplified in finance, the first non-scientific field, which entered the area of big data. While it is easy to be fouled by the apparent randomness of the data, or by apparent patterns that do not persist into the future. There are some who are able to see systematic patterns in the same data, which today is publicly available for everyone, and build extremly profitable trading strategies out of it.
The same kind of data driven insight will become increasingly important in other areas of the business world and a determinant for success.
I am very excited about all the new developments in data science that are happening around us. The number of sources of openly available data is increasing fast. Businesses are collecting increasing amounts of data, enabling them to profit from big data analysis techniques in many new ways.
I have been doing large-scale, high-complexity data analysis for nearly ten years in particle physics, finance and economics. You can find out more about me here.
This blog is about the adventures I encounter in exploring the new sources of data that are being created. The things I am learning on the way I will share in this blog, so that more people can profit from it. Hopefully some of it can be put to good use.