Like in Quantum mechanics, statistical arbitrage strategies acknowledge that it is impossible to predict a specific event, but it is possible to predict the statistical average outcome if a fundamental relation exists between the observed and predicted quantities.

A well-known example of this in finance is pairs trading, which uses the fact that some price series are very highly correlated (like bonds of different maturities, or indices of different European countries). So a move in the DAX might induce a similar move in the EuroStoxx 50 (or the other way around). In the end, the name of the game is understanding multidimensional probability distributions, the natural habitat of the particle physicist.

Finance, where data is sparse (compared to particle physics), can profit from the application of sophisticated analysis methods that were developed in fields were data is more abundant.

On the other hand, performing analysis in finance may sharpen the Data- Scientist’s abilities in two important ways:

- Since signals are very weak buried under a see of randomness a great amount of conviction and mastery of analysis tools is needed until the signal becomes clearly visible.
- Live trading provides a direct feedback and any sort of misjudgment in the analysis process will nearly always result in the los of money immediately (at least when done at high frequency).

The graph below shows the live performance of a strategy I worked on over a period of two and a half years.

The realized Information Ratio was 6.9, and P&L/maximumDrawDown was 62. The main strategy was running on a 1 second grid, while the execution strategy would react as fast as possible to price changes and changes in the order book of correlated instruments.

Latencies are not overly important here, but should be of the order of 10 milliseconds for Euronext for instance (from signal generation to receiving the order execution confirmation).

But it is essential to have a good execution strategy to minimize slippage. The average profit per trade is usually less than the spread, which also means less than one tick in most cases. Thus, it is necessary to have a good broker. The total execution costs (fees + commissions) should not exceed 15 percent of the spread.

The stable long-term performance is mainly due to portfolio diversification effects, resulting from trading a large number of pairs of bond and index futures. I will discuss more details on indicator construction, practical portfolio construction and execution strategies in later posts.

]]>I retrieved the list of all Swiss cities with at least 10K inhabitants from Wikipedia and consider for each city all connections to all the other Swiss cities and calculate the average travel time (taking the fastest connection in each case).

Due to its central location and good infrastructure Zürich has the lowest average travel time of 77 minutes. From Bern ones needs already 10 minutes more on average, from St.Gallen 138 minutes, Geneva 156 minutes, Lugano 228 minutes (see complete list).

I used the OpenStreeMap library in R to display the average travel times on a map (*klick for a better view*):

One can see nicely where the cities are located (I plotted only the bigger cities when numbers were overlapping). There are of course two factors which play a role here: how centrally a city is located, and how fast the train connection is. By comparing nearby values one can get an idea of the relative importance of the two.

]]>This week IBM hosted a workshop on Big Data Analytics at the IBM Innovation Center in Zürich.Thanks to Romeo Kienzler for showing us the latest developments in Hadoop, Pig, Jaql and Hive. Pig looks the most interesting at the moment due to it’s open source nature. But it seems that IBM is considering releasing an open source version of Jaql at some point.

The next set of tools under development aim to run analysis code from R directly within the Hadoop infrastructure. Although this sounds exciting, it reminds me when back in 2006 some people at CERN, including myself, started to develop something very similar for LHC analyses.

What we learned was that this kind of setup was not practical for complex big data analysis tasks. Understanding multi-dimensional variable spaces requires running hundreds of analysis variations, of which each should not take more that about 30 seconds to be practical. This means that the analysis has to be done in two steps:

- A
**pre-selection**, running on the cluster within Hadoop, which discards most of the background data and calculates some more condensed quantities, and - a
**final selection**, running on the laptop to analyze the pre-processed information using R for example.

What is missing curently is a system, which integrates these two allowing for an efficient transfer (write and read) of information between them. I am curious to see how these developments continue.

]]>For predicting this 50 years ago, Peter Higgs and François Englert recieved the Nobel Prize in 2013. Some people, like myself, like to believe that this is just the first out of a whole group of Higgs bosons which could open the door to a new world of supersymmetric particles.

To produce this plot about 6 petabyte of data had to be analysed, distributed over a global network of 170 computer centers.

]]>Data science, the use of data to understand complex systems and make predictions, has been around for some time. More data enables the application of more sophisticated methods of analysis, which allows understanding these systems with much higher precission and gaining completely new insights.

What is new now is that due to the exponential growth of data produced by modern technology in every aspect of life, the data scientific methods that were formerly used only at CERN and later at places like google and social media companies, become accessible for normal businesses, entailing a boost of efficiency and productivity.

A data scientist has to master a large set of tools from the areas of programming, statistics and machine learning. New University courses(1, 2), which aim to provide this kind of knowledge in a short time, are currently instantiated.

However, the most important characteristic of a successful data scientist is that he has to have the mind of a researcher, to be a curios out-of-the-box thinker and problem-solving generalist.* *It is not uncommon that different people who are given the same data set will be able to see different things in it. This ability to ‘see’ something where others do not, comes only with talent and many years of experience.

This is exemplified in finance, the first non-scientific field, which entered the area of big data. While it is easy to be fouled by the apparent randomness of the data, or by apparent patterns that do not persist into the future. There are some who are able to see systematic patterns in the same data, which today is publicly available for everyone, and build extremly profitable trading strategies out of it.

The same kind of data driven insight will become increasingly important in other areas of the business world and a determinant for success.

]]>I am very excited about all the new developments in data science that are happening around us. The number of sources of openly available data is increasing fast. Businesses are collecting increasing amounts of data, enabling them to profit from big data analysis techniques in many new ways.

I have been doing large-scale, high-complexity data analysis for nearly ten years in particle physics, finance and economics. You can find out more about me here.

This blog is about the adventures I encounter in exploring the new sources of data that are being created. The things I am learning on the way I will share in this blog, so that more people can profit from it. Hopefully some of it can be put to good use.

Have fun!

]]>