The rapid pace of technical progress is producing huge volumes of data. It has to be filtered, processed and presented in an intelligible form. Therefore, data science plays an important role in all institutions of the ETH Domain. Its development will be accelerated further in future.

The internationally renowned specialist in medicine and bioinformatics Gunnar Rätsch, professor at ETH Zurich. (Photo: Kellenberger Kaminski Photographie)

Gunnar Rätsch, a professor at ETH Zurich, combines data science with biomedicine. Together with his team, he is developing an early warning system for patients on ITUs, for instance. It is designed to trigger an alarm if renal failure is set to occur within six hours unless countermeasures are taken. Rätsch and his team are using data which was collected at Inselspital Bern over ten years but which has never been analysed. They are multidimensional time series of physiological measurements which are performed on a regular basis – a wealth of data. "For the purposes of analysis, we are using techniques which have been developed in recent years, such as machine learning in order to predict the next point in time, depending on what is happening or which treatment option is applied," explained the data scientist.  This enables the likelihood of kidney failure to be calculated. "This work will enable us to make a practical contribution towards improving treatment," added Rätsch.

Other projects he is involved in are concerned with the diagnosis and treatment of cancer, where personalised medical treatment is already applied.  If a patient is diagnosed with a tumour, it undergoes molecular analysis at some leading cancer centres straight away and, depending on any changes that are discovered, is treated accordingly.  However, this only happens in many places once standard medicine has failed. "It is often too late for treatment at that stage," said Rätsch. "It requires even more research here to confirm that the molecular analysis helps." In order to discover new correlations and hypotheses, the data scientist also follows unusual paths.  For example, he and his team analyse clinical test notes written by doctors and nursing staff about some 5000 patients at a hospital in New York, assessing whether there are correlations here with certain changes in the patient's tumour. According to Rätsch, "it is very exciting. We often view things differently to the medical staff and recognise the technical possibilities."

Particular importance is attached to data privacy in medical trials. In addition to laws that regulate access to patient files, there are also technical precautions. Therefore, data is often pseudonymised for research purposes. In addition, access control guarantees that the data is only accessible to researchers with a legitimate interest, and processing takes place with systems with special security features.  ETH Zurich is currently developing a new High Performance Computing System, which is particularly suitable for data of a medically sensitive nature. "The data is saved in encrypted form on the hard disk," explained Gunnar Rätsch, "which means that we can claim correctly that the data from hospitals is secure with us."

Processing raw data directly in-situ

Anastasia Ailamaki, professor of Computer Science at EPFL, is also concerned with medical data. She and her team build infrastructure to support the analysis of data from patients with brain disorders in order to identify biological causes of a disease. This relates to her involvement in the "Human Brain Project", an EU FET Flagship Programme seeking to understand the human brain using computer-based models. Using a new data management software called "RAW" which she developed, the computer scientist is able to deliver results without having to prepare the data beforehand. "RAW" accesses raw data directly in its original format and in its real-time state, automatically adapts to queries and delivers answers simply and efficiently. "RAW Labs" is the name of the spin-off company which Anastasia Ailamaki founded in 2015 and which has its offices in the EPFL Innovation Park, in order to distribute the popular software commercially.

"Many companies, like Facebook for example, only use ten percent of the existing data," explained Ailamaki. “As we do not know ahead of time which data will be useful, all the available data normally has to be cleansed using appropriate software and be loaded into the system before analysis becomes possible. The data scientist who is assigned this task, spends 80 percent of his well paid working time on these processes before he can perform an analysis." Her software, by contrast, automatically identifies the data needed for a specific query, locates it, delivers the result and saves it in order to respond to similar queries more quickly in future.   "We essentially write computer code which, in turn, generates other computer code and remembers it later," explains the scientist. 

Looking back over 100 years

Collecting and analysing huge volumes of data is a relatively new technique in medicine, as well as in many other fields of science. There is a long tradition of it in environmental research.  For instance, the researchers from the Swiss Federal Institute for Forest, Snow and Landscape Research WSL work with data which is over a century old. Recording and archiving observations about forests, weather or snow depth used to be considered tedious, scientifically unproductive work for a long time. In light of climate change and new ICT capability, monitoring, which often used to attract wry smiles, is very topical and provides important predictions on what we are likely to encounter in the future.  Christoph Marty examines the snow cover and its changes in the past and in the future. "It is difficult to read a clear signal from short time series from recent decades; the chaos of the weather plays too significant a role," explained the scientist. "Trends only emerge through the greater background noise when you look at data from over a longer period."

For example, the early winter of 2015 which brought hardly any snow to the ski resorts was a rather rare event which the tourist industry will not have to contend with again for the next five to ten years. "However, based on the models with which we feed our data," explained Christoph Marty, "we can predict that these situations will occur with increasing frequency in the future." Based on the previous measurements of snow depth, tests can be performed on the computer models which Marty and his colleagues use to predict the impact of climate change on snow cover in future. The model calculations which are verified in this way clearly show that people will have to travel to regions at higher altitudes in future for winter experiences that require continuous, heavy snow cover. However, the results are more complicated in the case of avalanches. "There are very likely to be fewer avalanches," revealed the WSL researcher, "although we may see avalanches in some winters on a scale that was rarely seen in the past."

When audiences applaud at the same time

Extracting scientific findings from the wealth of data: This increasingly requires collaboration between disciplines. Large volumes of data have long since been processed in statistical physics, in order to study the behaviour of interacting gas particles, for instance.  Tried and tested methods and algorithms exist which are used to describe and understand these systems. "Applying these techniques to living matter has become a new trend," explained Carlo Albert, a physicist who studies phytoplankton at the Eawag. They are algae and bacteria which form the basis for the food chain in oceans and seas, but which may also pose a danger to people and animals if individual species increase in vast quantities during an algae bloom.

Millions of plankton particles are detected using the very latest measuring techniques in our seas. Even though they each have a separate existence and respond to changes, they equate to physical particles in many ways. "Universal, simple phenomena even occur in complex systems," explained Albert. "Take the applause at the end of a concert, for instance. The audience often suddenly becomes synchronised and start clapping at the same time." During a first phase with clustering algorithms, the researchers analysed the distribution of certain properties in phytoplankton, such as length, volume or pigmentation of the particles in order to deduce laws. It became apparent that there was a very broad distribution, which is characteristic of systems at what is known as the critical point, where problems trigger a characteristic reaction. The researchers are seeking to ascertain in a second phase whether predictions can be made from this about algae blooms.

Revealing the unexpected

The rapid pace of technical development has not only taken in areas of science like medicine or environmental research, it is also revolutionising data science itself as there is exponential growth in the volumes of data; this is also the case at the Paul Scherrer Institute (PSI) with its major research facilities. "We have a huge mountain of data that has to be mined within a useful period of time," said Gabriel Aeppli, head of the Synchrotron and Nanotechnology Department at the PSI. Previously, people used to gather data, build a model and adapt it.  In the expert's opinion, "you barely have any time for that nowadays. The data must be processed more quickly and more effectively to keep pace with data collection." Data mining, machine learning and deep learning are buzzwords. Not only is the task done more quickly, you also discover things that are missed with model-based processing. "All the pixels that we look at contain things that are regarded as background noise and are ignored; however, modern information technology can reveal the unexpected," said Aeppli, who is also a professor at ETH Zurich and at EPFL.

"The new type of 'data science', with automation, standardisation and intelligible view of the results, have completely changed our working practices," believes Daniele Passerone, head of the "Atomistic Simulation" Group at Empa. Despite his initial reservations, he said that the machine reduces the workload so much that it leaves more time to drill down into the topic and to be creative. As the theoretical physicist pointed out, "the ideas are not automated".

New nanostructures from the computer

Researchers at the Empa use computer simulations to develop new materials, such as strips made from honeycombed plastic which is only one atom thick. They are seeking to manufacture new electronic components from graphene nanostrips like this. In order to give the nanomaterial the desired electronic properties, the researchers had the idea of replacing some carbon atoms with boron atoms.  But how many boron atoms does it take? "The computer allows us to calculate the electronic properties of all possible nanostructures with one, two or three boron atoms," revealed Passerone, "although it takes an efficient way of processing large volumes of data."

Within the scope of the National Center of Competence in Research MARVEL, an automated, interactive infrastructure and database have been developed to perform this task; the project is headed by EPFL.  "I act as the interface between MARVEL and the Empa," explained Passerone. If the question is defined in a workflow, the system processes this workflow automatically by distributing the tasks to local computer clusters, to remote computers in the Cloud or to supercomputers. This, for example, has led to the creation of a database with 1000 different graphene structures containing boron atoms. The system searches for those structures which are actually suitable for electronic applications.  Only then can the researchers in the experiment test whether the theoretical predictions prove to be true.

Swiss Data Science Center

Accelerating the use of modern data science in Switzerland: This is the objective of an "Initiative for Data Science" that was launched by the ETH Domain. Within this scope, the "Swiss Data Science Center" at EPFL and at ETH Zurich starts up in January 2017 with a budget of 30 million francs for four years. The Center is headed by Olivier Verscheure. "The first task that our platform will undertake is data incubation – how to obtain meaningful information from raw data, how to eliminate background noise and how to fill gaps."

If you wish to correlate health data with air pollution and traffic data nowadays, problems soon arise because the data comes from different silos and and is only understood by experts from the different areas. "Anyone who is not well versed in air pollution will not know about the need to recalibrate the data from a CO2 sensor as a function of relative humidity," explained Verscheure. "Our colIaboration with various teams enables us to introduce data records and sources of all kinds, which means that information from researchers in all disciplines and from Swiss industry can be evaluated."

Modern techniques such as machine learning are applied in a second step. It reveals, for example, that there is a correlation between traffic flow and the weather. Therefore, it would enable future traffic jams to be predicted as a function of previous congestion and weather forecasts. Calculations can also be performed for air pollution and associated health risks. "Needless to say, data privacy is especially important as far as clinical data is concerned," said Verscheure. "We have to show how to guarantee data security."

Better than permitted

But modern data processing methods also harbour risks.  Machine learning or deep learning is already integrated into spam filters, image and face recognition or in search engines used by Google and Facebook, and has overtaken traditional technology in many areas of application within barely three years. "Deep learning works astonishingly well," said Edouard Bugnion, professor of Computer Science at EPFL, "much better than we think it should. A machine can pick the 50 most similar apples from a sample of one million. How it does that is the mystery of deep learning," the scientist explained.

The models used are often so complicated that they are beyond comprehension. Features that are not a problem to anyone in the case of spam filters become problematic in other applications. "Should we as a society be concerned about how autonomous vehicles move?" asked Bugnion. "An error could prove to be fairly costly." The scientists also wish to understand how a research result has come about. And it could become particularly awkward if the doctor's machine recommends a particular drug without the doctor understanding why.  Olivier Verscheure is convinced that "this would not work".

He also hopes then that the Center can help to bridge the gap between the data scientists and the scientific users.  "This is a huge challenge," said the head of the Center. However, it is also important to make a great effort to train scientists, said Edouard Bugnion. Therefore, EPFL and ETH Zurich are now offering Master's degree courses in Data Science. EPFL professor Anastasia Ailamaki regards it as an "important step". It would make the students aware of where their work would be useful in the long term.

Gabriel Aeppli has high hopes of a collaboration between the PSI and the new Swiss Data Science Center, for example in explaining the structure of biomolecules. The Swiss Light Source (SLS), which is used to analyse the molecules, delivers what are known as interferograms. They show the scatter characteristics of the X-ray light waves on the atoms. Suitable software is required in order to calculate where the atoms are located and, thus, to determine the spatial structure of the molecules. "Up to now we have developed this software ourselves," said Aeppli. "We could accelerate this process if we could supplement our knowledge with the resources which will be provided by the ETH Domain in the future."