science transformed - new scientist · pdf filescience transformed in science, ... big data...

Science transformed In science, people tend to associate big data with particle physics and astronomy. But these are just the start. Big data and cloud computing are touching many other fields and promise a widespread transformation in learning and discovery, as Tony Hey reveals

Advertising feature

THE emergence of computing in the past few decades has changed forever the pursuit of scientific exploration and discovery. Along with traditional experiment and theory, computer simulation is now an accepted “third paradigm” for science. Its value lies in exploring areas in which solutions cannot be revealed analytically and experiments are unfeasible, such as in galaxy formation and climate modelling.

Researchers in many fields have been eager to capitalise on the innovations of computer scientists: new software tools and parallel supercomputers. This trend has accelerated as access to high-performance computing (HPC) clusters – servers linked up to behave as one – and ever more software for parallel applications has become available. Process-heavy simulations that run on graphics-processing units are now common.

Computing is also allowing scientists to collaborate in new ways. In years gone by they would have communicated mainly

at conferences and via publications. Today things are changing: networks, email, mobile devices, social media, shared storage, instant messaging, translation and video chat are letting scientists communicate with individuals and communities even when they are separated by time zones and large distances.

And there is another major shift going on, which US computer scientist Jim Gray described in 2007 as “the fourth paradigm” for scientific exploration and discovery. He could see that the collection, analysis, and visualisation of increasingly large amounts of data would change the very nature of science.

This is now happening. Over the past five years, the emergence of huge data sets and data-intensive science are fundamentally altering the way researchers work in virtually every scientific discipline. Biologists, chemists, physicists, astronomers, earth and social scientists are all benefitting from access to the tools and technologies that will integrate this “big data” into standard scientific methods and processes.

Researchers are increasingly capable of collecting vast quantities of data through computer simulations, low-cost sensor networks and highly instrumented experiments, creating a “data deluge”.

Instead of just running computer simulations on HPC clusters and supercomputers, scientists now need a different type of computing resource to store and process these massive data sets. Researchers are exploring the use of an increasingly attractive, complementary source of computing power – the cloud.

Particularly for the many scientists who may not have the time, access or expertise to use complex, in-house computing systems, cloud computing has opened another door to conducting data science on a large scale. With cloud computing, initial costs are minimised, the scaling of capacity is flexible, and access is more open and widespread. These characteristics look likely to generate an explosion of new understanding.

clau

dia

mar

cell

oni

/cer

n

Recently, a team at Microsoft Research has developed a new algorithm for processing the mixed model for which the computing resources it needs scale linearly – one-to-one – with the number of people in a study. This is a major step forward. Microsoft has installed this new algorithm – called FaST-LMM (Factored Spectrally Transformed Linear Mixed Model) – on its cloud platform, Windows Azure.

“When I hear [the term] ‘big data’, I think of hundreds of thousands of individuals – and the DNA for those individuals, and the intricate algorithms we need to process all that data,” says David Heckerman, Distinguished Scientist at Microsoft Research in Los Angeles. Today, the process of sequencing an individual’s genome is relatively simple, but analysing the sequenced data is arduous and complex. That is where FaST-LMM comes into its own (Nature Methods, DOI: 10.1038/nmeth.1681 and DOI: 10.1038/nmeth.2037).

Heckerman’s team has applied the FaST-LMM machine-learning algorithms to various data sets provided by collaborators including the Wellcome Trust in Cambridge,

UK. The Wellcome Trust 1 data set contains anonymised genetic data from about 2000 people for each of seven major medical conditions: bipolar disease, coronary artery disease, hypertension, inflammatory bowel disease, rheumatoid arthritis, and diabetes types 1 and 2. It also contains a shared set of data for about 1300 healthy controls.

Novel insightsThere has been extensive work on finding associations between diseases and genetic variants called single nucleotide polymorphisms (SNPs). Most analyses have looked at one SNP at a time. But interactions between two or more variants can have substantial effects on disease, so Heckerman and his team are searching for combinations of SNPs that make people more or less susceptible to these seven conditions. Although this technique dramatically increases the resources needed, the hope is that it will prove to be very powerful and reveal novel insights for creating treatments.

The SNPs are being stored in the cloud instead of on conventional hardware

The benefits of combining research with big data and the cloud have been most obvious in areas such as genome-wide association studies. Here, large amounts of data are used to identify potential links between a person’s genome and traits such as the propensity to develop a disease or a specific response to a drug. In such studies, genetic variants are collected from right across the genomes of large numbers of people with and without traits of interest. Then algorithms are used to pick out the associations.

The more people in a study, the weaker the signals that can be found. Even so, analyses can be disrupted by hidden causes. For example, both the genetic variant and trait may be associated with a person’s location. Such confounding factors need to be taken into account, which can be done with what’s called a “mixed model”. But there’s a problem: although such models work well on small data sets, the computer time and memory grow rapidly – polynomially – with the number of subjects. This makes large genome studies prohibitively expensive.

Genome studies

offer the hope of

new treatments

that will let

people live longer,

healthier lives

Genes in the cloud

in a company or research lab. Instead of having to buy the costly infrastructure of an on-premises HPC cluster to analyse the data, researchers are using high-performance computing methods in the cloud. This makes economic sense because they only need the resources for a limited time. The cloud route is significantly cheaper – a fact that should not be underestimated because it opens big data opportunities to a far wider range of researchers. Computing in the cloud can also make it easier for scientists to share their software and data with others. Using these techniques, Heckerman’s team aims to cut the time needed to make new discoveries and treatments.

Resource management is one of the primary questions to be answered with big data. It is not just a case of determining the scale of resources needed for a project, but also how to configure them, and all within the available budget. For example, running a large project on fewer machines might save on hardware costs but will result in a longer project.

For the Wellcome Trust project, the team’s resources included a combination of data storage and HPC servers, all on Windows Azure. This gave them access to tens of thousands of computer cores on which to run the FaST-LMM algorithm. “For this project, we would need to do about 1000 computer-years of work. With Window’s Azure we got that work done in about 13 days,” Heckerman says. The analysis revealed a new set of relationships between genetic variants and coronary artery disease, which are now being followed up (Scientific Reports, DOI: 10.1038/srep01099).

The Wellcome Trust project continues, but it represents just the beginning of a major shift in how research data are stored and analysed. “With the huge amount of data that’s coming online, we’re now able to find connections between our DNA and who we are that we could never have found before,” Heckerman says.

If understanding the human body represents a complex and intricate problem, consider the difficulties of getting to grips with the planet and its ecosystems. Any approach must operate at every level, from the microscopic to the global, from soil science and biodiversity to forest dynamics, carbon modelling, climate modelling and disease pandemics.

As head of the computational ecology and environmental science group at Microsoft Research in Cambridge, UK, Drew Purves leads an ambitious programme of research tackling fundamental problems in these areas. The project’s ultimate goal is to model and predict the future of all life on Earth. This work is critical to bridge the gap between science and effective environmental policy.

One priority is the development of analytical “pipelines” that connect data to models and then to predictions. Environmental science is an excellent proving ground for such big-data

An easy route to global models

Until now, there

have been huge

technical barriers

to building good

global ecological

models

Along with the many opportunities, data-intensive science will also bring complex challenges. Many scientists are concerned that the data deluge will make it increasingly difficult to find data of relevance and to understand the context of shared data. Today, it is difficult for a person even to track and maintain their own health records. Now envision the magnitude, diversity and dispersed nature of data generated by life sciences, genetics and bioinformatics over the next 10 years.

The management of data presents increasingly difficult issues. How do international, multidisciplinary and often competitive groups of researchers address challenges related to data curation, the creation and use of metadata, ontologies and semantics, and still conform to the principles of security, privacy and data integrity? Given that government organisations and commercial corporations struggle with these issues already, imagine the challenges that a loosely connected community of researchers will face.

Finally, a sustainable economic model for data-intensive research will need to emerge. Researchers must have time and money to create, curate, store and share large data sets. Some of these costs may exceed the expenses they now receive for using their data locally or sharing with a small circle. In future, data sets will be treated more as a costly resource to be shared by many researchers. So it is no wonder that many government funding agencies are re-examining their criteria for providing public research funds and open access to the results of that research.

Like many scientists, Microsoft and Microsoft Research see amazing opportunities for cloud computing and big data. People who overcome these challenges and embrace the capabilities will see novel and diverse opportunities for exploration and discovery.

BIG dATA cHAlleNGeS

plai

npi

ctu

re;

clo

ud

s: a

lam

y

get

ty

initiatives – it has a wide variety and large volumes of data, which need to be captured rapidly. However, the approaches and tools developed here can be applied more broadly in fields such as business, government and education.

Climate scientists, for example, may wish to use data to produce a model to make predictions – how climate change will alter ecosystems, say. Alternatively, they may want to use a model’s predictions to create data, such as how changing ecosystems will influence further climate change.

“We know, fundamentally, how to build these models,” says Purves. “But the technical barriers at the moment are so high that it’s the domain of specialists, which in turn means that it’s only the world’s largest organisations that can afford to support that kind of data-to-prediction pipeline.”

So Purves and his team have created a new web-browser application, called Distribution Modeller, which aims to make models and big-data analysis accessible to more people. A researcher can load data into the system – let’s say annual wheat production figures for countries around the world – and then call up global surface temperature and rainfall figures using

FetchClimate, a tool also developed by Purves’s team which supplies historical and contemporary climate data from a series of scientific sources around the globe.

With the touch of a button, our researcher can create a model linking wheat production to surface temperature and rainfall. It is then possible to compare the model with what happens in the real world, and make predictions about what would happen to wheat harvests if temperature and rainfall change.

“The model can be run in Windows Azure on demand whenever and wherever it’s needed, without ever leaving the browser,” says Purves. Because all the data and the model are in the cloud, they can be shared and adapted by others. “We’re looking at examples that would have taken weeks, and now take about 3 to 5 minutes to begin with data and end up with a model.”

Tools like FetchClimate and Distribution Modeller go to the heart of Purves’s quest. “Ever since I’ve been in the group,” he says, “we’ve had this two-part mission: to do the fundamental research to generate the predictive models we need in environmental sciences on one hand, and, on the other hand, to look for opportunities to develop software that accelerates that work for us and therefore accelerates that kind of work for others.

“If we get that right, we might be able to have a much larger impact. With the

support of colleagues across the company, I aim to raise our prototypes to the scope of the software that Microsoft produces in other areas. Then we could really have a dramatic impact on whole swathes of environmental science in a way that a traditional scientific career just basically can’t.”

Scientists have a lot of work ahead to develop a deep understanding of our wonderful and mysterious planet. Technology must play a central role in helping scientists to progress, whether they are interested in the oceans, deserts, Earth’s core or the atmosphere and beyond. Purves’s group at Microsoft Research is supporting that progress by offering its technology solutions, by pioneering entirely new kinds of scientific software developed by its own scientists, and exploring which methods and tools we need to build next. n

Tony Hey is Vice President of Microsoft Research Connections. He builds partnerships with scientists and scientific organisations around the world. Before starting at Microsoft Research he was an academic at the University of Southampton, UK, first in physics and then in computer science, ending his time there as Dean of Engineering and Applied Science

For more about big data, cloud computing, green IT, FetchData and Distribution Modeller go to newscientist.com/cloudup

Models of the

environment that

are easy to make

and trustworthy

should provide a

bridge between

science and policy

“ A model that would have taken weeks to create now takes 3 to 5 minutes to begin with the data and end up with a model”

imag

e ba

nk

/get

ty

gal

lo im

ages

/get

ty

science transformed - new scientist · pdf filescience transformed in science, ... big data...

Documents