big data, democratized analytics and deep context,

1

Big Data, Democratized Analytics and Deep Context will Change How We

Think and Do Development

Aniket Bhushan, Senior Researcher, The North-South Institute, [email protected]

Dec 21, 2011

International development as a field of research and practice has been more of a laggard than a

leader in using big data and powerful analytics. Much of the data is often dated or of poor

quality. Huge areas including those of the greatest interest remain entirely unmapped, data poor

or otherwise poorly understood.

This situation is changing faster than anyone predicted and the set of tools driving this evolution

represents one of the most important trends in international development. The proliferation of

mobile technologies, computing power and democratization of analytics within an open-source,

open-data chapeau will fundamentally change the way we think about and do development. I

provide a synopsis of the most impactful developments in three key areas: the base (data) layer,

the analysis layer, and the feedback (data) layer. Together these advances are changing how we

think about data in development, how we develop a deeper contextual understanding at a very

micro level without losing the ability to aggregate and generalize, and how we bring it all

together in a meaningful way to make better decisions.

mailto:[email protected]

2

Base layer: Big and Open

Priavte sector (proprietary)

Public sector data at

various levels

International institutions: World Bank,

IMF, UN Data

Analytics layer

Virtualization

Visualization

Feedback layer

"crowds"

Purposive - push

Anonymous - big

3

The Base Layer: Open Data and Big Data

How do we know what we know in the field of international development? What is the

information, or evidence base, who generates it and how? There are at least three main

collectors, sorters and repositories of development information: international institutions (UN,

World Bank, IMF etc), national and sub-national official public sector institutions and the private

sector. Change is afoot in each.

Take for instance the open data push by

international institutions such as the World Bank

and African Development Bank, or the UN’s Global

Pulse initiative. Opening up the World Bank’s

databank has made a huge amount of information

available to a wider range of stakeholders than ever

before, similarly Global Pulse is creating a platform

to harness new data streams. Groups such as

Development Gateway and Aid Data have sized the

opportunity to push openness further. A good example of what is made possible by this opening

is a tool like Development Loop which not only plots all World Bank and African Development

Bank projects at precise geographic locations across Africa but also overlays the same with

feedback sourced from the intended beneficiaries of the project or initiative.

This full circle or loop is a powerful reminder of

the importance of data transparency and universal

standards. Opening up aid data in a standardized

format will make geocoding a potent tool for real

transparency and accountability.

To appreciate what a game-changer open data can

be examine the current state of affairs. Research

conducted by UK based Publish What You Fund

recently in Uganda aimed at simply tallying up

financial resources available for development,

found that the government was unaware of the amount donors planned to spend in that year

(2006-07). The planned expenditure was more than double what the government was aware of;

indeed financial resources flowing into the country were far higher than had been estimated. Or

take another example, what the World Bank’s Chief Economist for Africa calls a “statistical

tragedy”. The majority of Africa’s population lives in countries that still use an outdated (1960s)

method of national income accounting used to generate fundamental data points such as Gross

Domestic Product (GDP). Ghana for instance only shifted to the 1993 UN system of national

http://www.unglobalpulse.org/

http://www.unglobalpulse.org/

http://www.aiddata.org/content/index/Maps/development-loop-app

http://www.publishwhatyoufund.org/resources/uganda/

http://blogs.worldbank.org/africacan/africa-s-statistical-tragedy

http://blogs.worldbank.org/africacan/africa-s-statistical-tragedy

4

accounts last year. When they did so they found their GDP was 62 percent higher than

previously thought, catapulting the country to ‘middle income’ status.

If even the most basic information often taken for granted is riddled with problems then what do

we really know? Data reliability is one issue but time-lag is another. Most of the data used in

international development is stale by the time it is called upon in decision making. The

information base we rely on in international development needs to be bolstered by building

bridges with new sources and data-streams.

Opening up proprietary private sector data and exposing it to the concerns in international

development in the coming years will be a game changer. To date international institutions have

made the most progress towards data openness, select public sector authorities (for instance

under the purview of the Open Government Partnership) are also making progress, but it is the

private sector – the main repository of “big data” – that is the holy grail. If you total all the data

collected by the US Library of Congress (one of the largest public sector repositories) it would

be about 235 terabytes as of April 2011. Wal-Mart processes and stores about 2500 terabytes per

hour! The big data revolution is changing business models fundamentally, and like capital and

labour data itself has become a commercial driver and firms like Google, Facebook, Twitters are

the first “data factories”, pioneers much like their antecedents in the industrial revolution.

This is truly big data and its growth has exploded off

the charts thanks in large part to the explosion of

mobile sensors (the most important of which is the

mobile phone but think also credit cards, laptops,

GPS and everything from radio-frequency to QR

codes), and the rapid democratization of high-power

analytics. Whether it is geocoded mobile phone data

modeled to track slum development or predict

microfinance loan defaults or provide weather-

indexed insurance to small farmers big data is

already a game-changer in development and we have

only begun to scratch the surface.

The Analytics Layer: Virtualization, Visualization are driving Democratization

Analytics is simply the collection of tools and techniques used to make sense of data. High-

power analytics proved a game-changer in the commercial sector and can now do the same in the

social sector. At its core analytics is about unearthing and understanding relationships and

patterns. Analytics helped retailers discover unlikely trends, most famously that customers who

came in to buy diapers also tended to buy beer! It can do the same for complex social systems.

http://www.tableausoftware.com/learn/whitepapers/big-data-revolution

http://www.tableausoftware.com/learn/whitepapers/big-data-revolution

http://www.tresata.com/blog/data-factory-say-that-again/

5

Developments in analytics have kept pace with the speed with which big data has grown.

Bringing this capacity to bear on development challenges such as food security and urbanization

is just getting started.

Before we look at some examples let us

put in perspective what we mean by the

explosion of mobile sensors. The mobile

phone is already the most celebrated

example; anyone who has visited Kenya

(or really any African country) has seen

vividly the power of mobile money. This

is old news. Anyone who has followed

elections in Kenya probably also heard of

Ushahidi, a locally developed open-source

platform to map reports of post-election

violence which went live in 2008. This is

also old news.

What is new is what is now made possible by bringing to bear requisite analytics on the huge

proliferation of mobile sensors. Mobile phones have grown from under 750million with more

than two thirds in developed countries to over 5billion with about four times as many in

developing countries as in the developed world. Of the 5billion about 1billion live on less than

$5 a day. The developing world is the leading driver of mobile big data. This includes voice,

text, financial, locational and positional information, which is now possible to overlay with the

base data layer described earlier (income, health, education and other indicators generated by

official sources) to produce new insights into real behaviour and complex incentive structures.

Sticking with Kenya, take the example of the

Engineering Social Systems lab. Coupling

terabytes of mobile phone data with Kenyan

census information ESS is modeling the growth

of slums to inform urban planners about where to

locate services such as water pumps and public

toilets. In Uganda the same group is developing

causal structures of food security, in Rwanda

they collected a sample of every phone-call over

a four year period, coupled with a random

survey, to analyze how different people react to the same economic shock. What is really

interesting is the way experimental initiatives are being brought out of the lab into real world

http://ushahidi.com/

http://www.hsph.harvard.edu/ess/

6

application. The shift that is taking place is fundamental; away from models governing theory to

models informed and built on real networks (see Reality Mining).

For long development analysis at best has been limited to correlations and inferences based on

correlations, for the first time big data coupled with high-power analytics is opening up the

possibility if not of entirely causal dynamics then at least more robust inferences. Our traditional

methods of inquiry have conditioned us to think in terms of generalizing on the basis of random

sampling, for the first time the proliferation of mobile sensors is making possible highly targeted

yet nonintrusive and anonymous inquiry.

And this is not another story about mobile phones alone. The point is the rapid emergence of

altogether new data-streams, in step with the development of analytical capacity to draw useful

inference out of them. Take Twitter for instance which generates information about the size of

the entire US Library of Congress in two weeks and together with Facebook has already shown

its efficacy during the Arab uprisings. At the heart of this evolution are open-source software

systems and tools that allow the simultaneous collection, categorization, and analysis of various

data types from Twitter hashtags to videos to positional data and machine IDs. Swift River

developed by Ushahidi is an example of a free open-source platform that enables rapid

simultaneous filtering and verification of real-time data from channels like Twitter, SMS, Email

and others. It also visualizes the information in dashboards that the average user can understand.

This is particularly powerful for monitoring immediate post-crisis developments when the

information flow suddenly increases but is also only useful if immediately analyzed.

Democratization of analytics driven by a

commitment to open-source is furthered by

virtualization of platforms and

visualization of information (to make it

engaging for the average user). An aspect

of virtualization is community or crowd

driven problem solving. Take for instance

Data Without Borders, a pro-bono data

scientist exchange. DWB organizes ‘data

dives’ to help NGOs, civil society

organizations and other who might not

have the time, capacity or inclination but

may be sitting on information useful for

purposes beyond their imagination, to make sense of their own information. At a recent data dive

DWB helped a human rights group that allows users to anonymously upload information about

violations get a better look at who was using the system and plot trends, without compromising

http://reality.media.mit.edu/

http://ushahidi.com/products/swiftriver-platform

http://datawithoutborders.cc/

7

anonymity. Using open-source tools (such as R) they also visualized the information to make it

easy to understand.

Here are some in a fast growing toolkit worth following:

Social network analysis: with the growing penetration of web 2.0 technologies, social

media is becoming the dominant communication channel for rapid exchange. Network

analysis includes a set of techniques used to characterize relationships among discrete

nodes in a graph or a network. In social network analysis, connections between

individuals in a community or organization are analyzed, e.g., how information travels, or

who has the most influence over whom. Examples of applications include identifying key

opinion leaders, and identifying bottlenecks in enterprise information flows. Two

important techniques within network theory and analysis are exploratory data analysis

(EDA) and link analysis. EDA is an approach for analyzing large datasets in summarized

formats according to their main characteristics. EDA came about in part as a reaction to

overdue emphasis in statistical fields on hypothesis testing or “confirmatory analysis”.

EDA emphasizes using the data to suggest hypotheses to test. It emphasizes testing

assumptions on which inference is based. Similarly link analysis focuses on analyzing

relationships among nodes through visualization methods, including network diagrams

and association matrices. Gephi is an interactive open source (java based) platform for

complex systems and network analyses.

Automated web-scraping: almost everything is on the web today, which means almost

everything has a hypertext (HTTP) related identity. Web scraping is an automated

technique for collecting information from the web. Web scraping transforms typically

unstructured data in HTML format into structured data that can be centralized in a

spreadsheet. Simply using MS Excel it is possible for instance to scrape tables and other

data in a variety of unstructured formats (including real-time, e.g. weather information on

world clock, or stock price info), and save them as a spreadsheet for further use. Web

scraping, though somewhat mired in legal and privacy issues, has the potential to

massively increase the volume of accessible information. For example UN Global Pulse

is running a project with Price Stats and the Billion Prices Project at MIT, investigating

daily bread prices across Latin American countries by scraping the web for online prices.

The end product is a demonstration, an e-bread index which tracks bread price inflation

real-time (daily), and can be compared contrasted or can complement the traditional

consumer price index (which is only published on a monthly basis). UN Global Pulse is

also pioneering Hunch Works, the first social network for hypothesis formation, evidence

gathering and decision making. Hunch Works allows researchers to connect with other

experts with complementary resources so that together they could quickly determine if

data signals are indications of deepening crisis and warrant further investigation.

http://gephi.org/

http://www.unglobalpulse.org/projects/comparing-global-prices-local-products-real-time-e-pricing-bread

http://www.unglobalpulse.org/technology/hunchworks

8

World Bank – Adept software platform: ADePT is a free (STATA based) platform

developed by the World Bank that automates and standardizes analysis. ADePT allows

complex statistical analyses, and direct pre-configured access to a range of micro level

data from the Bank’s and other sources. It is particularly useful for economic analysis and

particularly useful in developing countries where researchers may not have ready access

to otherwise expensive statistical packages.

Hadoop – distributed parallel analytics: is big data is the latest buzz word in IT then

Hadoop is a large part of the story behind it. Hadoop is a platform for distributing

problems, tasks, analyses across a number of servers, speeding up analysis and shrinking

the distance between data, analysis and result. Hadoop works behind things you know

well but you have never seen it or are likely to know it. For example one of the most

well-known implementations is Facebook, which brings in core data stored by you into

Hadoop clusters where it is reflected against your friends, their interest to suggest

recommendations back to you. High power distributed parallel analytics solutions like

Hadoop help square the big data circle, make it small, in real-time.

The Feedback Layer: Deep Context, Complex Microsystems, Real-Time Loops

The efficacy of the feedback layer is also new. This layer has two key aspects: the purposive or

push driven response and the big anonymous response (discussed above). Targeted

crowdsourcing has already come a long way. The Ushahidi experience in Kenya for instance also

worked for monitoring elections in Afghanistan and tracking emergencies including the cholera

outbreak during the Haiti earthquake. Mobile phone SMS platforms have been adapted to make

participatory budgeting more inclusive in hard to reach areas such as conflict-affected South-

Kivu in the Democratic Republic of Congo and results have been encouraging.

To understand how powerful the feedback layer can be consider the experience of the Mobile

Accord, which at the initiative of the World Bank’s World Development Report 2011, ran Geo

Poll an SMS based targeted polling in the DRC. The poll asked 10 sensitive questions including

about topics such as rape and violence against women in conflict zone. The survey produced

1.2million text responses and the outputs were turned into a video “DRC Speaks” which

captured people’s responses to questions about their experiences in their own words. This ended

up being one of the largest surveys ever conducted in the country.

http://go.worldbank.org/UDTL02A390

http://www.youtube.com/watch?v=1VtaWPAtHyA

9

There is a pattern here. In the base layer more and more (new and old) data is opening every day.

In the analytics layer experimental ideas are leaving the lab for real world application;

virtualization and visualization are helping foster new communities geared towards collaboration

and collective problem solving. Similarly in the feedback layer the tools are also democratizing.

Ushahidi has created a very easy to use version of their implementation called CrowdMap.

Anyone who knows how to set up an email account will be able to set up their own incident

mapping of whatever trend, alert or issue they are interested in getting feedback from the crowd

on. The service is already being used to track everything from citizen report cards assessing

corruption in India to the Syrian uprising to national emergencies on the tiny island of Samoa.

The feedback layer is also innovating in reverse – backwards from new to old media and formats

– something essential in international development where so much remains unmapped or in non-

digital formats. Take mapping. In the developing world there are still vast areas beyond GPS

coverage. But people in those communities have intimate knowledge of their surroundings. If

only there was a way to tap into that knowledge and build a bridge between that data source and

an open-source digital resource like Open Street Map. Walking Papers does precisely that.

Contributors can simply draw, by hand, on simple paper, say a map of their neighbourhood and

upload to Walking Papers, where a community specializes in taking that information and using it

to deepen Open Street Maps.

Looking Ahead

International development as a field of practice and research has tended to be a laggard in using

big data, powerful analytics and innovative sourcing techniques, and with good reasons.

However this is changing faster than anyone predicted and faster than most organizations that

‘do development’ are prepared for. The open data movement has widened access to a broad

range of basic contextual information. A similar push is needed to open private sector data in the

service of social good. Big data is beginning to have a big impact on how we think about

development challenges and this is because we have the ability to make it understandable like

never before. Powerful analytical tools and collaborative platforms are dramatically changing

https://crowdmap.com/mhi

http://india.cr/

http://india.cr/

https://syrianspring.crowdmap.com/main

https://alerts.crowdmap.com/main

http://walking-papers.org/

10

what is possible for even the most intractable challenges like understanding socioeconomic risks

and responses, dealing with urban planning, and better preparing for emergencies. For the first

time we have a feedback layer which has made possible deep and near real-time awareness of

what is working or not working where and why. Together big data, democratized analytics and

the ability to tap deep contexts will change the way we think and do development in the coming

years.

11

Bibliography Aid Data, Development Gateway, African Development Bank - Development Loop. Development Loop. n.d.

http://www.aiddata.org/content/index/Maps/development-loop-app (accessed Dec 21, 2011).

Courtney, Alexa, and David Kilcullen. "Big data, small wars, local insights: Designing for development with conflict-affected

communities." What Matters. McKinsey & Company, December 2, 2011.

Crowd Map. n.d. https://crowdmap.com/mhi.

Data Without Borders. n.d. http://datawithoutborders.cc/.

Davis, Steve, and Jonathan Bays. "Harnessing big data to address the world’s problems." What Matters. McKinsey & Company,

November 2, 2011.

DRC Speaks (Geo Poll). n.d. http://www.youtube.com/watch?v=1VtaWPAtHyA.

Eagle, Nathan. Reality Mining. Dataset: http://reality.media.mit.edu/, Massachusetts: Massachusetts Institute of Technology,

2009.

Engineering Social Systems. "Big Data for Social Good." Engineering Social Systems. n.d. http://ess.santafe.edu/bigdata.html

(accessed December 21, 2011).

Jay Chen, Trishank Karthik, Lakshminaryanan Subramanian. Contextual Information Portals. Online: http://ai-d.org/, Association

for the Advancement of Artificial Intelligence, n.d.

McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity. McKinsey & Company, 2011.

Metha, Abhishek. Big Data: Powering the Next Industrial Revolution. White paper:

http://www.tableausoftware.com/learn/whitepapers/big-data-revolution, Seattle: Tableau Software, n.d.

Priebatsch, Seth. The game layer on top of the world. Video online at:

http://www.ted.com/talks/lang/en/seth_priebatsch_the_game_layer_on_top_of_the_world.html, Boston: Ted Talks,

2010.

Publish What You Fund. "Aid Budgets in Uganda." Publish what you fund. n.d.

http://www.publishwhatyoufund.org/resources/uganda/ (accessed December 21, 2011).

Shantayanan Devarajan. "Africa’s statistical tragedy." Africa Can End Poverty (blog of World Bank, Africa Chief Economist).

October 6, 2011. http://blogs.worldbank.org/africacan/africa-s-statistical-tragedy (accessed October 6, 2011).

Swift River. n.d. http://ushahidi.com/products/swiftriver-platform.

UN Global Pulse. n.d. http://www.unglobalpulse.org/.

Ushahidi. n.d. http://ushahidi.com/.

Walking Papers. n.d. http://walking-papers.org/.

big data, democratized analytics and deep context,

Documents

data openness

aid data

data pooror

data reliability

base data layer

feedback data layer

big data revolution

opendata chapeau