big data, democratized analytics and deep context,
DESCRIPTION
Paper analyzes how big data, democratized analytics, deep context are changing how we think and do development. Outlines key new technologies, analysis techniques and tools that will have a major impact on development research. Classifies into data, analytics and feedback layer.TRANSCRIPT
1
Big Data, Democratized Analytics and Deep Context will Change How We
Think and Do Development
Aniket Bhushan, Senior Researcher, The North-South Institute, [email protected]
Dec 21, 2011
International development as a field of research and practice has been more of a laggard than a
leader in using big data and powerful analytics. Much of the data is often dated or of poor
quality. Huge areas including those of the greatest interest remain entirely unmapped, data poor
or otherwise poorly understood.
This situation is changing faster than anyone predicted and the set of tools driving this evolution
represents one of the most important trends in international development. The proliferation of
mobile technologies, computing power and democratization of analytics within an open-source,
open-data chapeau will fundamentally change the way we think about and do development. I
provide a synopsis of the most impactful developments in three key areas: the base (data) layer,
the analysis layer, and the feedback (data) layer. Together these advances are changing how we
think about data in development, how we develop a deeper contextual understanding at a very
micro level without losing the ability to aggregate and generalize, and how we bring it all
together in a meaningful way to make better decisions.
2
Base layer: Big and Open
Priavte sector (proprietary)
Public sector data at
various levels
International institutions: World Bank,
IMF, UN Data
Analytics layer
Virtualization
Visualization
Feedback layer
"crowds"
Purposive - push
Anonymous - big
3
The Base Layer: Open Data and Big Data
How do we know what we know in the field of international development? What is the
information, or evidence base, who generates it and how? There are at least three main
collectors, sorters and repositories of development information: international institutions (UN,
World Bank, IMF etc), national and sub-national official public sector institutions and the private
sector. Change is afoot in each.
Take for instance the open data push by
international institutions such as the World Bank
and African Development Bank, or the UN’s Global
Pulse initiative. Opening up the World Bank’s
databank has made a huge amount of information
available to a wider range of stakeholders than ever
before, similarly Global Pulse is creating a platform
to harness new data streams. Groups such as
Development Gateway and Aid Data have sized the
opportunity to push openness further. A good example of what is made possible by this opening
is a tool like Development Loop which not only plots all World Bank and African Development
Bank projects at precise geographic locations across Africa but also overlays the same with
feedback sourced from the intended beneficiaries of the project or initiative.
This full circle or loop is a powerful reminder of
the importance of data transparency and universal
standards. Opening up aid data in a standardized
format will make geocoding a potent tool for real
transparency and accountability.
To appreciate what a game-changer open data can
be examine the current state of affairs. Research
conducted by UK based Publish What You Fund
recently in Uganda aimed at simply tallying up
financial resources available for development,
found that the government was unaware of the amount donors planned to spend in that year
(2006-07). The planned expenditure was more than double what the government was aware of;
indeed financial resources flowing into the country were far higher than had been estimated. Or
take another example, what the World Bank’s Chief Economist for Africa calls a “statistical
tragedy”. The majority of Africa’s population lives in countries that still use an outdated (1960s)
method of national income accounting used to generate fundamental data points such as Gross
Domestic Product (GDP). Ghana for instance only shifted to the 1993 UN system of national
4
accounts last year. When they did so they found their GDP was 62 percent higher than
previously thought, catapulting the country to ‘middle income’ status.
If even the most basic information often taken for granted is riddled with problems then what do
we really know? Data reliability is one issue but time-lag is another. Most of the data used in
international development is stale by the time it is called upon in decision making. The
information base we rely on in international development needs to be bolstered by building
bridges with new sources and data-streams.
Opening up proprietary private sector data and exposing it to the concerns in international
development in the coming years will be a game changer. To date international institutions have
made the most progress towards data openness, select public sector authorities (for instance
under the purview of the Open Government Partnership) are also making progress, but it is the
private sector – the main repository of “big data” – that is the holy grail. If you total all the data
collected by the US Library of Congress (one of the largest public sector repositories) it would
be about 235 terabytes as of April 2011. Wal-Mart processes and stores about 2500 terabytes per
hour! The big data revolution is changing business models fundamentally, and like capital and
labour data itself has become a commercial driver and firms like Google, Facebook, Twitters are
the first “data factories”, pioneers much like their antecedents in the industrial revolution.
This is truly big data and its growth has exploded off
the charts thanks in large part to the explosion of
mobile sensors (the most important of which is the
mobile phone but think also credit cards, laptops,
GPS and everything from radio-frequency to QR
codes), and the rapid democratization of high-power
analytics. Whether it is geocoded mobile phone data
modeled to track slum development or predict
microfinance loan defaults or provide weather-
indexed insurance to small farmers big data is
already a game-changer in development and we have
only begun to scratch the surface.
The Analytics Layer: Virtualization, Visualization are driving Democratization
Analytics is simply the collection of tools and techniques used to make sense of data. High-
power analytics proved a game-changer in the commercial sector and can now do the same in the
social sector. At its core analytics is about unearthing and understanding relationships and
patterns. Analytics helped retailers discover unlikely trends, most famously that customers who
came in to buy diapers also tended to buy beer! It can do the same for complex social systems.
5
Developments in analytics have kept pace with the speed with which big data has grown.
Bringing this capacity to bear on development challenges such as food security and urbanization
is just getting started.
Before we look at some examples let us
put in perspective what we mean by the
explosion of mobile sensors. The mobile
phone is already the most celebrated
example; anyone who has visited Kenya
(or really any African country) has seen
vividly the power of mobile money. This
is old news. Anyone who has followed
elections in Kenya probably also heard of
Ushahidi, a locally developed open-source
platform to map reports of post-election
violence which went live in 2008. This is
also old news.
What is new is what is now made possible by bringing to bear requisite analytics on the huge
proliferation of mobile sensors. Mobile phones have grown from under 750million with more
than two thirds in developed countries to over 5billion with about four times as many in
developing countries as in the developed world. Of the 5billion about 1billion live on less than
$5 a day. The developing world is the leading driver of mobile big data. This includes voice,
text, financial, locational and positional information, which is now possible to overlay with the
base data layer described earlier (income, health, education and other indicators generated by
official sources) to produce new insights into real behaviour and complex incentive structures.
Sticking with Kenya, take the example of the
Engineering Social Systems lab. Coupling
terabytes of mobile phone data with Kenyan
census information ESS is modeling the growth
of slums to inform urban planners about where to
locate services such as water pumps and public
toilets. In Uganda the same group is developing
causal structures of food security, in Rwanda
they collected a sample of every phone-call over
a four year period, coupled with a random
survey, to analyze how different people react to the same economic shock. What is really
interesting is the way experimental initiatives are being brought out of the lab into real world
6
application. The shift that is taking place is fundamental; away from models governing theory to
models informed and built on real networks (see Reality Mining).
For long development analysis at best has been limited to correlations and inferences based on
correlations, for the first time big data coupled with high-power analytics is opening up the
possibility if not of entirely causal dynamics then at least more robust inferences. Our traditional
methods of inquiry have conditioned us to think in terms of generalizing on the basis of random
sampling, for the first time the proliferation of mobile sensors is making possible highly targeted
yet nonintrusive and anonymous inquiry.
And this is not another story about mobile phones alone. The point is the rapid emergence of
altogether new data-streams, in step with the development of analytical capacity to draw useful
inference out of them. Take Twitter for instance which generates information about the size of
the entire US Library of Congress in two weeks and together with Facebook has already shown
its efficacy during the Arab uprisings. At the heart of this evolution are open-source software
systems and tools that allow the simultaneous collection, categorization, and analysis of various
data types from Twitter hashtags to videos to positional data and machine IDs. Swift River
developed by Ushahidi is an example of a free open-source platform that enables rapid
simultaneous filtering and verification of real-time data from channels like Twitter, SMS, Email
and others. It also visualizes the information in dashboards that the average user can understand.
This is particularly powerful for monitoring immediate post-crisis developments when the
information flow suddenly increases but is also only useful if immediately analyzed.
Democratization of analytics driven by a
commitment to open-source is furthered by
virtualization of platforms and
visualization of information (to make it
engaging for the average user). An aspect
of virtualization is community or crowd
driven problem solving. Take for instance
Data Without Borders, a pro-bono data
scientist exchange. DWB organizes ‘data
dives’ to help NGOs, civil society
organizations and other who might not
have the time, capacity or inclination but
may be sitting on information useful for
purposes beyond their imagination, to make sense of their own information. At a recent data dive
DWB helped a human rights group that allows users to anonymously upload information about
violations get a better look at who was using the system and plot trends, without compromising
7
anonymity. Using open-source tools (such as R) they also visualized the information to make it
easy to understand.
Here are some in a fast growing toolkit worth following:
Social network analysis: with the growing penetration of web 2.0 technologies, social
media is becoming the dominant communication channel for rapid exchange. Network
analysis includes a set of techniques used to characterize relationships among discrete
nodes in a graph or a network. In social network analysis, connections between
individuals in a community or organization are analyzed, e.g., how information travels, or
who has the most influence over whom. Examples of applications include identifying key
opinion leaders, and identifying bottlenecks in enterprise information flows. Two
important techniques within network theory and analysis are exploratory data analysis
(EDA) and link analysis. EDA is an approach for analyzing large datasets in summarized
formats according to their main characteristics. EDA came about in part as a reaction to
overdue emphasis in statistical fields on hypothesis testing or “confirmatory analysis”.
EDA emphasizes using the data to suggest hypotheses to test. It emphasizes testing
assumptions on which inference is based. Similarly link analysis focuses on analyzing
relationships among nodes through visualization methods, including network diagrams
and association matrices. Gephi is an interactive open source (java based) platform for
complex systems and network analyses.
Automated web-scraping: almost everything is on the web today, which means almost
everything has a hypertext (HTTP) related identity. Web scraping is an automated
technique for collecting information from the web. Web scraping transforms typically
unstructured data in HTML format into structured data that can be centralized in a
spreadsheet. Simply using MS Excel it is possible for instance to scrape tables and other
data in a variety of unstructured formats (including real-time, e.g. weather information on
world clock, or stock price info), and save them as a spreadsheet for further use. Web
scraping, though somewhat mired in legal and privacy issues, has the potential to
massively increase the volume of accessible information. For example UN Global Pulse
is running a project with Price Stats and the Billion Prices Project at MIT, investigating
daily bread prices across Latin American countries by scraping the web for online prices.
The end product is a demonstration, an e-bread index which tracks bread price inflation
real-time (daily), and can be compared contrasted or can complement the traditional
consumer price index (which is only published on a monthly basis). UN Global Pulse is
also pioneering Hunch Works, the first social network for hypothesis formation, evidence
gathering and decision making. Hunch Works allows researchers to connect with other
experts with complementary resources so that together they could quickly determine if
data signals are indications of deepening crisis and warrant further investigation.
8
World Bank – Adept software platform: ADePT is a free (STATA based) platform
developed by the World Bank that automates and standardizes analysis. ADePT allows
complex statistical analyses, and direct pre-configured access to a range of micro level
data from the Bank’s and other sources. It is particularly useful for economic analysis and
particularly useful in developing countries where researchers may not have ready access
to otherwise expensive statistical packages.
Hadoop – distributed parallel analytics: is big data is the latest buzz word in IT then
Hadoop is a large part of the story behind it. Hadoop is a platform for distributing
problems, tasks, analyses across a number of servers, speeding up analysis and shrinking
the distance between data, analysis and result. Hadoop works behind things you know
well but you have never seen it or are likely to know it. For example one of the most
well-known implementations is Facebook, which brings in core data stored by you into
Hadoop clusters where it is reflected against your friends, their interest to suggest
recommendations back to you. High power distributed parallel analytics solutions like
Hadoop help square the big data circle, make it small, in real-time.
The Feedback Layer: Deep Context, Complex Microsystems, Real-Time Loops
The efficacy of the feedback layer is also new. This layer has two key aspects: the purposive or
push driven response and the big anonymous response (discussed above). Targeted
crowdsourcing has already come a long way. The Ushahidi experience in Kenya for instance also
worked for monitoring elections in Afghanistan and tracking emergencies including the cholera
outbreak during the Haiti earthquake. Mobile phone SMS platforms have been adapted to make
participatory budgeting more inclusive in hard to reach areas such as conflict-affected South-
Kivu in the Democratic Republic of Congo and results have been encouraging.
To understand how powerful the feedback layer can be consider the experience of the Mobile
Accord, which at the initiative of the World Bank’s World Development Report 2011, ran Geo
Poll an SMS based targeted polling in the DRC. The poll asked 10 sensitive questions including
about topics such as rape and violence against women in conflict zone. The survey produced
1.2million text responses and the outputs were turned into a video “DRC Speaks” which
captured people’s responses to questions about their experiences in their own words. This ended
up being one of the largest surveys ever conducted in the country.
9
There is a pattern here. In the base layer more and more (new and old) data is opening every day.
In the analytics layer experimental ideas are leaving the lab for real world application;
virtualization and visualization are helping foster new communities geared towards collaboration
and collective problem solving. Similarly in the feedback layer the tools are also democratizing.
Ushahidi has created a very easy to use version of their implementation called CrowdMap.
Anyone who knows how to set up an email account will be able to set up their own incident
mapping of whatever trend, alert or issue they are interested in getting feedback from the crowd
on. The service is already being used to track everything from citizen report cards assessing
corruption in India to the Syrian uprising to national emergencies on the tiny island of Samoa.
The feedback layer is also innovating in reverse – backwards from new to old media and formats
– something essential in international development where so much remains unmapped or in non-
digital formats. Take mapping. In the developing world there are still vast areas beyond GPS
coverage. But people in those communities have intimate knowledge of their surroundings. If
only there was a way to tap into that knowledge and build a bridge between that data source and
an open-source digital resource like Open Street Map. Walking Papers does precisely that.
Contributors can simply draw, by hand, on simple paper, say a map of their neighbourhood and
upload to Walking Papers, where a community specializes in taking that information and using it
to deepen Open Street Maps.
Looking Ahead
International development as a field of practice and research has tended to be a laggard in using
big data, powerful analytics and innovative sourcing techniques, and with good reasons.
However this is changing faster than anyone predicted and faster than most organizations that
‘do development’ are prepared for. The open data movement has widened access to a broad
range of basic contextual information. A similar push is needed to open private sector data in the
service of social good. Big data is beginning to have a big impact on how we think about
development challenges and this is because we have the ability to make it understandable like
never before. Powerful analytical tools and collaborative platforms are dramatically changing
10
what is possible for even the most intractable challenges like understanding socioeconomic risks
and responses, dealing with urban planning, and better preparing for emergencies. For the first
time we have a feedback layer which has made possible deep and near real-time awareness of
what is working or not working where and why. Together big data, democratized analytics and
the ability to tap deep contexts will change the way we think and do development in the coming
years.
11
Bibliography Aid Data, Development Gateway, African Development Bank - Development Loop. Development Loop. n.d.
http://www.aiddata.org/content/index/Maps/development-loop-app (accessed Dec 21, 2011).
Courtney, Alexa, and David Kilcullen. "Big data, small wars, local insights: Designing for development with conflict-affected
communities." What Matters. McKinsey & Company, December 2, 2011.
Crowd Map. n.d. https://crowdmap.com/mhi.
Data Without Borders. n.d. http://datawithoutborders.cc/.
Davis, Steve, and Jonathan Bays. "Harnessing big data to address the world’s problems." What Matters. McKinsey & Company,
November 2, 2011.
DRC Speaks (Geo Poll). n.d. http://www.youtube.com/watch?v=1VtaWPAtHyA.
Eagle, Nathan. Reality Mining. Dataset: http://reality.media.mit.edu/, Massachusetts: Massachusetts Institute of Technology,
2009.
Engineering Social Systems. "Big Data for Social Good." Engineering Social Systems. n.d. http://ess.santafe.edu/bigdata.html
(accessed December 21, 2011).
Jay Chen, Trishank Karthik, Lakshminaryanan Subramanian. Contextual Information Portals. Online: http://ai-d.org/, Association
for the Advancement of Artificial Intelligence, n.d.
McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity. McKinsey & Company, 2011.
Metha, Abhishek. Big Data: Powering the Next Industrial Revolution. White paper:
http://www.tableausoftware.com/learn/whitepapers/big-data-revolution, Seattle: Tableau Software, n.d.
Priebatsch, Seth. The game layer on top of the world. Video online at:
http://www.ted.com/talks/lang/en/seth_priebatsch_the_game_layer_on_top_of_the_world.html, Boston: Ted Talks,
2010.
Publish What You Fund. "Aid Budgets in Uganda." Publish what you fund. n.d.
http://www.publishwhatyoufund.org/resources/uganda/ (accessed December 21, 2011).
Shantayanan Devarajan. "Africa’s statistical tragedy." Africa Can End Poverty (blog of World Bank, Africa Chief Economist).
October 6, 2011. http://blogs.worldbank.org/africacan/africa-s-statistical-tragedy (accessed October 6, 2011).
Swift River. n.d. http://ushahidi.com/products/swiftriver-platform.
UN Global Pulse. n.d. http://www.unglobalpulse.org/.
Ushahidi. n.d. http://ushahidi.com/.
Walking Papers. n.d. http://walking-papers.org/.