Big Data: applications, ethics, algorithms
Vladislav Shershulsky
Microsoft Rus
ALMADA, Moscow 2013-07-02
The form
• This is one hour presentation
• Thus, I have almost no chances to teach you something valuable, or to discuss any really new and sophisticated results
• This is simply overview of several interesting techniques related to one important, and often underestimated, topic
The content
• The topic is “Computational ethics for Big Data apps”
• It is not about why people should behave ethically
• It is about the following:
• how to force people to behave ethically while dealing with our apps
• how to ensure, that our Big Data apps are “moral by design”
• and what does it technically mean “to be digitally ethical”
The map
Processing and Data
“Upstream” and “Downstream”
Social Interactions
Consolidation and Aggregation
De- & Re- personalization
Data Leaking and Forensics
Category Theory
Traditional Ethics for Big Data
(New) Information Ethics
Categories on (Statistics Structure)
Universal Moral Grammar
Deontic Logic and Obligations
Algebra of conscience
Differential privacy
Predictive Analytics
I. Motivation: why should we care
Broadly acknowledged definition
Big data are high volume, high velocity, and/or high variety
information assets that require new forms of processing to enable enhanced
decision making, insight discovery and process optimization
Gartner, 2012
Big Data is essentially complex
We usually are not aware of the generating laws, forming big
data sets, and about the nature of relations between them.
But their huge volume allows to infer laws (regressions…) and
thus gives (with the limits of inference reasoning) to Big Data
some predictive capabilities.
P. Delort, 2011
A bit like material
world natural
objects / subjects
Big opportunity
For governments:
• Budget savings
• Transparency and responsibility
• Real insight into society
• Optimal decisions
Big opportunity
For people:
• Self organization
• Better experiences
• Intelligent environment
• Introspection
Big opportunity
For business:
• Converting products to services
• Expanded value chains
• New business models
• Educated targeting
F r o m P r o d u c t t o S e r v i c e
V = V0 + A∙N + B∙N2
Value for
customer
Immanent
value
Volume
valueNetwork
value
On Premise
Off PremiseBig Data
& BIClients
Employees
Partners
Mobility
&
Connectivity Vlalue
Socialization
of Business
http://www.businesslogicsystems.com/Data%20Management
Big opportunity
For IT industry
• Next chance to change the world
• Step towards internet of everything
• Completely new markets
Big challenge
For people
• New lack of privacy
• Automated justice
• Need to understand
AKMs at your backyard
Big challenge
For business
• Hard to comply
• Easy to violate
• Unexpected backfire
• Need to defend sourcesTarget Predicts Pregnancy with Big Datahttp://smallbusiness.yahoo.com/advisor/target-predicts-pregnancy-big-data-104057627.html
Why Netflix's Facebook app would be illegalBy Julianne Pepitone @CNNMoneyTech March 27
VPPA arose from strange circumstances surrounding the
failed Supreme Court nomination of Robert Bork. While
Bork's nomination hearings were taking place in 1987, a
freelance writer for the Washington City Paper talked a
video store clerk into giving him Bork's rental history.
Google facing legal threat from six
European countries over privacyhttp://www.guardian.co.uk/technology/2013/apr/02/google-privacy-policy-legal-threat-europe
Big challenge
For government
• It is hard to be transparent
• It is easy to overuse
• Hard to defend sources
George Orwell, 1984http://budget4me.ru/ob/faces/home
http://online.wsj.com/article/SB10001424052970203391104577124540544
822220.html?mod=googlenews_wsj
http://www.wikileaks.org/
http://www.washingtonpost.com/investigations/us-intelligence-mining-data-from-nine-us-internet-companies-in-broad-secret-program/2013/06/06/3a0c0da8-cebf-11e2-8845-d970ccb04497_story.html
Big challenge
For IT industry
• Needs new hw and sw architecture to address scale
• Needs to know how to protect
• Needs to address extremely complicated usage scenarios
• Risk of over-restrictive regulation
Pro Contra
People: collective knowledge
Business: from disordered offerings to quality of life service
Government: know and address real needs of citizens
IT industry: change the world (again?)
People: final lack of privacy
Business: disruptive scenarios
Government: chance to miss everything
in a minutes
IT industry: new approaches to hw and sw
architecture, addressing new challenges
II. Computational ethics and Big Data
Why ethics?
• Benefiting from opportunities and mitigating risks assumes careful handling of
digital assets of high business and personal value, both in known scenarios and in
completely new situations
• To proceed successfully one should follow some sort of fundamental principles –
clear and consistent
Ethics, also known as moral philosophy, is a branch of philosophy that
involves systematizing, defending and recommending concepts of right and wrong
conduct.
http://www.iep.utm.edu/ethics/
Big Data and traditional ethics
• Let’s take concepts from traditional ethics and examine how they should
apply to digital world, and how they evolutionary evolve under influence
of Big Data capabilities
• Four Elements of Big-Data Ethics: Identity, Privacy, Ownership, Reputation
• Big Data is ethically neutral
• Personal data – not some specific data, but any data generated in the
course of a person’s activities
• Privacy interests, not always ultimate rights
• A responsible organization is an organization that is concerned both with
handling data in a way that aligns with its values and with being perceived
by others to handle data in such a manner.Davis Kord. Ethics of Big Data.
Balancing Risk and Innovation.
O'Reilly Media, 2012
Big-Data ethics: Identity
• Identity (in philosophy), also called sameness, is whatever makes
an entity definable and recognizable (Wikipedia)
• Christopher Poole vs. Mark Zuckerberg: “prismatic” multi-identity
vs. mono-identity
• Some governments concern about identity in the Internet: say
Italy and Belarus requires ID to obtain access
• Does Big Data allows to re-construct identity?
Partially following Davis Kord. Ethics of Big Data.
Big-Data ethics: Privacy
• Privacy is the ability of an individual or group to seclude themselves or information about themselves and thereby reveal themselves selectively (wikipedia).
• In 1993, the New Yorker published a cartoon whose caption read: “On the Internet, nobody knows you’re a dog” At the time, this was funny because it was true. Today, in the age of big data, it is not only possible to know that you’re a dog, but also what breed you are, your favorite snacks, your lineage, and whether you’ve ever won any awards at a dog show.
• There are two issues. First, does privacy mean the same thing in both online and offline in the real world? Second, should individuals have a legitimate ability to control data about themselves, and to what degree?
Following Davis Kord. Ethics of Big Data.
Big-Data ethics: Ownership
• The degree of ownership we hold over specific information about us varies as widely as the distinction between privacy rights and privacy interests.
• Do we, in the offline world, “own” the facts about our height and weight? Does our existence itself constitute a creative act, over which we have copyright or other rights associated with creation?
• How do those offline rights and privileges, sanctified by everything from the Constitution to local, state, and Federal statues, apply to the online presence of that same information?
• To the end of the day we more and more pay for “free” online services by our data sharing with its providers
Following Davis Kord. Ethics of Big Data.
Big-Data ethics: Reputation
• As recently as 20 years ago, reputation consisted primarily of what people, specifically those who knew and frequently interacted with you, knew and thought about you. In some cases, a second-degree perception – that is, what the people who knew you said about you to the people who they knew – might influence one’s reputation.
• One of the biggest changes born from big data is that now the number of people who can form an opinion about what kind of person you are is exponentially larger than it was a few years ago.
• And further, your ability to manage or maintain your online reputation is growing farther and farther out of individual control.
• There are entire companies now whose entire business model is centered on “reputation management”
Following Davis Kord. Ethics of Big Data.
Benefits of ethics inquiry
• Faster consumer adoption by reducing fear of the unknown (how
are you using my data?)
• Reduction of friction from legislation from a more thorough
understanding of constrains and requirements
• Increased pace of innovation and collaboration derived from a
sense of purpose generated by explicitly shared values
• Reduction in risk of unintended consequences from an overt
consideration of long-term, far-reaching implications of the use of
big-data technologies
Partially following Davis Kord. Ethics of Big Data.
But now we want to go even further
We need more formal theory(ies) and more practical instruments toincorporate ethical behavior into our products and services.
There are several important reasons to do so:
• We intend to ensure and enforce ethical behavior of our apps
• We hope that formal theory and algorithms will help us to findreasonable (at least not very self-contradictory) solutions incases where our obvious life experience does not works
• We need to comply with complicated regulation, and, the sametime, to cheque its consistency
• We have to be ready to disruptive changes in adopted bysociety of-line ethics under influence of IT and of Big Data inparticular
What to expect from computational ethics
• We need a set of formal rules to classify steps (state changes) asethically “right” or “wrong” (actually there are a bit more options),and to find most acceptable steps in a cases where our intuitionprovides no solution
• We need to apply these rules in a consistent way to both people(subjects), information objects (which in Big Data tasks havecertain level of autonomy due to complexity) and to collectives ofsubjects and objects (collectives as moral agents as well)
Computational ethics vs. Roboethics
Machine Ethics (or machine morality) is the part
of the ethics of artificial intelligence concerned
with the moral behavior of Artificial Moral Agents
(AMAs) (e.g. robots and other artificially
intelligent beings). Machine ethics is sometimes
referred to as computational ethics and
computational morality.
In contrasts roboethics concerns with the moral
behavior of humans as they design, construct, use
and treat such beings.
Here we do not talk about any sort of robotics or
any similar professional code of conduct (i.e. it is
not about hacker or software developer ethics)
This area of research recently draw a lot of attention due to Wendell Wallach from Yale University
urging U.S. and other governments regarding drone and similar AKMs proliferation
A few useful approacheswe shell just mention them and discuss in more details later
Most fundamental concepts
• Information logic by Luciano Floridy. For many reasons this is good background for the rest of approaches.
• Information accountability. Alternative to “all prohibited by default” approach in access management.
• Deontic logic. The field of logic concerned with obligation, permission, and related concepts, and, subsequently, –a formal system that attempts to capture the essential logical features of these concepts. Good (but not the most general) approach to expressing access/usage rules. More general modal logics are of interest as well.
A few useful approacheswe shell just mention them and discuss in more details later
Practical concepts and calculi
• Moral grammar.Still not all-embracing, but very expressive way to describe and solve situations of moral choice.
• (Auto)reflexive multi-agent calculus of conflict and cooperation.The first model to describe the role of reflection in ethics, moral choice in conflicts, difference between intention and readiness. Introduced simple morale agents classification. Popular in Russia due to expressiveness and predictivity, but rarely known abroad.
• Differential privacy.Means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records.
• Models, calculi, and languages to describe and derive ownership, access restrictions, and rules, including obligation specific. Privacy-aware Role Based Access Control (P-RBAC), Web Ontology Language (OWL, OWL2), and eXtensible Access Control Markup Language (XACML).
A few “ethically neutral” but useful instruments
Category Theory
Category theory is a toolset for describing the general abstract structures in mathematics.
As opposed to set theory, category theory focuses not on elements x,y,⋯ – called
objects – but on the relations between these objects: the (homo)morphisms between
them 𝑥 𝑓𝑦
Computation on encrypted data
Ability to perform some operations (such as searches, comparisons) on data encrypted
by another entity. Area of active research with some prominent results. Often referred as
homomorphic encryption.
A few referencesInformation ethics.
• Luciano Floridi. Information ethics, its nature and scope. In SIGCAS Computers and Society, Volume 36, No. 3, September 2006, pp. 21-36.
Information accountability. • Daniel Weitzner, Harold Abelson, Tim Berners-Lee, Joan Feigenbaum, James Hendler, Gerald Sussman. Information Accountability.
Communications of the ACM, Jun. 2008, 82-87
Deontic logic. • Paul McNamara, Deontic Logic, Stanford Encyclopedia of Philosophy, 2006, 2010.
Moral grammar. • John Mikhail. Moral Grammar and Intuitive Jurisprudence: A Formal Model of Unconscious Moral and Legal Knowledge. In The Psychology
of Learning and Motivation: Moral Cognition and Decision Making. D. Medin, L. Skitka, C. W. Bauman, D. Bartels, eds., Vol. 50, Academic Press, 2009, pp.27-100
(Auto)reflexive multi-agent calculus of conflict and cooperation. • Vladimir Lefebvre. Algebra of conscience. Kluwer Academic Publishers, Dordrecht, Boston. 2001, 358 p.
• Vladimir Lefebvre, Thomas Reader. Reflexive IW Model II. Report No.(s): AD-A399417; ARL-SR-114. Feb. 2002; Army Research Lab., Survivability/Lethality Analysis Directorate, White Sands Missile Range, NM USA, 50p.
Differential privacy. • Differential Privacy: A Survey of Results by Cynthia Dwork, Microsoft Research April 2008
P-RBAC, OWL, XACML. • Qun Ni, Dan Lin, Elisa Bertino, and Jorge Lobo. Conditional Privacy-Aware Role Based Access Control. ESORICS 2007, LNCS 4734. J. Biskup
and J. Lopez (Eds.). Springer-Verlag. Berlin, Heidelberg, 2007, pp.72-89.
• For OWL see http://en.wikipedia.org/wiki/Web_Ontology_Language
• For XACML see XACML 3.0 - committee specification 01. OASIS (oasis-open.org). Retrieved 10-August-2010
Category theory.• http://ncatlab.org/nlab/show/category+theory#idea_10
• Saunders MacLane. Categories for the Working Mathematician. 2nd edition, Springer Verlag, 1998, 314p.
Computation on encrypted data.• Abdullatif Shikfa. Computation on Encrypted Data: Private Matching, Searchable Encryption and More. Bells Lab, Alcatel-Lucent. 2013
Why Information Ethics
I will cover other instruments later while describing specific Big Data scenarios or operations, but
Information Ethics due to its universality needs discussion in advance.
Information Ethics is the theoretical foundation of applied Computer Ethics.
IE is an expansion of environmental ethics towards
1) a less anthropocentric concept of agent, which now includes also non-human (artificial)
and non-individual (distributed) entities; and
2) a less biologically biased concept of patient as a ‘centre of ethical worth’, which now
includes not only human life or simply life, but any form of existence.
3) a conception of environment that includes both natural and artificial (synthetic, man-
made) eco-systems.
IE is therefore: non-standard, patient-oriented, not agent-oriented, environmental, non-
anthropocentric but ontocentric, and based on the concepts of informational object/
infosphere/entropy rather than life/ecosystem/pain.
Luciano Floridi, Lecture @ Cambridge, 2006
Luciano Floridi, Lecture @ Cambridge, 2006
Infosphere, patient oriented
Ecology, patient oriented
Bioethics, patient oriented
All-human, agent oriented
Brief history of ethics
Athenian citizens
Moral Patient
• Question: what is the lowest possible common set of attributes whichcharacterizes something as intrinsically valuable and an object ofrespect, and without which something would rightly be consideredintrinsically worthless or even positively unworthy and thereforerightly disrespectable in itself?
• Answer: the minimal condition of possibility of an object’s leastintrinsic worthiness is its abstract nature as an informational entity.
• Conclusion: all entities, interpreted as clusters of information, have aminimal moral worth qua informational objects, that deserves to berespected.
Luciano Floridi, Lecture @ Cambridge, 2006
Four Principles of Information Ethics
• IE determines what is morally right or wrong, what ought to be done,what the duties, the “oughts” and the “ought nots” of a moral agentare, by means of four basic principles:
0. entropy ought not to be caused in the infosphere (null law)
1. entropy ought to be prevented in the infosphere
2. entropy ought to be removed from the infosphere
3. the welfare of the infosphere ought to be promoted by extending it,improving it and enriching it.
Luciano Floridi, Lecture @ Cambridge, 2006
Conclusions for Big Data
• When we recognise, that not only humans (actors), but also (at least some)
information objects (patients) have rights that should be respected, and
that under certain conditions entropy minimisation is immoral, we shell
have to agree that our privacy rights are only privacy interests and should
be balanced with the rights of information objects we created.
• Big Data (subsets, elements of BD) are complicated enough to satisfy, in
many cases, criteria of moral patient(s).
• Our ultimate privacy requirement becomes thus questionable.
• Do we really always have “Right to Forget” (i.e. somehow kill information
objects) ?
• In some future information objects representing our identity in infosphere
would become autonomous enough (shadow freewill dilemma).
• IE
Conclusions for Big DataInformation object rights to respect
• If you consider this problem abstract, just think about recent attempts in several
countries to prohibit car video-registrators as potentially violating privacy of
passengers in another cars.
• Or take into account that countries with over-restrictive regulation of Big Data
collecting and usage to be at risk of economic and social slowdown.
• Can we explain all this to legislators? Many of them are still at the stage of adopting
ecology.
Is this practical? Yes. It helps to architect big data management system in a consistent way:
• Your predictive conclusions have little value if you kill information objects voluntarily
• Access mechanisms to respect information object rights help to avoid massive leaking
and full service blocking (“all is prohibited by default”)
III. Sources of Big Data
Humans as data sources
Per person per day (in “golden billion”)
• 50-200 e-mails
• 10-50 voice calls
• 1-100 SMS and twits
• 0.1 blog posts
• 1-20 financial transactions
• 3-30 search requests
• 10-30 articles, read on the
Internet
• 10 audio records
• 30-90 minutes of
TV/Video
• 20-200 appearances in
video monitoring cameras
• 1-100 geospatial “notches”
• 20-200 RFID checks
• 0,05-10 healthcare records
And art least 4.5 billions of people have at least phones
(mostly wireless)
From The Human Face of Big Data by Rick Smolan. EMC inspired.
Things as data sources
T h e I n t e r n e t o f T h i n g s
World today
…and tomorrow
Subjects and objects (both actors and patients)
Humans, unique by default • Proactively using ITC infrastructure (social
networks, online services, etc.)
• Passively “catching sight” of monitoring
and recording systems
Real world Identifiable objects• Active networked devices
• Passive objects, identifiable by context
In-between• Pets with RFIDs
• Animals in wild nature
• AI agents (if, and when originating)
Abstract objects• Pieces of information (code and data, i.e.
objects) in ICT systems with their natural
behavior and lifecycle
IV. Procedures and Scenarios
Processing
• Operations conducted at centralized location, but in highly
distributed and parallelized manner, to prepare Big Data for
consumption
• Most popular paradigm of Big Data processing is Map/Reduce
• Many experts consider Map/Reduce, or the math behind this
paradigm, to be the heart of Big Data
Pieces of data (information objects)
• Can relate/belong to one subject/object or can describe some
sort of relation between subjects and/or objects
Examples:
relates to individual entity: birth day, location,
relation: friend, call record, etc.
• Generally has structure <Key, Value>. Can contain one simple
value, well structured data, or complicated unstructured or semi-
structured (nested) set (tuple) of data (and associated code)
Examples: name, location, friend list, blog, etc.
“Upstream”
• Users supply some personally identifiable info to some centralized
location to further consume some sort of social, business or
governmental online service
Examples: Facebook, LinkedIn, gosuslugi.ru, etc.
• Devices supply some spatially/personally associated info to some
sort of centralized location for further analyses or transactional
services
Examples: mobile billing, smart power grids, weather forecast stations,
wind tunnel data stream acquisition system
“Downstream”
• User extracts some personally identifiable information related to
him (her) or to other persons to consume some sort of social,
business or governmental online service
Examples: Facebook, LinkedIn, gosuslugi.ru
• User or device (the same as above or different) extracts some
information supplied by networked devices and/or users to
consume or provide service
Examples: Internet shallow search (in most cases), mobile billing, smart
power grids, traffic advising systems
Social interaction
• In many cases while consuming social or transactional online service users
or devices directly or indirectly interact with other users and/or devices (i.e.
with their virtual agents inside Big Data set), and thus form complicated
meta-structures with certain level of stability and specific meta-behavior
• It would be wrong to assume, that only humans or higher animals
demonstrate social behavior. Devices and math structures (virtual agents)
with reflection (and auto-reflection) demonstrates it as well (at least we
agreed to consider this social behavior)
Example: collaborative antipersonnel mines
• Far not all Big Data collections and servuces are social in nature
Predictive / inductive analytics
• Set of mathematical techniques aimed to obtain nonobvious
knowledge from Big Data sets
• Includes:
• Mathematical statistics
• Game theory
• Multiagent models (physical, economic, econophysical, social…)
• Graph theory
• Machine learning
• Deep neural networks
• Linear and nonlinear programing
• …and more
Consolidation
• Service (service provider) collects all of all existing or substantial
part of relevant data, representing / covering specific sort of
information / subject matter info
Examples: all suppliers of airplane spуres, all citizens with medical or
armed forces experience, etc.,
Aggregation
• User or service on behalf of special sort of user (probably, privileged)
performs operation aimed to obtain some common (intensive or
extensive) characteristics of data set or its large subset as a whole.
Examples: capacity of airplane spуres suppliers , national rare-earth
metal proven reserves, number of citizens with medical or armed
forces experience by region etc.
• Does this operation produces really new knowledge?
De-personalization / de-identification
• Removal or hiding of personal or sensitive identifying
information from data sets or from query responces
• Should not block (not contradict) at least most common (or all ?)
aggregation operations
• Often required while producing Open Data
Re-personalization / re-identification
Use of sophisticated algorithms and additional (extended,
external, open… ) sets of data to re-construct personally
identifiable information
Recent example: usage of several openly available data sets
(purchasing, property, etc.) to reconstruct personal information
from anonymized medical records sets
Is it always illegal ?
Data theft / leaking
• There are many scenarios of data theft to take into consideration while
protecting online service or offline business IT system.
• But one is Big Data specific. Big Data processing assumes consolidation of
disparate data sources to unified (even if physically distributed) warehouse
and access to the whole set of data for execution of some principally not
local operation (as building reverse index etc.).
• At least early noSQL systems were designed the way which potentially
could provide access to all data from one specific (internal and privileged)
account, thus forming threat of stilling large amount of data at a time.
Digital forensics
• Forensics – the use of science and technology (digital in our
case) to investigate and establish facts in criminal or civil
courts.
• In many cases utilizes capabilities not provided to ordinary
users.
• How should it be regulated ?
V. Math and (some) Legal
Math behind processing
• Map / Reduce effectiveness relies largely at commutative(ity) / associativity of its operations at each
step. This allows to abstract form details of operations happening at one computer node with one
chunk of data. Instead most of operations can be described in terms of “natural transformations” or
“mappings” (surprise ), thus making category theory natural language to talk about.
Just for illustration purposes
• We can, potentially, benefit form usage such language's as Haskell (ok, F# also, and, may be, Python…)
• And we can search, whether there any practical problems, already described in terms of category theory
Categories at a glance
http://www.cs.indiana.edu/cmcs/categories.pdf
Rosetta stone (new one)
• Category theory is really near universal. It helps to streamline and unify concepts in various
fields – from string theory to process management. It was proclaimed in well known article
Is something missed here?
Seems, yes.It is mathematical linguistics and its applications to search and information retrieval
But what about ethics?We’ll see soon.
Digital behaviorism vs. charm of simplicityor bag of words vs. universal grammar
Peter Norvig
Director of Research
for Google
Noam Chomsky
Professor (Emeritus) at MIT,
"father of modern linguistics"
A bit oversimplifying discussion after MIT anniversary session:
Chomsky wishes for an elegant theory of intelligence and language that looks past human fallibility
to try to see simple structure underneath. Norvig, meanwhile, represents the new philosophy: truth by
statistics, and simplicity be damned.Kevin Gold for tor.com
What is important here: Noam Chomsky really developed elegant and practically useful universal grammar
theory, which declares that all languages are essentially similar and can be generated from a sort of
symmetry structure with relatively small number of parameters. This leads to a conclusion, that ability to
learn grammar is “hard-wired into the brain”. Joachim Lambek proposed type enriched approach to
grammar structuring, which, finally, evolved into category grammar models.
Evolution of linguisticsand cognitive science as well…
Empiric linguistics
Early grammarsEarly vocabularies
Formal grammarsThesauruses and ontologies
Universal grammarHuman as a bag of senses
The minimalist programStatistical bag of words approach
Categorical approachesDeep search models
“Quantum” grammar
“Quantum” Linguistics
Merging the worlds
Meaning of the word lies in its
context
Statistical model of semantics can
be, in large, reduced to clustering
in appropriate vector space of n-
gram probabilities
Sentence formal structure (and
some specific part of its meaning)
depends on its formal structure
Text is considered as algebraic
structure
And now lets merge these mutually
complementary approaches
(through composition) and use
this new “space” to represents
sentences both with semantics and
formal structure
12
3 4
When and why do we need such models?
Shallow Search
Deep Search
Translation
RegulationNatural language
ERP readableXBRL like
“Translation”Formal structure of
sentences, ontologies for
terms with special meaning,
bags of words for the rest
of vocabulary and to
validate terms usage
RegulationNatural language
“Translation”• Formal structure of sentences,
• Ontologies for terms with
special meaning,
• “Bags of words” for the rest of
vocabulary and to validate
terms usage
What about pieces of data
• Generally <Key, Value>. But not so simple. Key and Value can be nested tuples. Primitive values can be large blobs of text or other media. Pieces of data actually can contain code or assume execution of some code.
• Traditional large scale distributed file systems (like HDFS) targeting Big Data applications initially ignored the nature of individual information objects both in terms of placement and access rights (i.e. security). Their core metaphor was enormous flat files. This is rurally optimal approach for social networks, news search, transactions.
• Seems, recent implementations (such as for Storm, Caffeine, and Singularity) are more granular, and, in some cases, respects individual object access requirements.
• In ideal world all information objects should «live» in an infinite flat storage interacting with each other and with external actors according to their contracts. Challenging goal in a multi-petabyte world.
What are the problems with “Upstream”
http://themainstreetanalyst.com/2013/01/18/most-popular-social-networking-sites-by-country-infographic/
It is huge.
Not only there are a lot of
business and state systems, that
supply data abbot almost
everything to some warehouses,
but there are a lot of people
voluntarily supplying their data to
social networks. More then 1B
Facebook users.
Little chances to change pattern
without disruption.
What are the problems with “Upstream”Sharing personal data…
…by people, voluntarily: feature, not bug
People supply enormous amount of data allowing to identify and research them,
their relatives, businesses, state institutions, etc. Some of them ignore or not aware
of risks, most value benefits of global socialization higher then damage caused by
leak of some privacy. I.e. they “pay” their personal data to receive social services.
This is reality in which legislators, lowers, security experts and IT pros have to live.
Rather then enforcing restrictive supply regulation legislators could enforce
responsible usage. Not a simple approach too, of course.
What are the problems with “Upstream”Sharing personal data…
…by devices: need moral justification
State monitoring systems are great instruments to improve national security, and
there are little if any objections against their implementation. Should people be
always explicitly notified, when interacts with them?
Officials often missed when defending such systems. Just an example: “it was
illegal to track a person without permission from the authorities, but that there was
no law against tracking the property of a company, such as a SIM card.”
Seems this is about “crime scene investigation”. It does not require explicit
permission from potentially affected citizens. How far should we expand its scope:
locale crime scene? City? State? How long after the case? Before the case?
(Personal) data as a new currency (and tax)
People pay (invest) their data to obtain benefits
To (commercial) social services
• If consider them valuable
• If trust providers considering then
reputable and responsible
To state security, tax and otter systems
• If consider them effective
• If trust administration considering it
legitimate and responsible
Lack of trust in this area can cause economic slowdown and social frustration
Supply and Identity
It was mentioned frequently, that approach to Identity in digital world can evolve. People
can either continue with their traditional approach to identity (which in real world intends to
be not only unique, but also single and “integral” for each subject), or to switch to multiple
partial complimentary or even contradicting identities.
Evidently both social service providers and state institutions prefer the first approach. But
process is far not accomplished and can turn to the second option is some geos or social
groups if over-pressed by restrictive regulation in a situation of luck of mutual trust.
Also, we shell soon face more and more complicated strategies to misinform supply
mechanisms developed by criminals etc. Multiple SIM-cards is only the beginning. What
math should be used to re-construct identity in a case of untrusted records in Big Data set ?
What are the issues with “Downstream”
Generally speaking there are two main problems (ok, there are
more, but out of our scope):
• Who is allowed to retrieve results from Big Data ?
• What is the quality of results (are they complete and accurate)?
Who is allowed to retrieve resultsResponsible usage
If you are old school IT security expert, your answer, most probably, would be:
“nobody, never, nothing”. At least by default.
In a few cases (highly bureaucratized hierarchical small to medium organization) it
works. But hardly in a case of 1B+ users (and they are different – individuals of
different age, provider employees, businesses, state officials, devices, etc.).
First what we’d do is to implement elements of responsible approach to information
as Tim Berners-Lee and his colleagues recommend. Just explicitly inform potential
user about usage regulation and potential liability.
It easier to say, then to accomplish, as such approach requires mechanism to
associate norms regulation with pieces of information. See above slide “When we
need such models”, and, potentially, ontologies of usage patterns.
Who is allowed to retrieve resultsObligations and deontic logic
Nevertheless, would society adopt strict or lightweight approach to access management, there will be areas of
strict control due to national security, or to children rights protection, etc. We should be able to describe
obligations, associates with permissions. Some should precede action, some – to be accomplished in course or
during pre-defined timeframe after the action. Deontic logic, while being not free of paradoxes, is a good
background for this.
permissible (permitted) must
impermissible (forbidden) supererogatory (beyond the call of duty)
obligatory (duty, required) indifferent / significant
omissible (non-obligatory) the least one can do
optional better than / best / good / bad
ought claim / liberty / power / immunity
Deontic logic is that branch of symbolic logic
that has been the most concerned with the
contribution that the following notions make
to what follows from whathttp://plato.stanford.edu/entries/logic-deontic/
Since early 80th deontic logic is used to formalize legal reasoning and legal corpus, access rights, database
protection, as well as moral choice in IT (see J.-J.Ch. Meyer and Roel J. Wieringa. Applications of deontic logic in
computer science: A concise overview. In J.-J.Ch. Meyer and Roel J. Wieringa, editors, Proceedings of the 1st
International Workshop on Deontic Logic in Computer Science (DEON 1991), Amsterdam, The Netherlands, December
11-13, 1991, pages 15-43)
Deontic logic at a glance
John Mikhail, Universal Moral Grammar: Theory, Evidence, and the Future. Trends in Cognitive Sciences,
April 2007; Georgetown Public Law Research Paper No. 954398.
We shell discuss Mikhail’s works in a bit more detailes soon
Who is allowed to retrieve resultsObligations and ethical calculus
Consider the following sentences:1. A informs B
2. A tells B that p
3. A lets B know that p
4. A shares his knowledge that p with B
5. A informs B about p
6. A sends a message to the effect that p to B
7. A's communications to B indicate that p
The general form of (1)-(7) can be rendered as:• Agent A in informational context C sees to it that Agent
B believes that p, or
• A informs B that X
Moral or legal constraints in information contexts
may be expressed as follows in general terms as
follows:• It is (not) obligatory or (not) permitted for A to see to it
that B knows that p
Intellectual Property(A) If John has an IP right in a particular piece of
information X, then Peter ought to have permission
from John to acquire, process or disseminate X.
Privacy and Data Protection
(B) If information X is about John and if Peter does not
have X then Peter is not permitted to acquire X without
John's consent. If he does have X, then he is not
permitted to process or disseminate it without John's
consent.
Equal Access
(C) If A is informed about X, then all ought to be informed
about X.
Responsibility and Information
(D) If John has an information responsibility regarding X,
then John has an obligation to see to it that specific
others have access to information X.
Jeroen van den Hoven and Gert-Jan Lokhorst. Deontic Logic and Computer-Supported Computer Ethics. 2002
What is the quality of resultsFalse positive in Big Data: Bonferroni’s principle
While preparing report on Big Data one may face several problems:
• Not all appropriate pares <key, value> identified
• Some wrong pares selected
• And the most interesting, some correct pares can satisfying search criteria, but not due to
their nature, but simply due to statistically large size of data sample
Calculate the expected number of occurrences of the events you are looking for, on
the assumption that data is random. If this number is significantly larger than the number
of real instances you hope to find, then you must expect almost anything you find to be
bogus, i.e., a statistical artifact rather than evidence of what you are looking for. This
observation is the informal statement of Bonferroni’s principle.
In a situation like searching for terrorists, where we expect that there are few terrorists
operating at any one time, Bonferroni’s principle says that we may only detect terrorists by
looking for events that are so rare that they are unlikely to occur in random data.
Anand Rajaraman, Jure Leskovec, Jeffrey D. Ullman. Mining of Massive Datasets, CUP, 2010, 2012
What are the issues with social interactionFrom access control to social interactions
People (and already some other information agents) establish social interactions – fall into
cooperation and conflicts, make moral choices and moral evaluations. If we wanna build
services processing data, supplied by billions of people, and expect the services to behave in
a “natural” way (either respond accordingly or make grounded predictions), we need to think
about implementation of ethical norms.
Area to apply ethical concepts in Big Data and social network applications is wide – from
obligation enhanced access control to ethical evaluation of autonomous decisions.
Why we need ethical evaluation? It is almost impossible to predict and list all micro-scenarios
of service usage: should service provide to agent A info regarding agent B if it will impact
decisions A can make regarding gent C ? If A is a physician ? If A is (or is not) his patient ? In
case of emergency ?
Do not forget, that modern “autonomous artificial intellectual agents” normally invoke some
predictive analytics over statistically large sets of data (Big Data) to make decisions. And this
decisions should be ethical. At least to be understandable to humans.
What are the issues with social interactionWhy ethical behavior is usually considered understandable
Let’s turn back to Jon Mikhail works(See Mikhail, John, Universal Moral Grammar: Theory,
Evidence, and the Future. Trends in Cognitive Sciences, April
2007; Georgetown Public Law Research Paper No. 954398.
and earlier)
Possible, and very attractive answer is:
“Because ethics, or at least, some main
ethical categories, are built in our brains”. So
humans are “ethical by design” as computers
should be.
And nothing common with creationism
here…
Michail’s Universal Moral Grammar follows
Chomsky’s Universal Language Grammar
approach
Five main questions of universal moral grammar
• What constitutes moral knowledge?
• How is moral knowledge acquired?
• How is moral knowledge put to use?
• How is moral knowledge physically realized in the brain?
• How did moral knowledge evolve in the species?
What are the issues with social interactionA bit more about Universal Moral Grammar
UMG proposes a set of rules to process transformation of questioned situation: from initial description through structural
analyses, temporal decomposition, deontic analyses to final decision. UMG takes into account different evaluation by
humans of direct impact and side effects etc.
What are the issues with analyticsWhich analytics do we need?
Again, there is a sort of competition between statistical and
structural approaches. But in case of analytics it is evident, that we
need both (and, in some cases, multi-agent simulation as well).
A few methods/disciplines to mention:
• Mathematical statistics and statistical hypothesis testing
• Linear and nonlinear programming, optimization
• Neural networks (yes, finally we have workable deep neural networks!), multidimensional approximations
• Game theory
• Graph theory
• Coding theory
• Mathematical logic, various flowers
• Formal linguistics
• Statistical econophysics…
What are the issues with analyticsWhy are they all different?
Let’s limit ourselves to one important problem.
How to describe and analyze the situations of cooperation and conflict and
different roles of actors, and how to explain that different actors behave differently
in similar situations remaining, the same time, ethical (not violating their own
moral rules).
This is the question from Michail again:
Moral diversity is far not only geo-cultural.
It is mental and group-social as well.
What are the issues with analytics In search of language and calculus for compromise and conflict
There are a number of approaches aiming to describe this interactions, classify their participants, and predict their steps.
Among others – (auto)reflexive approach by Vladimir Lefebvre developed in 70th and 80th first in USSR, and then in USA.
Lefebvre proposed (set of) simple ethical calculus based at Boolean logic, and found, that there are two self-consistent
ethical systems (see Lefebvre, V.A. Algebra of Conscience. D. Reidel, Holland, 1982)
First Ethical System Second Ethical System
• The end DOES NOT justify the means
• If there is a conflict between means and ends,
one SHOULD be concerned
• The end DOES justify the means
• If there is a conflict between means and ends, one
SHOULD NOT be concerned
First Ethical System Second Ethical System
• A “saint” is non-aggressive, tends toward
compromise, and has low self-evaluation
• A “hero” is non-aggressive, tends toward
compromise, has high self-evaluation
• A “philistine” is aggressive, tends toward conflict,
and has low self-evaluation
• A “dissembler” is aggressive, tends toward
conflict, has high self-evaluation
• A “saint” is aggressive, tends toward conflict, and has
low self-evaluation
• A “hero” is aggressive, tends toward conflict, and has
high self-evaluation
• A “philistine” non-aggressive, tends toward
compromise, and has low self-evaluation
• A “dissembler” non-aggressive, tends toward
compromise, and has low self-evaluation
Simple taxonomy for social interactions:
Lefebvre, V.A.: Algebra of Conscience. D. Reidel, Holland (1982)
What are the issues with analyticsAlgebra of conscience
Again, notation form theory of categories
Big Data specific theft / leakingAnatomy of (some of) grand leaks
There are many scenarios of data theft to take into consideration while protecting online
service or offline business IT system. But at least one is specific for Big Data. When people
(businessmen and politicians) first become acquainted with the technology, they attempted to
use it for predictive analyses of all non-quantitative sources at a time. Idea was: “Let’s collect all
evidences, nevertheless reliable or not, and attempt to classify them, using the same corpus for
verification of each statement”. Supplies were instructed to input into such systems everything –
from official documents through rumors. And let computer separate the wheat from the chaff.
For the first ever there appeared huge warehouses of text documents. At one place, with
unified access from one (internal and privileged, but one) system account. It was matter of time
only, and massive leaks happen.
Any protection mechanisms? Physical security. Access to different file system chunks or
information objects from different accounts, sophisticated access rights management, and,
mostly in some near future, computation on encrypted data (Homomorphic encryption).
The issues with de-/re- personalizationDifferential privacy
Big Data is huge valuable asset. But how to use it without violation of privacy? Say, how to arrange some
privacy-preserving sharing of Big Data between state and business? Just one example: Pharmacy chain
requires statistical/demographic/etc. data to expand and tune its business, but prohibited to access personal
medical records.
The simplest answer is “open data” – statically privacy- (and other secret-) cleaned sets of data. Good start,
but far not enough. Now you mostly are limited to predefined reports. Moat of Big Data power is lost. Real
question is – how to ensure privacy dynamically – over complete set of data, i.e. – how to ensure suppliers,
that participation in Big Data sets is not privacy risky?
Good answer is –
Differential Privacy
There are more
sophisticated definitions/
requirements as well
Ilya Mironov, Omkant Pandey, Omer Reingold, and Salil Vadhan, Computational Differential Privacy, in Advances in Cryptology—CRYPTO 2009, Springer, 2009
More to discuss
• Contract theory and ethics
• Econophysics and ethics
• Differential privacy and main laws of Information Ethics
• Applications