Download - Big Data and Computational Ethics 2013 (Engl)

Big Data: applications, ethics, algorithms

Vladislav Shershulsky

Microsoft Rus

ALMADA, Moscow 2013-07-02

The form

• This is one hour presentation

• Thus, I have almost no chances to teach you something valuable, or to discuss any really new and sophisticated results

• This is simply overview of several interesting techniques related to one important, and often underestimated, topic

The content

• The topic is “Computational ethics for Big Data apps”

• It is not about why people should behave ethically

• It is about the following:

• how to force people to behave ethically while dealing with our apps

• how to ensure, that our Big Data apps are “moral by design”

• and what does it technically mean “to be digitally ethical”

The map

Processing and Data

“Upstream” and “Downstream”

Social Interactions

Consolidation and Aggregation

De- & Re- personalization

Data Leaking and Forensics

Category Theory

Traditional Ethics for Big Data

(New) Information Ethics

Categories on (Statistics Structure)

Universal Moral Grammar

Deontic Logic and Obligations

Algebra of conscience

Differential privacy

Predictive Analytics

I. Motivation: why should we care

Broadly acknowledged definition

Big data are high volume, high velocity, and/or high variety

information assets that require new forms of processing to enable enhanced

decision making, insight discovery and process optimization

Gartner, 2012

Big Data is essentially complex

We usually are not aware of the generating laws, forming big

data sets, and about the nature of relations between them.

But their huge volume allows to infer laws (regressions…) and

thus gives (with the limits of inference reasoning) to Big Data

some predictive capabilities.

P. Delort, 2011

A bit like material

world natural

objects / subjects

Big opportunity

For governments:

• Budget savings

• Transparency and responsibility

• Real insight into society

• Optimal decisions

Big opportunity

For people:

• Self organization

• Better experiences

• Intelligent environment

• Introspection

Big opportunity

For business:

• Converting products to services

• Expanded value chains

• New business models

• Educated targeting

F r o m P r o d u c t t o S e r v i c e

V = V0 + A∙N + B∙N2

Value for

customer

Immanent

value

Volume

valueNetwork

value

On Premise

Off PremiseBig Data

& BIClients

Employees

Partners

Mobility

&

Connectivity Vlalue

Socialization

of Business

http://www.businesslogicsystems.com/Data%20Management

Big opportunity

For IT industry

• Next chance to change the world

• Step towards internet of everything

• Completely new markets

Big challenge

For people

• New lack of privacy

• Automated justice

• Need to understand

AKMs at your backyard

Big challenge

For business

• Hard to comply

• Easy to violate

• Unexpected backfire

• Need to defend sourcesTarget Predicts Pregnancy with Big Datahttp://smallbusiness.yahoo.com/advisor/target-predicts-pregnancy-big-data-104057627.html

Why Netflix's Facebook app would be illegalBy Julianne Pepitone @CNNMoneyTech March 27

VPPA arose from strange circumstances surrounding the

failed Supreme Court nomination of Robert Bork. While

Bork's nomination hearings were taking place in 1987, a

freelance writer for the Washington City Paper talked a

video store clerk into giving him Bork's rental history.

Google facing legal threat from six

European countries over privacyhttp://www.guardian.co.uk/technology/2013/apr/02/google-privacy-policy-legal-threat-europe

http://money.cnn.com/quote/quote.html?symb=NFLX

http://money.cnn.com/quote/quote.html?symb=NFLX

Big challenge

For government

• It is hard to be transparent

• It is easy to overuse

• Hard to defend sources

George Orwell, 1984http://budget4me.ru/ob/faces/home

http://online.wsj.com/article/SB10001424052970203391104577124540544

822220.html?mod=googlenews_wsj

http://www.wikileaks.org/

http://www.washingtonpost.com/investigations/us-intelligence-mining-data-from-nine-us-internet-companies-in-broad-secret-program/2013/06/06/3a0c0da8-cebf-11e2-8845-d970ccb04497_story.html

Big challenge

For IT industry

• Needs new hw and sw architecture to address scale

• Needs to know how to protect

• Needs to address extremely complicated usage scenarios

• Risk of over-restrictive regulation

Pro Contra

People: collective knowledge

Business: from disordered offerings to quality of life service

Government: know and address real needs of citizens

IT industry: change the world (again?)

People: final lack of privacy

Business: disruptive scenarios

Government: chance to miss everything

in a minutes

IT industry: new approaches to hw and sw

architecture, addressing new challenges

II. Computational ethics and Big Data

Why ethics?

• Benefiting from opportunities and mitigating risks assumes careful handling of

digital assets of high business and personal value, both in known scenarios and in

completely new situations

• To proceed successfully one should follow some sort of fundamental principles –

clear and consistent

Ethics, also known as moral philosophy, is a branch of philosophy that

involves systematizing, defending and recommending concepts of right and wrong

conduct.

http://www.iep.utm.edu/ethics/

Big Data and traditional ethics

• Let’s take concepts from traditional ethics and examine how they should

apply to digital world, and how they evolutionary evolve under influence

of Big Data capabilities

• Four Elements of Big-Data Ethics: Identity, Privacy, Ownership, Reputation

• Big Data is ethically neutral

• Personal data – not some specific data, but any data generated in the

course of a person’s activities

• Privacy interests, not always ultimate rights

• A responsible organization is an organization that is concerned both with

handling data in a way that aligns with its values and with being perceived

by others to handle data in such a manner.Davis Kord. Ethics of Big Data.

Balancing Risk and Innovation.

O'Reilly Media, 2012

Big-Data ethics: Identity

• Identity (in philosophy), also called sameness, is whatever makes

an entity definable and recognizable (Wikipedia)

• Christopher Poole vs. Mark Zuckerberg: “prismatic” multi-identity

vs. mono-identity

• Some governments concern about identity in the Internet: say

Italy and Belarus requires ID to obtain access

• Does Big Data allows to re-construct identity?

Partially following Davis Kord. Ethics of Big Data.

Big-Data ethics: Privacy

• Privacy is the ability of an individual or group to seclude themselves or information about themselves and thereby reveal themselves selectively (wikipedia).

• In 1993, the New Yorker published a cartoon whose caption read: “On the Internet, nobody knows you’re a dog” At the time, this was funny because it was true. Today, in the age of big data, it is not only possible to know that you’re a dog, but also what breed you are, your favorite snacks, your lineage, and whether you’ve ever won any awards at a dog show.

• There are two issues. First, does privacy mean the same thing in both online and offline in the real world? Second, should individuals have a legitimate ability to control data about themselves, and to what degree?

Following Davis Kord. Ethics of Big Data.

Big-Data ethics: Ownership

• The degree of ownership we hold over specific information about us varies as widely as the distinction between privacy rights and privacy interests.

• Do we, in the offline world, “own” the facts about our height and weight? Does our existence itself constitute a creative act, over which we have copyright or other rights associated with creation?

• How do those offline rights and privileges, sanctified by everything from the Constitution to local, state, and Federal statues, apply to the online presence of that same information?

• To the end of the day we more and more pay for “free” online services by our data sharing with its providers


Big-Data ethics: Reputation

• As recently as 20 years ago, reputation consisted primarily of what people, specifically those who knew and frequently interacted with you, knew and thought about you. In some cases, a second-degree perception – that is, what the people who knew you said about you to the people who they knew – might influence one’s reputation.

• One of the biggest changes born from big data is that now the number of people who can form an opinion about what kind of person you are is exponentially larger than it was a few years ago.

• And further, your ability to manage or maintain your online reputation is growing farther and farther out of individual control.

• There are entire companies now whose entire business model is centered on “reputation management”


Benefits of ethics inquiry

• Faster consumer adoption by reducing fear of the unknown (how

are you using my data?)

• Reduction of friction from legislation from a more thorough

understanding of constrains and requirements

• Increased pace of innovation and collaboration derived from a

sense of purpose generated by explicitly shared values

• Reduction in risk of unintended consequences from an overt

consideration of long-term, far-reaching implications of the use of

big-data technologies

Partially following Davis Kord. Ethics of Big Data.

But now we want to go even further

We need more formal theory(ies) and more practical instruments toincorporate ethical behavior into our products and services.

There are several important reasons to do so:

• We intend to ensure and enforce ethical behavior of our apps

• We hope that formal theory and algorithms will help us to findreasonable (at least not very self-contradictory) solutions incases where our obvious life experience does not works

• We need to comply with complicated regulation, and, the sametime, to cheque its consistency

• We have to be ready to disruptive changes in adopted bysociety of-line ethics under influence of IT and of Big Data inparticular

What to expect from computational ethics

• We need a set of formal rules to classify steps (state changes) asethically “right” or “wrong” (actually there are a bit more options),and to find most acceptable steps in a cases where our intuitionprovides no solution

• We need to apply these rules in a consistent way to both people(subjects), information objects (which in Big Data tasks havecertain level of autonomy due to complexity) and to collectives ofsubjects and objects (collectives as moral agents as well)

Computational ethics vs. Roboethics

Machine Ethics (or machine morality) is the part

of the ethics of artificial intelligence concerned

with the moral behavior of Artificial Moral Agents

(AMAs) (e.g. robots and other artificially

intelligent beings). Machine ethics is sometimes

referred to as computational ethics and

computational morality.

In contrasts roboethics concerns with the moral

behavior of humans as they design, construct, use

and treat such beings.

Here we do not talk about any sort of robotics or

any similar professional code of conduct (i.e. it is

not about hacker or software developer ethics)

This area of research recently draw a lot of attention due to Wendell Wallach from Yale University

urging U.S. and other governments regarding drone and similar AKMs proliferation

A few useful approacheswe shell just mention them and discuss in more details later

Most fundamental concepts

• Information logic by Luciano Floridy. For many reasons this is good background for the rest of approaches.

• Information accountability. Alternative to “all prohibited by default” approach in access management.

• Deontic logic. The field of logic concerned with obligation, permission, and related concepts, and, subsequently, –a formal system that attempts to capture the essential logical features of these concepts. Good (but not the most general) approach to expressing access/usage rules. More general modal logics are of interest as well.

A few useful approacheswe shell just mention them and discuss in more details later

Practical concepts and calculi

• Moral grammar.Still not all-embracing, but very expressive way to describe and solve situations of moral choice.

• (Auto)reflexive multi-agent calculus of conflict and cooperation.The first model to describe the role of reflection in ethics, moral choice in conflicts, difference between intention and readiness. Introduced simple morale agents classification. Popular in Russia due to expressiveness and predictivity, but rarely known abroad.

• Differential privacy.Means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records.

• Models, calculi, and languages to describe and derive ownership, access restrictions, and rules, including obligation specific. Privacy-aware Role Based Access Control (P-RBAC), Web Ontology Language (OWL, OWL2), and eXtensible Access Control Markup Language (XACML).

A few “ethically neutral” but useful instruments

Category Theory

Category theory is a toolset for describing the general abstract structures in mathematics.

As opposed to set theory, category theory focuses not on elements x,y,⋯ – called

objects – but on the relations between these objects: the (homo)morphisms between

them 𝑥 𝑓𝑦

Computation on encrypted data

Ability to perform some operations (such as searches, comparisons) on data encrypted

by another entity. Area of active research with some prominent results. Often referred as

homomorphic encryption.

A few referencesInformation ethics.

• Luciano Floridi. Information ethics, its nature and scope. In SIGCAS Computers and Society, Volume 36, No. 3, September 2006, pp. 21-36.

Information accountability. • Daniel Weitzner, Harold Abelson, Tim Berners-Lee, Joan Feigenbaum, James Hendler, Gerald Sussman. Information Accountability.

Communications of the ACM, Jun. 2008, 82-87

Deontic logic. • Paul McNamara, Deontic Logic, Stanford Encyclopedia of Philosophy, 2006, 2010.

Moral grammar. • John Mikhail. Moral Grammar and Intuitive Jurisprudence: A Formal Model of Unconscious Moral and Legal Knowledge. In The Psychology

of Learning and Motivation: Moral Cognition and Decision Making. D. Medin, L. Skitka, C. W. Bauman, D. Bartels, eds., Vol. 50, Academic Press, 2009, pp.27-100

(Auto)reflexive multi-agent calculus of conflict and cooperation. • Vladimir Lefebvre. Algebra of conscience. Kluwer Academic Publishers, Dordrecht, Boston. 2001, 358 p.

• Vladimir Lefebvre, Thomas Reader. Reflexive IW Model II. Report No.(s): AD-A399417; ARL-SR-114. Feb. 2002; Army Research Lab., Survivability/Lethality Analysis Directorate, White Sands Missile Range, NM USA, 50p.

Differential privacy. • Differential Privacy: A Survey of Results by Cynthia Dwork, Microsoft Research April 2008

P-RBAC, OWL, XACML. • Qun Ni, Dan Lin, Elisa Bertino, and Jorge Lobo. Conditional Privacy-Aware Role Based Access Control. ESORICS 2007, LNCS 4734. J. Biskup

and J. Lopez (Eds.). Springer-Verlag. Berlin, Heidelberg, 2007, pp.72-89.

• For OWL see http://en.wikipedia.org/wiki/Web_Ontology_Language

• For XACML see XACML 3.0 - committee specification 01. OASIS (oasis-open.org). Retrieved 10-August-2010

Category theory.• http://ncatlab.org/nlab/show/category+theory#idea_10

• Saunders MacLane. Categories for the Working Mathematician. 2nd edition, Springer Verlag, 1998, 314p.

Computation on encrypted data.• Abdullatif Shikfa. Computation on Encrypted Data: Private Matching, Searchable Encryption and More. Bells Lab, Alcatel-Lucent. 2013

http://plato.stanford.edu/entries/logic-deontic/

http://research.microsoft.com/apps/pubs/default.aspx?id=74339

http://en.wikipedia.org/wiki/Web_Ontology_Language

http://ncatlab.org/nlab/show/category+theory#idea_10

Why Information Ethics

I will cover other instruments later while describing specific Big Data scenarios or operations, but

Information Ethics due to its universality needs discussion in advance.

Information Ethics is the theoretical foundation of applied Computer Ethics.

IE is an expansion of environmental ethics towards

1) a less anthropocentric concept of agent, which now includes also non-human (artificial)

and non-individual (distributed) entities; and

2) a less biologically biased concept of patient as a ‘centre of ethical worth’, which now

includes not only human life or simply life, but any form of existence.

3) a conception of environment that includes both natural and artificial (synthetic, man-

made) eco-systems.

IE is therefore: non-standard, patient-oriented, not agent-oriented, environmental, non-

anthropocentric but ontocentric, and based on the concepts of informational object/

infosphere/entropy rather than life/ecosystem/pain.

Luciano Floridi, Lecture @ Cambridge, 2006

Infosphere, patient oriented

Ecology, patient oriented

Bioethics, patient oriented

All-human, agent oriented

Brief history of ethics

Athenian citizens

Moral Patient

• Question: what is the lowest possible common set of attributes whichcharacterizes something as intrinsically valuable and an object ofrespect, and without which something would rightly be consideredintrinsically worthless or even positively unworthy and thereforerightly disrespectable in itself?

• Answer: the minimal condition of possibility of an object’s leastintrinsic worthiness is its abstract nature as an informational entity.

• Conclusion: all entities, interpreted as clusters of information, have aminimal moral worth qua informational objects, that deserves to berespected.


Four Principles of Information Ethics

• IE determines what is morally right or wrong, what ought to be done,what the duties, the “oughts” and the “ought nots” of a moral agentare, by means of four basic principles:

0. entropy ought not to be caused in the infosphere (null law)

1. entropy ought to be prevented in the infosphere

2. entropy ought to be removed from the infosphere

3. the welfare of the infosphere ought to be promoted by extending it,improving it and enriching it.


Conclusions for Big Data

• When we recognise, that not only humans (actors), but also (at least some)

information objects (patients) have rights that should be respected, and

that under certain conditions entropy minimisation is immoral, we shell

have to agree that our privacy rights are only privacy interests and should

be balanced with the rights of information objects we created.

• Big Data (subsets, elements of BD) are complicated enough to satisfy, in

many cases, criteria of moral patient(s).

• Our ultimate privacy requirement becomes thus questionable.

• Do we really always have “Right to Forget” (i.e. somehow kill information

objects) ?

• In some future information objects representing our identity in infosphere

would become autonomous enough (shadow freewill dilemma).

• IE

Conclusions for Big DataInformation object rights to respect

• If you consider this problem abstract, just think about recent attempts in several

countries to prohibit car video-registrators as potentially violating privacy of

passengers in another cars.

• Or take into account that countries with over-restrictive regulation of Big Data

collecting and usage to be at risk of economic and social slowdown.

• Can we explain all this to legislators? Many of them are still at the stage of adopting

ecology.

Is this practical? Yes. It helps to architect big data management system in a consistent way:

• Your predictive conclusions have little value if you kill information objects voluntarily

• Access mechanisms to respect information object rights help to avoid massive leaking

and full service blocking (“all is prohibited by default”)

III. Sources of Big Data

Humans as data sources

Per person per day (in “golden billion”)

• 50-200 e-mails

• 10-50 voice calls

• 1-100 SMS and twits

• 0.1 blog posts

• 1-20 financial transactions

• 3-30 search requests

• 10-30 articles, read on the

Internet

• 10 audio records

• 30-90 minutes of

TV/Video

• 20-200 appearances in

video monitoring cameras

• 1-100 geospatial “notches”

• 20-200 RFID checks

• 0,05-10 healthcare records

And art least 4.5 billions of people have at least phones

(mostly wireless)

From The Human Face of Big Data by Rick Smolan. EMC inspired.

Things as data sources

T h e I n t e r n e t o f T h i n g s

World today

…and tomorrow

Subjects and objects (both actors and patients)

Humans, unique by default • Proactively using ITC infrastructure (social

networks, online services, etc.)

• Passively “catching sight” of monitoring

and recording systems

Real world Identifiable objects• Active networked devices

• Passive objects, identifiable by context

In-between• Pets with RFIDs

• Animals in wild nature

• AI agents (if, and when originating)

Abstract objects• Pieces of information (code and data, i.e.

objects) in ICT systems with their natural

behavior and lifecycle

IV. Procedures and Scenarios

Processing

• Operations conducted at centralized location, but in highly

distributed and parallelized manner, to prepare Big Data for

consumption

• Most popular paradigm of Big Data processing is Map/Reduce

• Many experts consider Map/Reduce, or the math behind this

paradigm, to be the heart of Big Data

Pieces of data (information objects)

• Can relate/belong to one subject/object or can describe some

sort of relation between subjects and/or objects

Examples:

relates to individual entity: birth day, location,

relation: friend, call record, etc.

• Generally has structure <Key, Value>. Can contain one simple

value, well structured data, or complicated unstructured or semi-

structured (nested) set (tuple) of data (and associated code)

Examples: name, location, friend list, blog, etc.

“Upstream”

• Users supply some personally identifiable info to some centralized

location to further consume some sort of social, business or

governmental online service

Examples: Facebook, LinkedIn, gosuslugi.ru, etc.

• Devices supply some spatially/personally associated info to some

sort of centralized location for further analyses or transactional

services

Examples: mobile billing, smart power grids, weather forecast stations,

wind tunnel data stream acquisition system

“Downstream”

• User extracts some personally identifiable information related to

him (her) or to other persons to consume some sort of social,

business or governmental online service

Examples: Facebook, LinkedIn, gosuslugi.ru

• User or device (the same as above or different) extracts some

information supplied by networked devices and/or users to

consume or provide service

Examples: Internet shallow search (in most cases), mobile billing, smart

power grids, traffic advising systems

Social interaction

• In many cases while consuming social or transactional online service users

or devices directly or indirectly interact with other users and/or devices (i.e.

with their virtual agents inside Big Data set), and thus form complicated

meta-structures with certain level of stability and specific meta-behavior

• It would be wrong to assume, that only humans or higher animals

demonstrate social behavior. Devices and math structures (virtual agents)

with reflection (and auto-reflection) demonstrates it as well (at least we

agreed to consider this social behavior)

Example: collaborative antipersonnel mines

• Far not all Big Data collections and servuces are social in nature

Predictive / inductive analytics

• Set of mathematical techniques aimed to obtain nonobvious

knowledge from Big Data sets

• Includes:

• Mathematical statistics

• Game theory

• Multiagent models (physical, economic, econophysical, social…)

• Graph theory

• Machine learning

• Deep neural networks

• Linear and nonlinear programing

• …and more

Consolidation

• Service (service provider) collects all of all existing or substantial

part of relevant data, representing / covering specific sort of

information / subject matter info

Examples: all suppliers of airplane spуres, all citizens with medical or

armed forces experience, etc.,

Aggregation

• User or service on behalf of special sort of user (probably, privileged)

performs operation aimed to obtain some common (intensive or

extensive) characteristics of data set or its large subset as a whole.

Examples: capacity of airplane spуres suppliers , national rare-earth

metal proven reserves, number of citizens with medical or armed

forces experience by region etc.

• Does this operation produces really new knowledge?

De-personalization / de-identification

• Removal or hiding of personal or sensitive identifying

information from data sets or from query responces

• Should not block (not contradict) at least most common (or all ?)

aggregation operations

• Often required while producing Open Data

Re-personalization / re-identification

Use of sophisticated algorithms and additional (extended,

external, open… ) sets of data to re-construct personally

identifiable information

Recent example: usage of several openly available data sets

(purchasing, property, etc.) to reconstruct personal information

from anonymized medical records sets

Is it always illegal ?

Data theft / leaking

• There are many scenarios of data theft to take into consideration while

protecting online service or offline business IT system.

• But one is Big Data specific. Big Data processing assumes consolidation of

disparate data sources to unified (even if physically distributed) warehouse

and access to the whole set of data for execution of some principally not

local operation (as building reverse index etc.).

• At least early noSQL systems were designed the way which potentially

could provide access to all data from one specific (internal and privileged)

account, thus forming threat of stilling large amount of data at a time.

Digital forensics

• Forensics – the use of science and technology (digital in our

case) to investigate and establish facts in criminal or civil

courts.

• In many cases utilizes capabilities not provided to ordinary

users.

• How should it be regulated ?

V. Math and (some) Legal

Math behind processing

• Map / Reduce effectiveness relies largely at commutative(ity) / associativity of its operations at each

step. This allows to abstract form details of operations happening at one computer node with one

chunk of data. Instead most of operations can be described in terms of “natural transformations” or

“mappings” (surprise ), thus making category theory natural language to talk about.

Just for illustration purposes

• We can, potentially, benefit form usage such language's as Haskell (ok, F# also, and, may be, Python…)

• And we can search, whether there any practical problems, already described in terms of category theory

Categories at a glance

http://www.cs.indiana.edu/cmcs/categories.pdf

Rosetta stone (new one)

• Category theory is really near universal. It helps to streamline and unify concepts in various

fields – from string theory to process management. It was proclaimed in well known article

Is something missed here?

Seems, yes.It is mathematical linguistics and its applications to search and information retrieval

But what about ethics?We’ll see soon.

Digital behaviorism vs. charm of simplicityor bag of words vs. universal grammar

Peter Norvig

Director of Research

for Google

Noam Chomsky

Professor (Emeritus) at MIT,

"father of modern linguistics"

A bit oversimplifying discussion after MIT anniversary session:

Chomsky wishes for an elegant theory of intelligence and language that looks past human fallibility

to try to see simple structure underneath. Norvig, meanwhile, represents the new philosophy: truth by

statistics, and simplicity be damned.Kevin Gold for tor.com

What is important here: Noam Chomsky really developed elegant and practically useful universal grammar

theory, which declares that all languages are essentially similar and can be generated from a sort of

symmetry structure with relatively small number of parameters. This leads to a conclusion, that ability to

learn grammar is “hard-wired into the brain”. Joachim Lambek proposed type enriched approach to

grammar structuring, which, finally, evolved into category grammar models.

http://www.tor.com/blogs/2011/06/norvig-vs-chomsky-and-the-fight-for-the-future-of-ai

Evolution of linguisticsand cognitive science as well…

Empiric linguistics

Early grammarsEarly vocabularies

Formal grammarsThesauruses and ontologies

Universal grammarHuman as a bag of senses

The minimalist programStatistical bag of words approach

Categorical approachesDeep search models

“Quantum” grammar

“Quantum” Linguistics

Merging the worlds

Meaning of the word lies in its

context

Statistical model of semantics can

be, in large, reduced to clustering

in appropriate vector space of n-

gram probabilities

Sentence formal structure (and

some specific part of its meaning)

depends on its formal structure

Text is considered as algebraic

structure

And now lets merge these mutually

complementary approaches

(through composition) and use

this new “space” to represents

sentences both with semantics and

formal structure

12

3 4

When and why do we need such models?

Shallow Search

Deep Search

Translation

RegulationNatural language

ERP readableXBRL like

“Translation”Formal structure of

sentences, ontologies for

terms with special meaning,

bags of words for the rest

of vocabulary and to

validate terms usage

RegulationNatural language

“Translation”• Formal structure of sentences,

• Ontologies for terms with

special meaning,

• “Bags of words” for the rest of

vocabulary and to validate

terms usage

What about pieces of data

• Generally <Key, Value>. But not so simple. Key and Value can be nested tuples. Primitive values can be large blobs of text or other media. Pieces of data actually can contain code or assume execution of some code.

• Traditional large scale distributed file systems (like HDFS) targeting Big Data applications initially ignored the nature of individual information objects both in terms of placement and access rights (i.e. security). Their core metaphor was enormous flat files. This is rurally optimal approach for social networks, news search, transactions.

• Seems, recent implementations (such as for Storm, Caffeine, and Singularity) are more granular, and, in some cases, respects individual object access requirements.

• In ideal world all information objects should «live» in an infinite flat storage interacting with each other and with external actors according to their contracts. Challenging goal in a multi-petabyte world.

What are the problems with “Upstream”

http://themainstreetanalyst.com/2013/01/18/most-popular-social-networking-sites-by-country-infographic/

It is huge.

Not only there are a lot of

business and state systems, that

supply data abbot almost

everything to some warehouses,

but there are a lot of people

voluntarily supplying their data to

social networks. More then 1B

Facebook users.

Little chances to change pattern

without disruption.

http://themainstreetanalyst.com/2013/01/18/most-popular-social-networking-sites-by-country-infographic/

What are the problems with “Upstream”Sharing personal data…

…by people, voluntarily: feature, not bug

People supply enormous amount of data allowing to identify and research them,

their relatives, businesses, state institutions, etc. Some of them ignore or not aware

of risks, most value benefits of global socialization higher then damage caused by

leak of some privacy. I.e. they “pay” their personal data to receive social services.

This is reality in which legislators, lowers, security experts and IT pros have to live.

Rather then enforcing restrictive supply regulation legislators could enforce

responsible usage. Not a simple approach too, of course.

What are the problems with “Upstream”Sharing personal data…

…by devices: need moral justification

State monitoring systems are great instruments to improve national security, and

there are little if any objections against their implementation. Should people be

always explicitly notified, when interacts with them?

Officials often missed when defending such systems. Just an example: “it was

illegal to track a person without permission from the authorities, but that there was

no law against tracking the property of a company, such as a SIM card.”

Seems this is about “crime scene investigation”. It does not require explicit

permission from potentially affected citizens. How far should we expand its scope:

locale crime scene? City? State? How long after the case? Before the case?

http://www.rferl.org/content/russia-moscow-metro-phones/25059582.html

(Personal) data as a new currency (and tax)

People pay (invest) their data to obtain benefits

To (commercial) social services

• If consider them valuable

• If trust providers considering then

reputable and responsible

To state security, tax and otter systems

• If consider them effective

• If trust administration considering it

legitimate and responsible

Lack of trust in this area can cause economic slowdown and social frustration

Supply and Identity

It was mentioned frequently, that approach to Identity in digital world can evolve. People

can either continue with their traditional approach to identity (which in real world intends to

be not only unique, but also single and “integral” for each subject), or to switch to multiple

partial complimentary or even contradicting identities.

Evidently both social service providers and state institutions prefer the first approach. But

process is far not accomplished and can turn to the second option is some geos or social

groups if over-pressed by restrictive regulation in a situation of luck of mutual trust.

Also, we shell soon face more and more complicated strategies to misinform supply

mechanisms developed by criminals etc. Multiple SIM-cards is only the beginning. What

math should be used to re-construct identity in a case of untrusted records in Big Data set ?

What are the issues with “Downstream”

Generally speaking there are two main problems (ok, there are

more, but out of our scope):

• Who is allowed to retrieve results from Big Data ?

• What is the quality of results (are they complete and accurate)?

Who is allowed to retrieve resultsResponsible usage

If you are old school IT security expert, your answer, most probably, would be:

“nobody, never, nothing”. At least by default.

In a few cases (highly bureaucratized hierarchical small to medium organization) it

works. But hardly in a case of 1B+ users (and they are different – individuals of

different age, provider employees, businesses, state officials, devices, etc.).

First what we’d do is to implement elements of responsible approach to information

as Tim Berners-Lee and his colleagues recommend. Just explicitly inform potential

user about usage regulation and potential liability.

It easier to say, then to accomplish, as such approach requires mechanism to

associate norms regulation with pieces of information. See above slide “When we

need such models”, and, potentially, ontologies of usage patterns.

Who is allowed to retrieve resultsObligations and deontic logic

Nevertheless, would society adopt strict or lightweight approach to access management, there will be areas of

strict control due to national security, or to children rights protection, etc. We should be able to describe

obligations, associates with permissions. Some should precede action, some – to be accomplished in course or

during pre-defined timeframe after the action. Deontic logic, while being not free of paradoxes, is a good

background for this.

permissible (permitted) must

impermissible (forbidden) supererogatory (beyond the call of duty)

obligatory (duty, required) indifferent / significant

omissible (non-obligatory) the least one can do

optional better than / best / good / bad

ought claim / liberty / power / immunity

Deontic logic is that branch of symbolic logic

that has been the most concerned with the

contribution that the following notions make

to what follows from whathttp://plato.stanford.edu/entries/logic-deontic/

Since early 80th deontic logic is used to formalize legal reasoning and legal corpus, access rights, database

protection, as well as moral choice in IT (see J.-J.Ch. Meyer and Roel J. Wieringa. Applications of deontic logic in

computer science: A concise overview. In J.-J.Ch. Meyer and Roel J. Wieringa, editors, Proceedings of the 1st

International Workshop on Deontic Logic in Computer Science (DEON 1991), Amsterdam, The Netherlands, December

11-13, 1991, pages 15-43)

http://plato.stanford.edu/entries/logic-deontic/

Deontic logic at a glance

John Mikhail, Universal Moral Grammar: Theory, Evidence, and the Future. Trends in Cognitive Sciences,

April 2007; Georgetown Public Law Research Paper No. 954398.

We shell discuss Mikhail’s works in a bit more detailes soon

Who is allowed to retrieve resultsObligations and ethical calculus

Consider the following sentences:1. A informs B

2. A tells B that p

3. A lets B know that p

4. A shares his knowledge that p with B

5. A informs B about p

6. A sends a message to the effect that p to B

7. A's communications to B indicate that p

The general form of (1)-(7) can be rendered as:• Agent A in informational context C sees to it that Agent

B believes that p, or

• A informs B that X

Moral or legal constraints in information contexts

may be expressed as follows in general terms as

follows:• It is (not) obligatory or (not) permitted for A to see to it

that B knows that p

Intellectual Property(A) If John has an IP right in a particular piece of

information X, then Peter ought to have permission

from John to acquire, process or disseminate X.

Privacy and Data Protection

(B) If information X is about John and if Peter does not

have X then Peter is not permitted to acquire X without

John's consent. If he does have X, then he is not

permitted to process or disseminate it without John's

consent.

Equal Access

(C) If A is informed about X, then all ought to be informed

about X.

Responsibility and Information

(D) If John has an information responsibility regarding X,

then John has an obligation to see to it that specific

others have access to information X.

Jeroen van den Hoven and Gert-Jan Lokhorst. Deontic Logic and Computer-Supported Computer Ethics. 2002

http://homepages.ipact.nl/~lokhorst/DL2002.html

What is the quality of resultsFalse positive in Big Data: Bonferroni’s principle

While preparing report on Big Data one may face several problems:

• Not all appropriate pares <key, value> identified

• Some wrong pares selected

• And the most interesting, some correct pares can satisfying search criteria, but not due to

their nature, but simply due to statistically large size of data sample

Calculate the expected number of occurrences of the events you are looking for, on

the assumption that data is random. If this number is significantly larger than the number

of real instances you hope to find, then you must expect almost anything you find to be

bogus, i.e., a statistical artifact rather than evidence of what you are looking for. This

observation is the informal statement of Bonferroni’s principle.

In a situation like searching for terrorists, where we expect that there are few terrorists

operating at any one time, Bonferroni’s principle says that we may only detect terrorists by

looking for events that are so rare that they are unlikely to occur in random data.

Anand Rajaraman, Jure Leskovec, Jeffrey D. Ullman. Mining of Massive Datasets, CUP, 2010, 2012

What are the issues with social interactionFrom access control to social interactions

People (and already some other information agents) establish social interactions – fall into

cooperation and conflicts, make moral choices and moral evaluations. If we wanna build

services processing data, supplied by billions of people, and expect the services to behave in

a “natural” way (either respond accordingly or make grounded predictions), we need to think

about implementation of ethical norms.

Area to apply ethical concepts in Big Data and social network applications is wide – from

obligation enhanced access control to ethical evaluation of autonomous decisions.

Why we need ethical evaluation? It is almost impossible to predict and list all micro-scenarios

of service usage: should service provide to agent A info regarding agent B if it will impact

decisions A can make regarding gent C ? If A is a physician ? If A is (or is not) his patient ? In

case of emergency ?

Do not forget, that modern “autonomous artificial intellectual agents” normally invoke some

predictive analytics over statistically large sets of data (Big Data) to make decisions. And this

decisions should be ethical. At least to be understandable to humans.

What are the issues with social interactionWhy ethical behavior is usually considered understandable

Let’s turn back to Jon Mikhail works(See Mikhail, John, Universal Moral Grammar: Theory,

Evidence, and the Future. Trends in Cognitive Sciences, April

2007; Georgetown Public Law Research Paper No. 954398.

and earlier)

Possible, and very attractive answer is:

“Because ethics, or at least, some main

ethical categories, are built in our brains”. So

humans are “ethical by design” as computers

should be.

And nothing common with creationism

here…

Michail’s Universal Moral Grammar follows

Chomsky’s Universal Language Grammar

approach

Five main questions of universal moral grammar

• What constitutes moral knowledge?

• How is moral knowledge acquired?

• How is moral knowledge put to use?

• How is moral knowledge physically realized in the brain?

• How did moral knowledge evolve in the species?

What are the issues with social interactionA bit more about Universal Moral Grammar

UMG proposes a set of rules to process transformation of questioned situation: from initial description through structural

analyses, temporal decomposition, deontic analyses to final decision. UMG takes into account different evaluation by

humans of direct impact and side effects etc.

What are the issues with analyticsWhich analytics do we need?

Again, there is a sort of competition between statistical and

structural approaches. But in case of analytics it is evident, that we

need both (and, in some cases, multi-agent simulation as well).

A few methods/disciplines to mention:

• Mathematical statistics and statistical hypothesis testing

• Linear and nonlinear programming, optimization

• Neural networks (yes, finally we have workable deep neural networks!), multidimensional approximations

• Game theory

• Graph theory

• Coding theory

• Mathematical logic, various flowers

• Formal linguistics

• Statistical econophysics…

What are the issues with analyticsWhy are they all different?

Let’s limit ourselves to one important problem.

How to describe and analyze the situations of cooperation and conflict and

different roles of actors, and how to explain that different actors behave differently

in similar situations remaining, the same time, ethical (not violating their own

moral rules).

This is the question from Michail again:

Moral diversity is far not only geo-cultural.

It is mental and group-social as well.

What are the issues with analytics In search of language and calculus for compromise and conflict

There are a number of approaches aiming to describe this interactions, classify their participants, and predict their steps.

Among others – (auto)reflexive approach by Vladimir Lefebvre developed in 70th and 80th first in USSR, and then in USA.

Lefebvre proposed (set of) simple ethical calculus based at Boolean logic, and found, that there are two self-consistent

ethical systems (see Lefebvre, V.A. Algebra of Conscience. D. Reidel, Holland, 1982)

First Ethical System Second Ethical System

• The end DOES NOT justify the means

• If there is a conflict between means and ends,

one SHOULD be concerned

• The end DOES justify the means

• If there is a conflict between means and ends, one

SHOULD NOT be concerned

First Ethical System Second Ethical System

• A “saint” is non-aggressive, tends toward

compromise, and has low self-evaluation

• A “hero” is non-aggressive, tends toward

compromise, has high self-evaluation

• A “philistine” is aggressive, tends toward conflict,

and has low self-evaluation

• A “dissembler” is aggressive, tends toward

conflict, has high self-evaluation

• A “saint” is aggressive, tends toward conflict, and has

low self-evaluation

• A “hero” is aggressive, tends toward conflict, and has

high self-evaluation

• A “philistine” non-aggressive, tends toward


• A “dissembler” non-aggressive, tends toward


Simple taxonomy for social interactions:

Lefebvre, V.A.: Algebra of Conscience. D. Reidel, Holland (1982)

What are the issues with analyticsAlgebra of conscience

Again, notation form theory of categories

Big Data specific theft / leakingAnatomy of (some of) grand leaks

There are many scenarios of data theft to take into consideration while protecting online

service or offline business IT system. But at least one is specific for Big Data. When people

(businessmen and politicians) first become acquainted with the technology, they attempted to

use it for predictive analyses of all non-quantitative sources at a time. Idea was: “Let’s collect all

evidences, nevertheless reliable or not, and attempt to classify them, using the same corpus for

verification of each statement”. Supplies were instructed to input into such systems everything –

from official documents through rumors. And let computer separate the wheat from the chaff.

For the first ever there appeared huge warehouses of text documents. At one place, with

unified access from one (internal and privileged, but one) system account. It was matter of time

only, and massive leaks happen.

Any protection mechanisms? Physical security. Access to different file system chunks or

information objects from different accounts, sophisticated access rights management, and,

mostly in some near future, computation on encrypted data (Homomorphic encryption).

The issues with de-/re- personalizationDifferential privacy

Big Data is huge valuable asset. But how to use it without violation of privacy? Say, how to arrange some

privacy-preserving sharing of Big Data between state and business? Just one example: Pharmacy chain

requires statistical/demographic/etc. data to expand and tune its business, but prohibited to access personal

medical records.

The simplest answer is “open data” – statically privacy- (and other secret-) cleaned sets of data. Good start,

but far not enough. Now you mostly are limited to predefined reports. Moat of Big Data power is lost. Real

question is – how to ensure privacy dynamically – over complete set of data, i.e. – how to ensure suppliers,

that participation in Big Data sets is not privacy risky?

Good answer is –

Differential Privacy

There are more

sophisticated definitions/

requirements as well

Ilya Mironov, Omkant Pandey, Omer Reingold, and Salil Vadhan, Computational Differential Privacy, in Advances in Cryptology—CRYPTO 2009, Springer, 2009

http://research.microsoft.com/apps/pubs/default.aspx?id=81141

More to discuss

• Contract theory and ethics

• Econophysics and ethics

• Differential privacy and main laws of Information Ethics

• Applications

Questions?

Vlad Shershulsky

[email protected]

Download - Big Data and Computational Ethics 2013 (Engl)

Top Related