e6885 network science lecture 13: large-scale analysis and...

23
1 © 2011 Columbia University E6885 Network Science Lecture 13: Large-Scale Analysis and Advanced Network Analysis Applications E 6885 Topics in Signal Processing -- Network Science Ching-Yung Lin, Dept. of Electrical Engineering, Columbia University December 12 th , 2011 © 2010 Columbia University 2 E6885 Network Science – Lecture 13: Analysis of Network Flow Data Course Structure Final Project Presentation 14 12/19/11 Large-Scale Network Processing System 13 12/12/11 Social and Economical Issues of Network Analysis 12 12/05/11 Graphical Models and Analysis 11 11/28/11 Information Diffusion in Networks 10 11/21/11 Final Project Proposal Presentation 9 11/14/11 Dynamic Networks -- II 8 10/31/11 Dynamic Networks -- I 7 10/24/11 Network Topology Inference 6 10/17/11 Network Modeling 5 10/10/11 Network Visualization, Sampling and Estimation 4 10/03/11 Network Partitioning, Clustering, and Use Case 3 09/26/11 Network Representations and Characteristics 2 09/19/11 Overview – Social, Information, and Cognitive Network Analysis 1 09/12/11 Topics Covered Class Number Class Date

Upload: others

Post on 30-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

1

© 2011 Columbia University

E6885 Network Science Lecture 13:

Large-Scale Analysis and Advanced Network

Analysis Applications

E 6885 Topics in Signal Processing -- Network Science

Ching-Yung Lin, Dept. of Electrical Engineering, Columbia University

December 12th, 2011

© 2010 Columbia University2 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Course Structure

Final Project Presentation 1412/19/11

Large-Scale Network Processing System 1312/12/11

Social and Economical Issues of Network Analysis1212/05/11

Graphical Models and Analysis1111/28/11

Information Diffusion in Networks1011/21/11

Final Project Proposal Presentation911/14/11

Dynamic Networks -- II810/31/11

Dynamic Networks -- I710/24/11

Network Topology Inference610/17/11

Network Modeling510/10/11

Network Visualization, Sampling and Estimation410/03/11

Network Partitioning, Clustering, and Use Case309/26/11

Network Representations and Characteristics209/19/11

Overview – Social, Information, and Cognitive Network Analysis109/12/11

Topics CoveredClass

Number

Class

Date

Page 2: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

2

© 2010 Columbia UniversityE6885 Network Science – Lecture 13: Analysis of Network Flow Data

Scientific Challenges of Large-Scale and Real-Time Network Mining Infrastructure

�Speed-Up of Network Mining Algorithms (with Christos Falutsos and U Kang)

�Network Sampling Theory (with Xifeng Yan)

�High Performance Computing for Network Analysis (with Ted Brown, Nitesh Chawla,

Jaideep Srivastava, Ido Rosen, and David Hsu)

�Social Network Storage and Indexing (with Ted Brown)

© 2010 Columbia University

Example: Centralities in Large Networks

[15th Century Florentine Family]

Degree: # of neighbor

Closeness: avg. shortest path

length

Betweenness: # of times a node

sits between shortest path

Measuring the financial

company value

Network attack monitoring [Internet Web]

Degree : Easy

|V| = 15 |E| = 19|V| = Billions |E| = Billions

Closeness : Easy

Betweenness : Easy

Degree : Easy

Closeness : Hard

Betweenness : Hard

O(|V|3)

O(|V|2log|V|)

O(|E|)

Application

Three centralities

“Who are the most

important actors?”

Page 3: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

3

© 2010 Columbia University5 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Large-Scale Graph Indexing and Network Management

5

Raw Graph Graph after shuffle

1 2Zip

Zip

Zip

Zip

Zip

Zip

Zip

Compressed

blocks

3

Graph DBs

..

..

Unified Query Execution EngineQuery Vectors

Resulting Vectors

1

User Query Stage

Indexing Stage

2

3

45

© 2010 Columbia University6 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Graph Storage and Indexing

� 1. Block Formulation

� Any partitioning algorithm

(METIS, Disco, etc.) can be used

� 2. Block Compression

� Compress each block using gzip.

� 3. Block Placement

Vertical: Horizontal: Grid:

.

.

..

Zip

Inefficient for out-

neighbor query

Inefficient for in-

neighbor query

Efficient for in/out-

neighbor query

Our choice

Page 4: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

4

© 2010 Columbia University7 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Graph Analytics Queries

� SQL-like queries

� Global Queries: degree, pagerank, RWR, connected component

� Targeted Queries

� Query Execution Engine

� Main tool: generalized matrix-vector

multiplication

� Grid Selection

1-step in-neighbors 1-step out-neighbors 1-step inout-neighbors

k-step

neighbors

induced subgraph,

egonet k-core cross-edges

© 2010 Columbia University8 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Applications of existing GBase supported queries

Browsing Ranking Finding Community Anomaly

Detection

Viz.

Connected

Component

Radius

PageRank, RWR

Induced Subgraph

K-Nh

K-Egonet

K-core

Cross-edges

Page 5: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

5

© 2010 Columbia University9 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Graph Mining on Hadoop� Extract Graph Features at Different Levels – Scale Up & Speed

� Other our Hadoop-Based existing Functions

By Parallelism

By Scalable

Algorithm Design

“Non-Negative Residual Matrix Factorization with Application to Graph Anomaly Detection”, SI of Best Papers of SIAM Data Mining Conf. 2011

Large-scale graph analysis of up to 1B nodes & 7B edges

[AAAI’11]

(Generalized) matrix-vector mul. PageRank, centralities,Global

Matrix factorization, partitioncommunity, roleSub-graph

Ego-netdegree, edge, weightLocal

How-ToExamplesLevel

© 2010 Columbia University

Our algorithms

� We proposed two new centralities (`effective closeness’ and `LineRank’), and

efficient large scale algorithms for billion-scale graphs

Scalability ResultsEffective Closeness vs. Closeness

Analysis of Real-World Graph

For 2 Billon Edges,

- standard closeness: 30,000 years

- effective closeness: ~ 1 day !

1,000,000 times faster!

Page 6: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

6

© 2010 Columbia University11 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Challenges and Core Ideas

� Challenges 1: Scalability

� Core Idea 1: Indexing at Block-level + Parallelism

� Challenges 2: Application Heterogeneity

� Core Idea 2: Unified Query Execution Engine

11

Key Feature of GBase:

- F1. Algorithms. Define common, core algorithms to satisfy various graph applications

- F2. Storage. Store and manage huge graphs in distributed settings to answer queries efficiently

- F3. Query Optimization. Exploit the storage and the general algorithm to answer queries quickly

© 2010 Columbia University12 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Unified Query Run-Time Execution Engine in GBase

� Q: Given a graph, can we compute connected components, PageRank, Random

Walk with Restart, and diameter/radius with one algorithm?

� A: Yes, expanding GIM-V for run-time SQL queries [GBase, KDD 2011]

–Generalized Iterative Matrix-Vector Multiplication [Pegasus, ICDM 2009]

–Extension of plain matrix-vector multiplication

Page 7: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

7

© 2010 Columbia University13 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Main Idea: Intuition

� Plain M-V multiplication

1

1

0.1

• Weighted Combination

of Colors

• ~ Message Passing

1 1 0.1

1

1

0.1X

∑=

=4

144 '

i

ii vmv

M v

=

'v

Details

© 2010 Columbia University14 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Main Idea

� Plain M-V multiplication

Three Implicit Operations here:

combine2

combineAll

assign

multiply and ijm jv

sum n multiplication results

update 'iv

'vvM =×

1 1 0.1

1

1

0.1X

M v

=

'v

∑=

=4

1

'i

ijij vmv

1

1

0.1

Message sending

Message combination

Details

Page 8: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

8

© 2010 Columbia University15 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Generalize MV to GIM-V

� GIM-V: Customizing the three operations leads to many algorithms

Assign

Sum

Multiply

assign

combineAll

combine2

Con. Cmpt. DiameterRWRPageRankStandard MVOperations

MIN

MIN

Multiply

Assign

Sum with rj

prob.

Multiply

with c

Assign

Sum with

restart prob

Multiply

with c

BIT-OR()

BIT-OR()

Multiply

bit-vector

(approx.)

Details

© 2010 Columbia University16 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Fast Run-Time Query Algorithms for GIM-V

� Solution 1: Decrease the file size and shuffle time by Block Indexs

� Solution 2: Achieve sub-linear response time by Grid Placement and Selection

Grid Placement Grid Selection

Page 9: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

9

© 2010 Columbia University17 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Performance

� Experiments done in M45 hadoop cluster

– Provided by Yahoo!

– One of the top 50 supercomputers in the world

– 500 nodes, 4000 cores, 3TB RAM, 1.5PB disk

� Data Sets

© 2010 Columbia University18 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Performance

Page 10: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

10

© 2010 Columbia University

Example: Bayesian Network Model – LDA on Blue Genes

�3-level hierarchical Bayesian model for content analysis; it is important to

social network analysis research but presents computational challenges

�Large-scale content analysis, Design of new architecture

Finding friends in blog data [1]

LDAGenerating email keywords [4]

Mining source code [3]Entity resolution [2]Mining graphs [5]

© 2010 Columbia University

Speeding Up LDA

Parameter estimation

� Variational Estimation-Maximization (EM)• It is an alternating procedure, while each of two

steps has potential to be in parallel

• E-step to find optimizing values of variational

parameters (used to compute posterior distribution

of hidden variables)

• M-step to find maximum likelihood estimates under

posterior from E-step

Inference� In E-step, for each iteration, data access

(reference) could be in parallel

wordn

topick

… …

word1

topick

… …

wordN

topick

… …

……

Inf. Inf.Inf.

MLE MLE MLE

… …

… …

E-step

M-step

Optimization to all (computing and I/O) nodes

Hardware architecture:

1.Instruction-level support for computation of 1st,

2nd, 3rd derivatives

2.Functional units for big but simple loops

traversing rows and columns

Making LDA much faster on Hadoop or Bluegenes

Page 11: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

11

© 2010 Columbia University21 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Understanding Human

Goal: Novel System for Detecting and Predicting Behaviors, through

large-scale social network analytics and data mining, for security

applications to decrease insider threats such as colleague-shooting,

suicide, data leakage, malware propagation, etc. or for commerce

applications, such as marketing and sales.

Detecting &

Predicting Feed subscription

Social sensors

Database access

Click streams capturer

Graph analysis

Behavior analysis

Semantics analysis

Emails

Chats

Meetings

Web Page Clicks

DB Server Logs

Social Media Data

Multimodality

Analysis

Beyond traditional security framework: leverage IBM Research strength on

1.Semantics: IBM Watson Q&A Framework,

2.Graphs and Machine Learning: SmallBlue (IBM Atlas) Social Network Analysis,

3.Large-Scale Processing: IBM BigInsight, and

4.Stream Processing: IBM Infosphere Streams

Psychological

analysis

© 2010 Columbia University22 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Cybersecurity Applications of Social Network Analytics

� Hypothesis: People interaction is where the

major activities are and is the major target for

cybersecurity

� Security Applications:

– Spam and Malware Propagation

– Bipolar communications in Targeted Attack –

Human vs. Malware

– Data Exfiltration through Social Networks –

finding sensitive data have been leaked.

– Entity Resolution by linking Social Media

Networks and Organization Networks

� Detection and Early Warning of Anomalies:

– Suspicious Communications

– Collusive Behaviors

– Inappropriate Online Social Network

Postings

Breakthrough Needed: Science advances and advanced multimodality analytics platform to model, learn, and predict relation networks and dynamic people behavior

Attacker / Spamer:

Near-Star

Normal:

(1) Clique-like

(2) Two-way links

Use scenario 2

Use scenario 1

Data leakage prediction;

Malware propagation

prediction

Page 12: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

12

© 2010 Columbia University23 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Cognitive Network

� Cognitive Network, e.g.:

– 30,000 nodes of dynamic brain MRI functional networks

Cognitive

Network

© 2010 Columbia University24 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

New Type of EEG Detector and Signal Analysis

� Fundamental Research on EEG Signal Processing

� New Dry Sensing

– Classifying Attention, Relaxation, etc.

– Classifying Target – P300 signals

– Classifying Visual Cortex Signals

� Breakthrough Non-Contact Sensing – suitable for everyday/normal use

� Cognitive Wireless Sensor becomes possible

EEG Wireless Sensor

developed

by an IBM partner

Page 13: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

13

© 2010 Columbia University25 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Our Multi-Modality Analysis Platform

Task 3:

Multimodality

Learning

Task 1:

Network

Analytics

Task 2:

Semantic

Analytics

Task 4:

Behavior

Reasoning &

Graphical

Models

Task 5:

Interface

Task 0: Infrastructure

© 2010 Columbia University26 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Application Example Composite Social-Cognitive-Info Wireless Network

Status Monitoring,

Visualization,

Personalized

Information

Recommendation,

Routing, etc.

EEG / Audio signal detection

GPS / location

detection,

Information Display

3G

3G

3G

Page 14: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

14

© 2010 Columbia University27 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

New Social Visual Sensor for Trustworthy Face Detection

� In fMRI and EEG studies of human faces, people “unconsciously” judge the trutstworthy of a

novel face in 100 ms.

� The rating difference of “consensus trustworthy” faces v.s. “consensus untrustworthy” faces

is in average 0.65 pints on a 5-point scale.

� Is it possible to detect “trustworthy” rating of human face automatically from visual signals?

Faces response at Amygala (Engell 2007) Faces v.s. descriptions (Todorov, 2007)

© 2010 Columbia University28 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Privacy Lesson

Perception > Policy > Law

Page 15: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

15

© 2010 Columbia University29 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Privacy laws worldwide

European Union• European Data Protection Directive (1995)

Canada• PIPEDA(2001 - 2004)

U.S. – Sectoral• Children’s Privacy; COPPA (1999)

• Financial Sector GLB (2001)

• Health Sector; HIPAA (2002)

• California Privacy; (2005) Taiwan• Computer-Processed PD Protection Law (1995)

South Korea• Info & CommNetwork Util. & Info Protection Law (2000)

Japan• Personal Data Protection Act (2005)

APEC• Guidelines (2004)

Existing Private SectorPrivacy Laws

EmergingPrivate SectorPrivacy Laws

Existing Private SectorPrivacy Laws

EmergingPrivate SectorPrivacy Laws

APEC• Guidelines (2004)

Russia• Federal law on Pers Data (January 2007)

Australia•Privacy Amendment Act (2001)

New Zealand• Privacy Act (1993)

Chile• Protection of Private Life Law (1999)Argentina• Protection of PD Law (2000)

Dubai• Data Protection Law (January 2007)

© 2010 Columbia University30 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

The most important two elements of privacy

– The claim of individuals, groups or institutions to define for

themselves when, how and to what extent information

about them is communicated to others

[ Michael, Privacy in Harris and Joseph eds, The International Convenant on

Civil and Political Rights and UK Law, 1995]

– The individual’s ability to control the circulation of

information relating to him.

[ Milller, Assault on Privacy, 1971]

Privacy is human right and personal perception

Page 16: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

16

© 2010 Columbia University31 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

For social software -- a root of conflict

(United Nations) Universal Declaration of Human

Rights [1948]

Article 12: No one shall be subjected to arbitrary

interference with his privacy, family, home or

correspondence, nor to attacks upon his honor and

reputation. Everyone has the right to the protection of

the law against such inference or attacks.

(United Nations) Universal Declaration of

Human Rights [1948]

Article 19: Everyone has the right to freedom

of opinion and expression; this right includes

freedom to hold opinions without interference

and to seek, receive and impart information

and ideas through any media and regardless

of frontiers.

© 2010 Columbia University32 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

European Convention on Human Rights (ECHR) Article 10

� Everyone has the right to freedom of expression. this right shall include freedom to hold opinions

and to receive and impart information and ideas without interference by public authority and

regardless of frontiers. This article shall not prevent States from requiring the licensing of

broadcasting, television or cinema enterprises.

� The exercise of these freedoms, since it carries with its duties and responsibilities, may be

subject to such formalities, conditions, restrictions or penalties as are prescribed by law and are

necessary in a democratic society, in the interests of national security, territorial integrity or public

safety, for the prevention of disorder or crime, for the protection of health or morals, for the

protection of the reputation or the rights of others, for preventing the disclosure of information

received in confidence, or for maintaining the authority and impartiality of the judiciary.

Page 17: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

17

© 2010 Columbia University33 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Privacy Legislation -- Europe

� Three main legal instruments:

–The European Convention on Human Rights (ECHR) Article

8, which protects the right to privacy [Rome, November 1950]

–Directive 95/46/EC of the European Parliament and of the

Council of 24 October 1995 on the protection of individuals

with regard to the processing of personal data and on the free

movement of such data [1995]

–Directive 02/58/EC, concerning the Processing of Personal

Data and the Protection of Privacy in the Electronic

Communications Sector. [2002]

© 2010 Columbia University34 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

The European Convention on Human Rights [1950]

� Article 8:

1. Everyone has the right to respect for his private and family life, his

home and his correspondence.

2. These shall be no interference by a public authority with the exercise of

this right except such as is in accordance with the law and is

necessary in a democratic society in the interests of national security,

public safety or the economic well-being of the country, for the

prevention of disorder or crime, for the protection of health or morals,

or for the protection of the rights and freedoms of others.

Page 18: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

18

© 2010 Columbia University35 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Workplace privacy

� Company as Legal person -- liable to sue and being sued

� Needs from Employers’ viewpoints:

1. Employer Liability for Employee Misuse of Company Owned Technology

2. Protection of Trade Secrets andAvoidance of Corporate Defamation

3. Discovery in Litigation

4. Productivity of Personnel and Systems

� Needs from Employees’ viewpoints:

– Human Rights

© 2010 Columbia University36 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Timing of Legislation

� Legislation is mostly behind what have already happened and is a social collective

action to seek for a common-ground resolution to existing issues.

� Legislation of privacy law of countries were enacted in the technical era of:

– Database technology – no concept of search engine

–Web 1.0 - no concept of social software

– Combating spam

–Regulating the behavior of government or personal-data collection

companies

– Regulating marketing behavior

Page 19: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

19

© 2010 Columbia University37 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Definition of Personal Data in EU Directive 95/46/EC� Article 2 (a):

–Personal data shall mean

• any information

Both objective and subjective information about a person

Irrespective of the technical medium

• relating to

Check ‘content’, ‘purpose’ or ‘result’

• an identified or identifiable

On the means likely reasonably to be used by controller or by

any other person to identify that person

The particular context and circumstances plays important role

• natural person

About living individual

In principle, not including ‘legal person’. However, member

states may extend legislation to legal person

© 2010 Columbia University38 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Criteria for Making Data Processing Legitimate (in 95/46/EC)

� Article 7:

– Personal data may be processed only if:

• The data subject has unambiguously given his consent; or

• for the performance of a contract to which the data subject is party or in order to take steps

at the request of the data subject prior to entering into a contract; or

• for compliance with a legal obligation to which the controller is subject; or

• in order to protect the vital interests of the data subject; or

• for the performance of a task carried out in the public interest….; or

• for the legitimate interests pursued by the controller or by the third party or parties to whom

the data are disclosed, except where such interests are overridden by the interests for

fundamental rights and freedoms of the data subject which require protection under Article

1(1) (i.e., fundamental rights and freedoms of natural persons).

controller shall mean the natural or legal person, public authority, agency or any other body which

alone or jointly with others determines the purposes and means of the processing of personal data;

where the purposes and means of processing are determined by national or Community laws or

regulations….

Page 20: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

20

© 2010 Columbia University39 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Information Collection (in 95/46/EC)

� Article 11: Information where the data have not been obtained from the

data subject

–The controller or his representative must at the time of undertaking

the recording of personal data or if a disclosure to a third party, no

later than the time when the data are first disclosed provide the

data subject with at lease the following info:

• The identity of the controller and of his representative;

• The purpose of the processing

• Any further information such as

The categories of data concerned,

The recipients or categories of recipients,

The existence of the right of access to and the right to rectify the data

concerning him

© 2010 Columbia University40 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Types of personal data (in 95/46/EC)

� Article 8:

– Shall prohibit the processing of these sensitive personal data:

• Racial or ethnic origin

• Political opinions

• Religious or philosophical beliefs

• Trade-union membership

• Health or sex life

– The above can be processed if

• The data subject has given explicit consent, except prohibited by law

• For carrying out the obligations and specific rights of the controller in the field of employment law

• To protect the vital interests of the data subject

• In the course of its legitimate activities with appropriate guarantees by a foundation, association, or any other non-profit-seeking body… solely to the members of the body…

• Data are manifestly made public by the data subject

Page 21: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

21

© 2010 Columbia University41 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Case Law

� Lindqvist v. Jonkoping [2002]

� Opinion of Advocate General:

– Mrs. Lindqvist was a part time, voluntary catechist in a parish in

Sweden.

– She set up a web-page of the parish. There is a direct link for the

webpage in the homepage of the Swedish church.

– It contained information about the parish including: the names, and in

some occasions the full names, of other employees and herself; her

colleagues’ jobs and hobbies; telephone numbers and other personal

information;

– Additionally, it was mentioned that one of her colleagues was a part-

timer because she had health problems.

– Mrs Lindqvist did not notify her colleagues about the webpage neither

did she inform the Datainspektionen. (Information Commissioner for

Sweden).

© 2010 Columbia University42 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Questions in the Lindqvist v Jonkoping case

1. Is it in the scope of the Directive? Does it constitute the processing of personal data by automatic means to list on a webpage a number of persons with comments about their jobs and hobbies?

2. Can the act of loading such information onto a webpage be regarded as outside the scope of the Directive, under the exceptions?

3. Is information on a webpage stating that a named colleague has injured her foot and is on half-time on medical grounds personal data concerning health which may not be processed?

4. If a person a person in Sweden uses a computer to load personal data onto a webpage stored on a server not in Sweden does that constitute a transfer of data to a third country?

5. Can the provisions of the Directive, in a case such as the above, be regarded as contradictory with the general principles of freedom of expression of other freedoms and rights, which are applicable within the EU and are enshrined in ECHR Article 10?

Page 22: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

22

© 2010 Columbia University43 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Processing of personal data and the protection of privacy in the electronic communications sector ( 02/ 58 / EC)

� Regulates ISPs and Telcommunication providers

� Specifically focuses on marketing

� Direct marking email messages may be sent only to subscribers who have

given their prior consent (“opt-in”). Prior permission is required for B2C

communication covering all “Natural persons”.

� For B2B communication, EU member states are free to make “opt-out” the

minimum legislation.

� In US, the CAN-SPAM Act allows direct marketing email messages to be sent

to anyone, without permission, until the recipient explicitly requests that they

cease (“opt-out”).

© 2010 Columbia University44 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

US vs. Europe

� US relies on a mix of legislation, regulation, and self-regulation.

� European Union relies on comprehensive legislation that requires creation of

government data protection agencies, registration of databases with those agencies

and in some instances prior approval before personal data processing may begin.

� Safe Harbor Program (US Dept. of Commerce and European Commission):

– EU would prohibit the transfer of personal data to non-European Union

nations that do not meet the European “adequacy” standard for privacy

Page 23: E6885 Network Science Lecture 13: Large-Scale Analysis and …cylin/course/netsci-11/NetSci-Fall2011-Lectu… · Large-Scale Analysis and Advanced Network Analysis Applications E

23

© 2010 Columbia University45 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Safe Harbor Program – 7 requirements

� Notice

� Choice

� Onward Transfer

� Access

� Security

� Data Integrity

� Enforcement

IBM signed up on 8/15/2002

© 2010 Columbia University46 E6885 Network Science – Lecture 13: Analysis of Network Flow Data

Questions?