privacy policy evaluation

Privacy Policy Evaluation

A tool for automatic natural language privacy policy analysis

THE AIM AND THE LOOK AND FEEL

Purposes

• A tool to help the user in performing privacy level assessment towards a website

• The tool should be able to:– analyze the privacy policies of a web site (written

in Natural Language– verify its compliance to the Privacy Principles (PP) • e.g. the ones specified by the EU

– verify its compliance to the user preferences

Server SideUser Side

DB

WebSite

Natural Language Privacy Policy

publish

save data

DB

DB

end-user

TEXT ANALYSIS

User’sPrivacy

Preferencesdefine

EUPrivacy

Principle

GOALS EXTRACTION

PRIVACY EVALUATION

PRIVACY ENFORCEMENT

Privacy Best

Practise

POLICY GENARATOR

<XACML>

SPARCLE

Set Of Goals

Server SideUser Side

DB

WEB SERVER

Natural Language Privacy Policy

publish

save data

DB

DB

END-USER

TEXT MINING

User’sPreferences

define

EUPrivacy

Principle

PRINCIPLES COMPLIANCE

CHECKING

COMPLIANCE RESULTS

PRIVACY ENFORCEMENT

read

NOTICE Verify the presence of statements regarding the way your information is used and collected

CHOICE

USE

CORRECTION

SECURITY

ENFORCEMENT

THIRD PARTIES

User’s Properties Settings

PET SettingsSelect the controls you want to carry out on the privacy policy

User's Preferences View

NOTIFICATION

PURPOSE SPECIFICATION Verify the presence of statements specifying the purposes for which the data will be collected

CONSENT

ACCESS AND CORRECTION

SECURITY

LIMITED DATA COLLECTION

THIRD PARTIES

LIMITED TIME RETENTION

PET SettingsSelect the controls you want to carry out on the privacy policy

Privacy Policy Results View

NOTIFICATION

PURPOSE SPECIFICATION Ours purpose is collecting non-personally identifying information to better understand how ours visitors use our website.

CONSENT

ACCESS AND CORRECTION

SECURITY

LIMITED DATA COLLECTION

THIRD PARTIES

LIMITED TIME RETENTION

WARNING We may use and disclose medical information about you without your consent

PET ResultsSelect the checks you want to carry out on the privacy policy

Example of real Privacy Policy for an hospital

• Example 1:– O'Connor (hospital) is the sole owner and user of all

information collected on the Web site.– O'Connor does not sell, give, or disclose the information

(statistical or personal) gathered on the Web site to any third party

• Example 2: – We may use and disclose medical information about you

without your consent if necessary for reviews preparatory to research, but none of your medical information would be removed from the Hospital in the course of such reviews.

(Key) Action listPrinciple Key Words

Notice Collect, use, require, include, ask, describe, notify

Choice Opt-out, delete, remove, unsubscribe

Use Purposes, aiming, using for,

Correction Modify, update, correct, access, change, review

Security Security, Encryption, SSL, https, password

Enforcement Store, retention, control, protect

Transfer to Third Parties Share, sell, provide, disclose, publish, aggregate, access, send, tell, transfer

Threats - Key words

Sell, disclose, publish, send, aggregate, identify, reveal, violate, tell, transfer

Example of goalPrinciple Sentence

Notice We collect your credit card information for the time necessary to deliver the product

Approaches Looking for key words (i.e. look for the

verbs we are interested in) Full text analysis ( i.e. analyze each

sentence and verify to which goals (if any) it corresponds )

THE ARCHITECTURE

The Architecture

Privacy Evaluation GUI

Sentence Splitter

Text MiningMachine Learning Module

Sentence Layout Organization

Privacy Policy document User’s Settings

Results Shown to the user

Sentence Classification

TEXT MINING

Text Mining versus Data Mining• Text mining is different from Data Mining

– Why?• According to Witten ‘text mining’ denotes any system that analyze large quantities

of NL text and detects lexical or linguistic usage patterns• Data Mining is looking for patterns in data while text mining is looking for patterns

in text• In data mining the information is hidden and automatic tools are needed to

discover it whole in text mining the information it’s present and clear, not hidden at all

• In text mining it’s the human resource the problem: for humans it’s impossible to read the whole text by themselves – A good output from text mining is a comprehensible summaries of salient features in a

large body of text

[1] I. H. Witten, “Text mining,” in Practical handbook of internet computing, Boca Raton, Florida: Chapman & Hall/CRC Press, 2005, pp. 1-22.

Bag of Words• A document can be represented as a bag of words:

– i.e. a list of all the words of the documents with a counter of how often each word appears in the document

– Some words (the stop words) are discarded– A query is seen as a set of words. – The documents are consulted to verify which one satisfies the query– A technique of relevance ranking allows to assess the importance of

each term according to a collection of documents or to the single document. • this allows to rank each document according to the query

• In the web search engines the query is usually composed by few words while in expert information retrieval systems query may be way more complex

Text categorization (1)

• Is the assignment of certain documents (or part of the document) to predefined categories according to their contents

• Automatic text categorization helps to detect the topics that a document covers– the dominant approach is to use machine learning to infer categories

automatically from a training set of pre-classified documents. • the categories are simple label with no associated semantic • in our case the categories will be the single privacy principle we are

interested in

• Machine learning techniques for text categorization are based on rules and decision trees

Text categorization (2)• Example for the principle “Transfer to Third Parties ”

– if (share&data)• or (sell&data) • or (third party &data)

– then “THIRD PARTY”• Rules like this can be produces using standard machine learning techniques • The training data contains a substantial sample of documents (paragraph) for

each category• Each document (or paragraph) is a positive instance for the category it

represents and a negative one for all the other categories• Typical approaches extract features from each document and use the

feature vectors as input to a schema that learns how to classify documents– The key words can be the features and the word occurrence can be the feature

value. Using the vector of features it’s possible to build a model for each category – the model predicts whether or not a category can be associated to a given

document based on the words in it and on their occurrence– Using the bag of words the semantic is lost (only words are accounted) but

experiments shows that adding semantic does not significantly improves the results

Approach to solve our problems1. Category Definition

– Define the categories we are interested in • e.g. the privacy principles

2. Annotation– Select a certain number of privacy policy (verify in literature the correct

number) and annotate the sentences (in the policy) related to each category

3. Bag of words– for each category define the bag of words (this can be an automated

process)– e.g. security has : {https, login, password, security, protocol….}

4. Training set– The annotated documents are our training set

5. Machine Learning – Apply machine learning algorithms to learn the classification rules

6. Usage– use the classification rules to classify the sentences of new privacy policies

in the right category

1. Category Definition (a)

• The categories are chosen as “privacy principle” adapted from – Directive 95/46/EC – Directive 2002/58/EC 2002– and the OECD Guidelines

Principle Description

1.Notification• Notify to the data object the identity of the data controller• Notify to the data object to whom his data are disclosed• Give to the data subject information about the data categories collected

2. Purpose Specification

• Data controller should specify for which purposes the data will be collected• Usage different from the original purposes can be done only after the data controller (or the third party) obtained a new

allowance from the user3. Consent • The data subject should give his consent to the data collection

4. Access and Correction

• The data subject has the right to access to the data collected;• The data subject has the right to modify the data collected; • The data collected by the data collector should be kept up-to date, and means to do that should be provided;• Notification to third party to whom data have been disclosed, of any rectification or erasure of data;

5. Security

• The data controller must implement appropriate technical measures against unlawful use or data loss;• The level of security should be appropriate to the risk and to associated to the process, and to the nature of the data to be

protected;• Data controller should take appropriate security measures (against unauthorized access or data loss) and inform the users about

them (out of any charge);• Data controller should inform the users of any special risks of a breach of the security;

6. Third Parties• The user should be fully informed when his data is transferred from one service provider to another ;• Any transition of data can be done if the user is well aware of it, and for the use and purposes for which the data were collected

7. Cookies

• Cookies are legitimate tools if the users is well informed of the information the cookie keeps and for which purposes such information are used

• Users should have the opportunity to refuse cookies and the service provider has the right to block some services in that case;• The method to give information, and offer refuse to the use of cookies, should be as user-friendly as possible;

8. Limited Collection

• The amount of personal collection should be reduced at the minimum possible;

9. Direct Marketing• SMS, emails or unsolicited communication for direct marketing can only be done after the user gave his explicit consent• Users should be given the possibility to withdraw their consent to direct marketing at any time

10. Limited Time Retention

• Data must be erased, or made anonymous, when it is no longer needed for the purposes it has been collected;• Data collectors must inform the user for how long data is needed for the processing;• The storage of data can be done only for the time necessary to the management of the purpose;

1. Category Definition (b)

2. Annotation (a)

• One problem for the annotation part is verify the quality of the annotator.

• Usually, the authors of the experiment are also the ones annotating the documents for the training set.

• Their quality has to be assessed comparing their annotations with the ones of other users

• A way to do this is presented in the following slides

2. Annotation (b)

• Ask 4, 5 persons to annotate the same privacy policy– For each sentences, annotate whether they are relevant

for one of the ten principle (categories)• Make a matrix of similarity to verify how close to

each other are the annotators– The similarity can be measured with an F_value ( or the

cosine-similarity measures)• Bad if <0.5• more o less ok if between 0.5 and 0.75• Very good if >0.75

2. Annotation (c)A0 A1 A2 A3 A4

A0 1 0.75 0.75 0.6 0.8

A1 … 1 0.8 … …

A2 … … 1 … …

A3 … … … 1 …

A4 … … … … 1

• if the annotations of A0 and A1 (i.e. the annotation of the authors of the experiment) are in average good then the authors can be good annotators

• Let’s call Fxy the value in the cell AxAy (agreement coefficient)

2. Annotation (d)

• If– a= number of paragraphs both Ax and Ay says belong to the category C– b= number of paragraph were Ax says belong to C and Ay says No– c= number of paragraphs were Ax says do not belong to C and Ay says yes– d= number of paragraphs were both Ax and Ay say do not belong to C

• then the Fxy can be computed as: (these formulas need to be verified!!! or better computation need to be found) – Recall = a / (a+b) ?– Precision = a/ (a+c) ?– F-Value = ?– Kappa-Value = ?

• Maybe the recall is the formula more useful in our case

AyYes No

AxYes a b

No c d

• How can we compute Fxy, the agreement value between the annotators Ax and Ay (regarding a given category C)?

3. Bag of Words

• A set of words relevant for the document– Bag of words selection (a.k.a. feature selection)

can be made selecting all the words of a document (in our case a document is a sentence)

– To each word of the bag of word a weight should be given

– Such process can be made automatically

4. Training Set

• The training set is the set of privacy policies we annotate and to which we apply Machine Learning algorithms

• The policies have to be chosen according to two criteria:

• belonging to the same domain (e.g. e-health)• belonging to as different domain as possible

• The size of the training set needs to be defined

5. Machine Learning

• You can use WEKA for machine Learning and use WORDNET for the relationship between words

• SYC is a tool for the ontology for the all the knowledge of the world– not recommended – Use first of all the cosine similarity to make a sort of

pre-test• Look at the Normal Decision Tree that should

give rules as result (??)

Tools

• Weka• TagHelper Tools– download the program + the documentation– Repeat the example showed in TagHelperManual.ppt to

understand how it’s possible to create a model starting from annotated documents

– The file AccessMyrecordsPrivacyPolicy.xls is the annotation result of the privacy policy available at http://www.accessmyrecords.com/privacy_policy.htm

– The model has been build on this annotations– You can test the model using the file EMC2.xls

http://www.accessmyrecords.com/privacy_policy.htm

MASTER THESIS REQUIREMENTS

Task 1 - Objectives1. Literature Study on Text Mining

– I. H. Witten, “Text mining,” in Practical handbook of internet computing, Boca Raton, Florida: Chapman & Hall/CRC Press, 2005, pp. 1-22.

2. Elisa will provide Yuanhao with a defined the set of categories (privacy principle) – and guidelines to follow to associate each sentence of a privacy policy to a

category

3. Yuanhao will select 2 privacy policies (e-health domain) and will proceed to the annotation of such policies according to the categories and the guidelines indicated above

4. Elisa will also annotate such privacy policies5. Other 4 persons need to be chosen to annotate the 2 above

mentioned privacy policies (these persons will be provided with the same annotation guidelines mentioned above)

Task 1 - OUPUT

• At the end of the TASK 1 the 6 annotators (Elisa, Yuanhao + 4), will provide annotated privacy policies in the following format– a xsl file with two column. The first column

indicate the category (e.g. choice, use, security) and the second column indicates the sentence of the privacy policies associated to that category• Note that every sentence of the privacy policy needs to

be associated to a category

TASK 2 – Objectives

• Validation of Elisa and Yuanhao as annotators.– The method described in the slide Annotation (c)

and (d) needs to be applied. – The annotators needs to be compared – The method to do that needs to be formalized• i.e. a measure to verify the agreement level between

two annotators needs to be found (e.g. precision, recall, f-measure)

Task 2 - OUTPUT

• Elisa and Yuanhao are demonstrated to be valid annotators

• They can start to annotate the training set of privacy policy

Task 3 – Objectives

• Annotate the privacy policy for the training set• 10 privacy policies (in the domain of the e-

health) may be sufficient (find support in the Literature)

• Build up the model using machine learning techniques

OUTOUTThe model for the privacy policy categorization

HIGH PRIORITY

Task 3: Suggestion• Read the paper (very useful)

— F. Sebastiani, “Machine learning in automated text categorization”.

• The tagHelperTool can be useful (it’s an open source tool)– Read the paper for technical details about it (not everything is interesting for us):

• C. Rosé et al., “Analyzing collaborative learning processes automatically: Exploiting the advances of computational linguistics in computer-supported collaborative learning”

– Look at the classes FeatureSelector and EnglishTokenizer to see how the tool treats stop words and how it creates its own bag of words

– etc/English.stp contains the file with the English stop words– This system uses WEKA classifiers

• Rainbow– a program that performs statistical text classification. It is based on the Bow library. For

more information about obtaining the source and citing its use, see the Bow home page.• Verify whether other tools are good to execute the following steps:

– Stop words elimination– Stemming– Bag-of-words (features) extraction– Words reduction (if necessary)– Machine learning model (classifier construction: verify which algorithm is better to use)

• Also, start writing the report about that so I can read it once back

HIGH PRIORITY

http://www.cs.cmu.edu/~mccallum/bow

TASK 4 – Objectives

• Apply the model (output of task 3) to new privacy policies

• Measure and document the quality of the model– Precision– Recall– F-Value

HIGH PRIORITY

TASK 5 - Objectives

• Build Up the GUI able to analyze privacy policies and show the different categories (principle) accounted by each policy

• The GUI may be web-based• The user should have the possibility to set the

categories (privacy principle) he is interested in verify

• The GUI should be based on the look&Feel shown in the very first slides of this presentation

privacy policy evaluation

Documents

privacy policies

privacy level assessment

privacy principles pp

medical information

information statistical

credit card information

web site

natural languageverify