josé miguel melo

FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO

Anonymization of clinical data

José Miguel Melo

Mestrado Integrado em Engenharia Informática e Computação

Supervisor: Gabriel David

July 23, 2017

Anonymization of clinical data

José Miguel Melo

Mestrado Integrado em Engenharia Informática e Computação

Approved in oral examination by the committee:

Chair: Doctor Rui CamachoExternal Examiner: Doctor Paulo OliveiraSupervisor: Doctor Gabriel David

July 23, 2017

Abstract

Over the past years, with the progress of technology, the amount of data being collected by the ITsystems has exponentially grown. By using data mining techniques, this data can be analyzed tofind trends and statistics, which are really useful for all companies and industries. The analysis anddata sharing for studies became a large industry, with a great impact in all sectors. However, withthis comes the concern with individual privacy - there is a huge amount of data which is privateand should not be public in any circumstances - so, it is highly needed to find a solution to shareand analyse data while protecting privacy. Nevertheless, it is truly important to take into accountperformance issues as the anonymization process should not hinder the normal functioning of theoperational system.

The focus goes to clinical data, which allows medical researchers to learn trends, statistics andrelations between certain clinical attributes, such as correlations between gender and a specificdisease. These studies and data analysis are very important as they can bring great benefits andknowledge in healthcare. However, maintaining individual privacy is crucial.

To balance the need for rigorous data and the requirement of privacy protection, a study ondata anonymization is done and some models and algorithms are discussed. This study allows topropose and develop a practical way to efficiently anonymize clinical data. With this solution, theuser can quickly and easily anonymize a given MongoDB dataset, through the provision of a setof configurations. The anonymization is done resorting to well known models and algorithms toprotect privacy, associated with specific clinical criteria, restrictions and hierarchies. At the end ofthe anonymization, an anonymized version of the subset is obtained that meets the selected privacymodel, balancing enough privacy versus keeping research value.

The solution is evaluated in terms of performance and some optimizations are proposed andimplemented to solve some limitations imposed by the implemented prototype. By using a subsetof clinical data that needs to be anonymized, the solution’s applicability is validated for researchpurposes.

i

Resumo

Nos últimos anos, com o avanço da tecnologia, a quantidade de dados armazenados pelos sistemasde informação tem vindo a crescer exponencialmente. Recorrendo a técnicas de data mining, estesdados podem ser analisados para encontrar tendências e estatísticas, que são de grande utilidadepara todas as empresas e industrias. Assim, a análise e a partilha de informação para estudostornou-se uma indústria com grande impacto em todos os setores. No entanto, com isto surgem aspreocupações com a privacidade individual - muita informação é privada e não deve ser tornadapublica em nenhuma circustância. Portanto, surge a necessidade de encontrar uma solução parapartilhar a informação, protegendo a privacidade. Esta solução deve ter em conta problemas deperformance para não comprometer o normal funcionamento do sistema.

O foco são os dados clínicos, que permite aos investigadores clínicos encontrar novas tendên-cias, estatísticas e relações entre atributos clínicos, como doenças e género. Estes estudos são deextrema importância por trazerem benefícios e conhecimentos na área da saúde. No entanto, aprivacidade individual é crucial.

Para balancear a necessidade de informação rigorosa para contextos de pesquisa e o requisitode privacidade, é feito um estudo sobre anonimização de dados e alguns modelos e algoritmos sãoapresentados. Este estudo possibilita a proposta e desenvolvimento de uma solução para anoni-mizar dados clinicos de uma forma prática e eficiente. Com esta solução, o utilizador conseguerápida e facilmente anonimizar uma base de dados MongoDB através de uma simples config-uração. A anonimização é feita recorrendo a modelos de privacidade e algoritmos conhecidos,aliados a hierarquias, critérios e restrições clinicas. No final do processo, é obtida uma versãoanonimizada da base de dados que cumpre o modelo de privacidade selecionado, balanceandoprivacidade versus valor de pesquisa.

A solução é avaliada em termos de performance e algumas otimizações são propostas e im-plementadas para resolver algumas limitações impostas pelo protótipo implementado. A aplica-bilidade da solução para contextos de pesquisa é ainda validada utilizando uma base de dados realque contém dados clinicos a ser anonimizados.

iii

Acknowledgements

First of all, I would like to thank my adviser Gabriel David for the interest shown in the disserta-tion and for all the suggestions given, which showed essential to achieve the best results on thisdissertation.

I would also like to express my gratitude to my course colleagues and teachers for all theknowledge that they could provide me during these years as student of Faculdade de Engenhariada Universidade do Porto. This knowledge will be trully important and useful for my future.

I would like to express my gratitude to Glintt, and in particular to my adviser Pedro Rocha, forall the support and help given during the dissertation, which showed essential to accomplish thebest results on this dissertation and without which it would be a lot harder to finish it.

In addition, a thank to all my closest friends for being there during all these years, for sup-porting me in all my decisions and for encouraging me to be better and to learn the most I can asstudent.

Finally, I would like to express a special thank to my family for supporting me during theseyears, for being there in all good and bad moments and without whom it would be a lot harder toachieve all the goals I achieve during these last years.

José Miguel Melo

v

“Arguing that you don’t care about privacy because you have nothing to hide is no different thansaying you don’t care about free speech because you have nothing to say. ”

Edward Snowden

vii

Contents

1 Introduction 11.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Anonymize Data for Privacy 52.1 Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Privacy Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 k-anonymity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 (δ , k) anonymity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 km anonymity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.4 Bayes-Optimal Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.5 `-diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.6 t-closeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.7 δ -Presence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Data models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.1 Numeric data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.2 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.3 Set-valued data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Approaches to anonymity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.2 Suppression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.3 Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Quality Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.5.1 Discernibility metric (DM) . . . . . . . . . . . . . . . . . . . . . . . . . 172.5.2 Classification metric (CM) . . . . . . . . . . . . . . . . . . . . . . . . . 182.5.3 Precision (Prec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5.4 Normalized Certainty Penalty (NCP) . . . . . . . . . . . . . . . . . . . 18

2.6 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.6.1 DataFly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.6.2 Optimal Lattice Anonymization (OLA) . . . . . . . . . . . . . . . . . . 202.6.3 Incognito . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.6.4 Flash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.7.1 UTD Anonymization Toolbox . . . . . . . . . . . . . . . . . . . . . . . 242.7.2 ARX - Powerful Data Anonymization Tool . . . . . . . . . . . . . . . . 25

2.8 Clinical Data Anonymization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

ix

CONTENTS

2.8.1 Recoding Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.8.2 Names, Contact information and Identifiers . . . . . . . . . . . . . . . . 312.8.3 Age and Date of Birth . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.8.4 Other Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.8.5 Medical dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Solution 333.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.1 Anonymization Process . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2.2 Anonymization GUI app . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.3 Anonymization web service . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 Anonymization Process Results . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 Optimizations 514.1 Streams for Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.1.1 Data Processment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.1.2 Save Results to MongoDB . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Clustering Pre-processment Phases . . . . . . . . . . . . . . . . . . . . . . . . . 544.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Results 635.1 Information loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1.1 Prescribed Medications Collection . . . . . . . . . . . . . . . . . . . . . 635.1.2 Consumptions Collections . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.2.1 Privacy Model Config Impact . . . . . . . . . . . . . . . . . . . . . . . 705.2.2 Solution - Initial version . . . . . . . . . . . . . . . . . . . . . . . . . . 725.2.3 Solution - Streaming version . . . . . . . . . . . . . . . . . . . . . . . . 755.2.4 Solution - Clusters version . . . . . . . . . . . . . . . . . . . . . . . . . 775.2.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 Web and Desktop Application 896.1 Web application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.1.1 View process information loss . . . . . . . . . . . . . . . . . . . . . . . 916.1.2 View useful analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.2 Desktop application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7 Conclusions and Future Work 957.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

References 99

x

CONTENTS

A MongoDB Collections Structure 103A.0.1 Prescribed Medications . . . . . . . . . . . . . . . . . . . . . . . . . . . 103A.0.2 Consumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

xi

CONTENTS

xii

List of Figures

2.1 Categorical data generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Date generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Vehicle generalization until suppression . . . . . . . . . . . . . . . . . . . . . . 152.4 Datafly pseudocode (Source: [Swe02a] ) . . . . . . . . . . . . . . . . . . . . . . 192.5 Lattice of generalizations and generalization strategy (orange trace). . . . . . . . 212.6 Incognito pseudocode (Source: [LDR05]). . . . . . . . . . . . . . . . . . . . . . 222.7 Lattice evolution for Incognito algorithm (Source: [KPE+12]). . . . . . . . . . . 222.8 Flash algorithm lattice example (Source: [KPE+12]). . . . . . . . . . . . . . . . 232.9 Outer loop of the Flash algorithm (Source: [KPE+12] . . . . . . . . . . . . . . . 232.10 CheckPath(Path, Heap) (Source: [KPE+12] . . . . . . . . . . . . . . . . . . . . 242.11 FindPath(Node) (Source: [KPE+12] . . . . . . . . . . . . . . . . . . . . . . . . 242.12 Configuration file, taken from UTD Anonymization Toolbox manual (Source: [KIK16]) 262.13 ARX main frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.14 ARX workflow (Source: [arxc] . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.15 ARX hierarchy wizard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.16 (1) Attribute properties configuration. (2) Privacy models selection. (3) Utility

metrics and suppression limit. . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.17 ARX solution exploring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.18 ARX anonymization result analysis . . . . . . . . . . . . . . . . . . . . . . . . 292.19 ARX tool API uml (Source: [arxb]) . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1 Anonymization process flow chart . . . . . . . . . . . . . . . . . . . . . . . . . 363.2 Connect to mongo server and gather data flowchart (left). Parse dates from query

object flowchart (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3 Consumptions and Prescribed Medications representations after processment. . . 413.4 GUI anonymization process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.5 Web service component diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . 473.6 Memory usage for 20k records (left) and memory consumption distribution (right). 49

4.1 Memory usage for 20k (left) and 400k (right) records. [SH] . . . . . . . . . . . . 524.2 Memory usage by object for 400k records. [SH] . . . . . . . . . . . . . . . . . . 524.3 Clustering architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.4 Client-cluster functioning and interaction flowchart. . . . . . . . . . . . . . . . . 56

5.1 Information loss vs k-anonymity for 20k (left) and 600k (right) records. . . . . . 675.2 Memory usage for 20k records (left) and memory consumption distribution (right).

[SH] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.3 Memory usage tendency charts. . . . . . . . . . . . . . . . . . . . . . . . . . . 745.4 Memory consumption distribution for 20k records. [SH] . . . . . . . . . . . . . 76

xiii

LIST OF FIGURES

6.1 Create new connection (left). Start anonymization (right). . . . . . . . . . . . . . 906.2 Anonymization ended notification. . . . . . . . . . . . . . . . . . . . . . . . . . 906.3 List of results (left). Results for specific anonymization process (right). . . . . . 906.4 Information loss view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.5 Web application dashboard with provided analytics. . . . . . . . . . . . . . . . . 926.6 Connect (left) and create the configuration file (right). . . . . . . . . . . . . . . . 936.7 Create query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.8 Preview and export results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

xiv

List of Tables

2.1 Patient’s diseases table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 2-anonymity patient’s diseases table . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Original glucose table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 22-anonymity glucose table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 Patient’s diseases table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 2-diversity patient’s diseases table . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 Income table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.8 Anonymized income table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.9 Original table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.10 2-anonymization table with local recoding . . . . . . . . . . . . . . . . . . . . . 152.11 2-anonymization table with global recoding . . . . . . . . . . . . . . . . . . . . 152.12 Weight table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.13 Perturbed weight table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Non-hierarchical structure version of Listing 3.2 . . . . . . . . . . . . . . . . . . 393.2 Results for an initial version of the anonymization process . . . . . . . . . . . . 48

4.1 Anonymization process analysis after optimization of data processment and saveresults phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Anonymization process analysis after optimization of data processment, save re-sults and validation phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Anonymization process analysis after clustering architecture implementation. . . 60

5.1 Prescribed Medications anonymization - Information loss for 20k, 100k and 600krecords using k=2, k=3 and k=5. . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Prescribed Medications anonymization - Information loss for 20k, 100k and 600krecords using `=2, `=3 and `=5. . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3 Consumptions anonymization - Information loss for 20k, 100k and 600k recordsusing k=2, k=3 and k=5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.4 Elapsed time for distinct values of k and `. . . . . . . . . . . . . . . . . . . . . . 715.5 Memory usage for distinct values of k and `. . . . . . . . . . . . . . . . . . . . . 715.6 Performance results for the initial version of the anonymization process using k=2. 725.7 Performance results for the initial version of the anonymization process using `=2. 735.8 Memory usage for 20k records using `=2. [SH] . . . . . . . . . . . . . . . . . . 735.9 Memory usage evolution with dataset size. . . . . . . . . . . . . . . . . . . . . . 735.10 Performance results for the streaming version of the anonymization process using

k=2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.11 Memory usage evolution with dataset size. . . . . . . . . . . . . . . . . . . . . . 765.12 Memory usage tendency chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

xv

LIST OF TABLES

5.13 Performance using cluster based solution. . . . . . . . . . . . . . . . . . . . . . 775.14 Memory usage evolution with dataset size. . . . . . . . . . . . . . . . . . . . . . 785.15 Memory usage tendency chart for cluster based solution. . . . . . . . . . . . . . 785.16 Elapsed time comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.18 Dataset size limitation comparison. . . . . . . . . . . . . . . . . . . . . . . . . . 795.17 Memory usage comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.19 Quality metrics returned from anonymization applied to anonymized version of

Prescribed Medications dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.20 k and ` returned from anonymization Prescribed Medications dataset. . . . . . . 82

xvi

Abbreviations

EMD Earth Mover’s DistanceQID Quasi-IdentifierPrec Precision metricDM Discernibility MetricCM Classification MetricNCP Normalized Certainty PenaltyHIPAA Health Insurance Portability and Accountability ActRDBMS Relational Database Management SystemGUI Graphical User InterfaceUML Unified Modeling LanguageEHR Electronic Health RecordEMR Electronic Medical RecordOLA Optimal Lattice AnonymizationJSON JavaScript Object NotationPRMD Privacy Model

xvii

Chapter 1

Introduction

1.1 Context

The data that is routinely collected in most current organizations can be analyzed and used for

research to find trends and statistics. These trends and statistics help researchers and industries to

evolve and to find new possible solutions, improvements and support the decision of several layers

of Public Administration.

Healthcare is one of the main industries when talking about data sharing for research. The

analysis of clinical data, resorting to data mining methods, brings several advantages for the in-

dustry and the Health Administration. It helps finding correlations between metrics and attributes,

which may lead to new discovers, new possible causes for some diseases, and much other useful

information.

However, with all this data sharing comes the concern with individual privacy, which is granted

by law and of great importance. So, it is highly needed to find a way of sharing data while

preserving privacy. To achieve this, it is required to find ways to anonymize data, that balances

privacy and data utility.

From the already said, it is easily understood that this dissertation is focused on database

security and privacy, since it addresses ways of anonymizing clinical data to enable data sharing

for research purposes.

This dissertation will be developed along with Glintt - Healthcare Solutions. Glintt is one of

the biggest technological companies in Portugal and has a focus on healthcare solutions, in which

it is a national leader.

1.2 Motivation and Goals

The amount of clinical data is enormous and the analysis of the data collected brings great benefits

to all of us. For the pharmaceutical industry, this data is useful as it allows to gain access and

1

Introduction

knowledge about how effective some drugs are in the treatment of some diseases, it allows to find

out possible secondary effects that were not supposed and possible correlations of these effects

with attributes (age, genre, ...), and many other useful information. For clinical research in general,

there is an infinite amount of knowledge that can be collected from the analysis of clinical data. For

example, it may allow to learn possible causes and correlations of diseases with other factors, the

evolution of diseases in our society and why the evolution is happening, or even why a disease is

more common in some locations. For our society, this data is extremely important as the analysis

of data allows to gain more knowledge and, as consequence, to improve everyone’s healthcare.

Final, but not least, it is also important for companies that collect data, as it is an important income

source that allows this data to continue being collected and used to study and gather knowledge.

However, individual privacy is very important in our society and clinical data can not be provided

for research purposes if researchers can identify individuals. Consequently, it is clear the necessity

to find the best solution to de-identify individuals in a data set. However, this is not as simple and

as linear as it may seem, because there is a huge amount of external information with which

clinical data can be associated with to re-identify individuals. This way, it is essential to find the

best anonymization method to the problem.

Another important aspect is that, the more encryption and anonymization, the more informa-

tion loss. Therefore, it is important to find the optimal solution for an anonymization problem,

which is the optimal balance between information loss and amount of anonymization.

Also, data must be always available for medical and clinical entities. So, it is also important to

take attention to performance and find a way to not compromise the system while encryption and

anonymization process is running.

Electronic Health Records (EHR) include a wide range of data, such as age, diseases, medica-

tions, prescriptions, and many other. So, it is clear that exist many types of data - numeric, names,

... - and it is important that the solution is scalable and flexible enough to handle all these types of

data.

The method to solve this problem that was chosen is to create a solution that enables to load

a dataset, configure hierarchies for each type of data and identify all required quasi-identifiers,

identifiers and sensitive data. After that, the system will process the dataset and return a new

anonymized one. At the end of this process, data should be ready for sharing with minimal privacy

risk.

1.3 Dissertation Structure

Besides this introduction, this dissertation is divided into 6 more chapters.

In this chapter some concepts and existing work will be covered, which will help in the creation

of a solution for clinical data anonymization.

On chapter 2 some important concepts as well as existing work on the field of anonymization

are covered and analysed. Based on this analysis, on chapter 3 a solution is proposed to solve the

problem of clinical data anonymization and a first analysis on an initial version of this solution is

2

Introduction

done. In order to improve this initial version, on chapter 4 some optimizations are analysed and

proposed.

With the solution implemented, on chapter 5 a study on the impact of privacy models in the

process of anonymization as well as an analysis and comparison of all implemented versions is

done.

On chapter 6 the web application and the desktop application that resulted from the proposed

solution are presented and explained in more detail.

Finally, on chapter 7 final conclusions on this dissertation are presented and some future work

proposed.

3

Introduction

4

Chapter 2

Anonymize Data for Privacy

With the growth of technology and data collection, came the necessity of sharing data for research

purposes. However, with this, came the concerns about individual privacy and the necessity of

anonymizing data.

This topic has been under intense research in the last years in order to try to minimize the risks

of breaking individual privacy.

In this chapter some concepts and existing work will be covered, which will help in the creation

of a solution for clinical data anonymization.

2.1 Principles

First of all, it is important to clarify the concept of data anonymization. Data anonymization is a

process in which a database that contains information about real people is converted into a new

database that contains the same information, but in which is not possible to identify any specific

person. [Rag13]

Three important characteristics that describe an anonymization problem are privacy model,

data model and quality metrics. [Pod11]

Privacy models are conditions that a database must satisfy to be considered anonymized.

The process of data anonymization always implies loss of information. There are several

methods that allow to anonymize a dataset but the best method for a problem is the one that less

decreases data utility due to information loss. Quality metrics are used to evaluate the quality

of the data generated by the anonymization process. These metrics are used to choose the best

version of the anonymization process. [Pod11, LGD12].

There are three types of disclosure, which privacy data publishing aims to protect against:

[KS+13]

1. attribute disclosure, in which sensitive information is discovered.

2. identity disclosure, in which an individual is associated to a row in the anonymized database.

5


3. membership disclosure, in which it is determined if an individual is member of the anonymized

dataset.

2.1.1 Basic Definitions

In this section some important definitions will be presented in order to better understand the next

sections.

In anonymization problems, and more in the specific case of clinical data anonymization, there

are 3 special types of attributes that must be well defined in datasets: sensitive attributes, identifiers

and quasi-identifiers. Identifiers, defined at 2.1.2, are attributes that identify an individual without

the need of external information, such as names. Quasi-identifiers (QID), defined at 2.1.3, are

attributes that can identify an individual if external information is available and used. An example

of a QID is a diagnosis code, which does not directly identify a patient but if it is a rare diagnosis,

then an attacker may infer to whom the diagnosis correspond. A QID can lead to the correct

association of a record with an individual, also known as identity disclosure. [BRK+13]

Sensitive attributes, defined at 2.1.1, are attributes that are not supposed to be linked to an

individual, such as a rare disease. [GDL15]

It is also important to keep in mind the Definition 2.1.4 of equivalence class as it is an important

part of most privacy models that will be covered in Section 2.2.

Definition 2.1.1 Sensitive attributes

An attribute that must be kept secret in the anonymized database. [SXF10]

Definition 2.1.2 Identifiers

An attribute which explicitly identifies record owner. These attributes are typically excluded

from the dataset. [SXF10]

Definition 2.1.3 Quasi-identifier (QID)

An attribute that can identify an individual in the database when combined with external data.

[KS+13]

Definition 2.1.4 Equivalence class

Group of rows in which the quasi-identifiers in the selected group of columns have exactly the

same values. [EEDI+09]

2.2 Privacy Models

During the last years, with the growth of data and the concerns about individuals privacy, multiple

privacy models have been proposed and developed to minimize individual’s privacy risk.

The most well-known models are:

• k-anonymity [Swe02b] and some variations of it

6


• `-diversity [MKGV07, LLV07]

• t-closeness [LLV07]

• o-Presence [NC10]

These models will be analysed and explained in the following sections.

2.2.1 k-anonymity

k-anonymity was introduced by Latanya Sweeney in 2002 to solve the problem of producing

a release of data that contains useful data about individuals while those individuals cannot be

identified. [Swe, Swe02b].

This privacy model tells that each tuple of QIDs must appear a minimum of k times in order

to achieve anonymity [BFW+10, Tho07, AFK+06]

Definition 2.2.1 k-Anonymity

A table is said to achieve k-anonymity if every tuple of quasi-identifiers appear at least k times

in that table. [Swe02b]

k-Anonymity is the most common and the most used privacy model because the results are sat-

isfactory for most problems and it is the most practical and the easiest to achieve in most scenarios.

In order to return good results, quasi-identifiers must be well defined by data publishers.

For better understanding of this model, an example will be shown and explained next. The

example is based on the example given by Benjamin C. M. Fung et al. [BFW+10].

Location Sex Age DiseasePorto Male 34 Cancer

Manchester Male 39 CancerLondon Male 38 DiabetesPorto Male 35 Diabetes

Lisbon Female 24 HepatitisLisbon Female 28 DiabetesTable 2.1: Patient’s diseases table

Location Sex Age DiseasePortugal Male [30-40] CancerEngland Male [30-40] CancerEngland Male [30-40] DiabetesPortugal Male [30-40] DiabetesPortugal Female [20-30) HepatitisPortugal Female [20-30) Diabetes

Table 2.2: 2-anonymity patient’s diseases table

7


Example 2.2.1 Suppose Table 2.1 was intended to be provided for research purposes.

In this table, if an attacker knows that someone lives in Porto and is 34 years old, then the

attacker will infer that person has Cancer (row 1).

To prevent this to happen, Table 2.1 was anonymized with k-Anonymity model, resulting Table

2.2. This new table is 2-Anonymous.

In this anonymized table we notice that every tuple of quasi-identifier - Location, Age and Job

- appears at least k times. The tuples are:

• (Portugal, Male, [30-40])

• (Portugal, Female, [20-30))

• (England, Male, [30-40])

If we analyse this table we notice that some generalizations on quasi-identifiers were made:

1. Location - For 2-Anonymity, we needed at least 2 equal values for attribute location, which

is not fulfilled in the initial table due to London and Manchester. So, this quasi-identifier

is generalized according to an hierarchy, defined by the data publisher. In this case, it is

generalized to Portugal and England.

2. Sex - For 2-Anonymity, we needed at least 2 equal values for attribute sex. This is fulfilled

in the initial table, so there is no need to make any changes to this attribute.

3. Age - In the initial table, we had the patient’s exact age. For 2-Anonymity, we needed at

least 2 equal values for the age, which is not fulfilled in the initial table. So, this QID is

generalized, creating groups: [20,30) and [30,40].

Now, after anonymizing the table, each distinct group of QID appears at least 2 times. So, if an

attacker knows that someone lives in Porto and is 34 years old, there will be at least 2 possibilities

and so, it is not as simple to find out who has which disease.

To achieve k-Anonymity some approaches are possible. The most well known are:

1. Generalization

2. Suppression

These approaches are better explained in Section 2.4.

2.2.1.1 Attacks against k-anonymity

Although k-anonymity is the most common and the most used privacy model, it is vulnerable to

attacks, even when k is high and the quasi-identifiers are selected with accuracy and care.

These attacks can be summarized into the following:

8


1. Unsorted matching attack - Records are saved into the database sequentially. This way, the

order in which the tuples appear on the database can be used as an attack. This type of attack

is easily solved by shuffling the anonymized table. [Swe02b]

2. Complementary release attack - Some generalizations of attributes may have the same value

or meaning as the initial value, which makes them similar. When this happens, joining the

anonymized table with an external information will lead to easily find correlations. Exam-

ple: If the quasi-identifier full date is generalized into the year, then the attacker may still

find out a relation between the anonymized table and an external table because attributes are

similar in both tables. [Swe02b]

3. Homogeneity Attack - Records with similar tuples of QIDs may have the same sensitive

attribute.

Example: Suppose an attacker is able to find out which rows in the table may correspond to

an individual. Now, suppose one of the attributes in the table is Disease and all rows that

may correspond to that individual have the same value for Disease. The attacker infered with

100% sure the disease the person has, even though he/she doesn’t know the exact record.

[MKGV07]

4. Background knowledge attack - An attacker may have external information about an indi-

vidual that removes possible rows for that individual. [MKGV07]

Example: Suppose an attacker knows that an individual corresponds to one of three rows

in the anonymized database. From those 3 rows, a column Disease only has two distinct

values. So, the attacker infers that the individual has one of two diseases. However, the

attacker also knows that the individual does not have one of the diseases. So, it is only left

one disease and with this background knowledge, the attacker finds out the disease a specific

individual has.

2.2.2 (δ , k) anonymity

(δ , k) anonymity is a simple model that extends k-anonymity, described on Section 2.2.1. In

this model, no sensitive value may have a frequency greater than δ within an equivalence class.

[Pod11]

2.2.3 km anonymity

As well as (δ , k) anonymity, km anonymity is a simple model that extends k-anonymity, described

on Section 2.2.1. This model intends to ensure that it is required more than m information values

about an individual to identify less than k records. [GAZ+14]

As an example of this model, consider the following table:

9


Name Glucose ValuesJohn 75, 80, 100, 95

Joshua 150, 144Mary 80, 95, 125Peter 144, 130, 125

Andrew 75, 130, 125Table 2.3: Original glucose table

Now suppose an attacker knows that John had the following glucose values: 75mg/dl and

100mg/dl. Even if all names are removed from Table 2.3, the attackers knows exactly which

record corresponds to John. To prevent from this to happen, consider the 22-anonymity Table 2.4.

In this new table, even with those two informations about John, the attacker can not identify which

record corresponds to John, having 3 possibilities.

Name Glucose Values* [70-99], [70-99], [100-130], [70-99]* [130-150], [130-150]* [70-99], [70-99], [100-130]* [130-150], [130-150], [100-130]* [70-99][130-150], [100-130]Table 2.4: 22-anonymity glucose table

2.2.4 Bayes-Optimal Privacy

Every attacker has different levels of knowledge that can be useful to discover sensitive informa-

tion. Bayes-Optimal Privacy is based on conditional probabilities, which are used to represent the

knowledge of an attacker. This model has the concept of prior and posterior belief, which are the

beliefs of an attacker knowing a sensitive value. In order to guarantee Bayes-Optimal Privacy the

difference between prior and posterior belief must be low. [Ker13, Pod11]

2.2.5 `-diversity

`-diversity was proposed by Machanavajjhala et al. in 2007 and it is based on the Bayes-Optimal

Privacy (explained in Section 2.2.4).

This principle, defined at Definition 2.2.2, tells that there must be a minimum of ` distinct

values on sensitive attribute for every group of QIDs. When this is achieved, a table is said `-

diverse. [MKGV07]

k-anonymity is satisfied by this privacy model with k = `. This happens because every tuple of

quasi-identifiers must contain ` distinct sensitive values and as consequence every tuple appear at

least ` times. [BFW+10]

Definition 2.2.2 A table is `-diverse if every equivalence class in that table has a minimum of `

distinct values on sensitive attributes. [MKGV07]

10


In a `-diverse table, it is needed at least `-1 samples of information about an individual to

discover the row that corresponds to that individual. [Pod11]

Age Sex Disease37 Male Heart Disease34 Female Cancer31 Male HIV24 Female Cancer26 Male Heart Disease

Table 2.5: Patient’s diseases table

Age Sex Disease[30-40] * Heart Disease[30-40] * Cancer[30-40] * HIV[20-30) * Cancer[20-30) * Heart Disease

Table 2.6: 2-diversity patient’s diseases table

Example 2.2.2 Consider the patients records shown in Table 2.5 and the corresponding 2-diverse

table in Table 2.6.

The sensitive attribute is Disease and there are 3 possible values for this attribute - "Heart

Disease", "Cancer" and "HIV".

In the 2-diverse version, 2 equivalence classes were created:

1. Age: [30-40] and Sex: *

2. Age: [20-30) and Sex: *

Each of these equivalence classes contain at least 2 distinct sensitive values:

1. First equivalence class contains "Heart Disease", "Cancer" and "HIV"

2. Second equivalence class contains "Heart Disease" and "Cancer"

This satisfies `-diversity model with ` = 2.

2.2.5.1 Instantiations

Machanavajjhala et al. proposed three variations of `-Diversity:

1. Distinct `-diversity - this definition is the simplest one and just ensures that at least ` distinct

values appear in each equivalence class. [LLV07]

2. Recursive (c-`)-diversity [LLV07] - this variation tries to decrease the difference between

the frequency most common values appear and the frequency less common values appear.

3. Entropy `-diversity - this variation takes advantage of the concept of Entropy and it tells

that a table is Entropy `-diverse when each equivalence class has an entropy greater than

log(`).

11


2.2.5.2 Limitations

`-diversity has some weaknesses and limitations that are easily understood. [LLV07]

First of all, `-diversity may be hard to accomplish in some cases. In a dataset with huge

amount of records and a sensitive attribute which has a value that rarely appears, `-diversity is

really hard to achieve and the information loss would be large. [LLV07]

Second, `-diversity can not prevent attribute disclosure. Even with ` distinct values on

each equivalence class for the sensitive attribute, it is not ensured that the values present in the

sensitive attribute are not similar or related. If they are similar, then the attacker may infer a valid

disclosure. For example, imagine that all sensitive values for an equivalence class are related to

a heart disease. Although all values are distinct, the attacker can infer with 100% sure that an

individual has heart problems.

2.2.6 t-closeness

t-closeness, proposed by Ninghui Li et al. in 2007 [LLV07], can be considered as an improvement

of `-diversity.

This model intends to reduce the level of detail of sensitive attributes. To achieve this, the

distribution of a sensitive atrribute within an equivalence class and within the entire table must not

have a distance of more than t.[LLV07] To calculate this distance between the two distributions, it

is used the earth mover’s distance (EMD). [RTG00, arxd]

As in the previous models, for a table to be t-closeness, every classes must accomplish t-

closeness.

This privacy model is difficult to achieve in some situations and it is even not needed to protect

data in most cases as other simpler privacy models are enough.

2.2.7 δ -Presence

δ -Presence evaluates the probability of an individual to be in the generalized table. A dataset is

δ -present when this probability is within the range (δmin , δmax). [NC10]

This model is intended to prevent membership disclosure.

2.3 Data models

Data models are another key part when choosing an algorithm for data anonymization. In what

concerns anonymization, data can be of two main types - Numeric, Categorical. Recently, ap-

peared the concept of Set-valued as data model.

12


2.3.1 Numeric data

An attribute is considered numeric data for anonymization purposes when it there is strong order

among elements. An example of numeric data is Income (Tables 2.7 and 2.8 ).

Person Id Income1 30K2 35K3 27K4 23K5 38K

Table 2.7: Income table

Person Id Income1 [30K - 40K]2 [30K - 40K]3 [20K - 30K]4 [20K - 30K]5 [30K - 40K]

Table 2.8: Anonymized income table

2.3.2 Categorical data

An attribute is considered categorical data for anonymization purposes when there is no reasonable

order among elements. An example of categorical data is Disease. This type of data is often

arranged hierarchically and can be described at different levels of generality. An example of

generalization of this type of data can be seen at Figure 2.1.

Figure 2.1: Categorical data generalization

2.3.3 Set-valued data

Set-valued data tells that an individual is associated to a set of items. A logical record has the form

(indId, {item1, . . ., itemm}) and the representation is conceptually easy: Pindividual = item1, item2,

item3. [HN09, TMK08]

2.4 Approaches to anonymity

In order to achieve anonymity on a database, there are three main approaches:

1. Generalization

2. Suppression

3. Perturbation

13


2.4.1 Generalization

Generalization is one of the approaches to anonymize data. This approach consists on replacing

values with more general values. These generalized values are a set of values that contain the initial

value. [KT12] In other words, there is a hierarchy of values where each value has a corresponding

value in the next generalization level. When a generalization is done, a value is replaced by its

correspondent in the next most general level. [TMK11]

Generalizations tend to have less information loss when there are more levels in their hierarchy.

This is easily understandable as more levels mean that each level has greater precision.

In Figure 2.2 there is an example of a generalization hierarchy where the most generalized

level is [1960-1980] and the less generalized is a singular year.

Figure 2.2: Date generalization

Generalization can be done using two distinct methods: global recoding and local recod-ing. Global recoding maintains more consistency along the dataset, while local recoding enables

anonymization with smaller loss of information.

Local recoding generalizes individual tuples. In this model, each column is generalized inde-

pendently to its hierarchicaly generalization. For example, if the weight 65kg is present in several

records of the database, it may be generalized to [60-65], to another value or even not suffer

changes. [KT12, XWP+06]

Global recoding generalizes every individual to a unique value of the hierarchicaly gener-

alization. In other words, each value in the dataset is generalized the same way. [KT12] For

example, if the weight 65kg appears in several records, it will be generalized the same way in all

records.

These two models for generalization are represented in Table 2.10 and 2.11. Let’s focus on the

value 65 for attribute weight. We notice that:

1. in Table 2.10 it was generalized into [60-65] (row with Id = 1) and into [65-70] (row with

Id = 2). This is possible because it was used local recoding model.

14


2. in Table 2.11 it was only generalized into [60-70]. This happens because it was used global

recoding model and, so, all values in the table were generalized the same way.

Id Weight1 652 653 604 575 696 55

Table 2.9: Original table

Id Weight1 [60-65]2 [65-70]3 [60-65]4 [55-60]5 [65-70]6 [55-60]

Table 2.10: 2-anonymization table withlocal recoding

Id Weight1 [60-70]2 [60-70]3 [60-70]4 [55-60]5 [60-70]6 [55-60]

Table 2.11: 2-anonymization table withglobal recoding

2.4.2 Suppression

Suppression consists in entirely removing a value from the dataset. This can be seen as a general-

ization, where the value is generalized to suppressed. This occurs when each value generalization

increments until no more generalization can occur and the value needs to be suppressed to not

include it in the database. [Swe02a]

In Figure 2.3 there is an example of an hierarchy where the last level of generalization is

suppression. Suppression happens when generalization is not sufficient to achieve k-anonymity.

Figure 2.3: Vehicle generalization until suppression

15


2.4.3 Perturbation

Data perturbation is a technique in which data in a record is changed to an insignificant value. This

approach makes harder to re-identify any individual as data is perturbated and changed. Also, even

if an individual record is re-identified, there is no certainty that the data on that record corresponds

to the initial data.

Micro-aggregation is a method that allows perturbation. This method consists in grouping

values in small aggregations and replacing those values with the average of the small aggregation

the value is in. The records that belong to the same aggregation will be represented with the same

value in the released dataset. [ASMN12]

Id Weight1 902 873 854 685 666 697 508 509 52

Table 2.12: Weight table

Example 2.4.1 The above table presents the weight of 9 people. Using the method micro-aggregation

for data perturbation, 3 groups are created: [1-3], [4-6] and [7-9]. With these 3 groups created,

the average for the weight is calculated for each group: [87.33], [67.67] and [50.67]. Then, the

perturbated table looks like the following:

Id Weight1 87.332 87.333 87.334 67.675 67.676 67.677 50.678 50.679 50.67

Table 2.13: Perturbed weight table

There are several other techniques for perturbation, which depend on many factors - level of

anonymity required, type of data, ... -, such as: [ASMN12]

16


1. Data swapping - consists in altering records by swapping values with selected pairs of

records (swap pairs),

2. Adding noise - consists in adding a random value to all values in the attribute to be protected.

3. Post-Randomization Method (PRAM) - for protecting categorical variables. This method

swaps the values for selected variables based on a transition matrix, which specifies the

probabilities to swap a value with another value. An example of this technique is better

explained in Example 2.4.2. [GKWW98, TMK13]

Example 2.4.2 PRAM

Suppose the variable disease has 4 possible values: "cancer", "hiv", "diabetes" and "heart

disease". A possible transition matrix could be

Ai, j =

0.2 0.6 0.1 0.1

0.8 0.2 0 0

0.1 0.5 0.2 0.2

0 0.3 0.5 0.2

The transition matrix has the probabilities of each variable being swapped by another. From

the matrix it is possible to notice that pcancer,diabetes = 0.1; phiv,diabetes = 0, which means that "hiv"

will not be swapped by "diabetes" in any situation.

2.5 Quality Metrics

As explained in the previous section, anonymization is reached by generalizing and suppressing

data. However, with this generalization and suppression, datasets tend to lose quality and infor-

mation. It is important not to forget about the quality of generated data and try to find a balance

between anonymization and information loss.

Quality metrics of anonymization are an important part of this process and can be expressed

in the form of information loss and utility. [RK13] This section discusses and shows up some of

these quality metrics.

2.5.1 Discernibility metric (DM)

Discernibility was proposed by Bayardo et al in 2005 [BA05] and penalizes for suppressed rows.

[KPK15]

The discernibility metric penalizes each record based on how many records are indistinguish-

able from it in the anonymized dataset. The dataset total penalization is equal to the sum of the

penalization of each record. The penalty applied to a tuple is:

Penalty(r) =

j, Unsuppressed tuple is present in an equivalence class with size j

|D|, Suppressed tuple. |D| is dataset’s size(2.1)

17


Penalties on this metric are applied based on the equivalence classes size and not based on in-

formation loss. It supposes that the information loss will be proportional to the size of equivalence

classes. [Pod11]

2.5.2 Classification metric (CM)

Each record is assigned a class label as an additional attribute. It evaluates and tries to ensure that

generalization and anonymization does not weaken too much the distinction of existing classes

using QID.

Classification metric, according to Iyengar, was proposed as: [Byu07, BKBL07]

CM = ∑allrows

Penalty(r)N

(2.2)

Where:

r = specific record

N = number of records.

Penalty(r) may assume one of the following values:[Byu07, BKBL07]

Penalty(r) =

1, r 6= majority class in the equivalence group

1, r is suppressed

0, other situations

(2.3)

This metric is suitable when the anonymized data is used to train a classifier. [GKKM07]

2.5.3 Precision (Prec)

The Precision (Prec) metric was introduced by Sweeney [Swe02a] with the aim to evaluate the

information loss. This information loss is evaluated taking into account the amount of modification

of a table caused by anonymization.

This metric consists in calculating the ratio between number of generalizations made and total

of possible generalizations. This ratio gives the information loss for a specific variable. [EEDI+09,

Swe02a]

2.5.4 Normalized Certainty Penalty (NCP)

This metric, which was proposed by Jian Xu et al. [XWP+06], is based on the approximation of

generalized entries to the original ones, which allows to evaluate the information loss caused by

anonymization process.

18


As it is based on the approximation of entries to the initial values, it separates the definition

into the two main types of data models: numeric and categorical. Each of these main types has a

specific way to be evaluated.

2.6 Algorithms

Anonymization is a topic of intense investigation and a wide range of algorithms to achieve

anonymization have already been proposed. These algorithms use privacy models, explained at

Section 2.2, to check whether a data set is anonymized or not. The most recent algorithms also

take into account the quality of generated data.In order to find the best balance between informa-

tion loss and anonymization, they use quality metrics to evaluate the quality of data and select the

best solution according to that metric.

In this section, some algorithms used to achieve k-anonymity will be stated.

2.6.1 DataFly

Datafly was proposed by Latanya Sweeney and aims to provide anonymity in clinical data [Swe98].

Figure 2.4: Datafly pseudocode (Source: [Swe02a] )

This algorithm achieves k-anonymity using global recoding as generalization model and it is

based on greedy algorithms. It has few steps, which are demonstrated in Figure 2.4 [Swe02a]:

1. Construct list of frequencies for QID, where each element corresponds to at least one tuple

on the database.

2. While there are elements that occur k or more times in the list, the attribute with more

distinct values in that list is generalized.

3. After generalization loop ends, any element on the list that occurs less than k times is sup-

pressed.

4. At the end of the algorithm, an anonymized table is constructed and returned.

19


Datafly has good performance when compared to other algorithms and solutions. However,

this algorithm performs unnecessary generalizations and as so, it does not provide an optimal

solution even the solution satisfies k-anonymity. This is one of the biggest problems with Datafly

as it may generate a solution with high information loss. [Swe02a, EEDI+09, ARMCM14]

2.6.2 Optimal Lattice Anonymization (OLA)

Generalization hierarchies for the QIDs tuples can be represented as a lattice, in which each node

represents a possible version of the database The optimal solution for the anonymization prob-

lem is within the lattice, corresponding to one of the nodes. Figure 2.5 represents a lattice of

generalizations for QID <d, e, f>.

As more generalization is done and as we go up on lattice levels, the suppression percentage

reduces. Although it seems confusing at first glance, it is easy to understand. When we are at level

0 of the lattice, there is no generalization. In this case, the quantity of distinct values is enormous

and a lot of suppression is needed in order to achieve k-anonymity. However, at level 3 of a lattice

some generalization has already been made. In this new case, there is not as much distinct values

as in the first scenario and it is easier to achieve k-anonymity without much suppression.

Another thing that is important to understand is that suppression only affects a single record,

while generalization affects many records at the same time. For this reason, suppression is pre-

ferred to generalization. However, suppressing all records increases information loss and, due to

this situation, it is important to impose a limit on suppression. This limit will be referenced as

MaxSup. [EEDI+09]

A node in the lattice will be globally optimal if the amount of suppression is less than MaxSup,

if it fulfills k-anonymity requirements and if it has the best value for the selected quality metric.

OLA algorithm, proposed by El Emam et al. [EEDI+09], aims to find the optimal solution for

the anonymization problem, which corresponds to a node in the lattice - the node with minimal

information loss. This algorithm consists in three steps, described in the article published by its

author [EEDI+09]:

1. Find all nodes that achieve k-anonymity on each generalization strategy. This is done by

using binary search method.

2. The node with lower level in the lattice and that achieves k-anonymity is selected as k-

minimal node in the generalization strategy. The worst case occurs when the only node that

achieves k-anonymity corresponds to suppression (higher level). From example in Figure

2.5, suppose nodes <d2, e1, f0> and <d1, e1, f0>, which are in the same generalization

strategy, are k-anonymous. As <d1, e1, f0> is below <d2, e1, f0> in the lattice, it means that

<d1, e1, f0> will have less information loss. In this generalization strategy, k-minimal node

is <d1, e1, f0>.

20


Figure 2.5: Lattice of generalizations and generalization strategy (orange trace).

3. After having all k-minimal nodes, they are compared with each other based on the quality

metric selected. This allows to compare the information loss of each node. The optimal

solution is the node with less information loss.

In order to minimize computational costs and improve performance, the algorithm imple-

ments predictive tagging. This predictive tagging consists in tagging whether or not a node is

k-anonymous after being analysed. Suming to this, in a generalization strategy, all nodes above a

k-anonymous node are marked as k-anonymous in the lattice. This tagging allows to not evaluate

several times the same node and as consequence, improve algorithm’s performance. [EEDI+09]

2.6.3 Incognito

Incognito algorithm was developed and proposed by LeFevre et al. in 2005. [LDR05] This

anonymization algorithm also uses the concept of lattice to find the optimal solution, exposed

in Section 2.6.2.

Incognito creates lattices for all possible combinations of QIDs, starting from single-attributes

subset and stopping with lattices of all QID. [LDR05] I.e., suppose we have 3 QIDs - a, b and c.

Incognito would start by creating a lattice for each a, b and c. After analysing these lattices, 3 new

ones would be created and analysed: (a, b), (a,c) and (b,c). And so on until no more combinations

are possible. At the end, the k-minimal nodes are compared and the best one is returned.

Similarly to OLA algorithm, Incognito uses tags to prevent nodes from being evaluated several

times. When a node is k-anonymous in a generalization strategy, it is tagged as so and all nodes

with higher height in the lattice are also tagged as k-anonymous.

Another optimization this algorithm takes advantage of occurs when a node does not achieve

k-anonymity on smaller subsets. If a node does not achieve k-anonymity on smaller subsets, then

21


Figure 2.6: Incognito pseudocode (Source: [LDR05]).

for larger subsets that node will also not achieve k-anonymity. This means that some nodes in

lattices of larger subsets of QID can be suppressed and so, the computational effort for evaluating

it is less than if all nodes were needed to be evaluated. [EEDI+09, LDR05]

For evaluating the lattice, Incognito algorithm starts from the bottom nodes and moves up-

wards using Breadth-First Search. Similarly to OLA, as it passes on the nodes, it tags them to

prevent repeated evaluations.

The pseudo-code, taken from the article published by its author [LDR05], for this algorithm

can be seen in Figure 2.6 and the evolution of the algorithm on the lattice can be seen on Figure

2.7, taken from [KPE+12].

Figure 2.7: Lattice evolution for Incognito algorithm (Source: [KPE+12]).

22


Figure 2.8: Flash algorithm lattice example (Source: [KPE+12]).

2.6.4 Flash

Flash algorithm was proposed by Kohlmayer et al.. This anonymization algorithm uses the same

concept of lattice as Incognito (Section 2.6.3) and OLA (Section 2.6.2). According to its author,

this algorithm goes over the lattice using a bottom-up breadth-first approach. [KPE+12]

This algorithm uses a greedy depth-first strategy and the lattice is traversed vertically. It uses

predictive tagging to reduce the number of nodes to be examined. Flash algorithm iterates through

every node and finds the path from that node to the next node that only has tagged successors. If

that condition is not fulfilled, the path will be from the node in question to the top node. The cre-

ated path is checked using binary search so that anonymous nodes are tagged and non anonymous

are added to the heap. After the path check is done, the algorithm continues to next iteration using

the heap nodes.

The algorithm ends when top level is the starting node. The algorithm pseudo-code can be

found in Figures 2.9, 2.10 and 2.11, retrieved from its original article. [KPE+12]

Figure 2.9: Outer loop of the Flash algorithm (Source: [KPE+12]

23


Figure 2.10: CheckPath(Path, Heap) (Source:[KPE+12]

Figure 2.11: FindPath(Node) (Source:[KPE+12]

2.7 Tools

Currently, there are already some tools to solve the problem of anonymization. In this section,

some tools will be addressed:

• PARAT [Inc] is the leading de-identification software. However, it is closed source and the

available information to the public is limited.

• Open Anonymizer [ope] is an open source tool which uses k-anonymity to protect sensitive

data by generalizing data records.

• µ-Argus [uar] is a non-commercial software that implements many techniques. However,

this tool is closed-source.

• UTD Anonymization Toolbox [utd] is a toolbox with several anonymization methods imple-

mented.

• ARX [arxd] is an open source graphical user interface (GUI) software for data anonymiza-

tion.

From the list of tools presented above, UTD Anonymization Toolbox and ARX will be covered

in more detail nextly, as they are free and open source and may be important for future work on

this project.

2.7.1 UTD Anonymization Toolbox

UTD Anonymization Toolbox is a compilation of several anonymization methods, implemented by

UT Dallas Data Security and Privacy Lab. As it is public, this toolbox can be used and extended

by researchers for their own purposes. [KIK16]

Currently, it supports 3 privacy models by implementing 6 different anonymization methods:

[utd]

• Datafly

• Incognito

24


• Incognito with `-diversity

• Incognito with t-closeness

• Anonymity

• Mondrian Multidimensional k-anonymity

2.7.1.1 Input, Output and Configuration

This toolbox not only has the implementation of the anonymization methods, but also allows to

anonymize some dataset if desired.

In order to anonymize data, the input needs to follow some rules and some configuration may

be needed.

As input, the toolbox only supports unstructured text files, such as CSV files. It is needed to

specify the input file organization in the configuration file.

The output and input format are, by default, the same.

The configuration is done using a XML structured file and starts with a root node config. The

root node has at most 5 children, which have all the information required to handle the provided

input: [KIK16]

1. input

2. output

3. List of identifiers (represented as id) - optional parameter

4. List of QIDs (represented as qid)

5. List of sensitive attributes (represented as sens) - optional parameter

An example of a configuration file, taken from this tool manual, can be seen in Figure 2.12.

2.7.1.2 Extending

This toolbox can be extended in order to add new methods. Every method extends the abstract

class Anonymizer. To extend this toolbox and implement new anonymization methods, it is simply

required to extend Anonymizer class and represent data according to the documentation. [KIK16]

This is useful for researchers as they can add their own methods and implementations to the

toolbox, using the already existing structure.

2.7.2 ARX - Powerful Data Anonymization Tool

ARX is an open source software for data anonymization. With an intuitive cross-platform GUI,

as seen in Figure 2.13, this tool can handle and anonymize large datasets. It also has a public

25


Figure 2.12: Configuration file, taken from UTD Anonymization Toolbox manual (Source:[KIK16])

API, which may be useful for other developers when implementing other software that require

anonymization.

Figure 2.13: ARX main frame

This tools has several features and supports: [arxa]

• Risk-based anonymization.

26


• Well known privacy models and quality metrics

• Data transformation using generalization, suppression and microaggregation.

• Data utility analysis.

2.7.2.1 Anonymization Process

The anonymization process centers in finding a balance between privacy and data utility for re-

search. With this concern, ARX tool is based on a 3-step data anonymization process, represented

in Figure 2.14: (1) configuring the model, (2) exploring the results and (3) comparing and ana-

lyzing the output and input data. Each step of this process will be explained in more detail in the

following sections.

Figure 2.14: ARX workflow (Source: [arxc]

2.7.2.2 Importing Data and Configuring

Currently, ARX supports importing three types of data: (1) character-separated values files (also

known as CSV), (2) Microsof Excel spreadsheets and (3) relational database management systems

(RDBMSs).

According to the documentation provided, before anonymizing the imported data, it is impor-

tant and necessary to configure at least: [PKLK14, arxa]

1. generalization hierarchies - hierarchies for each attribute should be created manually. The

tool has a feature to help the creation of those hierarchies, which is represented in Figure

2.15.

2. attribute properties - assign a type to each attribute. The type can be quasi-identifier, iden-

tifier, sensitive and insensitive.

3. privacy criteria - select the privacy model to use from the list of supported models.

4. utility metrics - select the metric to measure data utility from the list of available metrics.

5. suppression limit - maximum number of records that can be suppressed. It is recommended

to use 100% for this limit.

27


Figure 2.15: ARX hierarchy wizard

Attribute properties assignment, privacy criteria, utility metrics and suppression limit selection

are represented in Figure 2.16.

Figure 2.16: (1) Attribute properties configuration. (2) Privacy models selection. (3) Utility met-rics and suppression limit.

2.7.2.3 Solution Exploration

After a solution space has been classified, ARX tool allows the user to browse the lattice created

for the anonymization problem by using the exploration perspective, which is shown in Figure

2.17. This is useful to understand which and why a node was selected and applied by the tool in

the anonymization process.

28


Figure 2.17: ARX solution exploring

2.7.2.4 Transformed data analysis

Although ARX automatically finds the optimal solution according to the selected metric for data

utility, it provides a useful feature to analyse anonymized data and find out the utility of the trans-

formed data. [PKLK14]

This feature, which is represented in Figure 2.18, allows to compare statistical properties,

frequency distributions of values and many other metrics between input and output datasets.

Figure 2.18: ARX anonymization result analysis

2.7.2.5 Access via API

Another important part of this tool is the public API it provides. This API aims to provide

anonymization methods to other software system and facilitate the implementation of software

that needs anonymization.

In Figure 2.19, taken from the documentation [arxb], a Unified Model Language (UML) class

diagram for the API can be seen, with seven packages:

1. Data utility metrics - access available quality metrics

2. Solution space - access the solution space (lattice, nodes in the lattice,...)

29


3. Utility analysis - calculate utility of anonymization

4. Risk analysis - evaluate the risk of anonymization

5. Privacy criteria - access available privacy models

6. Data import & specification - import data from available input data types

7. Hierarchy creation - access to methods to create hierarchies.

These packages are accessed through the core classes ARXConfiguration, ARXAnonymizer

and ARXResult. Through all these packages that integrate the API, it is possible to use all features

provided by the GUI.

Figure 2.19: ARX tool API uml (Source: [arxb])

2.7.2.6 Limitations

Although it is a complete tool, with several privacy models and analysis features implemented, it

has some limitations that must be referenced.

Firstly, as input data it only supports CSV files, Microsoft Excel and RDBMS. NoSQL databases

are currently getting more and more common in the world of technology. Consequently, this is

a notorious limitation of this tool as it does not provide support for NoSQL databases, such as

MongoDB.

30


Secondly, according to Gkoulalas-Divanis [GDL15], as it uses globally-optimal search strat-

egy, it can only handle small search spaces. This limitation, however, is relative as it also depends

on several factors, such as hierarchies size.

In third place, it currently does not implement methods to support set-valued data, such as

km-anonymity. This is not, however, a big problem as most of anonymization problems can be

solved using the available models.

In fourth and last place, although it is possible to anonymize clinical data with this tool, it

requires high level of configuration to achieve good results. This may be an obstacle as it re-

quires good knowledge about clinical data and even with this knowledge, it may be complicated

to achieve the best and desired results as the configurations and restrictions may not be easy to

implement using the available GUI.

2.8 Clinical Data Anonymization

The HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule is a set of rules

needed to be taken into account when anonymization of clinical datasets is done. Two basic

methods for datasets anonymization are defined by this rule. The first requires the suppression of

a set of attributes. The second balances the quality of data and anonymization, using methods such

as k-anonymity. [PKLK14]

In order to anonymize clinical data and based on the requirements of the HIPPA Privacy Rule,

TransCelerate BioPharma Inc. proposed some models to handle specific types of data which will

be shown in the following section. [Inc13, oHS+03]

2.8.1 Recoding Identifiers

Individuals have an identifier associated to them in the database, which may allow to identify

the individual in case of knowledge of this identifier in the database. A possible approach to

anonymize this type of data is by recoding it. This can be done by generating random identifiers

which are not possible to reverse.

When recoding identifiers, it is important to use the same new identifiers across all datasets in

order to maintain relationship in the database. [Inc13]

2.8.2 Names, Contact information and Identifiers

According to the HIPAA Privacy Rule, all attributes that directly link a record to an individual

should be removed or set to blank. [oHS+03]

2.8.3 Age and Date of Birth

According to HIPAA Privacy Rule, individual’s date of birth and ages above 89 may compromise

anonymity. This way, date of birth should be removed and ages above 89 should be aggregated

into a single category - "above 89". [oHS+03]

31


2.8.4 Other Dates

Dates not related to age can also compromise anonymity. In order to guarantee anonymization,

these dates should be generalized or be replaced with a new date. This new date is calculated using

an offset. [Inc13] For example, suppose the date 1Jan2017 was in the database. To anonymize this

data using an offset of 90 days, a possible new date could be 18Fev2017.

2.8.5 Medical dictionaries

Medical dictionaries correspond to specific clinical data, such as names of diseases and names of

drugs. Although it does not directly identify an individual, it may lead to a correct identification

of an individual when associated to external data.

In order to reduce the risk of identity disclosure, this type data must be generalized during the

process of anonymization.

2.9 Summary

In this chapter, several concepts as well as related work were introduced within the context of

encryption and anonymization of data in general and clinical data in more specific. This helped to

clearly understand the topic of this dissertation.

We conclude that there is a large amount of research on this topic and that there are already

many privacy models and algorithms to achieve anonymity. Some scientific project and third-party

tools have already been implemented to help to solve this privacy problem. These tools allow to

anonymize relational databases, missing on the support to MongoDB databases, which are being

increasingly used.

Regarding clinical data, there are some data hierarchies, restrictions and data types (see Sec-

tion 2.8) that must be configured and taken into account when performing anonymization. Some

restrictions on data, which must be enforced when sharing data, were already defined by HIPAA.

From this, we can conclude that, in order to anonymize clinical data with the already existing

tools, it is required high level of configuration in order to implement these restrictions and add all

needed hierarchies.

32

Chapter 3

Solution

On Chapter 1 it is possible to conclude that sharing clinical data for research purposes brings

several benefits to pharmaceutical industry and clinical research in general. However, data sharing

brings several and critical problems that must be solved in order to allow the continuing of clinical

data share. Among these problems, the most important one, and in which all problems are based

on, is privacy disclosure.

In previous chapter, the state of the art on anonymization of data was analyzed. Given this,

it is possible to conclude that many models and algorithms already exist in the fields of data

anonymization. Some tools also have already been developed. However, some of these tools are

not free and the others are not focused on clinical data and require a lot of configuration, which is

not easy. Another point that is a big obstacle is the type of database those tools support - none of

them support MongoDB, which is in huge expansion.

With this in mind, in this chapter, a tool for anonymizing clinical data will be presented,

with the objective of providing the user with an easy to use interface that allows to quickly and

effectively anonymize clinical data that is stored in a MongoDB database. This tool will take into

account specific clinical data hierarchies and restrictions (e.g. diseases and drugs).

3.1 Requirements

As specified on Section 1.2, this dissertation aims to create a solution that makes clinical data

sharing possible. From a database, the solution must create a new anonymized database.

The solution must be applied to a Mongo database and should be scalable enough to be inde-

pendent of the collection structure. All documents inside the MongoDB collection must have the

same well defined structured in order to be anonymized. The results must be stored into a new

MongoDB collection with the same structure as the non-anonymized collection.

Another important requirement is the amount of data the solution can handle. The system in

which the process will run has limited amount of memory and processing units. So, the amount

of data that can be stored into memory during the process is limited. The solution must be able to

handle the maximum amount of data.

33

Solution

To reduce the amount of data, the anonymization process will run in a monthly basis. This is a

way of clustering the process without losing too much information. However, in order to find out

correlations between data, it must be possible to associate data inside the database. This means

that there must be a way to tell that two different records correspond to the same individual, but

without compromising its privacy.

In terms of speed and time of operation, it is essential to maintain a balance between the

amount of resources required by the system and the processing time. This means that the solution

should have good response time but this response time must be balanced with the resources needed

to complete the process.

In order to handle different collections, the solution must be easily configured by the database

administrator. This configuration must contain information about the attributes used by the anonymiza-

tion process, such as QID and sensitive attributes.

3.1.1 Database

As already specified before, the anonymization process will be applied to a MongoDB database

and more in specific to two different collections: Prescribed Medications and Consumptions.

Documents inside a collection must have a fixed structure over that entire collection.

3.1.1.1 Indexes

As specified in Section 3.1, the anonymization process will run with a monthly periodicity. This

means that the query to fetch the collection to anonymize will only use the creation date field. For

this reason, it is recommended to add an index to the attribute DateReg. This attribute contains the

date the record was created at. Both collections have this attribute.

3.2 Architecture

From the requirements analysis, three main modules were defined for a better final solution:

1. Anonymization process itself - receives as parameter a configuration file, a MongoDB

collection and the destination collection. This module fetches the collection, anonymizes it

and stores the anonymized version in a new mongo collection. Returns a result string with

information about the process (quality metrics, elapsed time, ...).

2. Anonymization GUI app - a simple GUI that allows to: (1) connect to a MongoDB col-

lection, (2) create a configuration file with an easy to use GUI, (3) start the anonymization

process, (4) preview the anonymized version and (5) export results to a MongoDB collec-

tion.

3. Anonymization web service - allows the administrator to automate the anonymization and

view some useful analytics. By using this module, the anonymization will run automatically

every month or week and the results can be later analysed using the web dashboard.

34

Solution

These three modules are explained and analysed in more detail in the next sections.

3.2.1 Anonymization Process

The requirements analysis and the analysis of the state of the art allowed to better understand the

problem and to define a better solution to solve the problem of clinical data anonymization.

On Section 2.7, some open source tools for anonymizing data were analysed. From these

tools, ARX implements several privacy models - such as k-anonymity and `-diversity - and Flash

algorithm, which uses the concept of lattice and it is the best in performance from all the anal-

ysed algorithms (see Section 2.7.2). This tool provides a Java API to access these implemented

privacy models and algorithm. This is useful for the solution as it allows to reduce the effort on

implementing the already existing algorithm and privacy models.

The API receives as input the dataset already processed, the required hierarchies and the con-

figuration of all attributes and privacy models. Then, it applies the algorithm to the dataset and

returns the anonymized version of the dataset. Before using the API, it is required to process

the MongoDB collection in order to create a dataset that can be used by the provided algorithm.

The API accepts a table in the form of an array, which is not compatible with the initial mon-

godb structure. It is also required to create the hierarchies and the configuration to be used by the

algorithm.

Taking this into account, the anonymization process is divided into eight steps: (1) gather data;

(2) process data; (3) initialize attributes; (4) anonymize data; (5) convert to the initial structure;

(6) save anonymized dataset, (7) quality metrics calculation and (8) k-anonymity and `-diversity

calculation.

These steps are represented as a flowchart in Fig. 3.1 and will be explained in more detail in

the next sections.

As input, the anonymization process receives the MongoDB connection information, the con-

figuration file location and the destination collection information.

Both the input and output MongoDB connection information contain the Host, Database and

Collection name.

The configuration is a JSON file and contains all required information for the anonymization

process. This configuration file is better covered on Section 3.2.1.1.

At the end of the entire process, it is returned a status message with information about the

process, such as time taken to anonymize, number of records anonymized, final k-anonymity and

`-diversity and quality metrics.

3.2.1.1 Configuration File

The configuration file, which comes in JSON format, is essential for the anonymization process

and it is this configuration file that contains information about all attributes and privacy models to

be used.

The configuration file contains several required fields (see Listing 3.1):

35

Solution

Figure 3.1: Anonymization process flow chart

1. qid - an object with all QID attributes. Each attribute must be linked with an hierarchy type

(see Section 3.2.1.4).

2. identifiers, sensitive and insensitive - All identifier, sensitive and insensitive attributes. The

format is the same as the qid but the value of each attribute is ignored - can be empty string.

Every sensitive attribute MUST be also configured in the anonymizationConfig as `-diverse.

3. excluded - attributes that are not needed in the anonymized dataset and that must be ex-

cluded from it.

4. suppress - attributes that must be completely suppressed in the anonymized dataset.

5. pseudonyms - attributes that must link records inside the database but which value should

be replaced by a non real value.

6. whereClause - MongoDB find clause that must be used to fetch collection from database.

7. anonymizationConfig - object that contains the privacy models to be used and its values.

An example of a configuration file can be seen in Listing 3.1.

36

Solution

1 {2 " e x c l u d e d " : {" _ i d " : " " } ,3 " s e n s i t i v e s " : {4 " P r e s c r i p t i o n . Code " : " "5 } ,6 " i n s e n s i t i v e s " : {7 } ,8 " q i d " : {9 " P a t i e n t . B i r t h Y e a r " : "YEAR"

10 } ,11 " i d e n t i f i e r s " : {12 " P a t i e n t . Name" : " "13 } ,14 " s u p p r e s s " : {15 " P a t i e n t . Genre " : " "16 } ,17 " pseudonyms " : {18 " P a t i e n t . Code " : " "19 } ,20 " whereClause " : "{ d a t e : { $ g t : ’2017−01−01 ’ , $ l t : ’2017−05−01 ’}}" ,21 " a n o n y m i z a t i o n C o n f i g " : {22 " p r i v a c y M o d e l s " : {23 " l−d i v e r s i t y " : {24 " P r e s c r i p t i o n . Code " : "2"25 } ,26 " k−anonymi ty " : "2"27 }28 } ,29 " c o l l e c t i o n s " : [ ]30 }

Listing 3.1: Example of a configuration file.

3.2.1.2 Gather Data

Firstly, it is required to get the collection to anonymize from a mongo server. To get the collection,

the process first connects to the database using the information passed as argument (host, database

and collection).

When the connection to the database server is established, the process connects to the database

in which the collection to anonymize is stored.

With the connection completed, the process finally parses the query passed on the configura-

tion file to create a valid query object. This object is then used by the mongo driver to fetch data.

During this parsing, an important point is the existance of dates. Dates in MongoDB are usually

ISODate objects and can not be compared with a simple date string. However, in the configuration

file it is not possible to pass an object and, for that reason, dates need to have special attention. For

this reason, when parsing, dates are identified and replaced by its corresponding ISODate object.

Finished the parsing and created the query filter object, the query is run and the results retrieved

into a list of mongo documents.

This process can be seen in the flowchart in Fig. 3.2.

This step takes advantage of an already existing Java driver, developed by MongoDB, that

allows to interact with a mongo database. [Mon]

37

Solution

Figure 3.2: Connect to mongo server and gather data flowchart (left). Parse dates from queryobject flowchart (right).

3.2.1.3 Process Data

From the previous step it is returned a list of mongo documents, which is an hierarchical object

similar to a JSON object. The API provided by ARX tool receives a multidimensional array of

strings, which is the representation of a table. The hierarchical object received from Mongo is

incompatible with this API representation of data. For this reason, it is required to process the data

received and serialize the list of mongo documents into a structure equal to the one used by the

API (array of strings).

In this phase of the anonymization process, the hierarchies are removed and a non-hierarchical

structure - equal to an SQL database table - is created. To achieve this, each column’s name in the

new data structure is the path to the attribute in the hierarchical structure. As an example, suppose

a mongo document equal to the one in Listing 3.2. In this example, we notice that the path to

patient’s gender is Visit -> Patient -> Gender -> Gender. This attribute will result in a column with

name Visit.Patient.Gender.Gender in the non-hierarchical structure. By applying this rule to all

attributes in the initial structure, it is created a non-hierarchical structure like the one in Table 3.1.

It is important to keep in mind that this is possible because one of the requirements for this

solution is that every mongo document in a collection must have exactly the same hierarchical

structure (see Section 3.1).

1 {2 {3 " _ i d " : O b j e c t I d ( " i d " ) ,4 " V i s i t " : {5 " P a t i e n t " : {6 " Gender " : {7 " Gender " : "F"8 } ,9 " B i r t h Y e a r " : {

38

Solution

10 " BeginYear " : 198911 }12 }13 } ,14 " P r e s c r i b i n g P h y s i c i a n " : {15 " P h y s i c i a n S p e c i a l t y " : "MEDICINA GERAL"16 } ,17 " P r o d u c t " : {18 " P r o d u c t I d " : "1"19 }20 }

Listing 3.2: Part of Prescribed Medications document.

_id Visit.Patient.Gender.Gender Visit.Patient.BirthYear.BeginYear PrescribingPhysician.PhysicianSpecialty Product.ProductIdObjectId("id") F 1989 MEDICINA GERAL 1

Table 3.1: Non-hierarchical structure version of Listing 3.2

During this phase of processment, not only the hierarchical structure is converted into a non-

hierarchical but also some adjustments and rules are applied apriori (see Algorithm 1):

1. suppression of attributes - some attributes must be completely suppressed, independently

of the rest of the attributes values. In order to reduce computational requirements during the

anonymization and in order to reduce the size of data, these attributes are removed during

this processment phase and are added after the anonymization completion. When adding

back these attributes, values are replaced with suppression character (*).

2. replacement with pseudonyms - another important aspect is that the anonymized data will

be used for research purposes. For this reason and as already specified in the requirements,

there must be a way to associate different records to the same individual inside the database,

while keeping anonymization. This can be achieved using pseudonyms.

Pseudonyms could be a randomly generated string that is associated to a value. Each

pseudonym is then stored with its association into a database for later usage in future

anonymizations. However, this solution is not safe because pseudonyms and its associa-

tions are stored in a database. If there is a leak of the database that contains these relations,

the anonymized database is compromised. To solve this issue, pseudonyms are generated

using encryption. By using an encryption algorithm, decrypting the pseudonym and finding

the real value is complicated and requires considerable computational resources. To make

the decryption even harder, it is used a salt, which is a string that is appended to the initial

value before the encryption is done. So, it is used SHA256 algorithm with a salt to encrypt

the real value and therefore generate the pseudonym.

Algorithm 1 contains the pseudo-code for this processing step. The algorithm receives a

mongo document (similar to a JSON object) and a path to that document; returns the processed

document. To generate the processed document, it iterates through the input document attributes:

1. If the attribute is to suppress, then it continues to next attribute

39

Solution

2. If the attribute is a sub-document, it adds the attribute name to the path and calls the al-

gorithm with the sub-document and the new path. This call to the algorithm returns the

processed sub-document which is then added to the final document.

3. If the attribute is not a sub-document, then the attribute name is added to the path, the value

for that attribute is calculated - if attribute is tagged as pseudonym, it generates pseudonym;

if it is not tagged as pseudonym, attribute value remains equal - and the key-value pair is

added to the final document.

The algorithm is called for every document in the list.

At this stage, we already have a list of documents without hierarchies. To create the required

structure - array of strings - it is just required to fetch the values for every attribute in each docu-

ment and add it to the array of strings.

Fig. 3.3 contains the representation of an already processed document for Consumptions and

PrescribedMedications.

Algorithm 1: processDocumentData: collectionDocument : Documentpath: StringResult: processedCollectionDocument : DocumentprocessedCollectionDocument : Documentfor documentAttribute : collectionDocument.keySet() do

if isToSuppress(documentAttribute) thencontinue;

endif isSubDocument(documentAttribute) then

path += documentAttribute;Document subCollectionDocument =

processDocument(collectionDocument.get(documentAttribute))for String s : subCollectionDocument.keySet() do

processedCollectionDocument.put(s, subCollectionDocument.get(s));end

endelse

path += documentKeyString valueForAttribute = collectionDocument.get(documentAttribute)if isPseudonym(documentAttribute) then

valueForAttribute = generatePseudonym(valueForAttribute)endprocessedCollectionDocument.put(path , valueForAttribute)

endend

3.2.1.4 Initialize attributes

After data processing is done, all attributes are initialized as Sensitive, Insensitive, QIDs and Iden-

tifier. These attributes initialization is defined in the configuration file provided to the anonymiza-

tion process.

40

Solution

Figure 3.3: Consumptions and Prescribed Medications representations after processment.

Each QID is associated to an hierarchy type. The process already has several predefined

hierarchies available:

1. Y EAR - Hierarchy for years. This hierarchy is created dynamically, starting from a base

year and creating groups of 5 years until current year is reached. Every level above this

initial one is created by joining two groups from the level below.

2. GENERAL - Replace each char in a string by *, from right to left.

3. DIAGNOSE_DESCRIPT ION - Diagnoses description hierarchy.

4. DIAGNOSE_CODE - Diagnoses codes hierarchy.

5. SPECIALTY _DESCRIPT ION - Specialty description hierarchy.

6. SPECIALTY _CODE - Specialty codes hierarchy.

7. NONE - No hierarchy specified. QID that contain this type of hierarchy can only be sup-

pressed.

Diagnose and specialty related hierarchies are stored in a CSV file. Every line of that file

corresponds to a distinct diagnose or specialty and levels of the hierarchy are separated by ";".

Sensitive attributes are used for `-diversity privacy model. For this reason, every sensitive

attribute must contain a value ` associated to it.

41

Solution

3.2.1.5 Anonymize Data

With the data structure created and the attributes initialized, the process can already call the API

provided by ARX tool. This API implements Flash algorithm, explained in Section 2.6.4, to

anonymize the dataset using the configuration and hierarchies provided. The API receives an

object with the data, hierarchies and its associations, and receives some configurations - privacy

models to be used, max suppression limit, ... From this input, the API generates the lattice based

on the hierarchies and applies the algorithm to that lattice. As result, it returns an object with an

array of strings that represents the data anonymized. It is also returned some information about

the algorithm process, such as generalization level applied to each QID.

3.2.1.6 Convert to the Initial Structure

At this point, the process already contains an anonymized data structure. However, this data

structure is not hierarchical. When the process finishes, it is expected to have the same hierarchical

structure as the initial one and it is expected to export those results to a new mongo collection. So,

it is required to convert back the data structure to the initial hierarchical structure.

In this step of the anonymization process, the hierarchical structure for the anonymized dataset

is created. This is achieved by reversing the name of each column, which is the path to that attribute

in the hierarchical structure. For every row in the anonymized dataset, a new mongo document is

created by reversing the column names and setting the value for that column in the corresponding

attribute.

Algorithm 2 contains the pseudo-code for the convertion to the initial structure. This algorithm

receives a document with the same representation as the anonymized array and returns a document

with the hierarchical representation. The algorithm iterates through every column of the document

and splits the name by character ".":

1. If the length equals 1, then it does not contain a sub-document and it adds the key-value pair

to the final document.

2. If length is greater than 1, then it gets the current document name - index 0 on the resulting

array from split operation - and creates a new document only with the columns that contain

the current document. This document represents the sub-document for the attribute. Repeats

the algorithm for the sub-document until no sub-documents are present. From this recursive

method, the initial sub-document will result in its hierarchical representation and is then

added to the final document.

At the end of the process, it is returned the document with the hierarchy created. Using the

example given in Section 3.2.1.3, from Table 3.1 this algorithm returns a document as shown in

Listing 3.2.

The algorithm is applied to every row in the anonymized dataset and the documents returned

are stored into a list of documents.

42

Solution

Algorithm 2: documentFromArrayOfPathsData: row : DocumentResult: finalRow : DocumentfinalRow : Documentfor columnName : row.keySet() do

if size of columnName.split(".") == 1 thenfinalRow.put(columnName, row.get(columnName)); continue;

endString currentDocument = column.split(".")[0]Document subRow = new Document for subColumnName : row.keySet() do

if subColumnName.contains(currentDocument) thenString subColumnPath = subColumnName.replace(currentDocument + ".", "");subRow.put(subColumnPath, row.get(subColumn));

endendfinalRow.put(currentDocument, documentFromArrayOfPaths(subRow));

end

3.2.1.7 Save anonymized dataset

After having the collection anonymized and with the required structure, the process exports the

results to a mongo collection. To export the results it is used the mongo driver already mentioned

on Section 3.2.1.2, which receives the list of documents and stores it in the destination collection.

3.2.1.8 Metrics calculation

After storing the results into a mongo collection, the process calculates the information loss. The

information loss is calculated using three metrics: (1) Loss information metric, (2) Precision (Prec)

and (3) Discernibility.

These metrics are implemented using the models explained in Section 2.5.

3.2.1.9 k and ` calculation

At the end of the process, it is calculated the k-anonymity and the `-diversity of the final dataset.

This calculation is a way of validating and ensuring the API provided by ARX tool correctly

anonymized the dataset.

In order to calculate k-anonymity (Algorithm 3), it is created an HashMap with QIDs and its

frequency in the dataset. k-anonymity corresponds to the lower frequency between QIDs frequen-

cies.

In order to calculate `-diversity (Algorithm 4), it is created an HashMap with all QIDs and the

values that it contains for the sensitive attribute. `-diversity corresponds to the lower number of

distinct values for the sensitive attributes. The algorithm returns -1 if the dataset is empty and -2

if the dataset was completely suppressed.

43

Solution

Algorithm 3: k-anonymityData: anonymizedCollectionWithoutHierarchy : DocumentResult: k : intHashMap<String, Integer> tuplesFrequency =

getTuplesFrequency(anonymizedCollectionWithoutHierarchy);int k = -1; for tuple : tuplesFrequency.keySet() do

if tuplesFrequency.get(tuple) < minK OR minK == -1 thenk = tuplesFrequency.get(tuple);

endend

Algorithm 4: `-diversityData: anonymizedCollectionWithoutHierarchy : DocumentResult: l : intHashMap<String, HashMap<String,Integer» tuplesWithSensitivesFrequency =

getTuplesWithSensitivesFrequency(anonymizedCollectionWithoutHierarchy);int l = -1; for tuple : tuplesWithSensitive.keySet() do

if allSuppressed(tuple) thencontinue;

endif l == -1 OR tuplesWithSensitive.get(tuple).keySet().size() < l then

l = tuplesWithSensitive.get(tuple).keySet().size()end

endif l == -1 AND tuplesWithSensitive.keySet().size() > 0 then

l == -2end

44

Solution

3.2.2 Anonymization GUI app

This module of the solution corresponds to a desktop application with a simple user interface that

allows to easily create a configuration file for a collection anonymization and to easily start the

anonymization process, explained in Section 3.2.2.

The usage of the GUI application starts with the connection to a MongoDB collection by pro-

viding Host, Database and Collection information. After connecting to MongoDB, the application

gets the collection structure by fetching and parsing the first element on the collection. When the

structure is parsed, it is shown to the user as a JSON tree.

To configure, the user selects each element on the tree, selects the type of attribute and specifies

the hierarchies for those which are QIDs. Every time an attribute configuration is changed, this

change is stored into a configuration object.

When configuration is done, it can be saved to a JSON file to be loaded later on.

The GUI also allows to start the anonymization process using the configuration created and to

see the resulting dataset, after which it can be exported to the anonymized MongoDB collection.

This process is shown in diagram of Fig. 3.5

Figure 3.4: GUI anonymization process.

45

Solution

3.2.3 Anonymization web service

The anonymization process will run periodically. In order to minimize the cost of running it

periodically, to automate the process and to easily manage the anonymization process, a web

service was also created. This web service allows to create a connection to a MongoDB, to set the

periodicity the anonymization will occur and to automatically call the anonymization process.

The web service also contains a dashboard in which it is possible to view the entire history

of anonymizations done, to view analytics about the process and to evaluate the results obtained

from each anonymization.

This module of the solution contains several sub-parts:

1. cronjob that runs every day. The cronjob checks every day which connections must be

anonymized by comparing the date last anonymization was done with the periodicity se-

lected by the user. The connections that are ready for anonymization are anonymized by

calling the process explained in Section 3.2.2.

2. command to call the anonymization process. This command implements the Command

pattern and allows to call the anonymization process with the needed parameters and con-

figuration file. The result returned by the process is stored into the web service database and

used to create useful analytics - average of data loss, average of amount of data, ...

3. database that stores all information required to connect and anonymize a MongoDB collec-

tion.

4. API that supports the web dashboard.

5. user interface that links with the API. It is the part the user interacts with, allowing to create

new connections, to view provided analytics and to view anonymization results in a user

friendly way.

6. notification system that allows to send notifications to user when anonymization process

ends. This notification system takes advantage of Pusher, which is a service for creating

realtime applications. [Ltd].

46

Solution

Figure 3.5: Web service component diagram.

47

Solution

3.3 Anonymization Process Results

The anonymization process will run monthly and the average of records to anonymize is 400.000.

It is required to handle two distinct datasets: Prescribed Medications and Consumptions (Section

3.1.1).

In an initial phase of the development and in order to analyse the capabilities and limitations

of this solution, a few tests were taken. The results for these tests are important to understand if

the solution is already good enough or if it needs attention in some aspects and optimizations.

In this initial phase, as the main goal is to analyse the performance and the limitations that

the solution contains, only k-anonymity was taken into account and the anonymization was only

applied to Prescribed Medications collection. Two different k values were used: k=2 and k=4.

The results for these tests are shown in Table 3.2. This table shows the Memory RAM con-

sumption and the time taken to anonymize the collection. From the results it is instantaneously

noticed that the main issue is the amount of RAM that is required to process the dataset. The

amount of RAM required is too high because it is storing into memory thousands of documents

and each document contains more than 100 attributes. So, the amount of data that it stores in

memory is too large.

In Fig. 3.6 it is presented on the left a chart with memory usage over time for 20.000 records.

From the analysis of this chart we notice that memory consumption is increasing until a certain

point. If we compare it with the times retrieved from Table 3.2, we notice that it stops increasing

when the export to mongo step starts. Also, if we analyse the architecture in more detail we

notice that a list with the documents is being stored and then a new array of strings with the

non-hierarchical data structure is also stored. This last array is used to anonymize and then the

structure is set back to the original one. The diagram on the right side of Fig. 3.6 shows exactly

that the memory is consumed mostly by the list with documents and by the array of strings -

java.util.LinkedHashMap$Entry, char[] and String.

With this first analysis, it is easily noticeable that the main issue and the main limitation of this

initial solution is the amount of Memory RAM it is required to complete the process of anonymiza-

tion. For this reason, we must focus on this limitation, which compromises the accomplishment of

the requirements. On next chapter this issue will be discussed in more detail and some optimiza-

tions will be presented and implemented.

# Records Memory RAM (k=2) Time (s) (k=2) Memory RAM (k=4) Time (s) (k=4)

20k 1.75Gb

Pre-processment = 4.21Anonymization = 0.4Convert back = 2.80Export = 0.47Validation = 0.13

1.8Gb


100k >6Gb Out of Memory >6Gb Out of Memory600k >6Gb Out of Memory >6Gb Out of Memory

Table 3.2: Results for an initial version of the anonymization process

48

Solution

Figure 3.6: Memory usage for 20k records (left) and memory consumption distribution (right).

3.4 Summary

In this chapter, the requirements for a solution that intends to anonymize clinical data were exposed

and then, the solution itself was exposed and explained in more detail.

Although ARX tool allows to anonymize data, it contains several obstacles when anonymizing

MongoDB collection, which is a requirement for the solution. However, it has available an open

source Java API which is useful for our solution by providing the implementation of the algorithm

and privacy models. This is useful as it reduces the effort of re-implementing the already existing

algorithms in our solution.

The solution is divided into three main modules which are linked together: (1) the anonymiza-

tion process itself - from a MongoDB collection, it creates the corresponding anonymized col-

lection using several built-in hierarchies -, (2) a web service that intends to make the process

accessible, automate and to take more advantage out of the process by providing the administrator

with analytics and (3) a desktop application that allows to create configuration files and access

the anonymization process with an user friendly interface. By implementing these three modules

it is intended to facilitate administrators job regarding anonymization of collections for research

purposes.

A preliminar test and analysis to the process of anonymization allowed to find some limitations

and issues in the designed architecture and implementation. From these limitations, the most

significant and the one which requires more focus is memory consumption. This limitation can

compromise the well-functioning of the process for bigger datasets, which is a requirement.

49

Solution

50

Chapter 4

Optimizations

On Chapter 5 the results for the implemented solution were exposed and analysed. From this

analysis and in order to fulfill the performance requirements, some issues appeared in terms of

performance and several optimizations had to be made to minimize the cost of the anonymization

process, especially in terms of memory usage and amount of data the process is able to handle

with. These optimizations resulted in another version of the solution.

In this chapter, the issues and the optimizations done to minimize those issues will be covered

in more detail.

4.1 Streams for Memory Usage

Memory usage increases mainly in two parts of the process: Data Processment and Convert back

to hierarchical structure to export to mongo.

These two parts will be analysed in more detail in this section.

4.1.1 Data Processment

The anonymization process will run every month and the average of data collected per month is

400.000 records. This means that the solution must be capable of handling a large amount of data.

Another requirement for the solution is that it must have good performance in terms of Memory

and CPU usage as the system in which the solution will run has limited resources available to the

process.

At the first stage of the development, data retrieved from MongoDB is being stored into a list

of documents. This list is then used during the phase of Processment of data to create a dataset that

can be used by the algorithm provided by ARX tool API. At the end of the anonymization, both the

anonymized and non-anonymized dataset were kept in Memory. This allowed faster processment

during phase of convertion to initial structure and export of results to Mongo collection. However,

it brings a big issue: Memory consumption is too high and compromises the well-functioning of

the anonymization for bigger datasets!

51

Optimizations

Figure 4.1: Memory usage for 20k (left) and 400k (right) records. [SH]

In Fig. 4.1, two charts with memory usage during the process of anonymization are shown.

On the left, it is presented the chart for 20.000 records; on the right, it is presented the chart

for 400.000 records. In both charts, it is easily noticed that the amount of memory that is used

during the process is too high and, in the second case, the process did not finish due to the lack of

resources. At this point, it is important to find out when and which objects are consuming more

memory to understand how we can reduce this consumptions and make it possible to fulfill the

requirements.

In Fig. 4.2, it is shown an histogram with information about memory consumption and number

of instances for each object in the process. From the analysis of this histogram, we can check that it

is a java.util.LinkedHashMap$Entry associated with char[] and java.lang.String that is consuming

the most memory. These elements are the storage of the entire collection in memory and they are

the cause for the process not to finish with a big dataset. In this case, the anonymization results in

out of memory exception.

Figure 4.2: Memory usage by object for 400k records. [SH]

4.1.1.1 Optimizations

With the issue well defined, it is important to search for solutions that can minimize it and decrease

the amount of Memory usage. There are several possible solutions to solve or minimize this issue:

52

Optimizations

1. clusters - one solution could be the creation of smaller clusters that are anonymized inde-

pendently and joined at the end. Although this solution may solve the problem of memory

usage, it has a big problem: by creating clusters, the size of the dataset is smaller and as con-

sequence the information loss is higher. As already said before, quality of generated data is

an important point and should not be forgotton. Another counter-part for this solution is that

anonymization will run every month. This is already a way of creating clusters with the size

of a month in the entire collection. If this solution is implemented, the monthly clustering

is reduced to weeks or even days. This is not the desired behaviour for the solution.

2. unneeded attributes removal - remove some attributes that are not required during the

process of anonymization reduces the size of each object and, as consequence, of the entire

dataset. With this solution, the collection size may be reduced to more than half the initial

size as most of the attributes are not essential for the anonymization process.

3. indexing attributes names - replace the name of each attribute with an index or id. This

solution reduces the size of the string used by the attribute name and, as consequence, the

memory usage of the entire list. Although it reduces the total size used, the reduction is not

significant.

4. use of streams - during the processment of data, use streams instead of a list with all data.

With this solution, data will only be stored after all processment has been done and just dur-

ing the anonymization process. With this solution, memory usage has a significant decrease

but it takes more time to complete the process of anonymization. Although the increase

of elapsed time in the processment of data phase, the decrease of total memory usage is

preferrable as it is the main limitation for the solution to succeed.

From the above possible optimizations and its analysis, it was chosen to join two of them: useof streams and unneeded attributes removal.

Instead of a list of Documents, it is created a stream for mongo results. This stream is then

used during the processment step to generate the data structure to be passed to the API. Unneeded

attributes removal is done during the step of data processing. With this optimization, the algorithm

that processes hierarchical data retrieved from mongo (Algorithm 1) needs to add a new rule:

"if the attribute is to exclude, it must continue to next attribute without adding it to the final

collection". These two optimizations allow to significantly decrease the amount of memory used

during these phases of the process. In Section 4.3, the results achieved from these optimizations

are shown.

4.1.2 Save Results to MongoDB

Java MongoDB driver allows to insert many documents at the same time. For doing that, it is

required an ArrayList with the Documents to insert into the MongoDB collection.

53

Optimizations

In the phase Convert to the Initial Structure of the anonymization process, the anonymized

collection is converted back to the initial MongoDB structure. This collection with the final hier-

archical structure is being stored in a list. After finishing the process of convertion, the list was

exported to the destination MongoDB collection using the insertMany method provided by Java

MongoDB driver.

As in Section 4.1, this process requires large amount of memory to store the entire list of

documents before exporting it. As a consequence and similarly to section before, with the growth

of datasets, the memory required was also increasing exponentially.

4.1.2.1 Optimizations

As in the previous issue, there are several possible solutions to solve or minimize this issue:

1. clusters - one solution would be to create clusters to save the anonymized collection. In

contrast to the previous issue, this solution doesn’t compromise the quality of generated

data as it already uses anonymized data. This solution can reduce significantly the memory

required.

2. use of streams - during the phase Convert to the Initial Structure, data is stored in a list that

is saved in runtime memory. Saving to a stream could be a possibility that can save a lot

of memory usage. By using this solution, the elapsed time increases but the total memory

usage decreases significantly. As memory usage is the most limitation for the solution and

for the process of anonymization due to the size of the datasets, it is a good choice to increase

the elapsed time of the process while reducing the memory consumption.

In this case, both solutions are feasible and can be linked together to achieve better results. For

this issue, the optimization chosen is to convert the results to a stream and then iterate through that

stream to export the anonymized documents to the destination MongoDB collection. This means

that steps 5 and 6 (Convert to the initial structure and save anonymized dataset) occur in parallel.

4.2 Clustering Pre-processment Phases

In Section 4.1, some options for optimizing memory usage during anonymization were analysed.

From these optimizations, it was chosen to implement streams to process data and remove some

unneeded attributes. With this solution, memory usage has a significant decrease as data will only

be stored in memory once - during anonymization itself - and in all other steps it will be used

streams which does not store data directly into memory.

This solution minimizes the problem of handling bigger datasets but reduces the speed of

the anonymization. For this case, it is preffered because speed does not compromise the well

functioning of the solution while memory does.

In order to reduce the time to process data, parallel computing was considered. At the stages of

pre-processing - gather data and process data -, data does not need to be considered as a block to

54

Optimizations

decrease information loss. For this reason, clustering the pre-processment over several machines

is a good solution to increase speed and performance.

This optimization requires some adjustments and changes to the initial architecture as the pre-

processment and gather of data is not done in a single process but using several processes running

in parallel, in multiple machines. This is possible using Sockets, which are used to transfer data

between client and clusters. The basic functioning of this optimization is:

1. Client sends requests to clusters asking for pre-processment. Message includes mongo con-

nection information, query to run, configuration used, page to anonymize and number of

elements in that page. It waits for response from all clusters before continuing.

2. Cluster receives message from client, parses it and calls the data pre-processment method.

The data pre-processment fetches data from mongo collection and creates a collection object

with that data. Then, it converts this collection, which contains an hierarchical structure, into

a non-hierarchical representation of it. This pre-processment is similar to the one used in

the initial solution and uses streams to reduce the amount of memory used. After converting

the mongo collection to a non-hierarchical structure, it returns to the client an array with the

same structure as the one used by the API. In order to reduce the time lost sending data to

the client, data is compressed before sending.

3. Client receives the array with all data already processed. It joins data received from all

clusters into the same structure. After all clusters answered, it continues the process as

before.

Fig. 4.3 contains the representation of this solution. In this solution, we have a client - re-

sponsible for the anonymization - and several clusters - responsible for data pre-processment. The

client connects to the available clusters and asks them to fetch and pre-process part of the entire

data. At the end, client joins all data received from all clusters.

The client creates a thread for each cluster connection. In Fig. 4.4, it is shown a flow chart

with the functioning of a client thread, of a cluster and the messages passed through sockets

between these two components. Each client thread connects to a cluster, sends a request and waits

for the response with the processed dataset. Each cluster waits until new client connects after

which it receives the request, fetches data from mongo collection, processes it and returns the new

processed dataset to the client. After this process, the cluster waits again for new client connection.

With this optimization it is expected that the data processing time reduces as the processing

is splitted by all clusters. However, it is important to notice that it is also required some time to

send data from clusters to client. So, in order to increase performance, clusters need to have good

performance to compensate the time lost in data transfer.

It is also expected that memory consumption reduces because the data processing takes advan-

tage of recursivity to create a non-hierarchical representation of the initial hierarchical structure.

Recursivity uses more memory as elements need to remain on the heap during the process. With

this solution, memory used by recursivity will be partitioned by all clusters.

55

Optimizations

Figure 4.3: Clustering architecture.

Figure 4.4: Client-cluster functioning and interaction flowchart.

56

Optimizations

It is also expected that this solution not only impacts the initial steps but also the subsequent

ones and the entire process. Using this new architecture, data will be stored into an array which

will be used in the next steps. For this reason, validation and export to mongo will iterate over an

array instead of a stream of documents that is not in memory. This makes the process run faster

than before.

57

Optimizations

4.3 Results

The optimization proposed to Data processment and Save Results to MongoDB resulted in a new

version of the solution. In this new version, Convertion to Initial Structure and Save anonymized

dataset phases occur in parallel. In order to test the optimizations done, it was used a 2-anonymity

configuration for 20k, 100k and 600k records. During these tests it was used an Intel Core i7

Quadcore with 6Gb RAM machine.

Table 4.1 presents the results in terms of RAM consumptions and elapsed time for this version.

From this table, we notice that the time for each stage had a significant increase when comparing

to the first version. This increase is justified by the fact of using stream, which is slower than

storing all data into memory before processing. However, the amount of RAM that is required

reduced, specially when datasets become larger.

Another point that is noticeable is that Validation phase requires more memory RAM than the

rest of the process. This happens because validation is not using streams. It first converts the

data to initial structure, stores it into memory and then validates the results. Although this version

requires less memory than the previous - as it only stores into memory the list of anonymized

documents during validation step -, validation still requires a lot more memory than the rest of

the process. For example, in a 600k records dataset, the entire process completed except from

validation. This shows the difference in memory usage between validation and all other steps.

For this reason, validation also requires some optimizations and, as well as other steps, streams

will be applied. Table 4.2 presents the results achieved from the implementation of streams in

validation step. As expected, steps before validation did not suffer changes in terms of speed

and memory usage. Validation step had a significant gain in its memory usage, decreasing up to

50% less for 600k records. In terms of speed, there is no difference when compared to the first

optimized version.

From both tables, it is possible to notice that the pre-processment phase - gather data and

process data - takes too long to conclude. For 600k records it takes almost three minutes to finish.

To reduce the time of pre-processment, it was implemented clustering over several machines.

Table 4.3 presents the results achieved from this clustering architecture implementation. To create

this table it was used 2 cluster machines with a Dual-core processor and 6Gb memory ram. From

these results it is possible to notice an increase on the overall performance. Although elapsed time

for pre-processment phase reduced a bit, it did not reduce as much as it was expected. However,

this may have several causes: data is taking too long to be transferred from clusters to client;

clusters do not have as good CPU performance as the client used in previous versions; number of

clusters are not enough to notice an increase of performance. These causes will be better analysed

in next chapter.

Export and validation phases noticed an enormous increase in terms of performance using

clusters solution, such that validation passed from 284 seconds to 4 seconds. These results are due

to the usage of an array instead of streams during these phases. Also, in terms of memory usage,

there is a significant decrease. This is justified by the fact that the pre-processment, which uses

58

Optimizations

recursivity and consumes lot of memory, is splitted by the clusters and only one array with the

entire data is stored in the client.

# Records Time (s) RAM

20k

Pre-processment = 7.83Anonymization = 0.6Export + Convert-initial = 13.34Validation = 8.74

Process: 2.3 GbValidation: 2.3 Gb

100k


Process: 2.5 GbValidation: 3 Gb

600k

Pre-processment = 172.44Anonymization = 1.56Export + Convert-initial = 352.96Validation = —-

Process: 3.6 GbValidation: >6 Gb

Table 4.1: Anonymization process analysis after optimization of data processment and save resultsphases.


20k



100k



600k



Table 4.2: Anonymization process analysis after optimization of data processment, save resultsand validation phases.

59

Optimizations


20k



100k



600k



Table 4.3: Anonymization process analysis after clustering architecture implementation.

60

Optimizations

4.4 Summary

In this chapter, the most significant issues that showed up in terms of performance and system

requirements are covered in detail and optimizations to solve them are exposed. From those issues,

the most significant one is the RAM memory required by the anonymization process. This big

issue occurs essentially at three points of the process: Process Data, Convert to the Initial Structure

and Save anonymized dataset.

With this big performance issue, several possible optimizations are analysed for each point in

which the issue is more notorious. This analysis led to the creation of three different versions of


A first version without optimizations, which uses RAM memory to save all collection copies

created during the every step of the process. From this version, the most notorious aspect is the

amount of memory it needs to store data. All data retrieved from mongo and all copies of this data

are stored in memory during the anonymization process, which leads to high levels of memory

consumption. So, a limitation in terms of dataset size is noticed and it compromises the minimum

required dataset size. For this reason, this limitation needs special attention as it compromises the

well-functioning of the anonymization.

A second version that removes unneeded attributes and uses streams to process and save col-

lections instead of RAM Memory. By removing unneeded attributes, the dataset becomes smaller

and consequently memory consumption reduces. This removal associated to the use of streams to

process data, allows to only use memory with data ready to anonymize. These changes allowed to

significantly reduce the amount of memory used during the process.

A third version which uses a clustering architecture to split the data processment phase by

several machines (clusters). This third version allows to reduce (1) the time to process data, as it

is splitted by several machines which work together for the same end, and also (2) the memory

usage, as it is only needed one array with the dataset in the entire process and recursivity is split-

ted by several machines. This last optimization had great impact in the anonymization process

performance.

61

Optimizations

62

Chapter 5

Results

On Chapter 3, a solution for the problem of clinical data anonymization was defined. The solution

tries to be as general as possible in order to handle several types of collections, independently of

data types to be anonymized.

In this chapter, the initial solution and its optimized versions are tested using distinct configu-

rations, different k and ` values and several sizes for the datasets. In the context of this dissertation,

two datasets are used: Prescribed Medications and Consumptions (Section 3.1.1). By using these

two collections, it is ensured the solution’s adaptability.

5.1 Information loss

In this section information loss and number of total suppressions will be analysed and compared

with the selected privacy model and dataset size. For this study, several configurations and differ-

ent sizes for the dataset were used.

5.1.1 Prescribed Medications Collection

Prescribed Medications collection contains information about medical prescriptions and the fields

that must be considered due to its importance and relevance are:

1. Patient related information - mainly location, birth year and type and diagnostic.

2. Product related information

3. Prescription date

Taking this into account, six configurations were used for this dataset: three using k-anonymity,

another three using `-diversity.

5.1.1.1 k-anonymity

For this privacy model, three distinct k were used: k=2, k=3 and k=5. The following configuration

for each attribute type was used:

63

Results

1. Excluded - All attributes IsAnonymized and with substring "Suppressed". All days, minutes

and seconds contained in Date objects.

2. Sensitives - no sensitive attribute was set as it is k-anonymity, not `-diversity.

3. Identifiers - Patient gender and type.

4. Pseudonyms - Prescribing physician and patient code. These were defined as they are

needed to link information inside the database.

5. Insensitives - These attributes are not generalized. For research purposes, prescribed prod-

ucts are the main information on this dataset. For this reason, all product information is set

as insensitive.

6. QID - Prescribing physician specialty code and description - using SPECIALTY _CODE

and SPECIALTY _DESCRIPT ION hierarchies provided by the anonymization process.

Diagnostic code and description - using DIAGNOSE_CODE and DIAGNOSE_DESCRIPT ION

hierarchies provided by the anonymization process.

Patient birth year - using Y EAR hierarchy.

7. Suppress - All other information that can directly identify and individual.

By using this configuration, it is returned an anonymized dataset with documents similar to

the one in Listing 5.1. This document was generated using k=2 and 100k records. By analysing

this document in more detail, we notice that all identifiers and suppressed attributes assume the

value *, all pseudonyms contain a string which is an encrypted version of the initial value using

SHA-256 algorithm and all insensitive attributes are left untouched. QIDs are generalized in order

to fullfill the k requirement. In this case, birth year was generalized to a 5 years range, while

diagnostic and physician specialty did not suffer any generalization.

1 {2 " _ i d " : O b j e c t I d ( " 5 9 0 c7d318208c307bac38e19 " ) ,3 " A c t io n " : "∗" ,4 " V i s i t " : {5 " P a t i e n t " : {6 " L o c a t i o n " : {7 " D i s t r i c t D e s c r i p t i o n " : "∗" ,8 " D i s t r i c t C o d e " : "∗" ,9 " CountryCode " : " 5 9 5 0 " ,

10 " C o u n t r y D e s c r i p t i o n " : " P o r t u g a l "11 } ,12 " Code " : "639 a 2 0 4 2 b 1 6 5 f 2 f d f f e 5 4 b 3 e d e 9 0 7 a b a 7 0 f 3 f 9 d 1 7 e e 5 2 6 9 1 c 3 1 c a 7 9 f 5 c 8 5 1 f 7 4 " ,13 " Type " : "∗" ,14 " Gender " : {15 " Gender " : "∗"16 } ,17 " B i r t h Y e a r " : {18 " BeginYear " : " [ 2 0 1 5 , 2020]"19 }20 } ,21 " Ep i sode " : "1274786" ,22 " EpisodeType " : " I n t e r n a m e n t o s "23 } ,24 " C o s t C e n t e r " : {

64

Results

25 " Code " : "∗" ,26 " D e s c r i p t i o n " : "∗"27 } ,28 " I n n e r I d " : "∗" ,29 " P r e s c r i p t i o n N u m b e r " : "∗" ,30 " V a l i d a t i o n R e s u l t " : "∗" ,31 " O b s o l e t e " : "∗" ,32 " Valence " : {33 " Code " : "∗" ,34 " D e s c r i p t i o n " : "∗"35 } ,36 " V a l i d " : "∗" ,37 "SCN" : "∗" ,38 " A p p l i c a t i o n I d " : "HBPPP " ,39 " T r a c k i n g I d " : "99477745808B4DAF8B1FA197B18ABDB5 " ,40 " I n s t i t u t i o n " : {41 " Code " : "∗" ,42 " O r g a n i z a t i o n " : {43 " Code " : "∗"44 }45 } ,46 " P r e s c r i b i n g P h y s i c i a n " : {47 " P h y s i c i a n S p e c i a l t y C o d e " : " 4 0 " ,48 " P h y s i c i a n S p e c i a l t y " : " P e d i a t r i a " ,49 " Code " : " a550a16e49be9f5a17a64413a5b923523a7e384870afdc72dc252a07eba f3ba9 "50 } ,51 " P r o d u c t " : {52 " P r o d u c t I d " : " 6 7 7 3 " ,53 " Produc tCode " : "100004348" ,54 " D e s c r i p t i o n " : "NUTRICAO PARENTERICA P / NEONATO" ,55 "CHNM" : " " ,56 " Dose " : " 0 " ,57 " MeasureUni t " : " " ,58 " A d m i n i s t r a t i o n R o u t e D e s c " : " " ,59 " P r e s e n t a t i o n F o r m " : "EMB" ,60 "GFT" : "110203" ,61 "ACT" : " " ,62 " Med ica lDev ice " : "N" ,63 " Family " : " 1 0 " ,64 " F a m i l y D e s c r i p t i o n " : " P r o d u t o s Farmac \ ^ e u t i c o s " ,65 " Pha rmaceu t i c a lFo rmDesc " : "SOL INJ " ,66 " A d m i n i s t r a t i o n R o u t e " : " " ,67 " P r e s e n t a t i o n F o r m D e s c " : "EMBALAGEM" ,68 " P h a r m a c e u t i c a l F o r m " : "SOL INJ "69 } ,70 " Dosage " : {71 " Frequency " : "PERFUSAO" ,72 " BeginDate " : {73 " Year " : " 2 0 1 5 " ,74 " Month " : "3"75 }76 } ,77 " D i a g n o s t i c " : {78 " D i a g n o s t i c C o d e " : {79 " Code " : "76524"80 } ,81 " D i s e a s e s C l a s s i f i c a t i o n " : "∗" ,82 " D e s c r i p t i o n " : "27−28 SEMANAS COMPLETAS DE GESTACAO"83 } ,84 " L i n e I d e n t i f i e r " : "∗" ,85 " C r e a t i o n D a t e " : {86 " Year " : " 2 0 1 5 " ,87 " Month " : "3"88 }89 }

Listing 5.1: Prescribed Medication anonymized document.

In Table 5.1 it is shown the resulting information loss for each k configuration and for each

dataset size.

As expected, for each k, the information loss decreases with the increase of the dataset size. As

example, when using k = 2, the information loss decreased from 2.5% for 20k records to 1.4% for

65

Results

# Records k = 2 k = 3 k = 5

20kLoss Information = 0.025Prec = 0.045Num. Suppressions = 237

Loss Information = 0.039Prec = 0.059Num. Suppressions = 521








Table 5.1: Prescribed Medications anonymization - Information loss for 20k, 100k and 600krecords using k=2, k=3 and k=5.

600k records. From the table, it is also possible to see that information loss is directly proportional

to k. This means that when k increases, so does information loss. This is justified by the fact that

for a higher k value, more generalization is required to fulfill the requirement.

Another important point to notice is that the number of records that are fully suppressed main-

tains similar for bigger datasets. This means that the ratio of total suppressions has a significant

decrease when using bigger datasets. As example, for k=2, there are 237 total suppressions when

anonymizing 20k records and 159 when anonymizing 600k records. This means that, in the first

scenario, the ratio for fully suppressed records is 1.2% and, in the second scenario, the ratio is

0.026%. As suppression of an entire record implies losing all information on that record, it must

be reduced.

For 600k records it is easily noticed that the information loss is almost the same for all k

values. This means that generated anonymized datasets are not too distinct from each other for all

the three k versions. Charts on Fig. 5.1 allow to better notice this statement. Looking at them,

we notice that, for 20k records (left chart), k has a significant impact in the anonymized dataset

and the information loss highly depends on it. However, for 600k records that is not true and

information loss is almost constant for every k. With this in mind, we can infer that for bigger

datasets k=2 is preferable as the dataset is almost equivalent to the one generated using higher k

and it has less suppression.

5.1.1.2 `-diversity

For this privacy model, three distinct ` were used: `=2, `=3 and `=5. The following configuration

for each attribute type was used:

1. Sensitives - For Prescribed Medications dataset two sensitive attributes were identified: di-

agnostic code and diagnostic description. These are sensitives because diagnostics must not

be linked to any individual and can identify them in some situations, such as a rare disease.

2. QID - Prescribing physician specialty code and description - use SPECIALTY _CODE and

SPECIALTY _DESCRIPT ION hierarchies provided by the anonymization process.

66

Results

Figure 5.1: Information loss vs k-anonymity for 20k (left) and 600k (right) records.

Patient birth year - uses Y EAR hierarchy.

In Table 5.2 it is shown the resulting information loss for each ` configuration and for each

dataset size.

From this table and as expected, information loss (1) decreases with the increase of the dataset

size and (2) increases with the increase of `. Analysing in more detail each cell of the table, we

notice that the evolution of information loss and amount of suppression is similar to the previous

k-anonymity configuration. However, loss and suppression is a lot higher than for k-anonymity

alone. This results confirm what was expected. `-diversity ensures that each group of QIDs

has a minimum of ` distinct values on sensitive attributes. So, the best scenario occurs when k-

anonymity also fulfills `-diversity, which does not occur for this collection. As consequence, more

generalization and more suppression is required to achieve `-diversity and so more information

loss.

Regarding the amount of suppression required, we notice that in contrast to what happens with

k-anonymity, `-diversity is not linear and does not reduce with the increase of total records. This

happens because `-diversity does not consider only QIDs tuples, but also sensitive values. This

means that not only one tuple of QIDs must appear more than ` times, but also for each tuple there

must appear at least ` distinct values on sensitives. So, it is harder to achieve these results even if

more records are considered.

5.1.2 Consumptions Collections

Consumptions collection contains information about patient’s consumptions during a clinical episode.

The fields that must be considered due to its importance and relevance are:

1. Patient related information - mainly location, birth year and type and diagnostic.

2. Transaction related information - such as type, date and amount.

3. Information related to the product that was consumed.

67

Results

# Records ` = 2 ` = 3 ` = 5










Table 5.2: Prescribed Medications anonymization - Information loss for 20k, 100k and 600krecords using `=2, `=3 and `=5.

For this dataset, only four k anonymity configuration with 3 dataset sizes are tested. The

following configuration for each attribute type was used:

1. Sensitives - no sensitive attribute was set as it is k-anonymity, not `-diversity.

2. Identifiers - Patient gender and type; transaction number.

3. Pseudonyms - Patient and institution code. These were defined as they are needed to link

information inside the database.

4. Insensitives - These attributes are not generalized. For research purposes, consumptions

are the main information on this dataset. For this reason, all product information is set as

insensitive.

5. QID - Patient birth year - using Y EAR hierarchy.

Product code and Product CHNM - using GENERAL hierarchy.

This configuration results in an anonymized dataset with documents similar to the one in List-

ing 5.2. This document was generated using k=5 and 100k records. As on Prescribed Medications

collection, all identifiers and suppressed attributes assume the value *, all pseudonyms are an en-

crypted representation of the initial value and all insensitive attributes are left untouched. QIDs

are generalized in order to fullfill the k. In this case, birth year was generalized to a 10 years range

and all other QID did not suffer any generalization.

1 {2 " _ i d " : O b j e c t I d ("59145381 a1d19d184a2a090b " ) ,3 " O b s o l e t e " : "∗" ,4 " A c t io n " : "∗" ,5 " Valence " : {6 " Code " : "∗" ,7 " D e s c r i p t i o n " : "∗"8 } ,9 " C o s t C e n t e r " : {

10 " Code " : "∗" ,11 " D e s c r i p t i o n " : "∗"12 } ,

68

Results

13 " V i s i t " : {14 " Ep i sode " : "∗" ,15 " P a t i e n t " : {16 " Code " : "4 c4de4ea3796cb62fe18bf9d970fc9193a015b169822a2018b2903f f9e95d228 " ,17 " Type " : "∗" ,18 " Gender " : {19 " Gender " : "∗"20 } ,21 " B i r t h Y e a r " : {22 " BeginYear " : " [ 1 9 3 0 , 1940]"23 } ,24 " L o c a t i o n " : {25 " CountryCode " : " 5 9 5 0 " ,26 " C o u n t r y D e s c r i p t i o n " : " P o r t u g a l "27 }28 } ,29 " EpisodeType " : " I n t e r n a m e n t o s "30 } ,31 " I n n e r I d " : "∗" ,32 " T r a n s a c t i o n " : {33 " Brand " : "∗" ,34 " Transac t ionNumber " : {35 " Number " : "∗"36 } ,37 " U n i t P r i c e " : " 0 . 0 4 " ,38 " Value " : "−0.08" ,39 " Date " : {40 " Year " : "∗" ,41 " Month " : "∗"42 } ,43 " Amount " : "−2" ,44 " Un i t " : "CMP" ,45 " T r a n s a c t i o n T y p e " : " 1 " ,46 " T r a n s a c t i o n T y p e D e s c " : " S a i d a Unidose " ,47 " T r a n s a c t i o n T y p e D e t a i l " : "SU"48 } ,49 " S e r v i c e " : {50 " Code " : "∗" ,51 " D e s c r i p t i o n " : "∗"52 } ,53 "SCN" : "∗" ,54 " I n s t i t u t i o n " : {55 " Code " : "2 fd09410f893477de881d85a6d4ba231272126d3609e65b10e6eade5b4273514 " ,56 " O r g a n i z a t i o n " : {57 " Code " : " f0fc1a01a04787265230d3d7eab3217165301d4458c774bb4a638a69ae9d43dd "58 }59 } ,60 " P r o d u c t " : {61 " P r o d u c t I d " : "∗" ,62 " Produc tCode " : "100001158" ,63 " D e s c r i p t i o n " : "∗" ,64 "CHNM" : "10079754" ,65 " Dose " : " 6 0 0 " ,66 " MeasureUni t " : "MG" ,67 " A d m i n i s t r a t i o n R o u t e D e s c " : "ORAL" ,68 " P r e s e n t a t i o n F o r m " : "CMP" ,69 "GFT" : " 1 1 . 3 . 2 . 3 " ,70 "ACT" : "10044575" ,71 " Med ica lDev ice " : "N" ,72 " Family " : " 1 0 " ,73 " F a m i l y D e s c r i p t i o n " : " P r o d u t o s F a r m a c e u t i c o s " ,74 " Pha rmaceu t i c a lFo rmDesc " : "COMP. " ,75 " A d m i n i s t r a t i o n R o u t e " : "ORAL" ,76 " P r e s e n t a t i o n F o r m D e s c " : "CMP" ,77 " P h a r m a c e u t i c a l F o r m " : "COMP. "78 }79 }

Listing 5.2: Consumptions anonymized document.

In Table 5.3 it is shown the resulting information loss for each k configuration and for each

dataset size. As expected, these results are similar to the results for Prescribed Medications col-

lections: higher k, higher information loss; lower dataset size, higher information loss.

69

Results

# Records k = 2 k = 3 k = 5

20k Loss Information = 0.05Prec = 0.07

Loss Information = 0.07Prec = 0.09








Table 5.3: Consumptions anonymization - Information loss for 20k, 100k and 600k records usingk=2, k=3 and k=5.

5.2 Performance

Nextly, the three versions of this solution are tested and evaluated in terms of performance. With

this in mind and in order to better analyse the process, all steps specified on Chapter 3 will be

analysed in terms of speed and memory consumption. Privacy model configuration and its impact

in terms of performance will also be analysed in this chapter.

During performance tests it is used an Intel Core i7 Quadcore with 6Gb RAM machine and

only Prescribed Medications collection is considered. It is only used one collection because both

collections are similar in terms of size and so performance variation will be similar on all collec-

tions.

5.2.1 Privacy Model Config Impact

In this section, the impact of privacy models configuration will be analyzed in terms of perfor-

mance. For this, two metrics will be considered:

1. Speed

2. Memory usage

For both studies the same dataset - Prescribed Medications -, the same number of records to

anonymized - 600k records - and the same version of the process - cluster-based - will be used. It

was chosen 600k records because it is large enough to notice differences in terms of performance.

5.2.1.1 Processing time

To analyse if privacy model configuration affects processing time, 4 distinct values are used for

each privacy model (PRMD).

Table 5.4 presents the elapsed time for the process using different configurations. From this

table, we notice that the duration of the process does not vary with the configuration used.

So, we conclude that privacy model configuration does not have significant impact in the

processing time.

70

Results

5.2.1.2 Memory usage

To analyse if privacy model configuration affects memory usage, 4 distinct values are used for

each privacy model (PRMD).

Table 5.5 presents the memory usage during the process using different configurations. From

this table, we notice that memory usage is similar for all configurations (both k and `).

So, we infer that privacy model configuration does not have significant impact in memory

usage.

PRMD / Value 2 3 4 5k 326s 328s 327s 328s` 327s 328s 328s 329s

Table 5.4: Elapsed time for distinct values of k and `.

PRMD / Value 2 3 4 5k 2.4Gb 2.4Gb 2.4Gb 2.4Gb` 2.4Gb 2.4Gb 2.4Gb 2.4Gb

Table 5.5: Memory usage for distinct values of k and `.

Concluding, privacy model configuration does not impact neither processing speed neither

memory usage.

71

Results

5.2.2 Solution - Initial version

The initial version stores all data into memory which restricts the size of the dataset. Therefore, it

is expected that this version has high level of memory consumption.

From the configuration impact analysis (see Section 5.2.1), it is possible to notice that the

privacy model does not have a considerable impact in terms of performance. For this reason, only

two configurations are used to test the performance (k=2 and `=2).

Table 5.6 contains the performance for this initial version using k=2 as configuration. Three

dataset sizes were used: 20k, 100k and 600k records.

From these results, the most notorious aspect is the Out of Memory status for 100k and 600k

records. The entire non-anonymized and anonymized dataset are stored into memory with and

without hierarchical structure. This totals four copies of the dataset being store into memory. For

a dataset with more than 50 columns, as both the required collections have, it consumes much

memory. Another point is that the pre-processment phase takes advantage of recursivity to go

through the mongo document hierarchy and convert to a non-hierarchical structure. Recursivity

also consumes a considerable amount of memory as references to objects need to keep alive in

memory. These two factors combined significantly increase the amount of memory that is needed

to conclude the process.

In Fig. 5.2 two charts with memory usage during the process of anonymization are presented.

These charts show what was said just before. First, from the analysis of memory consumption

distribution (chart on the right), we notice that RAM is being used mainly by String and Linked-

HashMap$Entry. These elements correspond to the datasets that are stored during the process of

anonymization. Second, from the analysis of the memory usage over time (chart on the left) we

notice that memory usage increases at the begining, decreases a bit and at the end it increases

again. Linking the evolution of RAM usage with the process phases, we can infer that the process

is wasting more memory during the pre-processment phase and during the convert back phase.

These two phases have a thing in common, they both use recursivity and create a new processed

dataset.

# Records Memory RAM Time (s)

20k 1.75Gb


100k >6Gb Out of Memory600k >6Gb Out of Memory

Table 5.6: Performance results for the initial version of the anonymization process using k=2.

72

Results

Figure 5.2: Memory usage for 20k records (left) and memory consumption distribution (right).[SH]

Table 5.7 and Fig. 5.8 present the results using `=2 configuration. As expected the results for

this configuration are similar to the results for k configuration. It confirms that the configuration

does not have any significant impact in terms of performance.


20k 1.75Gb


100k >6Gb Out of Memory600k >6Gb Out of Memory

Table 5.7: Performance results for the initialversion of the anonymization process using`=2.

Table 5.8: Memory usage for 20krecords using `=2. [SH]

5.2.2.1 Limitations

The most limiting factor of the anonymization process is the size of the dataset. For this reason, it

is important to find out the maximum dataset size the solution can handle. To achieve this, several

sizes were tested to create a tendency line that allows to better estimate the maximum size. This

estimation is done with a worst case scenario, by using a dataset with more than 60 fields which

can be considered a big dataset.

# Records Memory RAM20k 1.75Gb40k 2.9Gb60k 4.2Gb80k 5.9Gb

Table 5.9: Memory usage evolution with dataset size.

Table 5.9 presents the memory consumptions variation with the dataset size. This table allowed

to create a tendency chart (Fig. 5.3) to better estimate the number of records the initial version

73

Results

Figure 5.3: Memory usage tendency charts.

can handle. The resulting chart has a tendency line with the equation y = 1,38x + 0,2. From this

trendline we can estimate that the maximum dataset size is aproximately 84k records. This study

allows to estimate that in this version it is required aproximately 1Gb RAM for every 14k records.

This consumption is too high because there are four copies of the collection being stored, the

recursivity and also all hierarchies.

74

Results

5.2.3 Solution - Streaming version

A first optimized version consists in using streams to process data instead of storing it into mem-

ory before processing. With this optimization it is expected that memory consumption decreases

significantly.

As well as previous version, only two configurations are used to test this version’s perfor-

mance: k=2 and `=2.

The results for this version are presented in Table 5.10. Three dataset sizes were used: 20k,

100k and 600k records.

In this optimized version, it is noticeable a significant decrease in terms of memory consump-

tion, specially for bigger datasets. This happens because the dataset, which was wasting a majority

of the memory required by the process, is being processed using a stream. With this decrease, the

requirement of handling at least the average size of datasets to be anonymized each month (aprox.

400.000 records) is easily fulfilled.

In terms of speed, with the usage of streams it is expected the elapsed time to increase as

accessing data in streams is slower than in memory. When comparing the initial version with this

new optimized version, we notice exactly this. For 20k records, the total elapsed time increased

from 9 seconds to 27 seconds. This difference in time is essentially due to the phases that now

use streamings - pre-processment, convert back, export and validation. For bigger datasets, the

difference would be greater.


20k 2.2Gb

Pre-processment = 6.32Anonymization = 0.16Convert back + Export = 12.48Metrics = 0.03Validation = 8.68

100k 2.5Gb


600k 3.6Gb

Pre-processment = 164.23Anonymization = 1.47Export = 358.45Metrics = 0.19Validation = 272.30

Table 5.10: Performance results for the streaming version of the anonymization process using k=2.

In Fig. 5.4 the memory usage distribution is presented. From this chart, we notice the number

of instances of each object had a big decrease. This confirms what was expected: by using streams,

the dataset is only stored in memory on anonymization phase of the process. So, we have only 2

persistent copies of the dataset: anonymized and non-anonymized.

75

Results

Figure 5.4: Memory consumption distribution for 20k records. [SH]

5.2.3.1 Limitations

As in the first version, the most limiting factor is the RAM that is required to complete the process

and anonymize the entire dataset. With this in mind, it is important to find out the maximum

amount of records this version can handle. As well as in previous version, the maximum size

depends on the number of fields each document in the collection has.

Table 5.11 presents the memory consumption variation with the dataset size. By creating a

tendency line chart from this table, it is possible to estimate the number of records that can be

handled. The generated tendency line, shown in Fig. 5.12, has the equation y = 0.2257x + 2,26.

From this trendline we can estimate that the maximum dataset size is aproximately 1,66M records.

This study allows to estimate that in this version it is required aproximately 200Mb RAM for every

100k records.

# Records Memory RAM100k 2.5Gb200k 2.7Gb300k 2.9Gb400k 3.2Gb500k 3.4Gb600k 3.6Gb

Table 5.11: Memory usage evolutionwith dataset size. Table 5.12: Memory usage tendency chart.

76

Results

5.2.4 Solution - Clusters version

Another approach to this solution intends to optimize the process by reducing the elapsed time and

memory usage. This approach is based on clustering the pre-processment and is analysed in more

detail in this section.

To test the performance by using this approach, the clusters must have similar specifications

and must be similar to the machines used to test previous versions. Otherwise, the results cannot

be directly compared as they can differ due to the machine used. For this reason, all used cluster

machines are an Intel Core i7 QuadCore with 6Gb of RAM.

This cluster-based solution was tested using 2, 3 and 4 clusters. Clustering occurs on pre-

processment phase. As consequence, it is expected that the time wasted on this phase reduces

when more clusters are added to the process. Also, it is expected that RAM usage also reduces on

the client as pre-processment, which uses recursivity, is splitted by the clusters.

2 Clusters 3 Clusters# Records Time (s) RAM Time (s) RAM

20k


1.3Gb


1.3Gb

100k


1.7Gb


1.65Gb

600k


2.5Gb


2.4Gb

4 ClustersTime (s) RAM

20k


1.2Gb

100k


1.7Gb

600k


2.4Gb

Table 5.13: Performance using cluster based solution.

77

Results

The results for these tests are presented in Table 5.13.

In terms of speed of operation, we notice a significant improvement. With 2 clusters the time

spent on pre-processment phase is 6.82 seconds, while with 4 clusters it is 2.62 seconds. This

means that the elapsed time had a decrease of 61.58%. Not only this step has been affected by

clustering but also export and validation phases, which had a decrease of aproximately 42% when

compared to the version that uses streams. This happens because data received from clusters is

stored into an array of strings instead of a stream, which is used in subsequent phases (without the

need to create another copy of it). This way, the iteration over the collection during export and

validation is a lot faster than if it was done using a stream. Although these steps are also influenced

by this cluster based solution, they are not influenced by the number of clusters that are running.

So, even there are more clusters running, the elapsed time in these steps remains similar.

In terms of memory used, which was one of the most limiting factors on the first version of

this solution, a significant improve is noticed. From Table 5.13 we notice an increase of 0.7Gb

from 100k records to 600k records. This is a good improve when compared to version that uses

streams (in this case, the increase was 1.1Gb) and even better when compared to first version (gets

out of memory). This reduce in the amount of memory used is justified by the fact that recursivity

is splitted over the clusters and only one copy of the dataset is stored.

5.2.4.1 Limitations

# Records Memory RAM100k 1.7Gb200k 1.8Gb300k 1.95Gb400k 2.1Gb500k 2.25Gb600k 2.4Gb

Table 5.14: Memory usage evolutionwith dataset size.

Table 5.15: Memory usage tendency chart for cluster basedsolution.

To estimate the maximum number of records that can be handled by this solution, it was tested

the solution using different sizes for the dataset. As well as in previous version, the maximum size

depends on the number of fields each document in the collection has and a worst case scenario

was used on this estimation (dataset with more than 60 fields). The results are presented in Table

5.14 and a tendency line in Fig. 5.15. From the tendency line equation - y = 0.1429x + 1.533 - and

assuming that the available RAM is 6Gb, we estimate the maximum size is 3.1M records. From the

78

Results

table, we can infer that, for every 100k, it is required an average of 140Mb RAM. By comparing

these values with previous version, we conclude there was a decrease of 30% in memory usage

(200Mb against 140Mb).

5.2.5 Comparison

In this section, the three versions are directly compared in terms of speed and resources.

5.2.5.1 Processing time

Speed of operation is one important aspect to take into account. Table 5.16 presents the elapsed

time for each version. On this table, one of the most notorious aspects is the Out of Memory in

initial version for more than 100k records. However, if we focus on 20k records, the initial version

is the best in terms of performance, with a duration of 10s. It is even faster than cluster-based

solution (at least with up to 4 clusters). For this reason, initial version is the best choice when

anonymizing small datasets.

For bigger datasets, and now comparing streams version with cluster-based version, we notice

a big difference between these two versions. Cluster-based solution is notoriously faster than

stream version. However, it has as disadvantage: the need of clusters to anonymize the dataset. For

this reason, cluster-based solution is recommended for bigger datasets when there is the possibility

to create clusters. Otherwise, stream version is the way to go.

If we look in more detail to the time spent in every step of the process for each version (Tables

5.9, 5.11 and 5.13), we notice the main weak point in cluster-based solution is when converting

back and exporting the anonymized dataset. If we compare the time spent in this step with the

initial version, we quickly conclude that this is the factor that makes the initial version faster than

the cluster-based version (when up to 4 clusters). For bigger datasets, this performance issue is

even more notorious as the majority of time is lost during this phase (more than 65%).

# Records Initial Version Streams version Cluster-based Version2 Clusters 3 Clusters 4 Clusters

20k 10s 29s 16s 14s 12s100k Out of Memory 130s 56s 51s 49s600k Out of Memory 798s (13 min) 327s (approx. 5min20s) 308s (approx. 5 min) 296s (approx. 4min50s)

Table 5.16: Elapsed time comparison.

5.2.5.2 Memory Usage

Memory usage is one of the most important aspects to take into account as it is the most limiting

factor for the anonymization process.

Initial Version Streams version Cluster-based Version84k 1.66M 3.1M

Table 5.18: Dataset size limitation comparison.

79

Results

# Records Initial Version Streams version Cluster-based Version20k 1.7Gb 2.2Gb 1.2Gb100k Out of Memory 2.5Gb 1.7Gb600k Out of Memory 3.6Gb 2.4Gb

Table 5.17: Memory usage comparison.

Table 5.17 presents the memory usage for each version and Table 5.18 the limitation in terms

of dataset size. These tables show more clearly the evolution and the improvement done since

initial version, passing from (1) out of memory for 100k records to wasting only 2.4Gb for 600k

records and (2) maximum of 84 thousand records to 3.1 million records. This means an increase

of 3590% on the dataset size that can be handled.

Concluding, the above said allows us to conclude that there has been an enormous improve in

terms of performance since the initial version. Although there has been a big improve, all versions

have their own advantages and can be used in different situations: (1) the first version can be used

for small datasets, as it is fast and does not require clusters, (2) the second version is recommended

for bigger datasets when no clusters available and (3) the third version is recommended if there

are clusters available as it is fast and can handle bigger datasets than the other versions.

5.3 Validation

Due to cross information, anonymization process can not ensure with 100% certainty that the

resulting dataset does not compromise individual privacy. It can only ensure that using a given

configuration, the privacy model is ensured within the anonymized collection. For this reason,

validating that an anonymized dataset does not compromise individual privacy is not trivial and

we can only validate if the privacy model is fulfilled. However, our solution takes advantage of an

API that contains the implementation of some privacy models and a well-known anonymization

algorithm. Therefore, it is essential to ensure and validate that the resulting dataset fulfills the

selected privacy model. With this in mind, k-anonymity and `-diversity were implemented to be

used after the anonymization finishes. This allows to validate that the privacy model is guaranteed

in the anonymized dataset.

In order to validate the anonymization process results, three tests have been done:

1. Test Description: apply the anonymization process to an already anonymized dataset, using

exactly the same configuration file.

Expected Result: resulting dataset must be equal to the input dataset and quality metrics

must equal 0.

2. Test Description: calculate k-anonymity and `-diversity for the anonymized dataset.

Expected Result: value for k and ` is greater than the one passed in the configuration file.

80

Results

MongoDB collection # Records Loss Metric Precision MetricNon-anonymized 100k 2.4% 4.4%Anonymized v1 100k 9.60E-6 9.78E-5Anonymized v2 100k 0.0 0.0

Table 5.19: Quality metrics returned from anonymization applied to anonymized version of Pre-scribed Medications dataset.

3. Test Description: apply some questions used for research purposes. Compare answers for

non-anonymized and anonymized datasets.

Expected Result: the results for this test are not too concise. However, it is expected the

answers to be satisfatory.

Apart from these tests to validate the results, unit tests were implemented to ensure, under

supervisioned environment, the algorithms implemented and the anonymization process result.

5.3.0.1 Anonymize already anonymized dataset

A dataset that has been anonymized must fulfill the required properties to be considered anonymized.

This properties are passed via configuration file. Therefore, if an anonymized dataset is passed to

the anonymization process, the resulting dataset must be equal to the initial dataset (which is al-

ready anonymized). Also, the quality metrics must equal 0 as there is no difference between initial

and resulting dataset.

This verification allows to ensure that the anonymization is done right and that the anonymized

dataset is anonymized correctly. It also allows to validate quality metrics calculation because it

must return 0.

In order to check the equality of both collections, a MongoDB script was created, which

receives as input the two collections to compare. The script starts by comparing the size of the

datasets. If they have different sizes, it returns false. After that, it iterates through each document

on the first collection and checks if the second collection contains that document. If it is not found,

returns false. If it reaches the end of the collection, it returns true.

Three sequential anonymizations were done, starting with a non-anonymized collection and

ending by anonymizing an already anonymized collection. Quality metrics calculated in each

iteration are presented in Table 5.19.

As expected, for the first anonymization, which was applied to a non-anonymized dataset, the

calculated information loss is greater than 0% for both metrics. From these results, we conclude

that some generalizations and suppressions were applied to the input dataset and that it was not

anonymized before the process.

After having the first anonymized dataset, another anonymization was done. This time, the

input dataset is the anonymized one. The expected result is to obtain information loss = 0%.

Although the value was almost 0% (9.6E-6 for Loss Metric), it was not exactly 0%. It means

some generalization or suppression was done, in contrast to what was expected. As the value was

81

Results

Configuration # Records k-anonymity `-diversityk = 2 100k 2 1k = 4 100k 4 1` = 2 100k 2515 2` = 4 100k 4962 4

Table 5.20: k and ` returned from anonymization Prescribed Medications dataset.

too low, another anonymization was applied to this second anonymized version to check if it was

tending to 0. This time, the value corresponded to the expected: 0. So, with these results, we

confirm the anonymized dataset is correctly anonymized.

5.3.0.2 k-anonymity and `-diversity calculation

As our solution uses the API provided by ARX tool, another validation that must be taken into

account is the calculation of k and `. Therefore, at the end of the process, both privacy models are

calculated on the resulting dataset.

In order to correctly validate k and `, both the dataset and the configuration files must be the

same for all tests. Two configuration files were used: one to test k-anonymity, another to test

`-diversity. As dataset, it was used Prescribed Medications collection with 100k records.

For k-anonymity validation, it is expected that the resulting k equals the one passed on the

configuration file. For `-diversity validation, it is expected that (1) the resulting ` equals the one

passed on the configuration file and that (2) k is greater or equal than `.

The results for this validation test are presented in Table 5.20. From these results, we conclude

that (1) for k configuration, the resulting k equals the input value and (2) for ` configuration, the

resulting ` equals the input value and the resulting k is greater than `.

These results allow to validate the API privacy model calculation and also that the output

dataset complies with the requirements set by the admin.

5.3.0.3 Research questions

In order to validate if data is suitable for research purposes, some questions were made to the

anonymized dataset and its results analysed. By comparing the answers given by both datasets

(anonymized and non-anonymized), it is possible to have better perception of quality of generated

data. These questions are representative of those used by researchers.

In this section, the anonymized datasets are generated using k=2.

First Question: How many distinct prescribed medications are there?

This question allows to find out how many different medications were prescribed. The com-

parison between initial and anonymized dataset is useful to find out if generalizations were done

and if specificity was lost during anonymization process. If specificity is lost, then the number of

distinct prescribed medications on the anonymized dataset is less than on the initial dataset.

82

Results

To answer this question, the following query was applied to Prescribed Medications collection:

1 db . c o l l e c t i o n . d i s t i n c t (PRODUCT_CODE) . l e n g t h

The results were satisfatory, such that the result was equal for both anonymized and non-

anonymized dataset. This means that for research purposes, there is no information loss for this

question.

Second Question: How many distinct diagnostics are there?

As well as previous question, this question allows to find out if diagnostics specificity was lost

during anonymization process.

To answer this question, the following query was applied to Prescribed Medications collection:

1 db . c o l l e c t i o n . d i s t i n c t ( DIAGNOSTIC_ATTRIBUTE ) . l e n g t h

As well as for the first question, the results were equal for both collections, which means there

is no information loss for research purposes in this question.

Third Question: Lowest price for each product consumption.

This question is used to find out the lowest price of a product and is useful for researchers and

pharmaceutical industry to take knowledge of the price applied by competitors.

To answer this question, the following query was applied to Consumptions collection.

1 db . c o l l e c t i o n . a g g r e g a t e ( [ { " $group " : {" _ i d " : " $PRODUCT_ATTRIBUTE" , " p r i c e " : { $min : "$TRANSACTION_UNITPRICE_ATTR " } } } , { "$ p r o j e c t " : {" p r o d u c t " : " $_ id " , " _ i d " : 0 , " p r i c e " : 1}} ,{ $match : { " p r i c e " : { $ne : " 0 " } } } ] )

For the non-anonymized dataset, the results are:

1 { " p r i c e " : " 3 7 . 5 7 " , " p r o d u c t " : "5" }2 { " p r i c e " : " 0 . 6 1 " , " p r o d u c t " : "6" }3 { " p r i c e " : " 1 . 1 5 " , " p r o d u c t " : "7" }4 { " p r i c e " : " 4 7 7 . 9 5 " , " p r o d u c t " : "3" }5 . . . .

For the anonymized dataset, the results are:

1 { " p r i c e " : " 2 5 3 . 8 8 " , " p r o d u c t " : "1" }2 { " p r i c e " : " 1 0 4 2 . 5 3 " , " p r o d u c t " : "2" }3 { " p r i c e " : " 4 7 7 . 9 5 " , " p r o d u c t " : "3" }4 { " p r i c e " : " 2 . 0 9 " , " p r o d u c t " : "4" }5 . . . .

As the result list is too large, only some representative elements are shown.

83

Results

From the analysis of these results, we notice the anonymized dataset can answer the question

with the same precision as the initial dataset. If we compare the results in more detail, we notice

product "3" returns the same value in both datasets. This complies with the previous said and al-

lows us to infer that the anonymized table can be used to fetch satisfatory results for this question.

Fourth Question: Products consumption variations over time.

This question allows researchers to learn when a product is frequently more consumed, which

is useful to optimize production and prevent shortage of resources.


1 db . c o l l e c t i o n . a g g r e g a t e ( [ { " $group " : {" _ i d " : {" month " : " $TRANSACTION_MONTH" , " y e a r " : "$TRANSACTION_YEAR" , " day " : "$TRANSACTION_DAY" , " p r o d u c t " : " $PRODUCT_FIELD " } , " c o u n t " : { $sum : 1}}} ,{" $ p r o j e c t " : {" d a t e " : " $_ id " , " _ i d " : 0 , " c o u n t" : 1 } } , { $match : {" d a t e . p r o d u c t " : " PRODUCT_ID_VALUE " } } ] )


1 { " c o u n t " : 4 , " d a t e " : { " month " : 10 , " y e a r " : 2014 , " day " : 22 , " p r o d u c t " : "PRODUCT_ID_VALUE" } }2 { " c o u n t " : 8 , " d a t e " : { " month " : 10 , " y e a r " : 2014 , " day " : 3 , " p r o d u c t " : "PRODUCT_ID_VALUE" } }3 { " c o u n t " : 6 , " d a t e " : { " month " : 10 , " y e a r " : 2014 , " day " : 16 , " p r o d u c t " : "PRODUCT_ID_VALUE" } }4 { " c o u n t " : 4 , " d a t e " : { " month " : 10 , " y e a r " : 2014 , " day " : 14 , " p r o d u c t " : "PRODUCT_ID_VALUE" } }5 { " c o u n t " : 2 , " d a t e " : { " month " : 10 , " y e a r " : 2014 , " day " : 7 , " p r o d u c t " : "PRODUCT_ID_VALUE" } }6 { " c o u n t " : 5 , " d a t e " : { " month " : 10 , " y e a r " : 2014 , " day " : 31 , " p r o d u c t " : "PRODUCT_ID_VALUE" } }7 { " c o u n t " : 5 , " d a t e " : { " month " : 10 , " y e a r " : 2014 , " day " : 29 , " p r o d u c t " : "PRODUCT_ID_VALUE" } }8 { " c o u n t " : 4 , " d a t e " : { " month " : 10 , " y e a r " : 2014 , " day " : 17 , " p r o d u c t " : "PRODUCT_ID_VALUE" } }9 { " c o u n t " : 4 , " d a t e " : { " month " : 9 , " y e a r " : 2014 , " day " : 30 , " p r o d u c t " : "PRODUCT_ID_VALUE" } }

10 { " c o u n t " : 6 , " d a t e " : { " month " : 9 , " y e a r " : 2014 , " day " : 29 , " p r o d u c t " : "PRODUCT_ID_VALUE" } }11 { " c o u n t " : 3 , " d a t e " : { " month " : 9 , " y e a r " : 2014 , " day " : 24 , " p r o d u c t " : "PRODUCT_ID_VALUE" } }12 { " c o u n t " : 5 , " d a t e " : { " month " : 9 , " y e a r " : 2014 , " day " : 22 , " p r o d u c t " : "PRODUCT_ID_VALUE" } }13 { " c o u n t " : 5 , " d a t e " : { " month " : 9 , " y e a r " : 2014 , " day " : 19 , " p r o d u c t " : "PRODUCT_ID_VALUE" } }1415 . . .


1 { " c o u n t " : 114 , " d a t e " : { " month " : " 1 " , " y e a r " : " 2 0 1 4 " , " p r o d u c t " : "PRODUCT_ID_VALUE" } }2 { " c o u n t " : 100 , " d a t e " : { " month " : " 2 " , " y e a r " : " 2 0 1 4 " , " p r o d u c t " : "PRODUCT_ID_VALUE" } }3 { " c o u n t " : 97 , " d a t e " : { " month " : " 3 " , " y e a r " : " 2 0 1 4 " , " p r o d u c t " : "PRODUCT_ID_VALUE" } }4 { " c o u n t " : 117 , " d a t e " : { " month " : " 4 " , " y e a r " : " 2 0 1 4 " , " p r o d u c t " : "PRODUCT_ID_VALUE" } }5 { " c o u n t " : 102 , " d a t e " : { " month " : " 5 " , " y e a r " : " 2 0 1 4 " , " p r o d u c t " : "PRODUCT_ID_VALUE" } }6 { " c o u n t " : 102 , " d a t e " : { " month " : " 6 " , " y e a r " : " 2 0 1 4 " , " p r o d u c t " : "PRODUCT_ID_VALUE" } }7 { " c o u n t " : 108 , " d a t e " : { " month " : " 7 " , " y e a r " : " 2 0 1 4 " , " p r o d u c t " : "PRODUCT_ID_VALUE" } }8 { " c o u n t " : 110 , " d a t e " : { " month " : " 8 " , " y e a r " : " 2 0 1 4 " , " p r o d u c t " : "PRODUCT_ID_VALUE" } }9 { " c o u n t " : 107 , " d a t e " : { " month " : " 9 " , " y e a r " : " 2 0 1 4 " , " p r o d u c t " : "PRODUCT_ID_VALUE" } }

10 { " c o u n t " : 110 , " d a t e " : { " month " : " 1 0 " , " y e a r " : " 2 0 1 4 " , " p r o d u c t " : "PRODUCT_ID_VALUE" } }

In contrast to previous questions, we notice the results are different when applying the query

to anonymized and non-anonymized dataset. From the analysis of both results, we notice the non-

anonymized dataset has high level of precision - daily variation. The anonymized dataset does not

have as high level of precision - the best precision it can handle is monthly variations.

So, we conclude that the anonymized dataset can answer this question for periods of time of

more than 1 month. As these results are used to increase/decrease production according to average

84

Results

consumption and production is usually controlled on a monthly basis, the results retrieved from

the anonymized dataset can be considered satisfatory.

Fifth Question: For how long has a patient consumed a medication?

This question allows to find the average time to use a medication and compare it with other

similar products.


1 db . c o l l e c t i o n . a g g r e g a t e ( [ { $match : {"PRODUCT_FIELD " : "PRODUCT_VALUE" , "TRANSACTION_MONTH_FIELD " : 1 , " PATIENT_ID_FIELD " : "PATIENT_ID_VALUE " , "TRANSACTION_YEAR_FIELD" : 2 0 1 3 } } , { $group : { " _ i d " : { day : "$TRANSACTION_DAY_FIELD" , month : "$TRANSACTION_MONTH_FIELD" , y e a r : "$TRANSACTION_YEAR_FIELD" , p a t i e n t : " $PATIENT_ID_FIELD " , p r o d u c t : "$PRODUCT_FIELD" } , c o u n t : {$sum : 1 } } } , { $ s o r t : { c o u n t : −1}}] , { a l lowDiskUse : t r u e } )


1 { " _ i d " : { " day " : 28 , " month " : 1 , " y e a r " : 2013 , " p a t i e n t " : "PATIENT_ID_VALUE " , " p r o d u c t " : "PRODUCT_VALUE" } , "c o u n t " : 13 }






7 . . .


1 { " _ i d " : { " month " : " 1 " , " y e a r " : " 2 0 1 3 " , " p a t i e n t " : "e129d1f9c34294f91676a1b32ea785a48be63b8d398cc70617fd4c4fbc471a07 " , " p r o d u c t " : "PRODUCT_VALUE" } , " c o u n t " : 79 }

2 { " _ i d " : { " month " : " 1 " , " y e a r " : " 2 0 1 3 " , " p a t i e n t " : "f f3 f1d13d842d32525a389e569b4d28694df3cebcb682c545dd78caae7636a32 " , " p r o d u c t " : "PRODUCT_VALUE" } , " c o u n t " : 28 }

3 { " _ i d " : { " month " : " 1 " , " y e a r " : " 2 0 1 3 " , " p a t i e n t " : "c99c1c79de4c4d280f5a0d901ae6a0d263b57d3563a2e38bf2f9064a76b5539b " , " p r o d u c t " : "PRODUCT_VALUE" } , " c o u n t " : 27 }

4 { " _ i d " : { " month " : " 1 " , " y e a r " : " 2 0 1 3 " , " p a t i e n t " : "42bb7f05c75302db7d0ae7bb7c4d9d297522b95c864cc7c1464cd58f0a2b2723 " , " p r o d u c t " : "PRODUCT_VALUE" } , " c o u n t " : 22 }

5 . . . .

From the analysis of both results, we notice there is some information loss in the anonymized

dataset. In the non-anonymized dataset, it is possible to find out how many days a patient con-

sumed a medication. In the anonymized dataset, this is not possible as days are suppressed. For

this reason, results are grouped in a monthly basis.

Another important thing to notice is that in the non-anonymized dataset the patient identifi-

cation is shown and completely available. For this reason, in the non-anonymized dataset it is

possible to directly identify an individual. In contrast, in the anonymized dataset, the patient iden-

tification is encrypted, allowing to find relations inside the database while making it impossible to

directly identify the individual.

85

Results

Concluding, although less precision and more information loss, the anonymized version allows

to answer this question for a large portion of the cases. However, if it is required to answer with

more precision, this anonymized version does not allow it and privacy may be compromised.

5.4 Summary

In this chapter, all versions of the solution have been submitted to a wide range of tests to evaluate

them in terms of performance as well as in terms of limitations.

First, information loss is an important aspect when talking about data anonymization. With

this in mind, several privacy model configurations and dataset sizes were used to study its impact

on information loss. Regarding dataset size, we concluded that information loss is inversely pro-

portional to the size of datasets (bigger datasets imply lower information loss). Regarding privacy

configuration, it showed obvious the difference in information loss between k-anonymity and `-

diversity, such that `-diversity implies much more information loss than k-anonymity. This is not

strange as `-diversity can be seen as an extension of k-anonymity with more requirements.

Second, performance is also an important factor to analyze and find out how well the solution

works and its limitations. With this in mind, all versions have been analysed in terms of memory

usage and speed of operation, as well as its limitations in terms of dataset size and memory. With

these tests, we concluded that (1) initial version is faster because it stores everything into memory.

However, it requires a high amount of memory, which means it can not handle big datasets. (2)

First optimized version (with streams) solves the problem of memory consumption, reducing it

in a high factor. However, it takes more time to complete the process as data is processed using

streams instead of in memory. (3) Second optimized version (cluster-based) improves speed of

operation as processment is splitted over several clusters - more clusters mean better speed results

- and improves memory consumption, as it splits recursivity over clusters and the process receives

the dataset already processed. Concluding, the solution with better balance speed vs memory

usage is the cluster-based one. However, it requires clusters to run. For this reason, if no clusters

are available, we can conclude that initial version is preferrable for smaller datasets as it is faster

than streaming version, while second version is preferrable for bigger datasets as it can handle

more data even it takes more time to conclude.

Finally, results were validated by ensuring k-anonymity and `-diversity is present in the anonymized

dataset and by ensuring that if an anonymized dataset is passed to the process, the return equals the

input. With these validations, we can conclude and ensure the API algorithm works as expected

and that the results correspond to the desired. In order to have more perception on the applicability

of data for research purposes, the results for some real questions were analyzed. From this anal-

ysis, we concluded the results were satisfatory for all questions, allowing to validate the dataset

applicability for research purposes.

In conclusion, the resulting solution allows to easily handle both required collections and

shows good scalability being only dependent on the configuration file and not on the collection

to anonymize. In terms of data capacity, the solution showed more than enough to handle the

86

Results

average of data that is anonymized each month. In terms of speed of operation, a balance with the

amount of memory required by the process was made. The results showed more than sufficient for

the context in which the solution must run (monthly basis) and fulfills the expectations. However,

there is still some space of improvement on the future on this line of performance.

87

Results

88

Chapter 6

Web and Desktop Application

In previous chapter, results for anonymization process module are presented. This module is used

by other two modules: web and desktop application. In this chapter, the resulting work for these

two modules are shown and explained in more detail.

6.1 Web application

The web application intends to automate the process of anonymization and provide the user with

some useful analytics.

In this section, it is presented the web application workflow and how it is useful for the admin-

istrator to control the process of anonymization.

The core functionalities of the web service are:

1. Create and start anonymization process.

2. View anonymization results.

3. View process information loss.

4. View useful analytics.

6.1.0.1 Create and start anonymization process

In order to anonymize a collection, it is required to initially create a new connection. To create a

new connection it is required to provide some information: (1) a name for the connection, (2) host

to connect, (3) collection to anonymize, (4) database in which the collection is stored, (5) database

and collection to store the results and (6) date field used to run periodically the anonymization.

After creating the connection, the administrator can chose to automatically anonymize the col-

lection with a specified periodicity or manually anonymize it by clicking a single button. Before

the anonymization process starts, the administrator must upload a configuration file, which can be

created using the desktop app. The connection can also be removed and edited.

These functionalities are represented in Fig. 6.1, which is a screenshot of the application.

89


Figure 6.1: Create new connection (left). Start anonymization (right).

Figure 6.2: Anonymization ended notification.

When the anonymization process ends, a notification is sent to the client with the anonymiza-

tion result - failure or success (Fig. 6.2). This notification takes advantage of Pusher, which is a

service for realtime applications.

6.1.0.2 View anonymization results

Anonymization process returns as result some information related to the process: (1) elapsed

time, (2) k-anonymity and `-diversity calculation result, (3) quality metrics, (4) number of records

anonymized and (5) anonymization status code. These results can be seen later on by the admin-

istrator.

Figure 6.3: List of results (left). Results for specific anonymization process (right).

90


Fig. 6.3 presents the results view. On top of the page, there is a list with all anonymizations

done and the corresponding result. Below that list, the results for a specific anonymization process

are shown: (1) preview of the resulting dataset, (2) information loss and (3) k-anonymity and

`-diversity values.

This page is useful as it allows the administrator to view in more detail the results for a specific

anonymization. First, the preview allows to view some documents of the anonymized dataset and

have more perception of differences between initial and final versions. Second, k-anonymity and

`-diversity allow to validate the selected privacy model is fulfilled. Third, the information loss

allows to evaluate how good is the dataset created for research purposes.

6.1.1 View process information loss

An important aspect of the anonymization process is information loss. The web application pro-

vides the user with an useful view which allows to analyze information loss. Fig. 6.4 shows how

this view looks like. From this figure, we notice two main information are provided: (1) aver-

age information loss for two distinct metrics (Information loss and Precision) and (2) chart that

represents the relation between information loss and number of records anonymized.

These two informations are important because it allows to understand how information loss

is affected by dataset size and estimate the information loss for a specific size. Also, it allows to

have a good perception on the overall information loss and easily notice when some adjustments

are required to improve this aspect.

Figure 6.4: Information loss view.

6.1.2 View useful analytics

The web application also provides some general analytics, which are shown in the dashboard (Fig.

6.5). These analytics provide information about:

1. Number of sucessful anonymization as well as failures.

2. Last and next anonymization date - useful to control when anonymization process occurs.

3. Average of elapsed time and records anonymized - allows the administrator to have knowl-

edge about the number of records being stored in the database during the specified period

91


Figure 6.5: Web application dashboard with provided analytics.

(month or week). It also allows the administrator to predict the time taken to anonymize a

collection.

4. Usage over time - number of anonymizations done, each month, over last year. If it differs

from other months and it is not supposed, the administrator quickly notices something wrong

occured.

5. Anonymizations over time - number of failed and succeeded anonymizations, each month,

over last year. This chart allows the administrator to quickly notice when an error occurs on

an anonymization. This is useful to keep under control errors and unexpected behaviors.

6. Time taken and number of records anonymized over time - allows the administrator to eval-

uate the evolution of records being stored in the database over time. Also allows to quickly

notice an irregular variation in time on the anonymization process.

6.2 Desktop application

The desktop application intends to facilitate the creation of a configuration file and to easily access


This module provides a simple interface with some basic functionalities:

1. Connect to a mongo collection and view document hierarchy (Fig. 6.6) - it is required in-

formation about the host, database and collection. After connecting, the hierarchy is shown

under "Document Structure" section. From this hierarchy, it is possible to create the config-

uration file by selecting each attribute and choosing the correct type, hierarchy and configu-

ration.

2. Create, save and load a configuration file - after configuring each attribute in the document,

the configuration file can be saved into a JSON file and loaded later on.

3. Create mongo query (Fig. 6.7) - a query to fetch part of the collection can be specified. This

query is saved into the configuration file.

92


4. Start anonymization process.

5. Preview and export results (Fig. 6.8) - after the anonymization ends, it is shown a preview

of the resulting dataset under "Anonymization Result" section. Results can be exported to a

JSON file or to a Mongo collection.

Figure 6.6: Connect (left) and create the configuration file (right).

Figure 6.7: Create query.

Figure 6.8: Preview and export results.

93


94

Chapter 7

Conclusions and Future Work

Sharing data for research purposes brings several benefits to all industries as it allows to find trends

and statistics, which are useful to all of us. In particular, healthcare is one of the main industries

when talking about data sharing.

Due to its importance, the problem of clinical data sharing for research purposes while keeping

privacy disclosure has been covered. Several tools, privacy models and concepts on the field of

anonymization have been discussed and analysed in detail, which allowed to propose and imple-

ment a solution for clinical data anonymization.

In the first part of this dissertation, an analysis on the state of the art has been done and several

concepts, algorithms, models and tools have been presented for protecting privacy while allowing

data share for research purposes. Also, different types of clinical data have been analysed to better

understand which hierarchies must be created and how fields must anonymized to ensure privacy.

This first part looked essential to understand how the problem of anonymization could be solved

and allowed to propose a possible solution to correctly anonymize clinical data.

In the second part, a solution has been proposed based on the analysis done on the state of

the art. This solution takes advantage of an API provided by ARX anonymization tool, which

has the implementation of Flash algorithm. As input, the solution receives a mongo collection,

which is processed to create a data structure and hierarchies that are compatible with the provided

API. At the end of the process, the anonymized version of the dataset is exported to a new mongo

collection. At this point, with a first implementation done, an initial test has been done and some

weak points were found. These weak points allowed to propose some optimizations to the initial

solution. These optimizations passed from using streams to process data, allowing to reduce the

amount of memory that is used by the process, and from using clusters associated with streams,

which allowed to reduce memory usage and improve speed of operation.

The anonymization will be done with a monthly frequency. For this reason, an web applica-

tion has been proposed to automate the process and take control of the process by providing the

user with some useful analytics. These analytics allow the administrator to evaluate important

aspects of the anonymization process, such as information loss, average of time taken and average

95


of dataset size. A desktop application was also created intending to facilitate the creation of con-

figuration files and to enable the user to run the anonymization process as a standalone application.

At the end of the dissertation, the solution has been tested in more detail. First, impact of

privacy models configuration and dataset size in information loss has been studied. After that,

performance has been studied in detail by analysing each version of the solution and by comparing

them in terms of memory usage, speed of operation and maximum dataset size. At the end, the

solution has been validated to ensure the resulting dataset is complying with the requirements

passed to the anonymization process. These tests applied to the process allowed to validate the

solution and to ensure all requirements specified on Chapter 3 were accomplished.

Concluding, the research on the topic of clinical data anonymization allowed to better define

the problem: despite the importance of clinical data research, privacy is crucial. Defined the prob-

lem and in order to mitigate the requirements specified, a solution was proposed and implemented

to allow this data sharing while keeping privacy.

7.1 Future Work

The solution proposed during this dissertation raised some questions and several aspects which

should be pursued.

Firstly, although the available hierarchies can handle both collections specified on the require-

ments, some other hierarchies may still be needed to anonymize other collections. For this reason,

more pre-defined hierarchies should be added and those which already exist should be improved

to increase specificity and decrease information loss. In this line of work, a possible solution to

automatize the process of learning new clinical dictionaries and hierarchies would be to implement

some machine learning algorithms that allows the process to improve its results day by day.

Secondly, in terms of performance, there is still space for improvement. In first place, current

solution is taking too long to convert back and export the anonymized collection to mongo. This

phase is currently taking more than 65% of the elapsed time. So, it is obvious the need to find

possibilities to solve this "problem". In second place, although the requirement "the solution

must handle at least the average of records stored each month" is fulfilled, reducing memory

consumption is also important. In this line of work, a possible improvement and future work could

be to replace recursivity with an iterative approach during convertion of hierarchical structure into

a non-hierarchical structure.

Third, mongo files are usually stored in plain text. Consequently, if an attacker accesses the

initial collection, he has access to all data without much effort. This is forbidden and it is now

required these files to be encrypted in order to prevent from this to happen. However, this encryp-

tion must not change the results retrieved from queries, which should come unencrypted. With

this in mind, another important work that should be done in the future is to find a feasible solution

to encrypt these files and fulfill this requirement.

Finally, during this dissertation only the anonymization process was analysed and not the entire

system in which it must operate on. A next step could be to analyse the system as a whole and find

96


ways and technologies to create a sustentable workflow to populate the database and process data.

This is essential to find the best way of integrating this solution with the already existing solutions

on the company. Here, it would be good to implement and use Big Data technologies to provide

better results and to work on a scalable and distributed system.

97


98

References

[AFK+06] Gagan Aggarwal, Tomás Feder, Krishnaram Kenthapadi, Samir Khuller, Rina Pan-igrahy, Dilys Thomas, and An Zhu. Achieving anonymity via clustering. In Pro-ceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Prin-ciples of database systems, pages 153–162. ACM, 2006.

[ARMCM14] Vanessa Ayala-Rivera, Patrick McDonagh, Thomas Cerqueus, and Liam Murphy.A systematic comparison and evaluation of k-anonymization algorithms for practi-tioners. Transactions on Data Privacy, 7(3):337–370, 2014.

[arxa] Arx data anonymization tool. http://arx.deidentifier.org. cited 24 Jan2017.

[arxb] Arx data anonymization tool - api. http://arx.deidentifier.org/api/.cited 24 Jan 2017.

[arxc] Arx data anonymization tool - overview. http://arx.deidentifier.org/overview/. cited 24 Jan 2017.

[arxd] Arx data anonymization tool - privacy models. http://arx.deidentifier.org/overview/privacy-criteria/. cited 20 Jan 2017.

[ASMN12] Dr R Sugumar Anbazhagan, R Sugumar, M Mahendran, and R Natarajan. An effi-cient approach for statistical anonymization techniques for privacy preserving datamining. International Journal of Advanced Research in Computer and Communi-cation Engineering, 1(7):482–485, 2012.

[BA05] Roberto J Bayardo and Rakesh Agrawal. Data privacy through optimal k-anonymization. In Data Engineering, 2005. ICDE 2005. Proceedings. 21st In-ternational Conference on, pages 217–228. IEEE, 2005.

[BFW+10] CM Benjamin, M Fung, KE Wang, R Chen, and PS Yu. Privacy-preservingdata publishing: A survey of recent developments. ACM Computing Surveys,42(4):141–153, 2010.

[BKBL07] Ji-Won Byun, Ashish Kamra, Elisa Bertino, and Ninghui Li. Efficient k-anonymization using clustering techniques. In International Conference onDatabase Systems for Advanced Applications, pages 188–200. Springer, 2007.

[BRK+13] Korra Sathya Babu, Nithin Reddy, Nitesh Kumar, Mark Elliot, and Sanjay KumarJena. Achieving k-anonymity using improved greedy heuristics for very large rela-tional databases. Transactions on Data Privacy, 6(1):1–17, 2013.

99

http://arx.deidentifier.org

http://arx.deidentifier.org/api/

http://arx.deidentifier.org/overview/

http://arx.deidentifier.org/overview/

http://arx.deidentifier.org/overview/privacy-criteria/

http://arx.deidentifier.org/overview/privacy-criteria/

REFERENCES

[Byu07] Ji-Won Byun. Toward privacy-preserving database management systems—Accesscontrol and data anonymization. ProQuest, 2007.

[EEDI+09] Khaled El Emam, Fida Kamal Dankar, Romeo Issa, Elizabeth Jonker, DanielAmyot, Elise Cogo, Jean-Pierre Corriveau, Mark Walker, Sadrul Chowdhury,Regis Vaillancourt, et al. A globally optimal k-anonymity method for the de-identification of health data. Journal of the American Medical Informatics As-sociation, 16(5):670–682, 2009.

[GAZ+14] Olga Gkountouna, Sotiris Angeli, Athanasios Zigomitros, Manolis Terrovitis, andYannis Vassiliou. K m-anonymity for continuous data using dynamic hierarchies.In International Conference on Privacy in Statistical Databases, pages 156–169.Springer, 2014.

[GDL15] Aris Gkoulalas-Divanis and Grigorios Loukides. Medical data privacy handbook.Springer, 2015.

[GKKM07] Gabriel Ghinita, Panagiotis Karras, Panos Kalnis, and Nikos Mamoulis. Fast dataanonymization with low information loss. In Proceedings of the 33rd internationalconference on Very large data bases, pages 758–769. VLDB Endowment, 2007.

[GKWW98] José Gouweleeuw, Peter Kooiman, Leon Willenborg, and Paul P de Wolf. The postrandomisation method for protecting microdata. 1998.

[HN09] Yeye He and Jeffrey F Naughton. Anonymization of set-valued data via top-down,local generalization. Proceedings of the VLDB Endowment, 2(1):934–945, 2009.

[Inc] Privacy Analytics Inc. About parat de-identification software. http://www.privacyanalytics.ca/software/parat/. cited 23 Jan 2017.

[Inc13] TransCelerate BioPharma Inc. Data de-identification and anonymization of indi-vidual patient data in clinical studies – a model approach, 2013.

[Ker13] Michael Kern. Anonymity: A formalization of privacy-l-diversity. In Seminarpaper, Technische Universität München. Citeseer, 2013.

[KIK16] Murat Kantarcioglu, Ali Inan, and Mehmet Kuzu. Ut dallas anonymization toolbox,2016.

[KPE+12] Florian Kohlmayer, Fabian Prasser, Claudia Eckert, Alfons Kemper, and Klaus AKuhn. Flash: efficient, stable and optimal k-anonymity. In Privacy, Security, Riskand Trust (PASSAT), 2012 International Conference on and 2012 InternationalConfernece on Social Computing (SocialCom), pages 708–717. IEEE, 2012.

[KPK15] Florian Kohlmayer, Fabian Prasser, and Klaus A Kuhn. The cost of quality: Im-plementing generalization and suppression for anonymizing biomedical data withminimal information loss. Journal of biomedical informatics, 58:37–48, 2015.

[KS+13] Nagendra Kumar S et al. Sensitive attributes based privacy preserving in data min-ing using k-anonymity. International Journal of Computer Applications, 84(13):1–6, 2013.

[KT12] Batya Kenig and Tamir Tassa. A practical approximation algorithm for optimalk-anonymity. Data Mining and Knowledge Discovery, 25(1):134–168, 2012.

100

http://www.privacyanalytics.ca/software/parat/

http://www.privacyanalytics.ca/software/parat/

REFERENCES

[LDR05] Kristen LeFevre, David J DeWitt, and Raghu Ramakrishnan. Incognito: Efficientfull-domain k-anonymity. In Proceedings of the 2005 ACM SIGMOD internationalconference on Management of data, pages 49–60. ACM, 2005.

[LGD12] Grigorios Loukides and Aris Gkoulalas-Divanis. Utility-preserving transactiondata anonymization with low information loss. Expert systems with applications,39(10):9764–9777, 2012.

[LLV07] Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-closeness: Privacybeyond k-anonymity and l-diversity. In Data Engineering, 2007. ICDE 2007. IEEE23rd International Conference on, pages 106–115. IEEE, 2007.

[Ltd] Pusher Ltd. Pusher | leader in realtime technologies. https://pusher.com.cited 10 Apr 2017.

[MKGV07] Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and MuthuramakrishnanVenkitasubramaniam. l-diversity: Privacy beyond k-anonymity. ACM Transactionson Knowledge Discovery from Data (TKDD), 1(1):3, 2007.

[Mon] Inc MongoDB. Java mongodb driver. https://docs.mongodb.com/ecosystem/drivers/java/. cited 21 Apr 2017.

[NC10] Mehmet Ercan Nergiz and Chris Clifton. δ -presence without complete worldknowledge. IEEE Transactions on Knowledge and Data Engineering, 22(6):868–883, 2010.

[oHS+03] US Department of Health, Human Services, et al. Protecting personal health in-formation in research: Understanding the hipaa privacy rule. NIH Publication,(03-5388), 2003.

[ope] Open anonymizer. https://sourceforge.net/projects/openanonymizer/. cited 23 Jan 2017.

[PKLK14] Fabian Prasser, Florian Kohlmayer, Ronald Lautenschläger, and Klaus A Kuhn.Arx-a comprehensive tool for anonymizing biomedical data. In AMIA Annual Sym-posium Proceedings, volume 2014, page 984. American Medical Informatics As-sociation, 2014.

[Pod11] Benjamin Podgursky. Practical k-Anonymity on large datasets. PhD thesis, Van-derbilt University, 2011.

[Rag13] Balaji Raghunathan. The Complete Book of Data Anonymization: From Planningto Implementation. CRC Press, 2013.

[RK13] K Venkata Ramana and V Valli Kumari. Graph based local recoding for dataanonymization. International Journal of Database Management Systems, 5(4):1,2013.

[RTG00] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distanceas a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.

[SH] Jiri Sedlacek and Tomas Hurka. Visualvm. https://visualvm.github.io.cited 15 May 2017.

101

https://pusher.com

https://docs.mongodb.com/ecosystem/drivers/java/

https://docs.mongodb.com/ecosystem/drivers/java/

https://sourceforge.net/projects/openanonymizer/

https://sourceforge.net/projects/openanonymizer/

https://visualvm.github.io

REFERENCES

[Swe] Latanya Sweeney. Database security: k-anonymity. http://latanyasweeney.org/work/kanonymity.html.

[Swe98] Latanya Sweeney. Datafly: A system for providing anonymity in medical data. InDatabase Security XI, pages 356–381. Springer, 1998.

[Swe02a] Latanya Sweeney. Achieving k-anonymity privacy protection using generalizationand suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):571–588, 2002.

[Swe02b] Latanya Sweeney. k-anonymity: A model for protecting privacy. InternationalJournal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570,2002.

[SXF10] Pu Shi, Li Xiong, and Benjamin Fung. Anonymizing data with quasi-sensitiveattribute values. In Proceedings of the 19th ACM international conference on In-formation and knowledge management, pages 1389–1392. ACM, 2010.

[Tho07] Dilys Thomas. Algorithms and architectures for data privacy. PhD thesis, StanfordInfoLab, 2007.

[TMK08] Manolis Terrovitis, Nikos Mamoulis, and Panos Kalnis. Privacy-preservinganonymization of set-valued data. Proceedings of the VLDB Endowment, 1(1):115–125, 2008.

[TMK11] Manolis Terrovitis, Nikos Mamoulis, and Panos Kalnis. Local and global recodingmethods for anonymizing set-valued data. The VLDB Journal—The InternationalJournal on Very Large Data Bases, 20(1):83–106, 2011.

[TMK13] Matthias Templ, Bernhard Meindl, and Alexander Kowarik. Introduction to statis-tical disclosure control (sdc). Project: Relative to the testing of SDC algorithmsand provision of practical SDC, data analysis OG, 2013.

[uar] u-argus. http://neon.vb.cbs.nl/casc/mu.htm. cited 20 July 2017.

[utd] Utd anonymization toolbox. UTDallasDataSecurityandPrivacyLab. cited23 Jan 2017.

[XWP+06] Jian Xu, Wei Wang, Jian Pei, Xiaoyuan Wang, Baile Shi, and Ada Wai-Chee Fu.Utility-based anonymization for privacy preservation with less information loss.Acm Sigkdd Explorations Newsletter, 8(2):21–30, 2006.

102

http://latanyasweeney.org/work/kanonymity.html

http://latanyasweeney.org/work/kanonymity.html

http://neon.vb.cbs.nl/casc/mu.htm

UT Dallas Data Security and Privacy Lab

Appendix A

MongoDB Collections Structure

A.0.1 Prescribed Medications

1 {2 " _ i d " : O b j e c t I d ( ) ,3 " A p p l i c a t i o n I d " : " 1 " ,4 " T r a c k i n g I d " : "5A512123KAK1234KSXN8198JAN " ,5 " I n s t i t u t i o n " : {6 " Code " : " 2 " ,7 " O r g a n i z a t i o n " : {8 " Code " : "3"9 }

10 } ,11 " V i s i t " : {12 " P a t i e n t " : {13 " Code " : " 2 " ,14 " Type " : "A" ,15 " Gender " : {16 " Gender " : "M" ,17 " I sAnon imized " : "F"18 } ,19 " B i r t h Y e a r " : {20 " BeginYear " : 1900 ,21 " EndYear " : 1900 ,22 " I sAnon imized " : "F"23 } ,24 " L o c a t i o n " : {25 " CountryCode " : " 3 " ,26 " C o u n t r y D e s c r i p t i o n " : " P o r t u g a l " ,27 " D i s t r i c t C o d e " : " 1 " ,28 " D i s t r i c t D e s c r i p t i o n " : "PORTO" ,29 " I sAnon imized " : "F"30 } ,31 " L o c a t i o n C o u n t r y " : {32 " CountryCode " : " 3 " ,33 " C o u n t r y D e s c r i p t i o n " : " P o r t u g a l " ,34 " D i s t r i c t C o d e " : "ANON" ,35 " D i s t r i c t D e s c r i p t i o n " : "ANON" ,36 " I sAnon imized " : "T"37 } ,38 " Gen de rSup re s sed " : {39 " Gender " : "ANON" ,40 " I sAnon imized " : "T"41 } ,42 " B i r t h Y e a r I n t e r v a l " : {43 " BeginYear " : 1890 ,44 " EndYear " : 1910 ,45 " I sAnon imized " : "T"46 } ,47 " L o c a t i o n S u p r e s s e d " : {48 " CountryCode " : "ANON" ,49 " C o u n t r y D e s c r i p t i o n " : "ANON" ,

103


50 " D i s t r i c t C o d e " : "ANON" ,51 " D i s t r i c t D e s c r i p t i o n " : "ANON" ,52 " I sAnon imized " : "T"53 } ,54 " B i r t h Y e a r S u p r e s s e d " : {55 " BeginYear " : −9999999 ,56 " EndYear " : −9999999 ,57 " I sAnon imized " : "T"58 } ,59 " CodeSupressed " : "ANON"60 } ,61 " Ep i sode " : " 1 " ,62 " EpisodeType " : " U r g e n c i a s " ,63 " I s P a t i e n t A n o n i m i z e d " : "F " ,64 " E p i s o d e S u p r e s s e d " : "ANON"65 } ,66 " P r e s c r i b i n g P h y s i c i a n " : {67 " P h y s i c i a n S p e c i a l t y C o d e " : " 2 0 0 " ,68 " P h y s i c i a n S p e c i a l t y " : " Medic ina G e r a l " ,69 " Code " : " 1 " ,70 " I sAnon imized " : "F " ,71 " P h y s i c i a n S p e c i a l t y P a r e n t " : " E s p e c i a l i d a d e Medica " ,72 " P h y s i c i a n S p e c i a l t y C o d e P a r e n t " : " " ,73 " CodeSupressed " : "ANON" ,74 " P h y s i c i a n S p e c i a l t y C o d e S u p r e s s e d " : "ANON"75 } ,76 " P r o d u c t " : {77 " P r o d u c t I d " : " 1 " ,78 " Produc tCode " : " 1 2 " ,79 " D e s c r i p t i o n " : " Ben−u−Ron " ,80 "CHNM" : " 3 2 " ,81 " Dose " : " 1 " ,82 " MeasureUni t " : "UN" ,83 " A d m i n i s t r a t i o n R o u t e D e s c " : "INAL . " ,84 " P r e s e n t a t i o n F o r m " : "AMP" ,85 "GFT" : " 5 . 1 . 1 " ,86 "ACT" : "123123" ,87 " Med ica lDev ice " : "N" ,88 " Family " : " 2 3 " ,89 " F a m i l y D e s c r i p t i o n " : " P r o d u t o s F a r m a c e u t i c o s " ,90 " Pha rmaceu t i c a lFo rmDesc " : "SOL . RESP . " ,91 " A d m i n i s t r a t i o n R o u t e " : "INAL . " ,92 " P r e s e n t a t i o n F o r m D e s c " : "AMPOLA" ,93 " P h a r m a c e u t i c a l F o r m " : "SOL . RESP . "94 } ,95 " Dosage " : {96 " Frequency " : " 6 / 6 H" ,97 " BeginDate " : {98 " Year " : 2010 ,99 " Month " : 1 ,

100 " Day " : 1 ,101 " Hour " : 1 ,102 " Minutes " : 1 ,103 " Seconds " : 0 ,104 " WeekYear " : 1 ,105 " I sAnon imized " : "F"106 } ,107 " D u r a t i o n " : 3 ,108 " BeginDateMonth " : {109 " Year " : 2010 ,110 " Month " : 1 ,111 " Day " : 0 ,112 " Hour " : 0 ,113 " Minutes " : 0 ,114 " Seconds " : 0 ,115 " WeekYear " : 0 ,116 " I sAnon imized " : "T"117 }118 } ,119 " S e r v i c e " : {120 " Code " : "123123" ,121 " D e s c r i p t i o n " : " U r g e n c i a "122 } ,123 " Valence " : {124 " Code " : n u l l ,125 " D e s c r i p t i o n " : n u l l126 } ,

104


127 " C o s t C e n t e r " : {128 " Code " : n u l l ,129 " D e s c r i p t i o n " : n u l l130 } ,131 " D i a g n o s t i c " : {132 " D i a g n o s t i c C o d e " : {133 " Code " : n u l l ,134 " I sAnon imized " : "F"135 } ,136 " D i s e a s e s C l a s s i f i c a t i o n " : n u l l ,137 " D e s c r i p t i o n " : n u l l138 } ,139 " L i n e I d e n t i f i e r " : "123123" ,140 " C r e a t i o n D a t e " : {141 " Year " : 2010 ,142 " Month " : 1 ,143 " Day " : 1 ,144 " Hour " : 1 ,145 " Minutes " : 0 ,146 " Seconds " : 0 ,147 " WeekYear " : 1 ,148 " I sAnon imized " : "F"149 } ,150 " P r e s c r i p t i o n N u m b e r " : "ANON" ,151 " I n n e r I d " : "111111" ,152 "SCN" : NumberLong (1111111) ,153 " O b s o l e t e " : f a l s e ,154 " A c t io n " : "E " ,155 " V a l i d " : f a l s e ,156 " V a l i d a t i o n R e s u l t " : n u l l ,157 " V a l e n c e S u p r e s s e d " : {158 " Code " : "ANON" ,159 " D e s c r i p t i o n " : "ANON"160 } ,161 " C o s t C e n t e r S u p r e s s e d " : {162 " Code " : "ANON" ,163 " D e s c r i p t i o n " : "ANON"164 } ,165 " S e r v i c e S u p r e s s e d " : {166 " Code " : "ANON" ,167 " D e s c r i p t i o n " : "ANON"168 } ,169 " V i s i t S u p r e s s e d " : {170 " P a t i e n t " : {171 " Code " : "ANON" ,172 " Type " : "ANON" ,173 " Gender " : {174 " Gender " : "ANON" ,175 " I sAnon imized " : "T"176 } ,177 " B i r t h Y e a r " : {178 " BeginYear " : −9999999 ,179 " EndYear " : −9999999 ,180 " I sAnon imized " : "T"181 } ,182 " L o c a t i o n " : {183 " CountryCode " : "ANON" ,184 " C o u n t r y D e s c r i p t i o n " : "ANON" ,185 " D i s t r i c t C o d e " : "ANON" ,186 " D i s t r i c t D e s c r i p t i o n " : "ANON" ,187 " I sAnon imized " : "T"188 } ,189 " L o c a t i o n C o u n t r y " : {190 " CountryCode " : "ANON" ,191 " C o u n t r y D e s c r i p t i o n " : "ANON" ,192 " D i s t r i c t C o d e " : "ANON" ,193 " D i s t r i c t D e s c r i p t i o n " : "ANON" ,194 " I sAnon imized " : "T"195 } ,196 " Gen de rSup re s sed " : {197 " Gender " : "ANON" ,198 " I sAnon imized " : "T"199 } ,200 " B i r t h Y e a r I n t e r v a l " : {201 " BeginYear " : −9999995 ,202 " EndYear " : −9999990 ,203 " I sAnon imized " : "T"

105


204 } ,205 " L o c a t i o n S u p r e s s e d " : {206 " CountryCode " : "ANON" ,207 " C o u n t r y D e s c r i p t i o n " : "ANON" ,208 " D i s t r i c t C o d e " : "ANON" ,209 " D i s t r i c t D e s c r i p t i o n " : "ANON" ,210 " I sAnon imized " : "T"211 } ,212 " B i r t h Y e a r S u p r e s s e d " : {213 " BeginYear " : −9999999 ,214 " EndYear " : −9999999 ,215 " I sAnon imized " : "T"216 }217 } ,218 " Ep i sode " : "ANON" ,219 " EpisodeType " : "ANON" ,220 " I s P a t i e n t A n o n i m i z e d " : "T"221 } ,222 " Crea t ionDa teMon th " : {223 " Year " : 2010 ,224 " Month " : 1 ,225 " Day " : 0 ,226 " Hour " : 0 ,227 " Minutes " : 0 ,228 " Seconds " : 0 ,229 " WeekYear " : 0 ,230 " I sAnon imized " : "T"231 } ,232 " D i a g n o s t i c P a r e n t " : {233 " D i a g n o s t i c C o d e " : n u l l ,234 " D i s e a s e s C l a s s i f i c a t i o n " : n u l l ,235 " D e s c r i p t i o n " : n u l l236 } ,237 " D i a g n o s t i c P a r e n t P a r e n t " : n u l l ,238 " De ta i lComputed " : t r u e ,239 " P r e s c r i b i n g P h y s i c i a n S u p r e s s e d " : {240 " P h y s i c i a n S p e c i a l t y C o d e " : "ANON" ,241 " P h y s i c i a n S p e c i a l t y " : "ANON" ,242 " Code " : "ANON" ,243 " I sAnon imized " : "T " ,244 " P h y s i c i a n S p e c i a l t y P a r e n t " : "ANON" ,245 " P h y s i c i a n S p e c i a l t y C o d e P a r e n t " : n u l l ,246 " CodeSupressed " : "ANON"247 } ,248 " DateReg " : ISODate ( ) ,249 " S t a t u s " : "S " ,250 "OBS" : n u l l251 }

Listing A.1: Prescribed Medication document structure.

A.0.2 Consumptions

1 {2 " _ i d " : O b j e c t I d ( ) ,3 " A p p l i c a t i o n I d " : " 1 2 3 " ,4 " T r a c k i n g I d " : "1AKSJNDAKSNDK11KJ2N3KJ12N3KJN " ,5 " I n s t i t u t i o n " : {6 " Code " : " 1 " ,7 " O r g a n i z a t i o n " : {8 " Code " : "C1"9 }

10 } ,11 " V i s i t " : {12 " P a t i e n t " : n u l l ,13 " Ep i sode " : n u l l ,14 " EpisodeType " : n u l l ,15 " I s P a t i e n t A n o n i m i z e d " : "F"16 } ,17 " P r o d u c t " : {

106


18 " P r o d u c t I d " : " 1 " ,19 " Produc tCode " : " 1 " ,20 " D e s c r i p t i o n " : " Ben u Ron " ,21 "CHNM" : "10000000" ,22 " Dose " : " 1 " ,23 " MeasureUni t " : "MG" ,24 " A d m i n i s t r a t i o n R o u t e D e s c " : " I .V . " ,25 " P r e s e n t a t i o n F o r m " : "AMP" ,26 "GFT" : " 3 . 4 . 4 . 2 . 1 " ,27 "ACT" : n u l l ,28 " Med ica lDev ice " : "N" ,29 " Family " : " 1 0 " ,30 " F a m i l y D e s c r i p t i o n " : " P r o d u t o s F a r m a c e u t i c o s " ,31 " Pha rmaceu t i c a lFo rmDesc " : "SOL . INJ . " ,32 " A d m i n i s t r a t i o n R o u t e " : " I .V . " ,33 " P r e s e n t a t i o n F o r m D e s c " : "AMPOLA" ,34 " P h a r m a c e u t i c a l F o r m " : "SOL . INJ . "35 } ,36 " S e r v i c e " : {37 " Code " : " 1 " ,38 " D e s c r i p t i o n " : " G e r a l "39 } ,40 " Valence " : {41 " Code " : " 1 " ,42 " D e s c r i p t i o n " : " G e r a l "43 } ,44 " C o s t C e n t e r " : {45 " Code " : n u l l ,46 " D e s c r i p t i o n " : n u l l47 } ,48 " T r a n s a c t i o n " : {49 " Transac t ionNumber " : {50 " Number " : " 1 " ,51 " I sAnon imized " : "F " ,52 " N u m b e r V i s i b i l i t y " : "123"53 } ,54 " U n i t P r i c e " : " 1 0 . 0 0 " ,55 " Value " : "−10.00" ,56 " Date " : {57 " Year " : 2013 ,58 " Month " : 1 ,59 " Day " : 1 ,60 " Hour " : 0 ,61 " Minutes " : 0 ,62 " Seconds " : 0 ,63 " WeekYear " : 1 ,64 " I sAnon imized " : "F"65 } ,66 " Amount " : "−1" ,67 " Un i t " : "AMP" ,68 " Lot " : n u l l ,69 " E x p i r a t i o n D a t e " : n u l l ,70 " Brand " : n u l l ,71 " S t a t u s " : "OK" ,72 " T r a n s a c t i o n T y p e " : " 1 " ,73 " T r a n s a c t i o n T y p e D e s c " : " T r a n s a c a o " ,74 " T r a n s a c t i o n T y p e D e t a i l " : "T " ,75 " DateDay " : {76 " Year " : 2013 ,77 " Month " : 1 ,78 " Day " : 1 ,79 " Hour " : 0 ,80 " Minutes " : 0 ,81 " Seconds " : 0 ,82 " WeekYear " : 0 ,83 " I sAnon imized " : "T"84 } ,85 " DateMonth " : {86 " Year " : 2013 ,87 " Month " : 1 ,88 " Day " : 0 ,89 " Hour " : 0 ,90 " Minutes " : 0 ,91 " Seconds " : 0 ,92 " WeekYear " : 0 ,93 " I sAnon imized " : "T"94 } ,

107


95 " DateWeek " : {96 " Year " : 2013 ,97 " Month " : 0 ,98 " Day " : 0 ,99 " Hour " : 0 ,

100 " Minutes " : 0 ,101 " Seconds " : 0 ,102 " WeekYear " : 1 ,103 " I sAnon imized " : "T"104 }105 } ,106 " I n n e r I d " : "10000000000000000000" ,107 "SCN" : NumberLong ( 2 ) ,108 " O b s o l e t e " : f a l s e ,109 " A c t io n " : "E " ,110 " V a l e n c e S u p r e s s e d " : {111 " Code " : "ANON" ,112 " D e s c r i p t i o n " : "ANON"113 } ,114 " C o s t C e n t e r S u p r e s s e d " : {115 " Code " : "ANON" ,116 " D e s c r i p t i o n " : "ANON"117 } ,118 " S e r v i c e S u p r e s s e d " : {119 " Code " : "ANON" ,120 " D e s c r i p t i o n " : "ANON"121 } ,122 " V i s i t S u p r e s s e d " : {123 " P a t i e n t " : n u l l ,124 " Ep i sode " : "ANON" ,125 " EpisodeType " : "ANON" ,126 " I s P a t i e n t A n o n i m i z e d " : "T"127 } ,128 " De ta i lComputed " : t r u e ,129 " V a l i d " : f a l s e ,130 " V a l i d a t i o n R e s u l t " : n u l l ,131 " DateReg " : ISODate ( ) ,132 " S t a t u s " : "S " ,133 "OBS" : n u l l134 }

Listing A.2: Consumptions document structure.

108