free text keystroke dynamics

Upload: junaid-akram

Post on 10-Oct-2015

77 views

Category:

Documents


0 download

DESCRIPTION

Free text keystroke dynamics

TRANSCRIPT

  • An Examination of Keystroke DynamicsFor

    Continuous User Authentication

    by

    Eesa Alsolami

    Bachelor of Science (Computer Science), KAU, Saudi Arabia 2002Master of Information Technology (QUT) 2008

    Thesis submitted in accordance with the regulations forDegree of Doctor of Philosophy

    Information Security InstituteScience and Engineering Faculty

    Queensland University of Technology

    August 2012

  • Keywords

    Continuous biometric authentication, continuous authentication system, user-independentthreshold, keystroke dynamics, user typing behavior, feature selection.

    i

  • ii

  • Abstract

    Most current computer systems authorise the user at the start of a session and donot detect whether the current user is still the initial authorised user, a substituteuser, or an intruder pretending to be a valid user. Therefore, a system that contin-uously checks the identity of the user throughout the session is necessary withoutbeing intrusive to end-user and/or effectively doing this. Such a system is calleda continuous authentication system (CAS).

    Researchers have applied several approaches for CAS and most of these tech-niques are based on biometrics. These continuous biometric authentication systems(CBAS) are supplied by user traits and characteristics. One of the main types ofbiometric is keystroke dynamics which has been widely tried and accepted for pro-viding continuous user authentication. Keystroke dynamics is appealing for manyreasons. First, it is less obtrusive, since users will be typing on the computerkeyboard anyway. Second, it does not require extra hardware. Finally, keystrokedynamics will be available after the authentication step at the start of the com-puter session.

    Currently, there is insufficient research in the CBAS with keystroke dynamicsfield. To date, most of the existing schemes ignore the continuous authenticationscenarios which might affect their practicality in different real world applications.Also, the contemporary CBAS with keystroke dynamics approaches use characterssequences as features that are representative of user typing behavior but their se-lected features criteria do not guarantee features with strong statistical significancewhich may cause less accurate statistical user-representation. Furthermore, theirselected features do not inherently incorporate user typing behavior. Finally, theexisting CBAS that are based on keystroke dynamics are typically dependent onpre-defined user-typing models for continuous authentication. This dependencyrestricts the systems to authenticate only known users whose typing samples aremodelled.

    iii

  • This research addresses the previous limitations associated with the existingCBAS schemes by developing a generic model to better identify and understandthe characteristics and requirements of each type of CBAS and continuous authen-tication scenario. Also, the research proposes four statistical-based feature selec-tion techniques that have highest statistical significance and encompasses differentuser typing behaviors which represent user typing patterns effectively. Finally, theresearch proposes the user-independent threshold approach that is able to authen-ticate a user accurately without needing any predefined user typing model a-priori.Also, we enhance the technique to detect the impostor or intruder who may takeover during the entire computer session.

    iv

  • Contents

    Keywords i

    Abstract iii

    Table of Contents v

    List of Figures ix

    List of Tables xi

    List of Abbreviations xiii

    Declaration xv

    Previously Published Material xvii

    1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Research Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Research Significance . . . . . . . . . . . . . . . . . . . . . . . . . . 71.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2 Background 112.1 User Authentication in Computer Security . . . . . . . . . . . . . . 11

    2.1.1 Biometric Authentication . . . . . . . . . . . . . . . . . . . . 132.2 Typist Authentication . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.2.1 Static Typist Authentication . . . . . . . . . . . . . . . . . 162.2.2 Continuous Typist Authentication . . . . . . . . . . . . . . . 19

    v

  • 2.3 Machine Learning in Typist Authentication . . . . . . . . . . . . . . 262.3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 262.3.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . 27

    2.4 Anomaly Detection Techniques . . . . . . . . . . . . . . . . . . . . 282.5 Related Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    2.5.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 312.5.2 Application of Threshold Analysis . . . . . . . . . . . . . . . 342.5.3 Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . 35

    2.6 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 372.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    3 Model for Continuous Biometric Authentication 413.1 Continuous Biometric Authentication System (CBAS) . . . . . . . . 423.2 CBAS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.2.1 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.2 Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.2.3 Feature extractions . . . . . . . . . . . . . . . . . . . . . . . 473.2.4 Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.2.5 Biometric database . . . . . . . . . . . . . . . . . . . . . . . 503.2.6 Response unit . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    3.3 Continuous Authentication Scenarios . . . . . . . . . . . . . . . . . 503.4 Existing CBAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    3.4.1 Class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.4.2 Class 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.4.3 Limitations in the current CBAS . . . . . . . . . . . . . . . 58

    3.5 A new class for CBAS . . . . . . . . . . . . . . . . . . . . . . . . . 593.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4 Dataset Analysis 634.1 Predefined or free text? . . . . . . . . . . . . . . . . . . . . . . . . . 644.2 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . 654.3 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . 674.4 Data Prepossessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.5 Preliminary Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 704.6 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . 714.7 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 76

    vi

  • 4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    5 User-Representative Feature Selection for Keystroke Dynamics 795.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.2 Proposed feature selection techniques . . . . . . . . . . . . . . . . . 82

    5.2.1 Most frequently typed n-graph selection . . . . . . . . . . . 825.2.2 Quickly-typed n-graph selection . . . . . . . . . . . . . . . . 835.2.3 Time-stability typed n-graph selection . . . . . . . . . . . . 835.2.4 Time-variant typed n-graph selection . . . . . . . . . . . . . 84

    5.3 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 845.3.1 Selecting candidate features . . . . . . . . . . . . . . . . . . 845.3.2 Evaluate candidate features (obtained by feature selection

    techniques) . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.3.2.1 K-means algorithm . . . . . . . . . . . . . . . . . . 865.3.2.2 Assigning users . . . . . . . . . . . . . . . . . . . . 885.3.2.3 Cluster evaluation criterion . . . . . . . . . . . . . 89

    5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.4.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . 895.4.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . 905.4.3 Comparison with existing feature selection techniques . . . . 93

    5.5 Comparing fixed and dynamic features on different data sizes . . . . 965.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

    6 User-independent Threshold 1016.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.2 A user-independent threshold system . . . . . . . . . . . . . . . . . 1026.3 Designing and evaluating user-independent threshold . . . . . . . . 1036.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    6.5.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086.5.2 Experimental method . . . . . . . . . . . . . . . . . . . . . . 108

    6.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.6.1 Distance measures . . . . . . . . . . . . . . . . . . . . . . . 1146.6.2 Various number of keystrokes (data size) . . . . . . . . . . . 1146.6.3 Feature type . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.6.4 Feature amount . . . . . . . . . . . . . . . . . . . . . . . . . 117

    vii

  • 6.7 Comparing to user-dependent threshold . . . . . . . . . . . . . . . 1196.7.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . 1206.7.2 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . 120

    6.8 Discussion and limitations . . . . . . . . . . . . . . . . . . . . . . . 1226.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

    7 Typist Authentication based on user-independent threshold 1277.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287.2 Typist Authentication System . . . . . . . . . . . . . . . . . . . . . 1297.3 Change detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1317.4 Time Series Analysis and Attack Detection . . . . . . . . . . . . . . 132

    7.4.1 Sliding window(non-overlapping) . . . . . . . . . . . . . . . 1337.4.2 Sliding Window (overlapping) . . . . . . . . . . . . . . . . . 133

    7.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1357.5.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1357.5.2 Experimental method . . . . . . . . . . . . . . . . . . . . . . 1357.5.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . 136

    7.6 Discussions and Limitations . . . . . . . . . . . . . . . . . . . . . . 1387.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

    8 Conclusion and Future Directions 1418.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 1428.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

    8.2.1 Application of the Proposed Technique to Different Datasets 1438.2.2 Application of the Proposed Technique to Different Biomet-

    ric Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1448.2.3 Improvements to the Proposed the User-Independent Thresh-

    old . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1448.2.4 Detection of Impostor in Real time . . . . . . . . . . . . . . 144

    8.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 145

    A Characteristics of users typing data 147

    Appendix-Dataset-Details 147

    Bibliography 151

    viii

  • List of Figures

    2.1 Transforming a keystroke pattern into a timing vector when a userinputs a string AB: The duration and interval times are measuredby millisecond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.1 Continuous Biometric Authentication System Model . . . . . . . . . 443.2 Characteristics of CBAS . . . . . . . . . . . . . . . . . . . . . . . . 51

    5.1 Evaluation methodology for feature selection techniques . . . . . . . 855.2 Comparison between proposed feature selection techniques based on

    # of selected features cumulatively . . . . . . . . . . . . . . . . . . 915.3 Comparison between different statistics and distances for the most

    frequent 2-graphs technique based on individual group of features . 925.4 Comparison between different statistics for the most frequent 2-

    graphs technique based on # of selected features cumulatively . . . 935.5 Comparison between the most and least frequent 2-graphs based on

    individual group of features . . . . . . . . . . . . . . . . . . . . . . 965.6 Comparison between the most and least frequent 2-graphs based on

    # of selected features cumulatively . . . . . . . . . . . . . . . . . . 975.7 Comparing fixed and dynamic features over different data sizes . . . 98

    6.1 Equal error rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2 Inconsistency of threshold among different group of users . . . . . . 1126.3 Varying the data size for group 1 of users . . . . . . . . . . . . . . . 1136.4 Consistency of the threshold for different size of data . . . . . . . . 1166.5 Consistency of threshold among different group of users . . . . . . . 118

    7.1 Overview of the proposed typist authentication system . . . . . . . 1317.2 Sliding window (not overlapping) . . . . . . . . . . . . . . . . . . . 1337.3 Sliding window (overlapping) . . . . . . . . . . . . . . . . . . . . . 134

    ix

  • 7.4 Comparing the accuracy of detection between two different autho-rised users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

    x

  • List of Tables

    2.1 The accuracy of statistic typist authentication techniques . . . . . . 172.2 The accuracy of continuous typist authentication techniques . . . . 20

    3.1 Requirements of different scenarios of CBA. . . . . . . . . . . . . . 543.2 The differences between the first, second classes and the new class. . 61

    4.1 Characteristics of users typing data . . . . . . . . . . . . . . . . . . 694.2 Most frequent feature in the dataset . . . . . . . . . . . . . . . . . . 724.3 Avg time of most frequent characteristic for different users . . . . . 734.4 Average time of ER for all of the user samples . . . . . . . . . . . . 74

    5.1 Comparison of Italian words and most frequent 2-graphs . . . . . . 945.2 Comparison between common 2-graphs and most frequent 2-graphs 94

    6.1 Comparing accuracy for different measurements . . . . . . . . . . . 1146.2 comparing EER for different size of data . . . . . . . . . . . . . . . 1156.3 Comparing accuracy for different feature type . . . . . . . . . . . . 1176.4 Comparing accuracy for different amount of feature set . . . . . . . 1176.5 Comparative analysis of our approach (user-independent) and cur-

    rent schemes (user-dependent) . . . . . . . . . . . . . . . . . . . . 1216.6 comparing accuracy between user-dependent and user-independent . 122

    7.1 Accuracy of detection where varying the size of impostors data inone window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

    A.1 Characteristics of users typing data . . . . . . . . . . . . . . . . . . 148A.2 Data distribution for 10 users . . . . . . . . . . . . . . . . . . . . . 150

    xi

  • xii

  • List of Abbreviations

    CAS Continuous Authentication SystemCBAS Continuous Biometric Authentication SystemCA Continuous Authentication

    1 Sample 700 to 900 charactersFAR False Acceptance RateFRR False Reject RateFP False PositiveFN False NegativeIDS Intrusion Detection System

    CUSUM Cumulative SumGLR Generalized Likelihood Ratio

    GP Dataset Guntti Picardi DatasetAVG AverageSTD Standard deviationEER Equal Error RateROC Receiver Operating CharacteristicsTSW Testing WindowAUW Authenticated WindowSNO Sliding Window Not OverlappingSO Sliding Window Overlapping

    xiii

  • xiv

  • Declaration

    The work contained in this thesis has not been previously submitted to meetrequirements for an award at this or any other higher education institution. Tothe best of my knowledge and belief, the thesis contains no material previouslypublished or written by another person except where due reference is made.

    Signed: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Date: . . . . . . . . . . . . . . . . . . . . .

    xv

  • xvi

  • Previously Published Material

    The following papers have been published or presented, and contain material basedon the content of this thesis.

    Al solami, Eesa, Boyd, Colin, Clark, Andrew and Khandoker, Asadul Is-lam. Continuous biometric authentication : Can it be more practical? In:12th IEEE International Conference on High Performance Computing andCommunications, 1-3 September 2010, Melbourne.

    Al solami, Eesa, Boyd, Colin, Clark, Andrew and Ahmed, Irfan. User repre-sentative feature selection for keystroke dynamics. In Sabrina De Capitanidi Vimercati and Pierangela Samarati, editors, International Conference onNetwork and System Security, Universit degli Studi di Milano, Milan, 2011.

    Al solami, Eesa, Boyd, Colin, Ahmed, Irfan, Nayak, Richi, and Marring-ton, Andrew. User-independent threshold for continuous user authentica-tion based on keystroke dynamics. The Seventh International Conferenceon Internet Monitoring and Protection, May 27 - 1 June , 2012, Stuttgart,Germany.

    xvii

  • xviii

  • Acknowledgments

    Praise and thanks be to Allah for his help in accomplishing this work.I would like to express my deep and sincere gratitude and appreciation to my

    principal supervisor Prof. Colin Boyd, for his continued support and guidancethroughout my Phd. I could not have imagined having a better adviser for myPhD study. Thank you very much Colin.

    I sincerely thank my co-adviser, Assoc. Prof. Richi Nayak, for her encourage-ment, insightful comments and support. Indeed, Richi was a valuable addition tomy supervision team who provided a great deal of good ideas and suggestions onmy research.

    Lastly I wish to thank my entire extended family for their understanding, sup-port and guidance. I am heartily thankful to my parents, for their encouragement,guidance and support. Also, and most importantly, I wish to thank my wife andmy daughters, Shahad, Raghad and Yara Without them I could not have com-pleted this work.

    xix

  • xx

  • Chapter 1

    Introduction

    A breach of information security can affect not only a single users work but also theeconomic development of companies, and even the national security of a country.The breach is the focus of research into unauthorised access attacks to a computer,which is the second greatest source of financial loss according to the 2006 CSI/FBIComputer Crime and Security Survey [31]. Attacks on computer systems can beundertaken at the network, system and user levels [62].

    Most information security research undertaken in recent years is concernedwith system and network-level attacks. However, there is a lack of research onattacks at the user level. User level attacks include the impostor or intruder whotakes over from the valid user either at the start of a computer session or duringthe session. Depending on the risks in a particular environment, a single, initialauthentication might be insufficient to guarantee security. It may also be necessaryto perform continuous authentication to prevent user substitution after the initialauthentication step. The impact of an intruder taking over during a session is thesame as any kind of false representation at the beginning of a session. Most currentcomputer systems authorise the user at the start of a session and do not detectwhether the current user is still the initial authorised user, a substitute user, oran intruder pretending to be a valid user. Therefore, a system that continuouslychecks the identity of the user throughout the session is necessary. Such a systemis called a continuous authentication system.

    The majority of existing continuous authentication systems are built aroundbiometrics. These continuous biometric authentication systems (CBAS) are sup-

    1

  • 2 Chapter 1. Introduction

    plied by user traits and characteristics. There are two major forms of biometrics:those based on physiological attributes and those based on behavioural character-istics. The physiological type includes biometrics based on stable body traits, suchas fingerprint, face, iris and hand, and are considered to be more robust and secure.However, they are also considered to be more intrusive and expensive and requireregular equipment replacement [86]. On the other hand, behavioural biometricsinclude learned movements such as handwritten signatures, keyboard dynamics(typing), mouse movements, gait and speech. Collecting of these biometrics is lessobtrusive and they do not require extra hardware.

    Recently, keystroke dynamics has gained popularity as one of the main sourcesof behavioural biometrics for providing continuous user authentication. Keystrokedynamics is appealing for many reasons [32]:

    It is less obtrusive, since users will be typing on the computer keyboardanyway.

    It does not require extra hardware.

    Keystroke dynamics exist and are available after the authentication step atthe start of the computer session.

    Analysing typing data has proved to be very useful in distinguishing between usersand can be used as a biometric authentication. Various types of analysis have beencarried out on users typing data, to find features that are representative of usertyping behaviour and to detect an impostor or intruder who may take over fromthe valid user session.

    This research extends previous research on improving continuous user authenti-cation systems (that are based on keystroke dynamics) by developing a new flexibletechnique that authenticates users and automates this technique to continuouslyauthenticate users over the course of a computer session without the need for anypredefined user typing model a-priori. Also, the technique introduces new featuresthat represent the user typing behaviour effectively. The motivation for this re-search is provided in Section 1.1. Outcomes achieved by this research are identifiedin Section 1.2. and the organisation of this thesis is described in Section 1.3.

  • 1.1. Motivation 3

    1.1 MotivationThis thesis focuses on developing automatic analysis techniques for continuous userauthentication systems (that are based on keystroke dynamics) with the goal ofdetecting an impostor or intruder that may take over a valid user session. The mainmotivation of this research is that we need a flexible system that can authenticateusers and must not be dependent on a pre-defined typing model of a user. Thisresearch is motivated by:

    the relative absence of research in CBAS utilising by biometric sources ingeneral, and specifically using keystroke dynamics.

    the absence of a suitable model that considers a continuous authenticationscenario identifying and understanding the characteristics and requirementsof each type of CBAS and continuous authentication scenario.

    the need for new feature selection techniques that represent user typing be-haviour which guarantee that frequently typed features are selected and in-herently reflect user typing behavior.

    the lack of an automated CBAS based on keystroke dynamics, that is lowin computational resource requirements and thus is suitable for real timedetection.

    1.2 Research ObjectivesAccording to the previous section, the objectives that need to be addressed in thisthesis are:

    1. To develop a generic model for identifying and understanding the character-istics and requirements of each type of CBAS and continuous authenticationscenarios. (Chapter 3)

    2. To identify optimum features that represent user typing behaviour whichguarantee that frequently typed features are selected and inherently reflectuser typing behaviour. (Chapter 5)

    3. To discover whether a pre-defined typing model of a user is necessary forsuccessful authentication. (Chapter 6)

  • 4 Chapter 1. Introduction

    4. To minimise the delay for an automatic CBAS to detect intruders. (Chapter7)

    1.3 Research Questions

    The main research questions or problems that will be addressed in this thesis, aretwo.

    1. "What is a suitable model for identifying and understanding the characteris-tics and requirements of each type of CBAS and continuous authenticationscenarios?"

    In this thesis we develop a generic model for most continuous authenticationscenarios and CBAS. The model is developed based on detection capabilitiesof both continuous authentication scenarios and CBAS to better identify andunderstand the characteristics and requirements of each type of scenario andsystem. This model pursues two goals: the first is to describe the charac-teristics and attributes of existing CBAS, and the second is to describe therequirements of different continuous authentication scenarios.

    From the model that we have developed, we found that the main charac-teristic of most existing CBAS typically depend on pre-defined user-typingmodels for authentication. However, in some scenarios and cases, it is im-practical or impossible to gain a pre-defined typing model of all users inadvance (before detection time). Therefore, the following question will beaddressed in this thesis,

    2. "Can users be authenticated without depending on a pre-defined typing modelof a user? If so, how?"

    In this thesis we develop a novel continuous authentication mechanism thatis based on keystroke dynamics and is not dependent on a pre-defined typingmodel of a user. It is able to automatically detect the impostor or intruderin real time. The accuracy of CBAS is measured in terms of its detectionaccuracy rate. The aim is to maximize the detection accuracy: that is thepercentage of detection of impostors or intruder that masquerade as genuineusers, and minimise the false alarm rate: that is the percentage of genuineusers that are identified as impostor or intruder. In order to detect or dis-

  • 1.4. Research Outcomes 5

    tinguish between a genuine user and impostor in one computer session, weincorporate distance measure techniques in our approach.

    Three further sub-questions that arise from the previous question are:

    1. What are the optimum features that are representative of user typing behav-ior? To address this sub-question, we propose four statistical-based featureselection techniques. The first technique selects the most frequently occur-ring features. The other three consider different user typing behaviors byselecting: n-graphs that are typed quickly; n-graphs that are typed withconsistent time; and n-graphs that have large time variance among users.

    2. How accurately can a user-independent threshold determine whether twotyping samples in a user session belong to the same user? In order to addressthis sub-question, we have examined four different variables that influencethe accuracy of the threshold which directly manipulate the user samplesfor authentication: distance type, number of keystrokes, feature type andamount of features.

    3. Can we automatically detect an impostor who takes over from a valid userduring a computer session and the amount of typing data needed for a sys-tem to detect the imposter? To answer this sub-question, we need to firstanswer the questions 1 and 2, and use the answers to propose the automatedsystem. For automated detection, a sliding window mechanism is used andthe optimum size of the window is determined.

    1.4 Research OutcomesBy addressing the research objectives and research questions, this thesis makes anumber of contributions and achievements including:

    1. A generic model is proposed for most continuous authentication scenarios andCBAS. The model of CBAS is proposed based on their detection capabilitiesto better identify and understand the characteristics and requirements ofeach type of scenario and system. This model pursues two goals: the firstis to describe the characteristics and attributes of existing CBAS, and thesecond is to describe the requirements of different scenarios of CBAS. Theresearch results were published in:

  • 6 Chapter 1. Introduction

    Al solami, Eesa, Boyd, Colin, Clark, Andrew and Khandoker, AsadulIslam. Continuous biometric authentication : Can it be more prac-tical? In: 12th IEEE International Conference on High PerformanceComputing and Communications, 1-3 September 2010, Melbourne.

    2. We propose four statistical-based feature selection techniques that are repre-sentative of user typing behavior. The selected features have high statisticalsignificance for user-representation and also, inherently reflect user typingbehavior. The first is simply the most frequently typed n-graphs; it selectsa certain number of highly occurring n-graphs. The other three encompassusers different typing behaviors including:

    (a) The quickly-typed n-graph selection technique; it obtains n-graphs thatare typed quickly. The technique computes the average of n-graphs rep-resenting their usual typing time and then, selects the n-graphs havingleast typing time.

    (b) The time-stability typed n-graph selection technique; it selects the n-graphs that are typed with consistent time. The technique computesthe standard deviation of n-graphs representing the variance from theiraverage typing time and then selects the n-graphs having least variance.

    (c) The time-variant typed n-graph selection technique; it selects the n-graphs that are typed with noticeably different time.The techniquecomputes the standard deviation of n-graphs among all users repre-senting the variance from their average typing time and then, selectsthe n-graphs having large variance.

    The research results were published in:

    Al solami, Eesa, Boyd, Colin, Clark, Andrew and Ahmed, Irfan . Userrepresentative feature selection for keystroke dynamics. In Sabrina DeCapitani di Vimercati and Pierangela Samarati, editors, InternationalConference on Network and System Security, Universit degli Studi diMilano, Milan, 2011.

    3. A proposed user-independent threshold approach that can distinguish a useraccurately without needing any predefined user typing model a-priori. Thethreshold can be fixed across a whole set of users in order to authenticate

  • 1.5. Research Significance 7

    users without requiring pre-defined typing model for each user. The researchresults were published in:

    Al solami, Eesa, Boyd, Colin, Ahmed, Irfan, Nayak, Richi, and Mar-rington, Andrew. User-independent threshold for continuous user au-thentication based on keystroke dynamics. The Seventh InternationalConference on Internet Monitoring and Protection, May 27 - 1 June ,2012, Stuttgart, Germany.

    4. The design of an automatic system that is capable of authenticating usersbased on the user-independent threshold.

    1.5 Research Significance

    This research advanced the knowledge in the area of CBAS by linking the contin-uous authentication scenarios with the relevant continuous biometric authentica-tion schemes. It identifies and understands the characteristics and requirements ofeach type of CBAS and continuous authentication scenarios. The research helpsto choose the right accuracy measurements for the relevant scenario or situation.

    Furthermore, the research established a novel approach based on a user-independentthreshold without needing to build user-typing models. The new approach helpsto allow building new practical systems in a systemic way that can be used foruser authentication and impostor detection during the entire session without theneed for any predefined user typing model a-priori. The new system can be ap-plicable in some cases where it is impractical or impossible to gain the predefinedtyping model of all users in advance (before detection time). Examples are in anopen-setting scenario and in an unrestricted environment such as a public locationwhere any user can use the system. For instance, consider a computer that hasa guest account in a public location. In this instance, any user can interact withthe system. Naturally, no pre-defined typing model for the user would be availableprior to the commencement of the session.

    Additionally, the implications of this method extend beyond typist authenti-cation; it is generic and might be applied to any biometric source such as mousemovements, gait or speech. Another important implication is that unlike the ex-isting schemes our method can distinguish two unknown user samples and decide

  • 8 Chapter 1. Introduction

    whether they are from the same user or not. This might help in forensics investi-gation applications where you have two different typing samples and you want todecide if they related to one user or two different users.

    1.6 Thesis Outline

    The rest of this thesis is organised as follows:Chapter 2: Background This chapter gives an overview of biometrics, and

    introduces the concepts that will be used to describe typist authentication in sub-sequent chapters. Also, this chapter surveys related work, covering existing tech-niques for continuous user authentication that are based on keystroke dynamics.

    Chapter 3: A proposed model for CBAS using biometric This chapterproposes a generic model for most continuous authentication scenarios and CBAS.This model has two goals: the first is to describe the characteristics and attributesof existing CBAS; and the second is to describe the requirements of differentscenarios of CBAS. Also, we identify the main issues and limitations of existingCBAS, observing that all of the issues are related to the training data. Finally, weconsider a new application for CBAS without requiring any training data eitherfrom intruders or from valid users in order to make the CBAS more practical.

    Chapter 4: Dataset Analysis This chapter describes and analyses in depththe dataset that we used in our research. The analysis includes the data pre-processing that prepared the data for further analysis. Also, this chapter providespreliminary experiments that show the dataset is reliable and can be used for ourresearch. Furthermore, this chapter provides an overview of the experimental andevaluation methodology used in this thesis.

    Chapter 5: User-Representative Feature Selection for Keystroke Dy-namics This chapter explores which typing patterns can be used on a continuousbasis for user authentication. The chapter proposes four statistical-based featureselection techniques that mitigate limitations of existing approaches. First tech-nique selects the most frequently occurring features. The other three considerdifferent user typing behaviors by selecting: n-graphs that are typed quickly; n-graphs that are typed with consistent time; and n-graphs that have large timevariance among users. We further substantiate our results by comparing the pro-posed technique with three existing approaches (popular Italian words, commonn-graphs, and least frequent n-graphs). Finally, the chapter analyses and compares

  • 1.6. Thesis Outline 9

    fixed and dynamic features.Chapter 6: User-independent threshold for continuous user authenti-

    cation based on keystroke dynamics This chapter proposes user-independentthreshold approach that can distinguish a user accurately without the need for anypredefined user typing model a-priori. The chapter examines four different vari-ables that can directly manipulate the user samples for authentication in order tosee the influence of these factors on the accuracy of the user-independent threshold:distance type, number of keystrokes, feature type and number of features.

    Chapter 7: Typist authentication system based on user-independentthreshold This chapter presents a system design that shows the user-independentthreshold can work in a practical way. Particularly, the chapter has two aims.First, identify the minimum data needed from an impostor before they can be de-tected. Second, identify the point where the impostor takes over from the genuineuser in a computer session.

    Chapter 8: Conclusion and Future Work Conclusions and directions forfuture research are presented in this chapter.

  • 10 Chapter 1. Introduction

  • Chapter 2

    Background

    The goal of this thesis, as described in chapter 1, is to design and develop tech-niques for detection of the impostor who may take over from the authenticateduser during a computer session using keystroke dynamics. This chapter providesan overview of the authentication concept and different types of authenticationmethods focusing on typist authentication methods. Also, the chapter gives anoverview of the current anomaly detection techniques that can be used in our re-search problem with the emphasis on the related techniques that are used in thisthesis.

    This chapter is organized as follows. Section 2.1 provides an overview of au-thentication methods. Section 2.2 discusses in details the current schemes withtypist authentication including static typist and continuous typist authentication.Section 2.3 provides an overview of the current anomaly detection techniques. Sec-tion 2.4 presents previous research related to work described in chapters 5 to 7.Later in section 2.5, research challenges associated with the analysis of keystrokedynamics for continuous user authentication are discussed. Finally, the chapter issummarized in section 2.6.

    2.1 User Authentication in Computer Security

    Authentication is the process of checking the identity of someone or something.User authentication is a means of identifying the user and verifying that the user isallowed to access some restricted environments or services. Security research has

    11

  • 12 Chapter 2. Background

    determined that, for a positive identification, it is preferable that elements fromat least two, and preferably all three, factors be verified [17]. The three factors(classes) and some of elements of each factor are:

    the object factors: Something the user has (e.g., ID card, security token,smart card, phone, or cell phone)

    the knowledge factors: Something the user knows (e.g., a password, passphrase, or personal identification number (PIN) and digital signature)

    the inherent factors: Something the user is or does (e.g., fingerprint, retinalpattern, DNA sequence (there are assorted definitions of what is sufficient),signature, face, voice, unique bio-electric signals, or other biometric identi-fier).

    Any authentication system includes several fundamental elements that need to bein place [95]:

    the initiators of activity on a target system, normally a user or a group ofusers that need to be authenticated.

    distinctive traits or attributes that make a distinction for a particular useror group from others such as knowledge of a secret password

    proprietor or administrator working on the proprietor s behalf who is re-sponsible for the system being used and relies on automatic authenticationto differentiate authorized users from other users.

    an authentication mechanism to verify the user or group of users of thedistinguishing characteristic such as object factors, knowledge factors andinherent factors.

    some privilege allowed when the authentication of the user succeeds by usingan access control mechanism, and the same mechanism denies the privilegeif authentication of the user fails.

    As we mentioned at the start of this section, the third class of the positive identi-fications factors is the inherent factors. Also, we mentioned that the biometric isbased on inherent factors and since our research is focused on biometric authen-tication, we will limit our discussion only to biometric authentication in the nextsub-section.

  • 2.1. User Authentication in Computer Security 13

    2.1.1 Biometric Authentication

    Biometrics is the automatic recognition of a person using distinguishing traits[101]. Biometrics can be physical or behavioural. The physical biometric measuresa static physical trait that does not change, or changes only slightly, over time.It is related to the shape of the human body like fingerprints, face recognition,hand and palm geometry, and iris recognition. Behavioural biometric measuresthe characteristics of a person by how they behave or act, like speaker and voicerecognition, signature verification, keystroke dynamics, and mouse dynamics.

    The advantage of physical biometrics is that it has high accuracy compared tobehavioural biometrics. In contrast, physical biometric devices need to be imple-mented and this leads to some limitations such as high cost of implementation.Behavioural biometrics is one of the popular methods for the continuous authen-tication of a person, but it produces insufficient accuracy because behaviour isunstable and can change over time.

    Identification and verification are the goals of both biometric techniques thatinvolve determining who a person is; biometric verification involves determining ifa person is who they say they are [43]. Physical biometric authentication is themost foolproof form of person identity and the hardest to forge or spoof, comparedto other traditional authentication methods like user name and password.

    The operation of biometric identification or authentication technologies hasfour stages [41]:

    Capture It is used in the registration phase and also in the identificationor verification phases. The system can capture a physical or behaviouralsample.

    Extraction Distinctive pattern or features is extracted from the sample byselecting the optimum features that represent the user effectively and thenthe profile is created for each user.

    Comparison The profile is then compared with a new sample in the testingphase.

    Match/Non Match The system then makes a decission if the features onthe profile in the database are a match or non-match to the features on thenew sample in the testing phase.

    Jain et al. [1] presented seven essential properties or features of biometric measure.

  • 14 Chapter 2. Background

    Universality Everyone should have the same measure.

    Uniqueness The measure distinguishes each user from all companions whichmeans that no two people should have the same value.

    Permanence The measure should be consistent with time. However, be-havioural biometric slightly changes with time as user learns and improvehis skills for accomplishing tasks.

    Collectability The process of data collection should be quantitatively mea-surable.

    Performance The system should be accurate. Identification accuracy formost of the biometric sources is lower than verification accuracy [103].

    Acceptability The system should be willing to accept the measure by mostof the people. However, the measure might be objected to for ethical orprivacy reasons by some people.

    Circumvention The measure should not be easily fooled. However, oncesuch knowledge is available, fabrication may be very easy. Therefore, it isimportant to keep the collected biometric data secure.

    In the next section, we describe in detail the typist authentication as one of thebehavioural biometric types.

    2.2 Typist AuthenticationMost of the previous features of biometric measure are represented in the keystrokedynamics or typing. Jain et al. [1]mentioned that the typing behaviour has low per-manence and performance, and medium collectability, acceptability and circum-vention. We think all of the seven biometric measures are represented in keystrokedynamics or using the users typing behaviour for authentication. Universality thatevery user can type on the computer except the disabled person.collectability thatthe keyboard is able to collect and extract the users data even each keyboard hasdifferent specifications that may affect quality of the typing data. Uniqueness thateach user typing differently from other users. It means that no two people havethe same typing behaviour permanence that the users typing behaviour are nor-mally consistent over the time and several studies [44, 32, 21, 28, 48]conclude that

  • 2.2. Typist Authentication 15

    the keystroke rhythm often has characteristics or features that represent consistentpatterns of user typing behavior with time. However, some users typing behaviourmay slightly change with time as the typing skills of the user can be changed overtime. Acceptability that the keyboard is not intrusive instrument which can beacceptable by most of the users. Performance that the accuracy of using keystrokedynamics is normally high for representing the user typing behaviour effectively.Circumvention means that its difficult to copy some other users typing style. Inthis thesis we will consider and evaluate some of these measurements of the typingbehaviour including universality, uniqueness, permanence and performance.

    Typing "as a behavioural biometric for authentication" has been used for sev-eral years. By analysing users typing patterns, several studies [44, 32, 21, 28, 48]conclude that the keystroke rhythm often has characteristics or features that rep-resent consistent patterns of user typing behavior. Therefore it can be used foruser authentication. In chapter 5, we will present extensive analysis in finding thefeatures that can represent users typing patterns effectively and then they can beused for user authentication which is discussed later in chapters 6 and 7.

    The input to a typist authentication or keystroke dynamics system is a streamof key events and the time that each one occurs. Each event is either a key-press ora key-release. Most typist authentication techniques make use of the time betweenpairs of events, typically the digraph time or keystroke duration.

    The digraph time is the time interval between the first and the last of nsubsequent key-presses. It is sometimes called keystroke latency or interkeystroke interval.

    The keystroke duration is the time between the key-press and key-release fora single key. This is sometimes known as the key-down time, dwell time orhold time [30].

    There are two main types of keystroke analysis, keystroke static and keystrokedynamic (or continuous) analysis. Static keystroke analysis means that the anal-ysis is performed on the same predefined text for all the individuals under ob-servation. Most of the literature on keystroke analysis falls within this category[100, 57, 46, 11, 12, 69, 10]. The intended application of static analysis is at logintime, in combination with other traditional authentication methods.

    Continuous analysis involves a continuous monitoring of keystrokes typing andis intended to be executed during the entire session, after the initial authentication

  • 16 Chapter 2. Background

    step. It should be that keystroke analysis performed after the initial authenticationstep deals with the typing rhythms of whatever is entered by the users. It meansthat the system should deal with free text. In the next two sub-sections we willgive in more details the existing schemes in both types of keystroke dynamics.

    2.2.1 Static Typist Authentication

    Static authentication involves authenticating users through stable methods likeuser name and password. Behavioural static authentication is a static authentica-tion method that determines how the user acts and behaves with the authenticationsystem; for example, how a user name and password typed. This method is usedfor additional authentication methods and to overcome some limitations of tradi-tional authentication methods. Keystroke dynamics and mouse dynamics are themain examples of behavioural static authentication.

    Keystroke dynamics analyse the typing patterns of users. Using keystrokedynamics as an authentication method is derived from handwriting recognition,which analyses hand writing movements. Table 2.1 summarises a few techniquesthat will be discussed in this section. These techniques are measured by twomeasurements: FRR when the system incorrectly rejects an access attempt by anauthorised user and FAR when the system incorrectly accepts an access attemptby an unauthorised user.

    In 1980, Gaines et al. [29] were the first to use keystroke dynamics as anauthentication method. They conducted an experiment with six users. Eachparticipant was asked to retype two samples and the gap time between collection ofthe two samples was four months period. Each sample contained three paragraphswith varying lengths. They used specific digraphs as a feature that occurred duringthe paragraphs by analysing and collecting the keystroke latency timing. The mostfrequent five digraphs that appeared as distinguished features were in, io, no, on,ul. Then, they compared latencies between two sessions to see whether the averageand mean values were the same at both sessions. The limitation of this experimentwas that the data sample was too small to get reliable results. Also, there was noautomated classification algorithm used between the participants but the resultswere claimed to be very encouraging.

    Umphress and Williams [100] asked 17 participants to type two samples. Onetyping sample used for training included about 1400 characters and a second sam-ple used for testing included about 300 characters. They represent the features by

  • 2.2. Typist Authentication 17

    Reference FAR(%)

    FRR(%)

    Sample Content Method

    Gaines et al.[29]

    0.00 0.00 6000 characters Manual

    Umphressand

    Williams[100]

    12.00 6.00 1400 characters fortraining and 300

    characters for testing

    Statistical

    Leggett andWilliams[57]

    5.5 5 1000 words Manual

    Joyce andGupta [46]

    16.36 0.25 user name, a passwordand the last names

    eight times for trainingand five times for

    testing

    Statistical

    Bleha et al.[11]

    8.1 2.80 Name and fixed phrase Bayes

    Brown andRogers [12]

    21.2 12.0 First and last names Neuralnetwork

    Obaidat andSadoun[69,

    70]

    0.00 0.00 User names 225 times Neuralnetwork

    Furnell et al.[28]

    26 15 4400 characters Statistical

    Bergadano etal. [10]

    0.00 0.14 683 characters NearestNeighbour

    Sang et al.[88]

    0.2 0.1 alphabetic passwordand numeric password

    SVMs

    Table 2.1: The accuracy of statistic typist authentication techniques

  • 18 Chapter 2. Background

    grouping the characters in terms of words and then they calculated the time of thefirst 6 digraphs for each word. The classifier was based on statistical techniques bysetting the condition that each digraph must fall within 0.5 standard deviationsof its mean to be considered valid. Their system obtained a FRR of 12% and anFAR of 6%.

    Later, an experiment was conducted by Leggett and Williams [57] inviting 17programmers to type approximately 1000 words, which was similar to the Gaineset al. [29] experiment, but there was a condition of accepting the user if more than60% of the comparisons were valid. The results demonstrated that the FAR was5.5% and the FRR was 5%.

    Joyce and Gupta [46] recorded the keystroke during the log-in process by typingthe user name, a password, and the last names of users eight times. 33 usersparticipated in this experiment and they typed eight times to build historicalprofile, and five times for testing. The classifier was based on a statistical approach.It requires that the digraph fall with 1.5 standard deviations of its reference meanto belong to a valid user. The result demonstrated that the FAR was 16.36% andthe FRR was 0.25% .

    Bleha et al. [11] used the same approach that was proposed by Joyce andGupta [46] and they used digraph latencies as a feature to distinguish betweensamples of legal users and intruders. The experiment invited 14 participants asvalid users and 25 as impostor users to create their profiles. The classifier methodwas based on Bayes classifier using the digraph times. Results show that the FARwas 8.1% and FRR was 2.8%.

    Brown and Rogers [12] are the first to use the keystroke duration as a featureto distinguish between the samples of authenticated users and impostors. Theexperiment divided the participants into two groups (21 in the first group and25 in the second group), and they were asked to type their first and last names.The neural network method was applied in this experiment to classify the dataand results show a 0.0% false negative rate and 12.0% FRR in the first group and21.2% FAR in the second group.

    Furnell et al. [28] used the digraph latencies as representative feature. Thirtyusers were invited to type the same text of 2200 characters twice, as a measure tobuild their profiles. For intruder profiles, the users were asked to type two differenttexts of 574 and 389 characters. Digraph latencies were computed by statisticalanalysis, and the results show that in the first 40 keystrokes of the testing sample,

  • 2.2. Typist Authentication 19

    the FRR was 15% and the FAR were 26%.Obaidat and Sadoun [69, 70] used the keystroke duration and latency together

    as a feature to distinguish between the samples of authenticated users and impos-tors. The experiment invited 15 users to type their names 225 times each day overa period of eight weeks to build their profiles. Neural network was the classifier toclassify the user samples. The results showed that both FAR and FRR were zero.

    Bergadano et al. [10] used single text of 683 characters for 154 participants andthey considered the type errors and the intrinsic variability of typing as a featurethat can distinguish users. They used the degree of disorder in trigraph latenciesas a measure for dissimilarity metric and statistical method for classification tocompute the average differences between the units in the array. The results showthat the FAR was 4.0% and FRR was 0.01%. This method in the experimentis suitable for the authentication of users at log-in, but it is not applicable forcontinuous authentication because it requires predefined data.

    Sang et al. [88] conducted the same experiment as Obaidat and Sadoun [69,70](duration and latency together) but with a different technique. The techniqueused support vector machine (SVM) to classify ten user profiles, and the resultsdemonstrated that this technique is the best for classifying the data of user profiles,where more accurate results of 0.02% FAR and 0.1% FRR.

    All of the previous techniques show that the static typist authentication hadgreat success that can be used to distinguish different users effictively. It shows thatthe static typist authentication has different features that can be used to presentthe user typing behaviour. These features can be used for user authentication.In the next section, we will see whether the continuous typist authentication hasdifferent features that can be used effectively for user authentication similar to thestatic typist authentication .

    2.2.2 Continuous Typist Authentication

    Continuous typist authentication using dynamic or free text applies when usersare free to type whatever they want and keystroke analysis is performed on theavailable information. Continuous typist authentication using dynamic or freetext is much closer to a real world situation than using static text. The literatureon keystroke analysis of free text is pretty limited. This section describes mostcontinuous typist authentication techniques. They are summarised in Table 2.2.

    Monrose and Rubin [64] conducted an experiment on 31 users by collecting typ-

  • 20 Chapter 2. Background

    Reference FAR(%)

    FRR(%)

    Accuracy(%)

    Sample content Method

    Monroseand Rubin

    [64]

    - - 23 Few predefined and freesentences

    Euclideandistance andweightedprobability

    Dowlandet al. [22]

    - - 50 Normal activity oncomputers runing Windows

    NT

    Statisticalmethod

    Dowlandet al.[21]

    - - 60 Normal activity on specificapplications such as Word

    Statisticalmethod

    Bergadanoet al. [9]

    0 5.36 - Two different texts, each300 charcters long

    Distancemeasure

    Nisensonet al. [68]

    1.13 5.25 - Task response&each user typed 2551

    1866 keystrokes.

    LZ78

    Gunettiand

    Picardi[32]

    3.17 0.03 - Artificial emails &each usertyped 15 samples&each

    sample contains 700 to 900keystrokes

    NearestNeighbour

    Hu et al.[40]

    3.17 0.03 - 19 users &each one provide5 typing data "free text"

    k-nearestneighbor

    Bertacchiniet al.[8]

    - - - 62 different users typed 66samples based on spanish

    language

    k-medoids

    Hempstalket al. [38]

    - - - Real world emails150 email samples& 607

    email samples

    Gaussian densityestimation

    Janakiramanand Sim

    [44]

    - - - 22 users collected their databased on their daily activitywork of using email, rangedfrom 30,000 keystrokes to 2

    million keystrokes

    Based on acommon list offixed strings

    Table 2.2: The accuracy of continuous typist authentication techniques

  • 2.2. Typist Authentication 21

    ing samples in about 7 weeks. Users ran the experiment from their own computersat their convenience. They had to type predefined sentences from a list of availablephrases and/or to enter a few sentences not predefined and completely free. Tobuild a profile for each user, it is unknown how many characters the user shouldtype. They consider the features that represent the user behaviour by calculatingthe mean latency and standard deviation of digraphs as well as the mean durationand standard deviation of keystrokes. For filtering the user profile, they comparedeach latency and duration with its respective mean and any values greater thanstandard deviations above the mean (that is, outliers) were removed from the userprofile. Testing samples are manipulated in the same way by removing outliersand so turned into testing profiles to be compared to the reference profiles usingthree different distance measures:

    Euclidean distance

    Euclidean distance with calculation of the mean, and

    standard deviation time of latency and duration of digraph.

    The last experiment used Euclidean distance and added weights to digraphs. TheFRR is about 23% of correct classification in the best case (that is, using theweighted probability measure).

    Dowland et al. [22] applied different data mining algorithms and statisticaltechniques on four users, with data samples to distinguish between authenticatedusers and imposters. The users were observed for some weeks during their normalactivity on computers using Windows NT. It means that there was no constraintin the user to use the computer and the user is free to use the computer in anyway. Users profiles are decided to have features using the mean and standarddeviation of digraph latency and only digraphs typed less frequently by all theusers in all samples are considered. To filter the user profile, there were twothresholds: any times less than 40ms or greater than 750ms were discarded. Theresults demonstrated a 50% correct classification rate. The same experiment wasrefined by Dowland et al. [21]. It included some application information for PowerPoint, Internet Explorer, Word, and Messenger. The experiment collected the dataof eight users over three months and the results demonstrated that the FRR was40%.

    Bergadano et al. [9] calculated a new measure which was the time betweenthe depression of the first key and the depression of the second key for each two

  • 22 Chapter 2. Background

    characters in a sequence. Forty users were invited to build historical profiles bytyping two different samples of text. Each text contained 300 characters and theparticipants were asked to type 137 samples. 90 new users were invited to buildtesting files by typing the second sample only. The mean distance was computedbetween unknown instance sample and each sample of a users profile and the meandistance was also computed between unknown instance sample and each usersprofile to classify unknown instance sample. The authors applied a supervisedlearning scheme for improving the false negative rate to compute the mean andstandard deviation between every sample in users profile and every sample in adifferent user profile. Results demonstrated that the FRR was reduced to 5.36%and the FAR was zero.

    A longer experiment done by Dowland and Furnell [23] collected about 3.5million keystrokes from 35 users during three months. The sample content thatwere collected from users is based on the global logging. Global logging includesall possible typists behaviour.

    Nisenson et al. [68] collected free text samples from five users as normal usersand 30 users behaving as attackers. The sample content was either an open answerto a question, copy-typing, or a block of free typing. The time differentials werecalculated from typing data and used as a user feature. Each normal user typedbetween 2551 and 1866 keystrokes. Attackers were asked to type two open endedquestions and were required to type the specific sentence, To be or not to be.That is the question. Also, they were allowed to type in free text between 660to 597 keystrokes. Then, they trained these features on the LZ78-based classifieralgorithm. The accuracy of the system was attained with FRR 5.25% and FAR1.13%.

    Gunetti and Picardi [32], used free text samples in their experiment by inviting205 participants and they used the same technique that Bergadano et al. [9] pro-posed in their work based on static typist authentication, discussed in the previoussection. They created profiles for each user based on their typing characteristics infree text. The users performed a series of experiments using the degree of disorderto measure the distance between the test sample to reference samples from everyother user in their database. The samples are transformed into a list of n graphs,sorted by their average times. To classify a new sample it is compared with eachexisting sample in terms of both relative and absolute timing. Only digraphs thatappear in both reference and unknown samples are used for classification. The

  • 2.2. Typist Authentication 23

    Gunetti study [32] achieves very high accuracy when there are many registeredusers.

    Many researchers have used clustering algorithms in order to authenticate users.Hu et al. [40], applied similar technique to that was proposed by Gunetti andPicardi [32]. 19 users participated in this experiment with each of them providingfive typing samples. Another 17 users provided 27 typing samples which were usedas impostor data. Typing environment conditions were not controlled in this datacollection. They proposed k-nearest neighbor classification algorithm which aninput needs only to be authenticated against limited user profiles within a cluster.The main difference between the proposed algorithm by Hu et al. [40] and themethod of Gunetti and Picardi (GP method) is that the authentication processof the proposed algorithm is within a cluster while the GP method needs to gothrough the entire database. Also for a user profile X, the k-nearest neighborclassification algorithm uses only its representative profile in the authenticationprocess while the GP method needs to compare with every sample of each userprofile. They used the clustering algorithm to make a cluster for each user. First,each user provides several training samples and then the profile of each user is usedfor building. A representative user profile is built by averaging all such vectors fromall training samples provided. Second, the k-nearest neighbour method is appliedto cluster the representative profiles based on the distance measure. Finally, theauthentication algorithm for new text is executed only on the users correspondingcluster. The success of the proposed algorithm depends upon the threshold value,which is dependent on the registered users in the system. Moreover, the specificuse of the proposed algorithm in classifying and authenticating only users whoare already registered in the system makes the system less effective when newusers interact with the system. The experiment shows that the proposed k-nearestneighbor classification algorithm can achieve the same level of FAR and FRRperformance as the Gunetti and Picardi (PG) approach. However, the proposedapproach has improved the authentication efficiency up to 66.7% compared toGunetti and Picardi (PG) method.

    Bertacchini et al. [8], ran their experiment on one dataset. This datasetcontains keystroke data of 66 labeled users based on Spanish language. The datasetcontains a total of 66 samples, one sample per typing session, representing 62different users. They also used clustering to classify users by making a clusterfor each user. The number of clusters is based on the number of the users in the

  • 24 Chapter 2. Background

    dataset. The proposed algorithm is Partitioning Around Medoids (PAM) whichis an implementation of k-medoids, first proposed by Kaufman and Rousseeuw[49]. It is partitioning technique that clusters the data set of n objects into kclusters known a-priori. It has the advantage over k-means that it does not needa definition of mean, as it works on dissimilarities; for this reason, any arbitrarymeasure distance can be used. It is also more robust to noise and outliers ascompared to k-means because it minimises a sum of dissimilarities instead of a sumof squared Euclidean distances. So, the proposed approach worked successfully onthe registered users in the system but it failed were new users are added to thesystem.

    Hempstalk et al. [38] collected typing input from real-world emails. Theycollected about 3000 emails over three months and then they processed into twofinal datasets. It included 150 email samples and 607 email samples respectively.Then, they created profiles only for valid users based on their typing character-istics in free text. They performed a series of experiments using the Gaussiandensity estimation techniques by applying and extending an existing classificationalgorithm to the one class classification problem that describes only the valid usertyping data. Hempstalk applied a density estimator algorithm in order to generatea representative density for the valid users data in the training phase, and thencombined the predictions of the representative density of the valid user and theclass probability model for predicting the new test cases.

    Janakiraman and Sim [44], replicated the work of Gunetti and Picardi [32].They conducted the same experiment again but they did not avoid using the usualdigraph and trigraph latencies directly as features. 22 users were invited over twoweeks to conduct an experiment. Some of the users are skilled were typists andcould type without looking at the keyboard. Other users are unskilled typists, butare still familiar with the keyboard as they have used it for many years. The userscame from different backgrounds including Chinese, Indian or European origin,and all are fluent in English. Keystrokes were logged as users went about theirdaily activity work of using email, surfing the web, creating documents, and soon. The collected data from users ranged from 30,000 keystrokes to 2 millionkeystrokes. In total 9.5 million keystrokes were recorded for all users. Howevere,they did not report their findings in their paper.

    One of the main limitation of Gunetti and Picardi approach [32] is the highverification error rate which cause scalability issue. Gunetti and Picardi proposed

  • 2.2. Typist Authentication 25

    a classical n-graph-based keystroke verification method (GP method), which canachieve a low False Acceptance Rate (FAR). However, the GP method suffersfrom a high False Rejection Rate (FRR) and a severe scalability issue. Thus, GPis not a feasible solution for some applications such as computing cloud applicationwhere scalability is a big issue. To overcome GPs shortcomings, Xi et al. [102]devloped the latest keystroke dynamic scheme for user vervication to overcomeGPs shortcomings. To reduce high FRR, they designed a new correlation measureusing n-graph equivalent feature (nGdv) that enables more accurate recognitionfor genuine users. Moreover, correlation-based hierarchical clustering is proposedto address the scalability issue. The experimental results show that the nGdv-Ccan produce much lower FRR while achieving almost the same level of FAR asthat of the GP method.

    All of the previous techniques shows that the continuous typist authenticationhad great success similarly to the static typist authentication that can be usedto distinguish users effectively. It shows that we can obtain some features fromthe typing data that can be used to represent the user typing behaviour. Thesefeatures can be used for successful user authentication. Howevere, the extractedfeatures from the typing data of continous typist authentication do not guaranteefeatures with strong statistical significance and also, do not inherently incorporateuser typing behavior. Furthermore, one of the main limiatation of the previouscontinuous typist authentication technuiques is requiring the users data to beavailable in advance. In principle, the requirement of collecting the users datain advance restricts the systems to authenticate only known users whose typingsamples are modelled. In some cases, it is impractical or impossible to gain thepre-defined typing model of all users in advance (before detection time). It shouldbe possible to distinguish users without having pre-defined typing model whichlead the system to be more practical.

    In the next section, we present some related pioneering works in both supervisedand unsupervised typist authentication. Supervised and unsupervised methodswill help to link the existing continuous typist authentication schemes with therelevant setting environment or scenario.

  • 26 Chapter 2. Background

    2.3 Machine Learning in Typist AuthenticationThere are two main types of settings (or scenarios) for continuous typist authen-tication. A continuous authentication scenario might be conducted either in anopen-setting environment or a closed setting environment [79]. There are two mainmachine learning methods used to represent these two setting environments su-pervised learning (closed-setting environment) where the profile of authenticatedusers and possible impostors are available in advance such as computer based examscenario. This type of setting might be considered as a restricted environment inthat the environment should be under access control to stop any user not reg-istered in the system. The second method of machine learning is unsupervisedlearning (open-setting environment) where the profile of the impostor is not avail-able such as online banking scenario. However, open-setting environment mightbe conducted when the profile of the impostor and valid users are not availablesuch as computer based TOEFL exam scenario.

    2.3.1 Supervised Learning

    Supervised approaches can only detect previously known typing models and areunable to detect unknown typing models. Predefined typing models of both validuser and possible impostors are required to construct models in order to assignobservations into one of the two classes. In this case, the identity of the user whoinitiates the session is known, as well as the identities of all possible impostors orcolluders. The characteristics of supervised learning using typing behaviour aresummarised below.

    This approach requires the continuous authentication scenario to be in aclosed setting and restricted environment. The environment should preventany user not registered in the system from gaining access.

    The identity of an authorised user (who initiates a session) is known and thisuser has pre-defined typing model registered in the database in advance.

    The unauthorised user such as an imposter or intruder would try to claim theidentity of an authorised user throughout the session and, it is assumed, theywill have a pre-defined typing model registered in the database in advance.

    The unauthorised user in this approach might be an adversary or a colluder.The adversarial user may be deliberately acting maliciously towards the au-

  • 2.3. Machine Learning in Typist Authentication 27

    thorised user. This may happen when the authenticated user is harmed bya malicious person or they forget to log off at the end of the session. In thiscase, the malicious person may conduct some actions or events on behalf ofthe authorised user. Alternatively, the colluding user may be invited by thevalid user to complete an action on behalf of the user for example TOEFLexam. The victim in this case would be the system operator or the owner ofthe application.

    The labelled normal data from the authorised users and anomalous data frompossible imposters should be used in order to build the detection model. Thisapproach is similar to the multi-class classifier that learns to differentiatebetween all classes in the training data. This classifier is then used to predictthe class of an unseen instance by matching it to the closest known class.

    There are many existing supervised learning techniques for building models of nor-mal behaviour when pre-defined typing models are available for all users. Neuralnetworks (NNs) [19, 84, 12], decision trees (DTs) [80], support vector machines(SVMs) [88], nearest neighbour [32], and supervised statistical models [59, 10] arewell known supervised learning techniques.

    In general, supervised measures produce more accurate results than unsuper-vised approaches because the pre-defined typing models have labeled data (i.e.,it had examples of both normal and anomalous behaviors). However, supervisedtechniques are only able to detect known pre-defined typing model of user andcannot detect an unknown user. In the next section, we discuss studies that haveincluded unsupervised techniques to detect impostors who do not have pre-definedtyping model.

    2.3.2 Unsupervised Learning

    In contrast to supervised learning approaches, unsupervised learning methods donot require a pre-defined typing model of the user. In this case, we assume thatno pre-defined typing model is available for any users, authorised or not, at thebeginning of the session. We do however, assume that the user who initiates thesession is authorised to do so. The challenge here is to build a typing model ofthis authorised user while, at the same time, trying to decide whether or not thesession has been taken over by an imposter. A summary of the characteristics ofunsupervised learning using typing behaviour follows.

  • 28 Chapter 2. Background

    This approach requires the continuous authentication scenario to be in theopen setting and in a non-restricted environment. The environment shouldbe in a public location so that any user can use the computer system.

    While the identity of an (authorised) user who initiates a session may beknown, no pre-defined typing model for this user is available prior to thecommencement of the session.

    Similarly, the pre-defined typing model for the unauthorised user is not avail-able or cannot be collected before the prediction time or testing time.

    The unauthorised user in this approach would be an adversary user and thevictim in this case would be the end user.

    There are no labels for both normal and anomalous data to be used in orderto build the detection model in this approach.

    In this approach, the system determines whether the typing data in the testingphase is related to one user or two users by trying to identify any significant changewithin the typing data. Many unsupervised learning methods have been appliedto anomaly detection: clustering [105, 90, 65], time series analysis [82, 4] andthreshold analysis [42, 78] are well known unsupervised learning techniques.

    In this thesis, we propose a novel unsupervised impostor detection approach,which uses a combination of threshold analysis technique and the time series anal-ysis technique to detect the impostor who may take over from the authorised userduring the computer session (discussed in chapters 6 and 7). In the next section,we discuss the related work using anomaly detection techniques that need to findthe patterns data related to an impostor.

    2.4 Anomaly Detection TechniquesAnomaly detection is detecting patterns in a given data that do not conform to arecognised normal behaviour[14]. These non-conforming patterns have two mainexpressions used in the context of anomaly detection which are: anomalies andoutliers. Anomalies usually occur when the comparison between testing patternsdata and historical patterns data or normal patterns data do not match. Thereare different reasons that anomalies occur like malicious activity or the breakdownof a system. Anomaly detection has different challenges like:

  • 2.4. Anomaly Detection Techniques 29

    Distinguishing the normal behaviour region from the anomalies behaviourregion is very difficult.

    The intruder adapts his activity to become a normal behaviour but in realityis anomalous behaviour.

    Distinguishing between noisy data and anomalies data is a major issue.

    There are several applications of anomaly detection.

    Intrusion Detection: refers to the detection of malicious activity in a com-puter system [76]. The challenge of anomaly detection in this domain is thehuge data that the anomaly detection techniques need to deal with. Den-ning [20] (classifies intrusion detection systems into host based and networkbased intrusion detection systems). Host based intrusion detection systemsmonitors all or parts of the dynamic behavior and the state of a computersystem. The dynamic behaviour in this domain can be profiled at differentlevels such as at the program level or user level. The techniques of this do-main have to model the sequence of data or calculate the similarity betweensequences. The main reason for the anomalies in this domain relates to theoutside attacker who wants to obtain unauthorised access to the network forinformation stealing or disrupting the network. A CBAS based on keystrokedynamics can be thought of as a kind of intrusion detection application.An intrusion detection system monitors a series of events, with the aim ofdetecting unauthorised activities. In the case of the CBAS with keystrokedynamics, the unauthorised activity is a person acting as an impostor bytaking over the authenticated session of another (valid) user.

    Fraud Detection: refers to detecting the criminal activities in business organ-isations such as banks, credit card companies, stock market, etc. Fawcettand Provost [24] were the first who introduce a new term called "activitymonitoring" to fraud detection in this domain. They make a profile for eachcustomer and monitor the profiles to detect any anomalies. One of the ap-plications in this domain is credit card and the challenge related to thisapplication is detecting unauthorised credit card usage which requires online detection of fraud as soon as the fraudulent transaction takes place.

    Now, we will give a brief definition and the assumption for some of anomaly de-tection techniques:

  • 30 Chapter 2. Background

    Classification based anomaly detection techniques: are used to learn a modelfrom training data and then classify a test instance into one of the classesusing the testing model. There are different anomaly detection techniquesthat use different classification methods to build classifiers. For example:Neural networks based, Bayesian Networks Based, Support Vector MachinesBased, and Rule based. The assumption of this technique is the classifiertechnique which can discriminate between normal and anomalous classesthat can be obtained from the given feature space. These techniques can beclassified into two different categories [14]: multi-class classification whichassumes the training data belonging to multiple normal classes; and one-class classification which assumes the training data belonging to only oneclass. The advantages of these techniques, especially multi-class techniquesare that they can use effective algorithms that discriminate between differ-ent instances which belong to different classes. Furthermore, the process ofcomparison between a historical profile and a testing profile is very fast forthese techniques. However, these techniques have some limitations, such asthey depend on the availability of labels for different normal classes.

    Nearest neighbour based anomaly detection techniques: require a distancebetween two data instances. For continuous attributes like in our research,the Euclidean distance is a well known measure. The assumption of thesetechniques is that the normal data instances would be in neighborhoods,while the anomalies would not be close to their neighbors. For example,the user typing is based on the premise that a user often types in a similarfashion (that is the distance between samples from the same user is small) anddifferent users often type differently (that is the distance between differentuser samples is large). Otey et al. [72] proposed a distance measure fordata which include categorical and continuous attributes separately. One ofthe advantages is that there are no assumptions regarding the distributionfor the data which is by nature not supervised. However, these techniqueshave some limitations such as, some techniques in this domain sometimesmiss anomalies, especially if the data has normal instances that do not haveenough close neighbors or if the data has anomalies that have enough closeneighbors [14].

    Clustering based anomaly detection techniques: clustering is used to groupsimilar data instances into clusters [97]. The clustering techniques are cat-

  • 2.5. Related Techniques 31

    egorized into three groups depending on three assumptions. The first as-sumption: normal data instances belong to a cluster while the anomalies donot belong to any cluster. The drawback of these techniques "which dependon the previous assumption" is that the techniques focus on finding clustersinstead of finding anomalous data. The second assumption: normal datainstances are close to the center of cluster and the anomalies data is far fromthe center of cluster. These techniques have been used in different aspectslike, intrusion detection [83] and sequence data [13]. The third assumption:normal data instances belong to thick and large clusters but the anomaliesbelong to small or separate clusters. Proposed by He et al. [37], one of thesetechniques is called cluster-based local outlier factor (CBLOF). The tech-nique has the ability to determine the size of the cluster and the distance ofdata to the center of the cluster.

    Our research needs an anomaly detection technique in order to detect unauthorisedactivities that acting as an imposter by taking over the authenticated session ofanother (valid) user. The characteristics of our problem is similar to anomalydetection based on clustering characteristics. In our case, we need to check whetherthe typing data in a single session can be divided into two different clusters or onecluster. Time series analysis technique can be one of the applications of anomalydetection based on clustering. In section 2.5.3, we will give more details abouttime series analysis technique.

    2.5 Related Techniques

    The previous sections have established the background for the thesis as a whole.In this section, we review the related research to the work described in chapters 5to 7.

    2.5.1 Feature Selection

    Feature selection has been an active research area in different domains includingpattern recognition, statistics, and data mining communities [56]. The main ideaof feature selection is to choose a subset of input variables by removing featureswith little or no predictive information. In order to select the features in the typingdata, it is important to understand the different features of the data. The main

  • 32 Chapter 2. Background

    aim is to know what are the best attributes that can be extracted from the usertyping data. The raw keystroke dynamic data, such as key events and timestamps,cannot be used directly by an anomaly detector. Instead, sets of timing featuresare extracted from the raw keystroke dynamics data which can help to differentiatebetween users.

    Figure 2.1 illustrates how a string AB can be represented as a vectors of fourtimes. This information can be used to extract other keystroke characteristics suchas the average, standard deviation, maximum and minimum of the times from theprevious five extracted time vectors. The timing features of keystroke dynamicsinclude:

    Duration, which is defined as the time between the pressure on a key and itsrelease.

    Interval, which is defined as the time between two key presses, time betweenthe release of a key, the press on the next key and the time between therelease of two successive keys and the rate of typing that the average numberof characters per minute or seconds. All of the previous features have beenextracted by different researchers [98, 12, 69].

    Figure 2.1: Transforming a keystroke pattern into a timing vector when a userinputs a string AB: The duration and interval times are measured by millisecond

    Continuous authentication approaches based on keystroke dynamics use sequencesof characters that users type during a session as distinguished features. Since userscan type characters in any sequence during a session, continuous authenticationapproaches require selection of multiple features that are representative of user

  • 2.5. Related Techniques 33

    typing behavior. The n-graph is a popular feature among existing continuousauthentication schemes. It is the time interval between the first and the lastof n subsequent key-presses. Existing approaches use n-graph (feature) selectiontechniques to obtain user-representative features.

    Dowland et al. [21] collected the typing samples of five users by monitoringtheir regular computer activities, without any particular constraints being imposedon them such as asking users to type predefined set of words. They selectedthe features (2-graphs only) that occurred the least number of times across thecollected typing samples. They used keystroke latency which is the elapsed timebetween the release of the first key and the press of the second key. They builduser profiles by computing the mean and standard deviation of 2-graphs latency.They achieved correct acceptance rates in the range of 60%.

    Unlike Dowland et al., Gunetti et al. [32] avoided using the 2-graphs and 3-graphs latencies directly as features. Instead, they used latencies that determinethe relative ordering of different 2 and 3-graphs. They extracted the 2-graphs and3-graphs that are common between two samples and found the difference betweenthem. For this, they devised a distance metric to measure the distance between thetwo-orderings of 2-graphs and 3-graphs between two samples. In order to identifythe user of an unknown sample, they compared it with all the samples of the usersby computing the distance between them. The users sample with least distanceis deemed to be the user of the unknown sample. They reported 95% accuracy.

    Rajkumar and Sim [44] selected popular English words such as the, or, to,you as features. They showed that many fixed strings qualify as good candidatesand identified the user as soon as they typed any of the fixed strings. Rajkumarand Sim proved that these words can be used to discriminate users effectively.

    While the previous selected features exploit the occurrences of n-graphs, theirselection criteria do not guarantee features with strong statistical significancewhich apparently causes less accurate statistical user-representation. Furthermore,their selected features do not inherently incorporate user typing behavior. We be-lieve that the existing feature selections do not represent the user typing behavioureffectively.

    Our approach of selecting features based on statistical techniques is similar toManning et al. [60]. In Manning et al. [60] study they have suggested differentstatistical techniques for text mining to find the importance of a term in a docu-ment. While in our approach we use the statistical techniques to fin