automatic security labelling prototype system...

Automatic security labelling prototype system architecture

Alan Magar

The scientific or technical validity of this Contract Report is entirely the responsibility of the contractor and the contents do not necessarily have the approval or endorsement of Defence R&D Canada.

Defence R&D Canada – Ottawa

Contract Report DRDC Ottawa CR 2011-134

October 2011

Automatic security labelling prototype system architecture

Defence R&D Canada – Ottawa

Contract Scientific Authority

Original signed by David Brown

David Brown

Defence Scientist

Approved by

Original signed by Julie Lefebvre

Julie Lefebvre

Head / NIO Section

Approved for release by

Original signed by Chris McMillan

Chris McMillan

Head / Document Review Panel

© Her Majesty the Queen in Right of Canada, as represented by the Minister of National Defence, 2011

© Sa Majesté la Reine (en droit du Canada), telle que représentée par le ministre de la Défense nationale, 2011

Abstract ……..

This report proposes a design for a system capable of determining and applying a security label to unstructured data. Policy-based Labelling Automation TechnologY Prototype for Unlabelled Sources (PLATYPUS) is a proposed system that will be capable of digesting the vast quantities of unstructured, unlabelled content and, using content analysis, determining the sensitivity of the information and assigning it the appropriate security label.

Résumé ….....

Ce rapport propose une conception d’un système capable de déterminer le niveau de sensibilité et d’apposer l’étiquette de sécurité à des données non structurées. Le système Policy-based Labelling Automation TechnologY Prototype for Unlabelled Sources (PLATYPUS) (prototype de technologie d’automatisation de l’étiquetage basé sur la politique) est un système proposé qui pourra digérer d’importantes quantités de contenu non structuré et non étiqueté et d’utiliser l’analyste du contenu pour déterminer la sensibilité de l’information et d’y apposer la bonne étiquette de sécurité.

This page intentionally left blank.

Executive summary

Automatic security labelling prototype system architecture Alan Magar; DRDC Ottawa CR 2011-134; Defence R&D Canada – Ottawa; October 2011.

Department of National Defence (DND) networks are typically system high networks containing a great deal of unlabelled data. Since all personnel with access to the network are appropriately cleared the situation, although not ideal, has long been considered acceptable. However, this situation is starting to shift, first with the increased requirement for information sharing between security domains and more recently with systems supporting multi-caveat separation and multi-level access. All of these initiatives are predicated on data being appropriately labelled so that access to the data can be controlled and proper handling of the data can be enforced. Unfortunately, the process of properly labelling the vast quantities of unlabelled data in DND repositories is onerous.

Policy-based Labelling Automation TechnologY Prototype for Unlabelled Sources (PLATYPUS) is a proposed system that will be capable of digesting the vast quantities of unstructured, unlabelled content and, using content analysis, determining the sensitivity of the information and assigning it the appropriate security label. It accomplishes this through the provision of the five following services:

External Labelling Service – The External Labelling Service is the external interface through which users, applications or services submit data to be labelled;

Orchestration Service – PLATYPUS is a collection of completely independent web services. The business logic, and specifically the manner in which unlabelled data is routed between the web services, is provided by the Orchestration Service. By separating the business logic from the individual web services the overall flexibility of the solution is increased, allowing it to support additional use cases;

Data Manipulation Service – The Data Manipulation Service is responsible for preparing the data for content analysis. This includes scanning the data for threats, determining the data type and converting the data to a common data format;

Classification Service – The Classification Service leverages content analysis and contextual information (e.g., data, system, user) as prescribed in the overarching policy in order to determine the security classification of the data; and

Label Creation Service – The Label Creation Service takes the security classification of the data, as determined by the Classification Service, and applies the appropriate security label and markings. The security label and markings are cryptographically bound to the data using a digital signature.

In addition to the logical design the report also includes a prototype design. The prototype design takes the logical design and details how to implement it using primarily open source components, albeit supplemented with a limited amount of custom code. While the selection of most open source products was relatively straightforward, the selection of the data classifier was more complicated. Not only is it the

key component of PLATYPUS but there appeared to be two viable candidates for use in this capacity; Apache Mahout and RapidMiner. Consequently, an options analysis was conducted of these two solutions. After considerable analysis it was determined that RapidMiner was the more mature offering and at this point in time better suited for inclusion in the prototype.

Based on the prototype design documented in this report it is recommended that Defence Research and Development Canada (DRDC) proceed with the prototype development in order prove the viability of the technology and the design. Assuming that this next phase of the project is successful, DRDC should consider its integration in the Secure Access Management for Secret Operational Networks (SAMSON) project.

Sommaire .....

Automatic security labelling prototype system architecture Alan Magar; DRDC Ottawa CR 2011-134; R & D pour la défense Canada – Ottawa; octobre 2011.

Les réseaux du ministère de la Défense nationale (MDN) sont typiquement des réseaux élevés du système qui contiennent beaucoup de données non étiquetées. Puisque tout le personnel qui a accès au réseau possède l’habilitation de sécurité appropriée, la situation, bien qu’elle ne soit pas idéale, est depuis longtemps considérée comme acceptable. Cependant, cela commence à changer, à cause de la demande croissante d’échange d’information entre différents domaines de sécurité et plus récemment à cause des systèmes qui prennent en charge une séparation et un accès multi-niveaux. Toutes ces initiatives sont basées sur un étiquetage approprié des données, afin que l’accès à ces données puis être contrôlé et qu’une manipulation adéquate des données puisse être mise en place. Malheureusement, le processus visant à étiqueter adéquatement une importante quantité de données non étiquetées se trouvant dans les dépôts du MDN est onéreux.

Le système PLATYPUS est un système proposé qui pourra traiter d’importantes quantités de contenu non structuré et non étiqueté et, à l’aide de l’analyse du contenu, pourra déterminer la sensibilité de l’information et lui attribuer l’étiquette de sécurité appropriée. Le système accomplit cette tâche à l’aide des cinq services suivants :

Service d’étiquetage externe (External Labelling Service) – Le service d’étiquetage externe est l’interface externe à laquelle les utilisateurs, les applications ou les services envoient les données qui doivent être étiquetées;

Service d’orchestration (Orchestration Service) – PLATYPUS est une collection de services Web entièrement autonomes. La logique opérationnelle, et surtout la manière à laquelle les données non étiquetées sont acheminées entre les services Web, est fournie par le service d’orchestration. Le fait de séparer la logique opérationnelle des services Web distincts augmente la souplesse globale de la solution, lui permettant de prendre en charge des cas d’utilisation supplémentaires;

Service de manipulation des données (Data Manipulation Service) – Le service de manipulation des données est chargé de préparer les données pour l’analyse du contenu. Cela comprend l’exploration des données pour les menaces, la détermination du type de données et la conversion des données dans un format de données commun;

Service de classification (Classification Service) – Le service de classification tire parti de l’analyse du contenu et de l’information contextuelle (p. ex., données, système, utilisateur) conformément à la politique générale, afin de déterminer la classification de sécurité des données;

Service de création d’étiquettes (Label Creation Service) – Le service de création d’étiquettes vérifie la classification de sécurité des données, déterminée par le service de classification, et appose l’étiquette de sécurité et les mentions de sécurité appropriées. L’étiquette et les mentions de sécurité appropriées sont liées au niveau cryptographique aux données utilisant une signature numérique.

En plus de la conception logique, le rapport comprend également la conception d’un prototype. La conception du prototype adopte une conception logique et expose en détail la façon de la mettre en œuvre à l’aide de composantes provenant principalement de logiciels libres, quoique complétée d’un nombre limité de code sur mesure. Bien que la sélection de la plupart des produits libres se soit faite de façon simple, la sélection du classificateur de données a été plus compliquée. C’est non seulement la composante clé du système PLATYPUS, mais en plus, il y avait deux solutions valables pouvant être utilisées pour cette capacité : Apache Mahout et RapidMiner. Par conséquent, une analyse des options offertes par ces deux solutions a été effectuée. Après une analyse poussée, il a été déterminé que RapidMiner était la meilleure offre et qu’il s’adaptait mieux au prototype à ce moment-là.

Selon la conception du prototype documentée dans ce rapport, il est recommandé que Recherche et développement pour la défense Canada (RDDC) poursuive le développement du prototype, afin de prouver la viabilité de la technologie et de la conception. Si l’on part du principe que cette prochaine étape du projet réussira, RDDC devrait penser à son intégration au projet Gestion de l’accès protégé aux réseaux opérationnels secrets (SAMSON).

Table of Contents

Abstract ........................................................................................................................................i

Résumé ........................................................................................................................................i

Executive Summary ..................................................................................................................iii

Sommaire ...................................................................................................................................v

Table of Contents .....................................................................................................................vii

1.0 Introduction....................................................................................................................1

1.1 Background.......................................................................................................1

1.2 Purpose .............................................................................................................1

1.3 Scope.................................................................................................................1

1.4 Assumptions......................................................................................................1

1.5 Document Structure ..........................................................................................2

2.0 Logical Design...............................................................................................................3

2.1 Overview..................................................................................................................3

2.2 Use Cases .................................................................................................................4

2.3 External Labelling Service................................................................................5

2.3.1 Authentication................................................................................5

2.3.2 Validation.......................................................................................6

2.3.3 Context Retrieval ...........................................................................6 2.4 Orchestration Service...............................................................................................6

2.5 Data Manipulation Service ......................................................................................7

2.5.1 Threat Detection.............................................................................8

2.5.2 Data Identification .........................................................................9

2.5.3 Context Retrieval (data).................................................................9

2.5.4 Data Conversion.............................................................................9 2.6 Classification Service...............................................................................................9

2.6.1 Data Classifier..............................................................................10

2.6.2 Data Classification Algorithms....................................................11

2.6.3 Policy-based Classification..........................................................12 2.7 Label Creation Service...........................................................................................13

2.7.1 Security Labelling........................................................................13

2.7.2 Security Marking .........................................................................14

2.7.3 Cryptographic Binding ................................................................14

3.0 Options Analysis..........................................................................................................15

3.1 Overview................................................................................................................15

3.2 Options...................................................................................................................16

3.2.1 Apache Mahout ...........................................................................16

3.2.2 RapidMiner .................................................................................16 3.3 Methodology..........................................................................................................16

3.4 Analysis .................................................................................................................18

3.4.1 Phase 1 – Research.......................................................................18

3.4.2 Phase 2 - Development ................................................................21

3.4.3 Phase 3 - Operations ....................................................................24 3.5 Results....................................................................................................................27

4.0 Prototype Design.................................................................................................................29

4.1 Overview................................................................................................................29

4.2 Scope......................................................................................................................29

4.3 Architecture ...........................................................................................................31

4.4 External Labelling Service..............................................................................33

4.4.1 ELS Web Service.........................................................................34

4.4.2 Samba File Server ........................................................................35

4.4.3 Validation Application.................................................................35

4.4.4 System & User Context Application............................................36 4.5 Orchestration Service.............................................................................................36

4.5.1 Logical Flow ................................................................................37

4.5.2 Samba Client................................................................................39

4.5.3 Apache ODE ................................................................................40 4.6 Data Manipulation Service ....................................................................................40

4.6.1 Threat Detection...........................................................................41

4.6.2 Data Identification .......................................................................42

4.6.3 Data Context Retrieval.................................................................43

4.6.4 Data Conversion...........................................................................44 4.7 Classification Service.............................................................................................46

4.7.1 Data Classifier..............................................................................46

4.7.2 Policy-based Classification..........................................................47 4.8 Label Creation Service...........................................................................................49

4.8.1 Security Labelling........................................................................50

4.8.2 Security Marking .........................................................................51

4.8.3 Cryptographic Binding ................................................................51

5.0 Next Steps ...........................................................................................................................53

5.1 Building the Prototype ...........................................................................................53

5.2 Integrating with SAMSON ....................................................................................54

5.3 Contextual User Profile..........................................................................................54

6.0 Conclusions & Recommendations ......................................................................................55

References.................................................................................................................................57

Acronyms & Abbreviations ......................................................................................................59

Annex A – Context ...................................................................................................................63

Data Context .........................................................................................64

System Context .....................................................................................64

User Context .........................................................................................65

Annex B – Open Source Data Classification ............................................................................67

Annex C – PLATYPUS Policy Languages...............................................................................69

Business Process Execution Language (BPEL)...........................................................69

eXtensible Access Control Markup Language (XACML)...........................................70

XML Security Policy Information File (SPIF) ............................................................71

Page intentionally left blank

1.0 Introduction

1.1 Background Much of the information stored on Government of Canada (GC) information systems does not contain a security label explicitly denoting the classification and handling restrictions of the information. This is problematic as information lacking a security label in a system high environment is handled at the highest classification level of the system, thereby severely inhibiting information sharing. In some cases, the information may be inappropriately handled due to its lack of a security label thereby potentially exposing sensitive information to compromise. A security label cryptographically bound to the data explicitly denotes the sensitivity of the data and can be used to mediate access to the data and enforce proper handling of the data.

What is required is a system that is capable of automatically determining, or aiding in the determination of, the sensitivity of both structured/semi-structured data (data stored in databases and document management systems, including eXtensible Markup Language (XML)) and unstructured data (email, web content, Portable Document Format (PDFs), Office documents, audio, video) using the context of the data and content analysis. Once the sensitivity of the data has been determined the data can be assigned an appropriate security label that is cryptographically bound to the data. This security label can then serve as the basis for access mediation and handling.

1.2 Purpose The purpose of this report is to propose a prototype architecture capable of accepting unlabelled data and, by leveraging a number of services, determining and applying the appropriate security label to the data.

1.3 Scope While this report will touch on various aspects of security label determination and application, it is not intended to address concepts related to data classification in detail. These are addressed in more detail in the related research effort; Security classification using automated learning (SCALE) [Reference 1].

1.4 Assumptions This report assumes that the reader is familiar with the subjects of data security labelling and content analysis. Readers unfamiliar with these two subjects are encouraged to consult Investigation of Technologies and Techniques for Labelling Information Objects to Support Access Management [Reference 2] and Security classification using automated learning (SCALE) [Reference 1].

1.5 Document Structure

This document consists of the following sections:

Section 1 – Introduction: provides a general introduction to the document;

Section 2 – Logical Design: details the logical design for the prototype;

Section 3 – Options Analysis: details the results of a detailed options analysis in order to select the most suitable open source content analysis product;

Section 4 – Prototype Design: details a prototype design based on the logical design described in Section 2 and the content analysis product selected in Section 3;

Section 5 – Next Steps: proposes subsequent research steps based on the results of this report;

Section 6 – Conclusions & Recommendations: summarizes the conclusions and recommendations derived from the development of this report, as well as the recommended path forward;

Section 7 – References: identifies the reference material that was used in the development of this report;

Section 8 – Abbreviations and Acronyms: provides the long form for all of the acronyms used throughout the report;

Annex A – Context: describes the concepts of data, system and user context used throughout this report;

Annex B – Open Source Data Classification: lists alternative open source content analysis products currently available; and

Annex C – Prototype Policy Languages: discusses the policy languages proposed for use in the prototype.

2.0 Logical Design

2.1 Overview PLATYPUS is a collection of web services that will enable users, applications and services to submit unlabelled data to be automatically classified and labelled. The prototype, which is illustrated in Figure 1, consists of the following five services:

External Labelling Service – The External Labelling Service, as the name implies, is an externally accessible service intended to be leveraged by clients, whether human or service oriented, with a requirement to determine the security classification of a given piece of data and apply the appropriate security label to the data;

Orchestration Service – The Orchestration Service is responsible for coordinating all communications between the web services and ensuring that data is appropriately routed between them;

Data Manipulation Service – The Data Manipulation Service is responsible for determining the file type of the data, converting it to a standard file format that is supported by the Classification Service and identifying/mitigating any threats found in the data;

Classification Service – The Classification Service utilizes content analysis and contextual information as prescribed in the overarching policy in order to determine the security classification of the data; and

Label Creation Service – The Label Creation Service is responsible for applying the appropriate security markings, creating an appropriate security label and cryptographically binding it to the data.

Figure 1 – PLATYPUS Logical Design

2.2 Use Cases The business logic has been separated from the web services in order to support a variety of business models and use cases. While the unlabelled data use case will be supported initially, it is envisioned that other use cases outlined below could easily be supported. Alternate use cases include the following:

Unlabelled Data Use Case – The unlabelled data use case is the base use case. In this use case a user or a web service submits one or more unlabelled files to the External Labelling Service. The user would submit the file(s) through a Samba file share whereas the web service client would use a web service interface. PLATYPUS would then determine the appropriate security label and return the labelled data through the same interface from which it was submitted. This use case is most appropriate for the bulk labelling of data;

Chat Use Case – The chat use case would stream all of the communications between the participating parties through PLATYPUS to ensure that the security label assigned to the chat session accurately reflects the dialog taking place. In cases where the dialog is determined to be more sensitive than the security label indicates, a number of actions could take place. These actions could range from a security officer being notified, to a warning being displayed to the chat participants, to the chat session being terminated. In this use case there would be no need to send chat communications through the Data Manipulation Service or the Label Creation Service. Rather, chat communications could be routed directly to the Classification Service for near real-time analysis;

Email Use Case – The email use case would route all outgoing emails through PLATYPUS to ensure that the security label assigned to the email, and its attachments, accurately reflects its sensitivity. In cases where the email is determined to be more sensitive than the security label indicates, a number of actions could take place. These actions could range from a security officer being notified to a warning email being returned to the sender. In this use case there would be no need to send email communications through the Label Creation Service. However, email would need to be routed to the Data Manipulation Service as well as the Classification Service;

Label Suggestion Use Case – The label suggestion use case is a variation of the unlabelled data use case. In this use case users are not looking to have the data labelled. Instead, they are looking for advice on the appropriate label to apply to a given piece of data. In all likelihood this use case would be supported from within an existing application (e.g, email, word processor) or accessible through a web interface. Rather than returning the labelled data, PLATYPUS would merely return the results from the Classification Service. The user could then make a determination on whether he wanted to have the data labelled or not; and

Cross Domain Solution (CDS) Use Case – Depending on the particular instantiation of the CDS it will need to perform a subset of the functions provided by PLATYPUS prior to transferring data between security domains. For example, a low to high CDS is primarily concerned with the unauthorized transfer of malicious code from the low security domain to the high security domain. Consequently, it might leverage the

PLATYPUS threat detection capability.1 While a high to low CDS is concerned to some extent with the unauthorized transfer of malicious code, its primary concern is that of data leakage. Consequently, it could leverage PLATYPUS services for threat detection, including the detection of hidden content, data identification, data conversion, as well as data classification. Ultimately the data classifier would be used to determine if the sensitivity of the data to be transferred exceeded the classification of the low domain.

2.3 External Labelling Service The External Labelling Service, as the name implies, is the externally accessible web service through which users and systems submit data to be labelled. The External Labelling Service, which is illustrated in Figure 2, performs the following functions:

Authentication;Validation; and Context Retrieval (user or system).

Figure 2 – External Labelling Service

2.3.1 Authentication Depending on the environment in which PLATYPUS is deployed, user and/or system authentication may be a prerequisite in order to access the External Labelling Service. Authentication would help mitigate misuse of the service, including denial of service attempts and attempts to introduce malicious code into the system. User and/or system authentication would also serve to provide additional context information that could prove useful in classifying the data. Lastly, mutual authentication would increase the complexity of attackers standing up fraudulent labelling systems. It is envisioned that Security Assertion Markup Language (SAML) will be used for authentication to PLATYPUS.

1 This capability would need to be enhanced to protect against threats specifically targeted at the CDS.

2.3.2 Validation PLATYPUS includes a threat detection service in order to ascertain if any advertent or inadvertent attempts were made to conceal content or target the system through the use of malicious code. In order to help mitigate this threat the data can also be digitally signed by a valid entity. The digital signature may be used to ensure the integrity of the submitted data such that attempts to hide data or malicious code in the data could be detected. The digital signature could also be used to attest to the identity of the user or process submitting the data. While these safeguards are important for security labelling they are potentially critical for other potential business use cases such as CDS. Within PLATYPUS the digital signature could also be used to provide system or user context information. If this optional capability is leveraged then Secure/Multipurpose Internet Mail Extensions (S/MIME) or XML Digital Signature (XMLDSig) could be leveraged to digitally sign data submitted to PLATYPUS.

2.3.3 Context Retrieval The External Labelling Service will also serve to retrieve user and/or system context information that would then be passed on to the Classification Service for use in the security classification determination process. While some of the context information could be retrieved from the authentication/validation process, other information might need to be retrieved from a repository (e.g., identity repository). A full discussion on context is included in Annex A. All readers are strongly encouraged to read this section of the report.

2.4 Orchestration Service The right music played by the right instruments at the right time in the right combination: that’s good orchestration.2

Web service technologies (e.g., XML, SOAP, Web Service Description Language (WSDL), Universal Description Discovery and Integration (UDDI)) provide a mechanism with which to describe, locate and invoke web services. However, these services alone do not provide the business process that dictates how the web services interact with one another at the message level and at the execution level, including error handling. The Orchestration Service, illustrated in Figure 3, will be responsible for routing data through the various web services and ensuring that each web service has the requisite information it needs to perform its function.

Relevant standards for orchestration include the World Wide Web Consortium (W3C) Web Services Choreography Description Language (WS-CDL) and OASIS Business Process Execution Language (BPEL). It is envisioned that one or both will be used to provide orchestration within PLATYPUS.

2 Leonard Bernstein, composer

Figure 3 - Orchestration Service

2.5 Data Manipulation Service Previous experience with content analysis products has demonstrated that these products can only process data in specific data formats. Consequently, all data will need to be converted to a supported data format so that it can be analyzed accordingly. This is the role of the Data Manipulation Service. In other words, it performs the pre-processing required to prepare the data for the Classification Service. Specifically, the Data Manipulation Service, illustrated in Figure 4, performs the following functions:

Threat Detection; Data Identification; Context Retrieval (data); and Data Conversion.

Figure 4 - Data Manipulation Service

2.5.1 Threat Detection Data submitted for security labelling will need to be cleansed of any threats that may adversely affect the operation of PLATYPUS or the accuracy of the data labelling process. Of particular concern is malicious code that may alter the operation and ultimately the security label determination of PLATYPUS. However, hidden content is also a concern, especially if PLATYPUS is to be used in other business cases such as CDS. PLATYPUS is not intended to be used to provide a general virus scanning service, although it may have some overlapping functionality. Specific threats that will need to be mitigated against include the following:

Denial of Service - Denial of service is a category of attack whereby improperly structured data is sent with the intent of slowing down or disabling the service so that valid requests are denied;

Embedded code - Embedded code is a category of attack whereby executable content and commands are intermingled with data in order to compromise the system in some manner;

External reference attacks - External reference attacks are a category of attack whereby a Uniform Resource Indicator (URI) redirects the user to malicious remote content; and

Concealed/hidden content attacks – Concealed/hidden content is problematic in terms of security labelling due to the fact that this type of content can contain sensitive information that can affect the overall security classification of the data. If the concealed/hidden content is not identified and/or removed then the classification label ultimately assigned to the data may not be accurate. An inaccurate security label can lead to mishandling of the data and even compromise of the data. Consequently, it is very important that concealed/hidden content is identified and/or removed so that any security label assigned to the data accurately represents the contents of the data being labelled.

Note – Example: Microsoft Compound Binary/Document File Format (MCBFF or MCDFF)

The Microsoft Compound Binary File Format (MCBFF), sometimes referred to as the Microsoft Compound Document File Format (MCDFF), is the container format used in Microsoft Office 2003 to contain Office-specific data. This container is extremely verbose as Microsoft uses fixed sector sizes that contain unused space. It is for this reason that a Microsoft Word 2003 document containing the text “Hello world.” is 19.5 KB while the text file equivalent is 12 bytes. This container file system is problematic from a security labelling perspective due to the ease with which hidden data can be added to the container that is unknown to the container’s file system. This hidden data can take the form of video, documents, pictures, etc. Given that there are no standards to specify what is valid container content and what isn’t, this hidden data is all but impossible to detect.

2.5.2 Data Identification In order for data to be converted the data type must first be identified. Given the myriad of data formats in existence, this is no small feat. Once the data has been identified the Data Manipulation Service will determine if it is capable of converting it. If so, the data will be processed by the data conversion process. If not, then an error message will be returned. Likewise, if the data identification process is unable to identify the data then an error message will need to be returned. It is envisioned that PLATYPUS will originally support text-based data formats such as American Standard Code for Information Interchange (ASCII), MCBFF, Microsoft Office Open XML, HyperText Markup Language (HTML), PDF, XML, etc.

2.5.3 Context Retrieval (data) The Data Manipulation Service will be responsible for extracting intrinsic metadata from the data. It will also be responsible for determining syntactic context information about the data. Lastly, it may be required to retrieve extrinsic metadata from an external repository. These concepts are discussed in more detail in Annex A.

2.5.4 Data Conversion All data will need to be converted to a data format supported by the Classification Service. In order to facilitate threat detection (Section 2.5.1) it is envisioned that all data will be converted to XML. XML was selected for a number of reasons. First and foremost, XML is a widely used standard. Second, XML is relatively benign compared to most other file formats thereby providing decreased exposure to a range of threats.

2.6 Classification Service The Classification Service is responsible for determining the sensitivity of the data using a number of available inputs and according to an overarching policy. The security classification of a given piece of data will be determined in part by comparing the data to prototypical documents and determining whether it matches a previously observed pattern. The Classification Service, which is illustrated in Figure 5, consists of the following components:

Data Classifier; Data Classification Algorithms; and Policy-based Classification.

Figure 5 - Classification Service

2.6.1 Data Classifier Classification systems are a form of machine learning that uses learning algorithms to provide a way for computers to make decisions based on experience and in the process to emulate certain forms of human decision making.3

The Data Classifier is the key component of PLATYPUS. While all other components are an integral part of the solution, in all likelihood the majority of data to be classified will be unstructured data with relatively little context information. Consequently, the data classifier will be essential in order to determine the security classification of the data.

In the case of PLATYPUS the data classifier will be used to determine the appropriate classification of a given piece of data based on the use of learning algorithms and training data. The data classifier, as illustrated in Figure 6, consists of two separate processes. The first is the training process in which training document with known classifications (reference decisions) are fed into the training algorithms in order to produce the data classification model. This process, which is outside of the scope of this report, is the subject of a separate research initiative that will attempt to optimize a suitable data classification model for PLATYPUS. Specifically, this involves the use of statistical natural language processing and machine learning techniques to determine the sensitivity of data. Initial results from this research initiative can be found in Security classification using automated learning (SCALE) [Reference 1]. The second process is the actual data classification process in which text to be classified is fed into the data classification model in order to determine its classification. The results of the data classification process will then be used as additional training examples with

3 Mahout in Action [Reference 3]

reference decisions. It is anticipated that this feedback mechanism will improve the overall accuracy of the data classification model.

Note – Multi-Stage Classifier

The data classifier may need to leverage a multi-stage classifier in order to effectively classify data. This multi-stage classifier would consist of two phases: clustering and classification. Clustering techniques would be used to determine the topical domain(s) of the document within a pre-defined taxonomy, whereas the data classifier would be used to ascertain whether the data was classified or not. It is anticipated that documents would not necessarily belong to a unique topical domain but may overlap multiple domains. The clustering algorithm would determine the correlation of the document with each topical domain and this result would be used to select the applicable classification model(s).

Figure 6 - Data Classifier 4

2.6.2 Data Classification Algorithms A significant amount of additional research will need to be conducted into the data classification algorithms in order to determine which ones are most effective in terms of determining the security classification of data. Furthermore, it is envisioned that subsequent research efforts will look at the possibility of leveraging a multi-stage classifier, concept mapping techniques, and cross-domain inference in order to improve the accuracy of the results.

Security classification using automated learning (SCALE) [Reference 1] conducted a number of tests using three different machine learning algorithms; k-Nearest Neighbour (kNN), Naïve Bayes (NB) and Support Vector Machines (SVM). It was the conclusion of this report that

4 This figure is based on a diagram found in Mahout in Action [Reference 3]

there truly is no silver bullet when choosing a machine learner. Depending upon specific system goals, some learners are more suitable than others. It is envisioned that the classification software selected for PLATYPUS must be capable of supporting a wide variety of machine learning algorithms in order to accommodate different requirements and the results of future research.

2.6.3 Policy-based Classification Policy-based classification, illustrated in Figure 7, refers to the ability to take various inputs and, based on the relevant policy, apply them so that the most accurate security classification can be achieved. Inputs will include the results of the data classification effort, the context information and even environmental variables and temporal data. The data classification effort could be the result of a single classification process or a multi-stage classifier. The context information will consist of the data, user, and system context information discussed in Annex A. Environmental variables could include such things as location, time and even threat level. Temporal data consists of a time period attached to the data that may affect its security classification. For example, some data (e.g., budget information) is classified for a certain period of time at which point it becomes unclassified.

It is envisioned that the policy-based classification used within PLATYPUS will initially be fairly basic in that weights will be assigned to the various inputs. However, these input weights could automatically be adjusted based on the level of confidence of the data classification process. For example, if the data classifier determines the classification of the data with a high degree of confidence, then the weight of this input could be increased correspondingly. Similarly, if the data classifier determines the classification of the data with a low degree of confidence, then the weight of this input could be decreased significantly. In terms of context information, the presence of certain metadata fields may result in an increased weighting, whereas the inability to obtain sufficient system context would decrease the weighting of this input.

It is anticipated that the classification policy will be defined using eXtensible Access Control Markup Language (XACML). This is discussed in more detail in Annex C.

Figure 7 - Policy-based Classification

2.7 Label Creation Service The Label Creation Service is responsible for taking the security classification determined by the Classification Service and translating it into a security label and security markings according to policy. The Label Creation Service will support security labels through the use of a Security Policy Information File (SPIF). The SPIF will be used to specify valid security labels and security markings and the manner in which they should be applied. It will also allow the system to adapt to security labelling changes without having to hard code these changes. Additional information on the XML SPIF can be found in Annex C. The security label is then cryptographically bound to the data. The Label Creation Service, which is illustrated in Figure 8, performs the following functions:

Security Labelling; Security Marking; and Cryptographic Binding.

Figure 8 - Label Creation Service

2.7.1 Security Labelling The Security Labelling component will include an XML label as a separate data element rather than incorporating it into the data itself. This approach, which is consistent with the work done by the North Atlantic Treaty Organization (NATO) Research Task Group on XML in Cross Domain Security Solutions 5, will allow the Label Creation Service to support a wide variety of data formats. The NATO Profile for the XML Confidentiality Label Syntax specifies a recommended syntax for machine readable labels. It is anticipated that the Label Creation Service would comply with this syntax.

For non-XML data it is envisioned that the PLATYPUS will adopt a different strategy. The non-XML data will be included in a zip file and the security label will be included as metadata in the archive. This approach allows the PLATYPUS to support a wide variety of data formats without having to design specific security labels for each data format.

5 A Proposal for an XML Confidentiality Label and Related Binding of Metadata to Data Objects [Reference 4]

2.7.2 Security Marking The Security Marking component will be responsible for adding human readable security marking information to XML data as specified in the SPIF. The NATO Profile for the XML Confidentiality Label Syntax [Reference 5] specifies a recommended syntax for human readable labels. It is anticipated that the Label Creation Service would comply with this syntax. For non-XML data, no security marking information will be added unless this functionality can be automatically leveraged through a third-party product (e.g., Titus Labs Document Classification).

2.7.3 Cryptographic Binding In order for the security label to be used for access control and information flow decisions the security label must be cryptographically bound to the data. The NATO Profile for the Binding of Metadata to Data Objects [Reference 6] proposes the semantics for the binding of metadata to XML using XMLDSig. For non-XML data, an XML digital signature of the zip file containing the data will be included in the archive file.

3.0 Options Analysis

3.1 Overview The purpose of this section of the report is to select a product to be used as the data classifier for PLATYPUS. In order to accomplish this, an options analysis comprised of two leading open source machine learning products will be conducted. Based on previous efforts (see note below) it was decided that open source machine learning solutions were comparable to their commercial counterparts. Furthermore, it was determined that in some cases these open source solutions were better supported than their commercial brethren. The paper The Need for Open Source Software in Machine Learning [Reference 7] makes an argument for the use of open source software in machine learning.

The Machine Learning Open Source Software (MLOSS)6 organization lists 278 open source machine learning solutions. These solutions, which are also referred to as data mining frameworks and natural language processing, were then whittled down to just two products. The two products selected, Apache Mahout and RapidMiner, were chosen based on the reading and research conducted over the course of the previous year. Apache Mahout is a relatively new entrant into the space but it seemingly has the weight of Apache, Google and IBM behind it. RapidMiner is an established veteran in this space and was recently voted the most used data mining/analytic tool in a KDnuggets poll.7

This section consists of the following sub-sections: Options;Methodology; Analysis; and Results.

Note – Previous Efforts

Two initiatives were conducted that examined the effectiveness of data classifiers. Both of these initiatives leveraged prototypical documents to train the data classifier. However, the initiatives differed in that one was conducted using an open source solution while the other leveraged a commercial solution. Security classification using automated learning (SCALE) [Reference 1] documents the results of the open source data classifier, whereas [Reference 8]documents the results of a commercial data classifier. While both initiatives were able to achieve comparable results, it was determined that the open source solution was easier to work with and, somewhat surprisingly, better supported than its commercial counterpart.

6 http://mloss.org/software/ 7 http://www.kdnuggets.com/polls/2010/data-mining-analytics-tools.html

3.2 Options

This section will provide a brief introduction of the two solutions being analyzed as part of the options analysis. The two solutions are Apache Mahout and RapidMiner.

3.2.1 Apache Mahout Apache Mahout was started by members of the Apache Lucene community as a subproject in January 2008. The intent of Apache Mahout was to implement the ten machine learning libraries documented in Map-Reduce for Machine Learning on Multicore [Reference 9] using Hadoop. Version 0.1 of Apache Mahout was released in April 2009. This version was quickly followed by versions 0.2, 0.3 and 0.4 in November 2009, March 2010 and October 2010 respectively. Version 0.5 of Apache Mahout is currently under development with fourteen developers contributing to the project. Apache Mahout, which is now an Apache project in its own right, is written primarily in Java (83%).

3.2.2 RapidMiner RapidMiner, which was originally called YALE (Yet Another Learning Environment), originated in the Artificial Intelligence Unit of the University of Dortmund in 2001. Although the code remains open source, the university spun off Rapid-I in 2007 in order to develop and support RapidMiner. RapidMiner is available in a free Community Edition as well as in an Enterprise Edition. They are functionally equivalent, however, the Enterprise Edition includes professional support provided by Rapid-I. Rapid-I is the company that is responsible for development and maintenance of both versions of RapidMiner. The current version of the software, version 5.0, is written primarily in Java (86%). While RapidMiner has been hosted by SourceForge since 2004, the Rapid-I site is a better place to find information.

3.3 Methodology The product selected must be capable of being leveraged across all three phases of the project. These include a research phase, a development phase and eventually an operational phase. Consequently, the options analysis will be conducted using an analysis of the compliance of the two open source data classification solutions in terms of seven evaluation factors that cross all three phases. Since this is a paper-based exercise we have kept the evaluation factors fairly high-level. There is no sense having very specific evaluation factors if we cannot adequately assess the products due to a lack of available information.

Evaluation factors are either assigned five or ten points depending on the perceived importance of the evaluation factor. The more important evaluation factors have been assigned a greater weighting (10 points) while the less important evaluation factors have been

8 Additional information on Apache Mahout can be found at http://mahout.apache.org/,https://www.ibm.com/developerworks/java/library/j-mahout/index.html and in Apache Mahout in Action [Reference 3].9 Additional information on RapidMiner can be found at http://rapid-i.com and http://sourceforge.net/projects/rapidminer/.

assigned a lesser weighting (5 points). Three evaluation factors were assigned a score out of ten while four were assigned a score out of five for a total of 50 points. The scoring, which is dependent on the degree of compliance, is as follows: five (or 10) points for excellent, four (or 8) points for good, three (or 6) points for average, two (or 4) points for below average and one (or 2) point for poor. The seven evaluation factors, which can be seen in Table 1, are as follows:

Phase 1 - Research

1. Algorithms – This evaluation factor will be used to assess the depth and breadth of the data classification algorithms supported by the solution. While consideration will be given to the overall roadmap for the implementation of data classification algorithms in the product, the primary focus will be on algorithms that are currently implemented in the solution. This evaluation factor will also take into account the product’s ability to support a multi-stage classifier;

2. Usability – This evaluation factor will be used to assess the overall ease of use of the solution. It will take into account the ease with which the product can be installed, datasets can be manipulated and the range of functionality provided through a Graphical User Interface (GUI);

Phase 2 - Development

3. Integration – This evaluation factor will be used to assess the ease with which the product can be integrated into PLATYPUS. Specifically, this evaluation factor will examine the Application Programming Interfaces (API) and the degree to which the functionality required by PLATYPUS can be accessed through them;

4. Support – This evaluation factor will be used to assess the level of support for the product. Support includes such things as documentation, forums, mailing lists, books, newsletters, training, user conferences, webinars, etc. In terms of documentation, this evaluation factor will be used to assess the depth and breadth of the documentation available for the solution;

Phase 3 - Operational

5. Licensing – This evaluation factor will be used to assess the software license governing the use of the product and the degree with which it supports the product’s inclusion in PLATYPUS;

6. Maturity – This evaluation factor will be used to assess the level of maturity of the product. It will also be used to examine the level of adoption of the product and the overall viability of the solution going forward; and

7. Scalability – This evaluation factor will be used to assess the overall scalability and performance of the solution. As part of this, this evaluation factor will examine support for distributed computing and cloud computing environments.

Note – While both products will be installed in order to gauge their overall usability and capabilities, the Options Analysis is primarily a paper-based exercise.

Phase Evaluation Factor Apache Lucene RapidMinerAlgorithms /10 /10 Phase 1 – Research Usability /5 /5 Integration /10 /10 Phase 2 - Development Support /10 /10 Licensing /5 /5 Maturity /5 /5

Phase 3 - Operations

Scalability /5 /5 TOTAL /50 /50

Table 1 - Options Analysis Methodology

3.4 Analysis The options analysis was conducted according to the methodology outlined in Section 3.3. The analysis portion is included in this section of the report.

3.4.1 Phase 1 – Research The research phase has been underway for some time and in all likelihood will proceed in parallel with the development phase. The intent being that the research findings can be incorporated into the prototype at any point of the project. In terms of the options analysis, the research phase consists of the following evaluation factors; algorithms and usability.

3.4.1.1 Algorithms Given that the research phase of the project is ongoing no specific algorithms have been selected for use. Furthermore, a combination of algorithms may be required to achieve optimum results. For example, clustering algorithms may be needed in order to determine the closest relevant topic whereas classification (learning) algorithms may be required to determine the level of classification of the data.

Attempting to evaluate products based on the algorithms that they support is quite subjective. Rather than simply looking at the number of algorithms supported it was decided to base the evaluation on the respective product’s ability to support key algorithms. Consequently, to aid in assigning a numeric score to this evaluation factor a survey paper entitled Top 10 Algorithms in Data Mining [Reference 10] was used as the basis of the assessment. The paper lists the ten most influential data mining algorithms in the research community. The ten algorithms cover classification, clustering, statistical analysis and link mining.

3.4.1.1.1 Apache Mahout The problem with the Apache Mahout algorithms is stated in the first sentence of the Apache Mahout Algorithms wiki page. It states that this section contains links to information, examples, use cases, etc. for the various algorithms we intend to implement.10 Of the approximately twenty algorithms listed, of which ten are classification algorithms, only a subset of them have actually been implemented. Furthermore, many of those implemented made it to a first draft at which point further development seems to have ceased. Classification algorithms include the following:

Bayesian Classifiers 11 - A first draft of a NaiveBayes implementation (MAHOUT-9) was completed. However, no additional progress has been made since June 2008. Similarly, a first draft of a Complementary Naïve Bayes classifier (MAHOUT-60) was developed but no work has followed since August 2008;

Logistic Regression 12 - A Mahout implementation of Logistic Regression using Stochastic Gradient Descent (SGD) has been undertaken. However, there does not seem to be a downloadable implementation at this point in time;

Neural Network 13 - Although the development team is looking to leverage MAHOUT-228 to aid in the development of this classification algorithm, little progress has been made thus far;

Perceptron and Winnow – At this point in time neither algorithm seems to have been implemented in Apache Mahout;

Random Forests – There are three implementations of random forest algorithms in Apache Mahout; Random Forests Reference Implementation (MAHOUT-122), In-memory mapreduce Random Forests (MAHOUT-140) and PartialData mapreduce Random Forests (MAHOUT-145);

Restricted Boltzmann Machines – An implementation of Restricted Boltzmann Machines (MAHOUT-375) was developed for Apache Mahout as part of the Google Summer of Code 2010 project; and

Support Vector Machines (SVMs) – There are two implementations of SVMs in Apache Mahout; Implementation of sequential SVM solver based on Pegasos (MAHOUT-232) and Linear SVM for Mahout (MAHOUT-334).

The Mahout team has referenced Map-Reduce for Machine Learning on Multicore [Reference 9] as a roadmap of sorts for some of the algorithms that they have yet to implement.

10 https://cwiki.apache.org/MAHOUT/algorithms.html 11 The implementation of this algorithm is based on the paper Tackling the Poor Assumptions of Naïve Bayes Text Classifiers [Reference 11].12 The implementation of this algorithm is based on the paper Logistic Regression for Data Mining and High-Dimensional Classification [Reference 12]13 The implementation of this algorithm is based on the paper Map-Reduce for Machine Learning on Multicore [Reference 9]

3.4.1.1.2 RapidMiner RapidMiner has an extensive number of algorithms to choose from. Specifically, 113 algorithms in the following categories:

Classification and Regression (52); Attribute Weighting (21); Clustering and Segmentation (11); Association and Item Set Mining (5); Correlation and Dependency Computation (8); Similarity Computation (4); and Model Application (12).

In terms of Classification and Regression algorithms RapidMiner supports the following: Lazy Modeling (2); Bayesian Modeling (2); Tree Induction (8); Rule Induction (5); Neural Net Training (2); Function Fitting (7); Logistic Regression (2); Support Vector Modeling (7); Discriminant Analysis (3); and Meta Modeling (14).

3.4.1.1.3 Results RapidMiner supports significantly more algorithms, including classification algorithms, than does Apache Mahout. Furthermore, RapidMiner’s algorithm support seems fairly straightforward (either supported or not supported) whereas Apache Mahout’s collection of algorithms includes many that are partially implemented. From Table 2 we can see that Apache Mahout supports four of the top ten algorithms, whereas RapidMiner supports eight. These are reasonable assessments for the algorithm support in the respective products.

Evaluation Factor Type of Algorithm Apache Mahout RapidMiner1. C4.5 Classification 0 14 1 2. k-means Clustering 1 1 3. SVMs Classification 1 14. Apriori Clustering 0 15 1 5. Expectation-Maximization (EM)

Clustering 0 16 1

14 While Mahout supports an implementation of Random Decision Forests in which you can plug in a variety of tree building algorithms, there is currently no implementation of C4.5 in Mahout. 15 The source code for the Apriori algorithm was apparently completed (MAHOUT-108) but the contributor seems to have failed to post it. However, this effort has apparently been superseded by FP-Growth (MAHOUT-157). 16 MAHOUT-4 and MAHOUT-28 were attempts to implement EM in Mahout. However, neither effort seems to have been successful.

Evaluation Factor Type of Algorithm Apache Mahout RapidMiner6. PageRank Clustering 0 07. AdaBoost Classification 0 18. k-nearest Neighbor (kNN) Classification 1 1 9. Naïve Bayes Classification 1 110. CART Classification 0 0

TOTAL 4/10 8/10

Table 2 - Algorithm Support

3.4.1.2 Usability

3.4.1.2.1 Apache Mahout Apache Mahout differs from a traditional product in that it does not include a GUI through which its functionality can be easily accessed. In contrast, it is more of a developer framework in which users need to include other packages and write applications capable of leveraging the appropriate Java classes. Specifically, Apache Mahout requires Java, an Integrated Development Environment (IDE e.g., Eclipse), a package manager (e.g., Maven (included in some IDEs)) and Hadoop.

3.4.1.2.2 RapidMiner In contrast, RapidMiner is almost completely GUI-based, although there is a command line mode (batch mode) for automated large-scale applications, as well as Java classes. The GUI interface, which is extremely intuitive, provides access to the complete range of functionality provided by the product. RapidMiner is extremely easy to install and get working. Furthermore, software updates are installed automatically.

3.4.1.2.3 Results Apache Mahout does not seem to be targeted at the casual user. It is currently targeted at mathematicians and programmers who are not afraid to get their hands dirty. In contrast, RapidMiner seems to have been developed in such a manner as to make it as usable as possible. Given that the latest version of RapidMiner included a completely revised GUI demonstrates the commitment that Rapid-I has towards usability.

Evaluation Factor Apache Mahout RapidMinerUsability 2/5 4/5

3.4.2 Phase 2 - Development The development phase of PLATYPUS will see the development of the prototype in a virtual environment. In terms of the options analysis, the development phase consists of the following evaluation factors; integration and support.

3.4.2.1 Integration

3.4.2.1.1 Apache Mahout Apache Mahout provides a single deployment option; its Java API. Fortunately, this is likely the preferred approach for integration with PLATYPUS. The Apache Mahout Java API is fully documented at https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/. There is a section on the Mahout wiki under Installation/Setup entitled “Integrating Mahout into an Application”. However, somewhat ironically the web page that this links to is empty. Apache Mahout support for input format is a bit confusing. It seems to only support a specifically formatted file type. However, in Apache Mahout in Action [Reference 3] there is reference to org.apache.mahout.classifier.bayes.XMLInputFormat for reading an XML file in a MapReduce17 operation. However, the XMLInputFormat class seems to be absent from the published Apache Mahout Java API.

3.4.2.1.2 RapidMiner RapidMiner provides a number of deployment options. Specifically, it can be deployed as a stand-alone application accessible through the GUI or it can be invoked through either the command line or the Java API without using the GUI. While research and configuration may be facilitated by using the GUI, integration with PLATYPUS would likely leverage the Java API. The RapidMiner Java API is fully documented at http://rapid-i.com/api/rapidminer-5.1/index.html. A tutorial on integrating RapidMiner into an application can be found at http://rapid-i.com/wiki/index.php?title=Integrating_RapidMiner_into_your_application.RapidMiner also supports a variety of data sources including Excel, Access, Oracle, IBM, DB2, Microsoft SQL, Sybase, Ingres, MySQL, Postgres, SPSS, dBase and text files (Comma-Separated Values (CSB) file). It also apparently supports text documents and web pages in ASCII, PDF and HTML. Lastly, RapidMiner uses an internal XML representation in order to support a standardized interchange format.

3.4.2.1.3 Results Integrating either product into PLATYPUS should be fairly straightforward as both solutions provide a fairly well documented Java API that can be leveraged. However, support for XML, which is the preferred import format for the prototype, is somewhat unclear for both products. However, where RapidMiner seems to beat Apache Mahout is in terms of deployment options, integration documentation and the internal XML representation to facilitate exchange between research and development efforts.

Evaluation Factor Apache Mahout RapidMinerIntegration 6/10 8/10

17 MapReduce refers to a Google software framework for performing distributed computing on large datasets using system clusters.

3.4.2.2 Support Support is extremely important to the development of a prototype. Lack of available support can often limit what can be accomplished with a product and limited support can sometimes prevent use of a product entirely.

3.4.2.2.1 Apache Mahout Support for Apache Mahout is primarily confined to the Apache Mahout wiki (https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki), which is maintained by a team of volunteer developers. This team of developers consists of about fourteen people directly involved in the project. In addition to the wiki, there are some Apache Mahout presentations given at ApacheCon conferences. There is also a recently finished book on Apache Mahout, Apache Mahout in Action [Reference 3], published by Manning Publications.

There does not seem to be any courses, webinars or training provided for Apache Mahout. Furthermore, professional services seems to be limited to that provided by a few individuals working directly on the project. Lastly, Apache Mahout documentation can best be described as anemic, consisting only of a Javadoc listing the packages and classes.

IBM, which includes Apache Mahout in some of their products, has some information on Apache Mahout. It can be found at http://www.ibm.com/developerworks/java/library/j-mahout/.

3.4.2.2.2 RapidMiner Support for RapidMiner is primarily located at the Rapid-I website (http://rapid-i.com/), which is maintained by a private company. However, some additional information can be found at the RapidMiner sourceforge webiste (http://sourceforge.net/projects/rapidminer/). In terms of support, the following resources are available for RapidMiner:

Blog – Rapid-I hosts a RapidMiner blog that seems to get updated every few days. The blog discusses various aspects of data mining and provides examples of how to better use the product as well;

Conferences - Due to the fact that Rapid-I is based out of Dortmund, Germany most of the conferences that it participates in are located in Europe. The most recent conference was the RapidMiner Community Meeting and Conference (RCOMM 2010). The conference proceedings containing the papers presented at RCOMM 2010 are available on the Rapid-I website;

Courses – The RapidMiner-I website has an extensive number of courses available for purchase. These include introductory courses, advanced courses, Text/Web Mining courses, Sentiment/Opinion Analysis courses, Time Series Analysis and Forecasting courses, and Applications of Data Mining courses;

Documentation – The RapidMiner documentation consist of an Installation Guide, a User Manual and a Javadoc. The online installation guide is pretty basic but given the simplicity with which RapidMiner installs that is all that is required. The RapidMiner

user manual is a 110 page PDF that provides a high-level overview of most of the capabilities of RapidMiner including fundamental terms, design, analysis processes, display, and the repository. The Javadoc lists the packages and classes that constitute RapidMiner 5.0;

Newsletter – RapidMiner users can subscribe to a newsletter published by RapidMiner-I;

User Forum – The RapidMiner user forum consist of five fairly active discussion groups containing hundreds of posts each. The discussion groups are as follows; Getting Started, Data Mining/ETL/BI Processes, Problems and Support, Feature Requests and Development;

Videos – The RapidMiner-I website has a number of video tutorials on how to accomplish various tasks within RapidMiner; and

Webinars – The RapidMiner-I website has a number of webinars available. While most are available for purchase, there is a freely available webinar entitled “Short Introduction into Data Mining with RapidMiner”. The other webinars fall into the introductory, advanced and Applications of Data Mining categories.

3.4.2.2.3 Results RapidMiner combines the best of both worlds; an open source product backed by a commercial company. Similar open source initiatives, CentOS and OpenAM are backed by commercial companies, RedHat and ForgeRock respectively. This approach has proven quite effective in the past.

Purely open source initiatives sometimes lack the structure and discipline of open source backed by commercial. Consequently roadmaps are not strictly adhered to and release dates tend to be somewhat nebulous. In addition, the quality control of some components of the solution is not always up to the same level. At this point in time Apache Mahout seems to fall into this category.

Evaluation Factor Apache Mahout RapidMinerSupport 3/10 8/10

3.4.3 Phase 3 - Operations The operations phase of PLATYPUS will commence when the solution is deployed to an operational environment. In terms of the options analysis, the operations phase consists of the following evaluation factors; licensing, maturity and scalability.

3.4.3.1 Licensing

3.4.3.1.1 Apache Mahout Apache Mahout falls under the Apache License, Version 2.0, the details of which can be found at http://www.apache.org/licenses/LICENSE-2.0.html. The Apache license is completely open. The code can be redistributed in original or modified form, and there is no obligation to disclose the source code. All that is required is that a copy of the license is provided and the copyright, patent, trademark and attribution notices from the originating file are retained.

3.4.3.1.2 RapidMiner There are two versions of RapidMiner; an enterprise edition and an open source edition. The enterprise edition has a closed source (commercial) license. For the purposes of licensing we will be considering the open source edition which is governed by the Affero General Public License (AGPL). Details of AGPL can be found at http://www.gnu.org/licenses/agpl.html.The AGPL differs from the GNU General Public License (GPL) in that it includes provisions for using the software over a computer network. This provision requires that the complete source code be made available to any network user.

3.4.3.1.3 Results The key difference between the two open source licenses seems to boil down to whether users are obligated to disclose the source code. The AGPL open source license includes additional provisions requiring that the source code be made available whereas the Apache 2.0 open source license includes no such provisions. Aside from that minor variation, both open source licenses encourage the use and redistribution of the open source solutions.

Evaluation Factor Apache Mahout RapidMinerLicensing 5/5 4/5

3.4.3.2 Maturity

3.4.3.2.1 Apache Mahout Apache Mahout, which originated as a subproject of Apache Lucene, originated in January 2008. Version 0.1 of Apache Mahout was released in April 2009, version 0.2 in November 2009, version 0.3 in March 2010 and version 0.4 in October 2010. Based on its release schedule thus far it looks as if the developers of Apache Mahout are targeting an incremental version every six months. Apache Mahout is now an Apache project of its own that is maintained by a team of volunteer developers. The core team consists of fourteen core committers, of which eleven seem to regularly contribute to the project.

According to the open source network Ohloh (www.ohloh.net), over the last twelve months, Apache Mahout has seen a substantial increase in activity. This is probably good sign that

interest in this project is rising, and that the open source community has embraced this project.18

3.4.3.2.2 RapidMiner RapidMiner originated as YALE in the Artificial Intelligence Unit of the University of Dortmund in 2001. Most recently, version 4.6 was released in October 2009 while version 5.0 was released at the end of 2009. Since 2001 there have been more than 500,000 downloads of RapidMiner.

According to Ohloh, the first lines of source code were added to RapidMiner (YALE): Java Data Mining in 2002. This is a relatively long time for an open source project to stay active, and can be a very good sign. A long source control history like this one shows that the project has enough merit to hold contributors's interest for a long time. It might indicate a mature and relatively bug-free code base, and can be a sign of an organized, dedicated development team.19

3.4.3.2.3 Results There is little doubt that RapidMiner is a much more mature offering than Apache Mahout due primarily to the date of its inception and the current release number. Furthermore, a poll conducted by KDnuggets of 912 voters in May 2010 asked which data mining/analytic tools you used in the past 12 months in a real project (not just evaluation). RapidMiner received the most votes of any product - 345 or 37.8% of the votes. Apache Mahout did not receive a single vote.

Evaluation Factor Apache Mahout RapidMinerMaturity 2/5 5/5

3.4.3.3 Scalability

3.4.3.3.1 Apache Mahout One of the primary design goals of Apache Mahout is scalability. Apache Mahout has been designed from the beginning to be scalable to datasets consisting of millions of items. It accomplishes this through the use of Hadoop, which is an open-source Java-based implementation of the MapReduce distributed computing framework popularized and used internally at Google. Hadoop facilitates independent processing so that the processing can be conducted in parallel across a number of systems. Instructions to run Mahout on Amazon Elastic Compute Cloud (EC2) using a Hadoop cluster can be found at https://cwiki.apache.org/MAHOUT/mahout-on-amazon-ec2.html.

3.4.3.3.2 RapidMiner

18 http://www.ohloh.net/p/mahout/factoids/3096941 19 http://www.ohloh.net/p/5510

According to Rapid-I, RapidMiner has been used with datasets consisting of “several hundreds of millions of tuples” with the caveat that “not every operator/every process can be applied on such large datasets” and that it “might still fit in memory (at least on a 16 Gb machine) but it is probably better to work on the database as long as possible”. In version 5 of the product, RapidMiner has included new extensions for parallel processing. In addition, the Enterprise Edition of the product is capable of using multicore machines and provides several parallelized algorithms.

3.4.3.3.3 Results While both products seem to support extremely large datasets and some level of parallelization, Apache Mahout has been designed from the beginning with scalability in mind.

Evaluation Factor Apache Mahout RapidMinerScalability 5/5 3/5

3.5 Results The results of the Options Analysis can be seen in Table 3. RapidMiner was the clear winner of the Options Analysis largely due to the maturity of the product.

Apache Mahout is in the early stages of development and this is reflected in such things as its usability, support and algorithm support. In its current incarnation, Apache Mahout is a research tool being developed and, to a lesser extent, used by mathematicians. However, over time Apache Mahout will likely develop into a full-featured product with a much wider user base.

RapidMiner is a mature product offering that has been developed over the past decade and is used in many real-world application. Consequently, it includes much of the functionality one would expect in a mature product offering. In addition, RapidMiner is well supported due to the presence of a commercial company behind the open source product.

Based on the results of the Options Analysis, RapidMiner will be used as the data classification product in PLATYPUS.

Phase Evaluation Factor Apache Mahout RapidMinerAlgorithms 4/10 8/10 Phase 1 – Research Usability 2/5 4/5 Integration 6/10 8/10 Phase 2 - Development Support 3/10 8/10 Licensing 5/5 4/5 Maturity 2/5 5/5

Phase 3 - Operations

Scalability 5/5 3/5 TOTAL 27/50 40/50

Table 3 - Options Analysis Results

4.0 Prototype Design

4.1 Overview This section describes the prototype design. It takes the logical design detailed in Section 2 and provides a design from which the prototype can be developed. Specifically, this section will address the following aspects of the prototype design:

Scope;Architecture; External Labelling Service; Orchestration Service; Data Manipulation Service; Classification Service; and Label Creation Service.

Note – Although some products were installed in a virtualized environment for preliminary testing, this was primarily a paper-based effort.

4.2 Scope It is hoped that at some point a complete automatic labelling capability will be developed and deployed within the operational environment. However, it is anticipated that this will be an iterative process, of which this research/design document and the proposed prototype are important first steps. Consequently, while the prototype must be able to demonstrate key capabilities, it must also be able to be developed with moderate resources. As a result, the prototype must be appropriately scoped as follows:

Supported File Types – For the prototype, file types will be limited to Microsoft Word documents (both binary and Office XML formats), OpenOffice.org Writer documents, PDFs and text files. The supported file types must be consistent with the file types that are supported by the Data Conversion service. All file types will ultimately be converted to text files for processing by the Data Classifier service;

Supported Digital Signatures – For the prototype, all files can be digitally signed using an XML-based digital signature. It is worth noting that XML files can be digitally signed using either an enveloped, enveloping or detached XML digital signature, whereas non-XML data must be digitally signed using a detached XML digital signature;

User/System/Data Context – For the prototype, all three types of context information will be used. User and system context information will be retrieved from a Lightweight Directory Access Protocol (LDAP) directory, whereas data context information will consist of extracted metadata. More advanced types of context

information, such as that from a contextual user profile20, is outside of the scope of this prototype; and

Error Checking – The prototype will support rudimentary error checking. If a web service encounters an error (e.g., invalid digital signature, invalid data type) that prevents it from performing its task it will return an error code to the Orchestration Service which will write the information to a file that will be stored on the Samba file server.

Note – Securing a Production System

Completely securing PLATYPUS is outside of the scope of the prototype. However, PLATYPUS is designed in such a way as to facilitate security when and if it is deployed operationally. Furthermore, PLATYPUS was designed with the intention that it would operate in a system high environment for the unlabelled data use case. Consequently, if it is to be deployed in a multi-level environment then modifications would need to be made. The following steps should be taken to secure PLATYPUS for deployment in an operational environment:

Secure External Interfaces – PLATYPUS was designed so that there is only a single externally accessible interface. Consequently, this interface should be protected and access to this interface carefully controlled. Both a network firewall and an application firewall should be used to protect this interface. Likewise, if any other PLATYPUS interfaces are made externally available then they too should be protected in a similar manner;

Secure External Communications – All external communications to and from PLATYPUS should be appropriately secured. If there are intermediary systems the use of persistent security in the form of XML digital signatures and encryption should be considered;

Harden Systems – All PLATYPUS systems should be appropriately hardened. This involves removing unnecessary services, disabling unnecessary accounts and appropriately configuring the system firewall. Given that all five PLATYPUS services are hosted on a common platform, this should simplify this process;

Authentication of Users – PLATYPUS requires the authentication of users (and services) in order to access its functionality. Obviously, this should be employed in an in an operational environment as well. Furthermore, effort should be made to strengthen the authentication required for privileged access to PLATYPUS components;

Validation of Data – PLATYPUS requires that data submitted for labelling be digitally signed by a valid user or service. This helps to prevent attackers from submitting malicious code that could be used to target PLATYPUS; and

20 The contextual user profile is described in Annex A.

Threat Detection – PLATYPUS includes a threat detection capability to ensure that data does not contain malicious code or hasn’t been engineered to cause the system harm. However, PLATYPUS has not been engineered to be resistant to phishing attacks that purport to submit valid data for labelling in order to ascertain its sensitivity.

4.3 Architecture This section will describe the architecture for the prototype. Specifically, it provides three different views of the prototype architecture. The first, illustrated in Figure 9, is a high-level architecture detailing some of the key components. The second, illustrated in Figure 10, is a more detailed architecture detailing the five core services, the sub-components that comprise those services and all interactions between components. This view of the architecture will be used throughout this section of the report in order to describe the individual services and their components. Lastly, illustrated in Figure 11, is a virtualized architecture depicting how the prototype will be deployed within the DRDC lab environment.

The PLATYPUS architecture has the following characteristics:

Service Oriented Architecture (SOA) – The PLATYPUS is an SOA consisting of largely independent web services. None of the web services contains any business logic as this is handled by a separate service; the Orchestration Service. As a result these web services could be made accessible to other applications within the organization if so desired;

Open Source – In order to minimize the cost and level of effort required to prototype PLATYPUS, it was decided to leverage open source software as much as possible while minimizing the use of Commercial-Off-The-Shelf (COTS) software and custom development;

Python & Java – Where possible it was decided to use, and develop, code in Java or Python due to their relative ease of use and the availability of suitable libraries. While the use of Python was preferred due to its ease of use and readability it is not always possible to find the necessary libraries in Python. Furthermore, certain aspects of the project, notably web services and interfacing with Java-based applications, lend themselves to the use of Java. It should be noted that Python 2.7 is recommended for backwards compatibility purposes. Version 3.x does not allow the use of existing modules which could prove problematic given the reliance of the project on existing Python code;21

Standards - Every effort was made to ensure that PLATYPUS is completely standards based. This includes the protocols used for communication (HTTP, SOAP, Server Message Block (SMB)/Common Internet File System (CIFS)), web services (XML, WSDL, XMLDSig), and policies (XACML, BPEL, XML SPIF);

21 While the intent is to use python code to interface with Python applications/libraries and Java code to interface with Java applications/libraries there may be a requirement to access Python applications from Java and vice versa. If this is necessary then either Jython (http://www.jython.org/) or Jepp (http://jepp.sourceforge.net/) can be used to access Python from Java, and eitherJPype (http://jpype.sourceforge.net/) or Jcc (http://lucene.apache.org/pylucene/jcc/) can be used to access Java from Python.

Common Platform – All five systems hosting PLATYPUS services have the same base configuration consisting of Ubuntu Linux (version 10.10), a Java Runtime Engine (JRE) and the Apache Tomcat application server. The other two systems, the Client and the Support Services Server, are Windows-based (Windows XP and Windows 2003 respectively) in order to approximate the DND client-server environment. The Support Services Server will consist of the necessary services to support the prototype. These will include Microsoft Active Directory, Microsoft Internet Information Server (IIS), Entrust Authority Security Manager and Entrust Authority Enrollment Server for Web; and

Virtualization – The prototype will be built using Virtual Machnes (VMs). These VMs will ultimately be deployed in the DRDC lab. It is anticipated that the seven VMs can be built on a single VMware ESXi system.22

Figure 9 - High-Level Architecture

22 VMware ESXi is a bare metal hypervisor meaning that the hypervisor runs directly on the hardware. Since there is no requirement for a host operating system performance is improved. This approach should allow PLATYPUS to run on a single system thus simplifying prototyping. ESXi runs on a number of hardware platforms. Given the relatively large number of VMs that will be concurrently running it is recommended that the system have a minimum of 8Gb of Random Access Memory (RAM). To determine whether a system is compatible with VMware ESXi it is recommended that DRDC check the VMware Compatibility Guide at http://www.vmware.com/resources/compatibility/search.php.

Figure 10 - Detailed Architecture

Figure 11 - Virtualized Architecture

4.4 External Labelling Service

The External Labelling Service, illustrated in Figure 12, is the service through which clients, either users, applications or services, interact in order to submit data to be labelled. Specifically, it will consist of the following components:

External Labelling Service (ELS) Web Service; Samba File Server; Validation Application; and System & User Context Application.

Figure 12 - External Labelling Service

4.4.1 ELS Web Service The ELS Web Service is a web service interface through which data can be submitted to the PLATYPUS for labelling. Once data has been received by the ELS Web Service it is moved to the “To be Labelled File Share”, where it is treated exactly like data received directly through the Samba interface.

Note – HTTPS

For the purpose of the prototype SOAP with attachments over HTTP will be supported in order to access this interface. In an operational environment HTTPS should be used. Furthermore, a WSDL will be used to fully describe how to leverage this service. This comment is applicable to all PLATYPUS web services.

4.4.2 Samba File Server Samba 23 will be installed on the External Labelling Service system and used primarily for two purposes; 1) as a means for clients to submit data to be labelled and 2) as a means for PLATYPUS to return labelled data back to the client.24 Windows users will use their Active Directory credentials to authenticate to the Samba file server.25 SMB/CIFS is the protocol used between the user workstation and the Samba file server. In an operational environment this link would be secured using Secure Socket Layer (SSL).

The Samba file server will have three different shared directories serving three different purposes. The shared directories are as follows:

To be Labelled File Share – Having authenticated, Windows users will copy data to be labelled to the external “To be Labelled File Share”. Data can either be copied to this file share individually or many at a time. Once the data has been processed by the ELS it will be moved to the “To be Processed File Share” and deleted from the “To be Labelled File Share”;

To be Processed File Share – The “To be Processed File Share” is an internal file share only available to PLATYPUS. Once the data has been processed by ELS, it will automatically be moved to this file share for processing by the Orchestration Service; and

Labelled File Share – The “Labelled File Share” is an internal/external file share accessible by both the Orchestration Service and Windows clients. Once data has been successfully labelled it is stored in this file share where it can then be accessed by the user or web client that submitted it for labelling.

The Samba file server will be configured to audit the identity of Windows users who move files to the “To be Labelled File Share”. This identity information will be used as user context information by PLATYPUS. To configure this level of auditing the Samba audit module will need to be installed.

4.4.3 Validation Application Regardless of whether the data arrived via the web service or the Samba file server, it may include a digital signature which will need to be validated. The digital signature was included by either the user or the web service client. The Validation Application is responsible for monitoring the “To be Labelled File Share” for new data and validating the digital signature. Data whose digital signature successfully validates is passed on to the System & User Context Module. The copy of the data on the “To be labelled File Share” is deleted. Data whose digital signature does not successfully validate will result in an error text file outlining the unsuccessful validation being written to the Labelled File Share.

23 Additional information on Samba can be found at http://www.samba.org/. 24 The Samba file server will be used to return all labelled data regardless of whether it was submitted through the Samba or webservice interface. This is being done to simplify the prototype architecture. 25 Instructions for integrating Samba with Active Directory can be found at http://wiki.samba.org/index.php/Samba_&_Active_Directory.

The Python module pyxmldsig26 will be used to validate the digital signature on the data. The following code will need to be included in the ELS Validation Application in order to invoke the pyxmldsig module:

xdsig2 = pyxmldsig.Xmldsig()

xdsig2.load_certs(['cacert.pem', 'myx509cert.pem'])

assert xdsig2.verify_xmlstring(signed_xml1) == True assert xdsig2.verify_xmlstring(signed_xml2) == True

As can be seen from the code, it requires that the ELS Validation Module have a copy of the Certification Authority (CA) certificate and its own certificate27, both in Privacy Enhanced Mail (PEM) format.

4.4.4 System & User Context Application This component will be used to extract the system and user context. For the purpose of the prototype the identity of the system and/or user will be obtained from the authentication and validation processes. The System & User Context Application will use either the user’s or the web service client’s identity information to retrieve either system or user context information from the Microsoft Active Directory on the Support Services system. This context information will be written to an XML-based context file and stored with the data in the “To be Processed File Share”.

4.5 Orchestration Service The Orchestration Service, which is illustrated in Figure 13, is the glue that connects all of the other PLATYPUS services. Specifically, it provides the business logic that has been purposely removed from all of the other independent services. This section will examine the following aspects of the Orchestration Service:

Logical Flow; Samba Client; and Apache Orchestration Director Engine (ODE).

26 Additional information on the pyxmldsig module can be found at http://www.decalage.info/python/pyxmldsig. If there are any problems leveraging pyxmldsig within APSLP, there is an alternative (xmldsig) available at https://github.com/andrewdyates/xmldsig. If for some reason it is preferable to validate the digital signature using a Java API, then the Java XML Digital Signature API can be leveraged. It is available at http://java.sun.com/developer/technicalArticles/xml/dig_signature_api/. 27 Given that the Validation Application will not be digitally signing anything there is no requirement for its own certificate otherthan the fact that when a keystore is created the certificate is one of the components that will typically be included.

Figure 13 - Orchestration Service

4.5.1 Logical Flow The logical flow, illustrated in Figure 14, will be used to define the process for the Orchestration Service. It consists of the following steps:

1) The client initiates the security labelling process either by sending the data to be labelled to the ELS Web Service or by copying the data to the To be labelled File Share on the External Labelling Service system;

2) The client is authenticated by the ELS Web Service or by the Samba file server. If the authentication process fails then the security labelling attempt also fails;28

3) The Validation Application attempts to validate the digital signature on the data. If the digital signature validation process fails then the security labelling attempt also fails;

4) The System & User Context Application extracts the client and/or system identity information from the authentication and validation processes. It also communicates with the central identity store in order to retrieve additional system and/or user context information;

28 For the Samba file server the authentication step (#2) will actually take place prior to sending the data to be labelled (#1).However, since the two flows have been combined it has been listed in this order.

5) The Samba client on the Orchestration Service system, which periodically polls the To be Processed File Share, detects new data (and context information) and submits it to the Orchestration Service;

6) The Orchestration Service initiates a new labelling process for this data;

7) The Orchestration Service sends the data to the Threat Detection Web Service. The Threat Detection Web Service scans the data for malicious code and sends the results back to the Orchestration Service. If malicious code is detected then the security labelling attempt fails;

8) The Orchestration Service sends the data to the Data Identification Web Service. The Data Identification Web Service determines the file type and sends the results back to the Orchestration Service. If the file type cannot be determined or is unsupported then the Data Identification process fails and the security labelling attempt also fails;

9) The Orchestration Service sends the data to the Data Context Web Service. The Data Context Web Service extracts any metadata and returns it to the Orchestration Service;

10) The Orchestration Service sends the data to the Data Conversion Web Service along with instructions for converting it. In some cases the conversion process might be a two-step process. The Data Conversion Web Service returns the converted data. If the data cannot be converted to the desired file format then the Data Conversion process fails and the security labelling attempt also fails;

11) The Orchestration Service sends the converted data to the Data Classifier Web Service. The Data Classifier Web Service determines the recommended classification for the data;

12) The Orchestration Service sends the recommended classification to the Policy-based Classification Web Service along with any context information (e.g., system, user, data). The Policy-based Classification Web Service returns the appropriate security classification for the data;

13) The Orchestration Service sends the data, along with the appropriate security classification, to the Security Labelling Web Service. The Security Labelling Web Service returns the labelled data to the Orchestration Service;

14) The Orchestration Service sends the labelled data to the Security Marking Web Service. The Security Marking Web Service returns the marked and labelled data to the Orchestration Service;

15) The Orchestration Service sends the marked and labelled data to the Cryptographic Binding Web Service. The Cryptographic Binding Web Service returns the digitally signed, marked and labelled data back to the Orchestration Service;

16) The Orchestration Service places the digitally signed, marked and labelled data in the Labelled File Share; and

17) The client retrieves the digitally signed, marked and labelled data from the LabelledFile Share.

Figure 14 - Logical Flow

4.5.2 Samba Client The Samba Client will be used to retrieve unlabelled data from the “To be Processed File Share” on the External Labelling Service system and place labelled data on the “Labelled File Share” once the labelling process has completed successfully.

4.5.3 Apache ODE Based on a high-level examination of the market, it was decided to use Apache ODE29 as the core component of the Orchestration Service.30 The latest version of Apache ODE is version 1.3.5 and it is distributed as a Web Application aRchive (WAR) file (ode.war) that can be deployed using Apache Tomcat.

The steps required to enable orchestration within PLATYPUS include the following:

Define the process – A BPEL process is used to specify the order in which web services are invoked. The PLATYPUS BPEL process, which is an XML file, will be based on the logical flow depicted in Figure 14. As can be seen from this diagram some of the flow is conditional. Fortunately BPEL fully supports conditional behavior;

Build the process – In order to build the process the schemas, the WSDLs for each of the web services (all nine) and the BPEL process file must be copied into a Java ARchive (JAR) file. The BPEL process file will be built using the Apache ODE interface; and

Build and deploy the project – The JAR file needs to be added to a project within Apache ODE. The project is then ready to be deployed.

Due to the fact that the business logic is distinct from the PLATYPUS web services, the other use cases described in Section 2.2 can be supported by developing the corresponding processes using Apache ODE.

4.6 Data Manipulation Service The Data Manipulation Service, illustrated in Figure 15, is actually a collection of web services that have logically been grouped together. The Data Manipulation Service includes the following web services:

Threat Detection; Data Identification; Data Context Retrieval; and Data Conversion.

29 Additional information on Apache ODE can be found at http://ode.apache.org. 30 There are a number of other open source and commercial alternatives that were briefly explored. These include Oracle’s (formerly Sun’s) BPEL Service Engine within OpenESB, Active BPEL engine, Microsoft BizTalk and Oracle BPEL Process Manager. A list of Java-based open source solutions can also be found at http://java-source.net/open-source/workflow-engines.

Figure 15 - Data Manipulation Service

4.6.1 Threat Detection

For the prototype it is envisaged that the Threat Detection service will consist initially of a single threat detection filter. Follow-on research can examine supporting additional threat detection filters, including file type specific filters.31 The Threat Detection capability will consist of the following three components:

Threat Detection Web Service; pyClamd; and ClamAV.

Note – ExeFilter32

ExeFilter is another possibility for use as a threat detection filter for PLATYPUS. It is an open-source tool and Python framework that can be used to improve protection against malicious content in data. Specifically, it targets active content in most common file formats 31 Perhaps for a more advanced prototype it might be worth considering including a commercial threat detection capability such as the Tresys File Sanitization Tool (FiST). http://www.tresys.com/file-sanitization-tool.php32 Additional information on ExeFilter can be found at http://www.decalage.info/exefilter and http://adullact.net/projects/exefilter/.

such as Office documents, PDF, HTML and XML. It performs a number of scans including malicious code scanning using ClamAV and pyClamd. It is also capable of white-listing specific data formats and excluding all unsupported data formats.

4.6.1.1 Threat Detection Web Service The Threat Detection Web Service is a service through which data can be submitted by the Orchestration Service for threat detection.

4.6.1.2 pyClamd pyClamd will be used to interface with ClamAV from Python.33 If for any reason the pyClamd module is unsuitable, there is also pyClamAV which is a Python binding to the ClamAV libraries written in C.34

4.6.1.3 ClamAV The prototype will support Clam AntiVirus (AV) which is an open source anti-virus toolkit for UNIX.35 In 2007 the ClamAV open source project was acquired by the security vendor Sourcefire. However, Sourcefire has committed to providing versions of ClamAV that are open source. Also in 2007 security vendor Untangle ran a limited antivirus “bakeoff” in which ClamAV ranked second ahead of both Symantec Norton and McAfee.36

4.6.2 Data Identification

Once it has been determined that the data does not pose a threat, it will be sent to the Data Identification Web Service in order to positively identify the data type. The Data Identification capability will consist of the following three components:

Data Identification Web Service; Python-magic; and File Magic.

4.6.2.1 Data Identification Web Service The Data Identification Web Service is the service through which data can be submitted by the Orchestration Service for data identification. 33 Additional information on pyClamd can be found at http://www.decalage.info/python/pyclamd.34 Additional information on pyClamAV can be found at http://xael.org/norman/python/pyclamav/.35 Additional information on ClamAV can be found at http://www.clamav.net/lang/en/.36 The results of this “bakeoff” can be found at http://advosys.ca/viewpoints/2007/08/clamav-beats-mcafee-and-norton/.

4.6.2.2 Python-magic Python-magic will be used to interface the data identification Python code with File Magic. Python-magic can be installed as follows - apt-get install Python-magic. Sample Python code37 to interface with Python-magic is as follows:

import magic ms = magic.open(magic.MAGIC_NONE) ms.load() type = ms.file("/path/to/some/file") print type f = file("/path/to/some/file", "r") buffer = f.read(4096) f.close() type = ms.buffer(buffer) print type ms.close()

4.6.2.3 File Magic File Magic will be used to identify the file type. It does this by comparing the binary “fingerprint” of the data with a definition file that specifies header/footer and content binary patterns for a variety of file types. The definition file is located at /usr/share/file/magic on Ubuntu. The Linux man page for file can be found at http://linux.die.net/man/1/file. If File Magic is unable to identify the file type then an error will be returned to the Orchestration Service which will in turn create an error file and store it on the Labelled File Share. Assuming that the file type is identifiable then the Data Identification Service will return this value to the Orchestration Service which will then process the data based on the identified file type.

4.6.3 Data Context Retrieval

Once the data type has been determined the Orchestration Service will send unlabelled data to the Data Context Retrieval service in order to extract the metadata. The Data Context Retrieval capability will consist of the following three components:

Data Context Retrieval (DCR) Web Service; DCR Interface; and Metadata Extraction Tool.

4.6.3.1 Data Context Retrieval Web Service

37 The sample python code was downloaded from http://www.gavinj.net/2007/05/python-file-magic.html.

The Data Context Retrieval Web Service is the service through which the data context (metadata) can be extracted from data sent by the Orchestration Service.

4.6.3.2 DCR Interface The Java interface will be used to interface between the Data Context Retrieval Web Service and the Metadata Extraction Tool. Specifically, it will take the data from the web service and make the appropriate calls to the Metadata Extraction Tool. It will then return the extracted metadata to the web service.

4.6.3.3 Metadata Extraction Tool The Metadata Extraction Tool, which was developed by the National Library of New Zealand, extracts metadata from data and then outputs it in an XML format.38 The metadata extracted is limited to semantic information about the data including author, creation date, etc. The tool includes a number of adapters that are capable of extracting data from a variety of data types including images (BMP, GIF, JPEG and TIFF), Office documents (MS Word, Word Perfect, Open Office, MS Works, MS Excel, MS PowerPoint and PDF), audio/video (WAV, MP3, BFW, FLAC), markup languages (HTML, XML) and Internet files (ARC). The Metadata Extraction Tool is written in Java and XML and can be used through a Microsoft Windows GUI or a UNIX command-line interface. The most recent version of the tool is version 3.5GA.

Note – Hachoir 39

In terms of a metadata extraction tool for use in PLATYPUS the initial focus was on Hachoir which is a generic framework for binary file manipulation written in Python. The purpose of Hachoir is to serve as a framework for examining files, of which it supports over sixty file types. The framework can be extended. For example, the program hachoir-metadata can be used to extract metadata. However, while Hachoir in general, and hachoir-metadata specifically, showed considerable promise the framework seems to be focused primarily on multimedia files including music, pictures and video. Consequently, it was decided to pursue an alternate approach and keep Hachoir as a backup solution.

4.6.4 Data Conversion

38 Additional information on the Metadata Extraction Tool can be found at http://sourceforge.net/projects/meta-extractor/ and http://meta-extractor.sourceforge.net/. 39 Additional information on Hachoir can be found at http://pypi.python.org/pypi/hachoir-core.

Once the metadata has been extracted, the original data and its DI will be sent to the Data Conversion service in order to convert it to a format supported by the Data Classifier. The Data Conversion service consists of the following components:

Data Conversion Web Service (JODConverter); and Open Office.

4.6.4.1 Data Conversion Web Service (JODConverter) Fortunately there are a few options for accessing the data conversion capabilities of OpenOffice.org. Each of these options will require that OpenOffice.org be running on the Data Manipulation Service system. The Java OpenDocument Converter (JODConverter) is capable of leveraging OpenOffice.org in order to convert data between different document formats.40 It can be used as a Java library and embedded in a Java application, as a command line tool invoked by a script, as a web application and as a web service. There is also a Python version, the Python OpenDocument Converter (PyODConverter), that is a Python script which is capable of automating document conversion from the command line using OpenOffice.org.41 It is envisioned that JODConverter will be used as a Web Service. An online guide providing instruction to accomplish this can be found at http://www.artofsolving.com/node/15.

JODConverter supports any formats supported by OpenOffice.org. A relatively complete list can be found at http://www.artofsolving.com/opensource/jodconverter/guide/supportedformats. All data will need to be formatted to a file format supported by the data classifier in the Classification Service. It is envisioned that the plain text (.txt) data format will ultimately be used.42

4.6.4.2 Open Office It is anticipated, at least initially, that PLATYPUS will be used primarily for the labelling of documents. Consequently a data conversion program that is capable of converting document formats, including binary formats, to a common data format is required. After a considerable amount of research it was determined that the most effective open source tool would in fact be OpenOffice.org.43 OpenOffice.org is an open source office suite that runs on all major operating systems. The project originated within Sun Microsystems and, since its acquisition in 2010, continues to be sponsored by Oracle.

40 Additional information on JODConverter can be found at http://www.artofsolving.com/opensource/jodconverter and at http://sourceforge.net/projects/jodconverter/.41 Additional information on PyODConverter can be found at http://www.artofsolving.com/opensource/pyodconverter. 42 OpenOffice.org 3.3 was installed in order to test some of its data conversion capabilities. It is worth noting that the import, and subsequent conversion, of PDF files is not supported directly out-of-the box. It requires the installation of the Oracle PDF Import Extension which is available at http://extensions.services.openoffice.org/project/pdfimport. 43 Additional information on OpenOffice.org can be found at http://www.openoffice.org/.

4.7 Classification Service

The Classification Service, illustrated in Figure 16, will be leveraged by the Orchestration Service to ultimately determine the sensitivity of the data and the appropriate security label for the data. It will consist of the two following independent web services:

Data Classifier; and Policy-based Classification.

Figure 16 – Classification Service

4.7.1 Data Classifier

The Data Classifier capability will consist of the following three components:

Data Classifier Web Service; Interface; and RapidMiner.

4.7.1.1 Data Classifier Web Service The Data Classifier Web Service is the service through which the converted data can be submitted by the Orchestration Service in order to determine its classification.

4.7.1.2 Interface The data classifier being used within PLATYPUS is written entirely in Java and provides a Java API that allows RapidMiner to be invoked from Java applications. Basically, the Java

API will be leveraged to input the converted data file, invoke the appropriate data classification algorithms through a process and obtain the results.

4.7.1.3 RapidMiner Based on the analysis conducted in Section 3, RapidMiner was selected as the data classifier. However, it is worth noting that PLATYPUS is not intending to increase the effectiveness of the data classification efforts. This is a separate research effort within DRDC. However, due to the fact that RapidMiner supports the XML-based exchange of processes, PLATYPUS can incorporate the latest research efforts relatively seamlessly. The RapidMiner implementation in PLATYPUS will use the same data classification algorithms, training set and processes dictated by this parallel research effort. These changes will be incorporated into the PLATYPUS instantiation of RapidMiner through the administrative console interface.

In terms of supported input file formats RapidMiner seems to support a wide variety, but surprisingly not XML. These include Excel files, SPSS files, data sets from well known databases such as Oracle, mySQL, PostgreSQL, Microsoft SQL Server, Sybase, and dBase. It also accepts Sparse file formats such as SVMight and mySVM, as well as standard data mining and learning scheme formats such as csv, Arff, and C4.5. As mentioned in Section 4.6.4 of the report, the text file format will be used.

4.7.2 Policy-based Classification

The Policy-based Classification capability will consist of the following three components:

Policy-based Classification Web Service; Interface; and XACML Policy Engine.

4.7.2.1 Policy-based Classification Web Service The Policy-based Classification Web Service is the service through which the security label of the data will be determined.

4.7.2.2 Interface At this point in time it is envisioned that an interface will be required to take the XACML request submitted to the web service and forward it on to the XACML Policy Engine. In return, the interface would receive the XACML response which it would pass on to the web service for return to the Orchestration Service. However, in reality the interface may be

subsumed by the XACML Policy Engine which would handle the XACML requests/responses directly.

4.7.2.3 XACML Policy Engine A Policy Decision Point (PDP) is required to take a number of inputs (e.g., context, data classification, temporal data) and, based on the relevant policy, make an overall determination as to the sensitivity and corresponding security label of the data. Initially the policies will be quite rudimentary as one would expect from a prototype. Specifically, the various inputs will be given weights in determining the overall classification of the data. However, as the prototype matures, and definitely prior to deployment, it is envisioned that the PDP will need to be able to support fairly complex policies. Consequently, it was decided to use a policy engine capable of supporting a full-featured policy language such as XACML.

After a brief amount of searching it was determined that there are two primary open source XACML policy engine available for use; Sun’s (Oracle’s) XACML Engine and XEngine. XEngine 44 is a high performance open source XACML policy engine originating in academia. The developers claim that it is orders of magnitude faster than the Sun implementation. Sun’s implementation served as a reference implementation soon after XACML was developed. While XEngine seems to have been updated as late as last year it seems to lack sufficient documentation and community involvement. In contrast, while the Sun implementation45 has not been updated in a few years46 there is a considerable amount of documentation and community support. Consequently, the Sun implementation will be used as the policy engine for PLATYPUS.

The PDP consists of finder modules which are used to access policies and to retrieve attributes. The PLATYPUS PDP will be quite simple, at least initially, given that access policies will be stored locally and attributes will be sent to the PDP for policy mediation. The PDP would include the following modules:

attrFinder – This module would enable the policy engine to retrieve attributes, specifically the context information that will be used in policy evaluation;

FilePolicyModule – This module would enable the policy engine to access policies as files. It would be used to access the locally stored XACML policy; and

CurrentEnvModule – This module could be used to provide some of the temporal attributes, specifically current time, date and dateTime, used in the policy evaluation.

Although not planned initially, the intent is to provide a Policy Administration Point (PAP) customized for PLATYPUS. This GUI would provide an interface through which the XACML policies could be easily modified or updated.

44 Additional information on XEngine is available at http://sourceforge.net/projects/xacmlpdp/. There is also a paper entitled XEngine: A Fast and Scalable XACML Policy Evaluation Engine [Reference 13].45 Additional information on the Sun implementation can be found at http://sunxacml.sourceforge.net/. 46 It should be noted that the current version of XACML is 2.0 which was released in 2005. Consequently, there has probably been no need to update the Sun implementation since then. Version 3.0 of XACML is currently being drafted.

Note – Reason Code

The intent is to include a reason code within PLATYPUS in order to provide justification for having selected a particular security label. Initially, the reason code will be quite limited and be used to indicate which input (data classifier, data context, system context, user context, environmental variable or temporal data) most contributed to the security label determination. It is envisioned that the reason code could eventually be incorporated into the label suggestion use case so that the user is provided with the suggested security label and justification for why that security label was selected.

4.8 Label Creation Service The Label Creation Service, illustrated in Figure 17, is actually a collection of three web services that have been logically grouped together. These services take the results of the Classification Service and turn it into the appropriate security label, markings and cryptographic binding. The Label Creation Service includes the following components:

Security Labelling Service; Security Marking Service; and LCS Signing Application.

Figure 17 - Label Creation Service

Note – Paragraph-level Labelling & Marking

Although the initial version of PLATYPUS will not support paragraph-level labelling and marking it is worth exploring how future iterations could potentially address this shortcoming. Assuming that paragraph level labelling and marking requires a cryptographic binding just as does document labelling and marking, PLATYPUS would be restricted to data formats (e.g., XML) which facilitate this level of granularity.

In terms of determining the appropriate security label for a paragraph, the document would need to be broken down into paragraphs (i.e., elements), with each of the elements treated as its own entity by PLATYPUS. In other words, PLATYPUS would determine the appropriate classification, label the paragraph, mark the paragraph and then cryptographically bind the security label to the paragraph. In this case, either an enveloped or an enveloping security label and signature could be used.

When PLATYPUS had done this for each of the elements it would reconstruct the whole document from its components. It could either determine the appropriate classification of the document from its component parts or send the document through PLATYPUS as a whole. As long as the resulting security label equaled or exceeded the security label of any of its component parts it could be considered a valid security label for the document.

4.8.1 Security Labelling

Once the appropriate security label has been determined for the data, it will need to be applied. The Orchestration Service will send the unlabelled data to the Security Labelling service to accomplish this. The Security Labelling Capability will consist of the following two components:

Security Labelling Web Service; and Security Labelling Application.

4.8.1.1 Security Labelling Web Service The Security Labelling Web Service is the service through which data can be submitted by the Orchestration Service for labelling.

4.8.1.2 Security Labelling Application The Security Labelling Application will need to be developed for the prototype. It will develop an appropriate security label based on the security label guidance provided and according to the XML SPIF. As detailed in Section 2.5.1, the security label will conform to the NATO Profile for the XML Confidentiality Label Syntax. XML-based data will have the

security label included in the data, whereas non-XML data will contained in a zip file and the security label will be included as metadata in the archive.

4.8.2 Security Marking

Once the data has been appropriately labelled it will need to be marked in such a way as to provide human readable guidance as to the sensitivity of the data. The Security Marking capability will consist of the two following components:

Security Marking Web Service; and Security Marking Application.

4.8.2.1 Security Marking Web Service The Security Marking Web Service is the service through which data can be submitted by the Orchestration Service for marking.

4.8.2.2 Security Marking Application The Security Marking Application will need to be developed for the prototype. It will appropriately mark the data based on the assigned security label and according to the XML SPIF. XML data will be marked according to the NATO Profile for the XML Confidentiality Label Syntax [Reference 5]. Non-XML data will not be marked for the initial prototype.

4.8.3 Cryptographic Binding

The Cryptographic Binding capability will serve to bind the security label and markings to the data. It will consist of the two following components:

Cryptographic Binding Web Service; and Cryptographic Binding Application.

4.8.3.1 Cryptographic Binding Web Service The Cryptographic Binding Web Service is called by the Orchestration Service in order to complete the security labelling process by including a cryptographic binding.

4.8.3.2 Cryptographic Binding Application As with the Validation Program (Section 4.4.3) the Python module pyxmldsig will be used. However, in this case it will be used to digitally sign the data, thereby cryptographically binding the security label and markings to the data. The following code will need to be included in the Cryptographic Binding Application in order to invoke the pyxmldsig module:

xdsig = pyxmldsig.Xmldsig(key_file='mykey.pem', cert_file='myx509cert.pem',

password='mypassword')

signed_xml1 = xdsig.sign_file('myfile.xml') signed_xml2 = xdsig.sign_file(pyxmldsig.TEMPLATE_WITH_CERT)

As can be seen from the code, it requires that the Cryptographic Binding Application have a copy of its own keys and certificate, both in PEM format.

5.0 Next Steps While there is a considerable amount of future research that could be conducted there are three logical follow-ups to this report. Consequently, this section will examine building the prototype, integrating with SAMSON and contextual user profiles.

5.1 Building the Prototype Table 4 lists the tasks required to build the prototype. Given the broad range of skills required it is envisioned that the prototype could be built by a relatively small team of senior personnel. Specifically, the team would be composed of the following Subject Matter Experts (SME):

Orchestration SME; Java & Web Services SME; Python SME; XACML SME; and Prototyping SME.

Table 4 - Prototype Build Tasks

TasksBuild VMs – Installation & Configuration

Client (XP, Java) Support Services (W2003, Entrust PKI, AD) Automated Security Labelling Service (Base Install, Samba (AD authentication)) Orchestration Service (Base Install, Apache ODE, MySQL) Data Manipulation Service (Base Install, ClamAV, Metadata Extraction Tool, OpenOffice.org) Classification Service (Base Install, RapidMiner, XACML Policy Engine)Label Creation Service (Base Install)

Develop Web Services ELS Web Service Threat Detection Web Service Data Identification Web Service Data Context Retrieval Web Service Data Conversion Web Service Data Classifier Web Service Policy-based Classification Web Service Security Labelling Web Service Security Marking Web Service Cryptographic Binding Web Service

Detailed Design Apache ODE (including BPEL policy) Metadata Extraction Tool OpenOffice.orgRapidMinerXACML Policy Engine (including XACML policy) XML SPIF

Develop Python & Java Code Validation Application (pyxmldsig) System-User Context Application Polling Application pyClamd Application python-magic Application Java Application (metadata extraction) JODConverter Application RapidMiner Application XACML Policy Engine Application Security Labelling Application Security Marking Application Cryptographic Binding Application (pyxmldsig)

TestingTest Plan TestingTest Results

Detailed Design Document Project Management

Team Meetings DRDC Progress Reports & Meetings

Installation & Demonstration at DRDC Labs

5.2 Integrating with SAMSON SAMSON is a DRDC project intended to demonstrate multi-caveat separation (Canadian Eyes Only (CEO), Canadian U.S. (CANUS)) in an operational Secret network. Depending upon the success of the prototype in 5.1, it may be desirable to integrate PLATYPUS into SAMSON in order to provide an automatic security labelling capability within the SAMSON demonstration project.

5.3 Contextual User Profile It is envisaged that the accuracy of the security labelling results for users could be improved through the use of contextual user profiles. As detailed in Annex A, a contextual user profile is constructed by monitoring the activity of the user on his system and then building a contextual profile from this information. A research/prototype effort could examine approaches to building a contextual user profile, likely based on related work on the semantic web, and prototype them. Specifically, it would be of interest to know the effect of using contextual user profiles on the accuracy of security labelling results.

6.0 Conclusions & Recommendations PLATYPUS is a proposed system that will be capable of digesting the vast quantities of unstructured, unlabelled content and, using content analysis, determining the sensitivity of the information and assigning it the appropriate security label. It accomplishes this through the provision of the five following services:

External Labelling Service – The External Labelling Service is the external interface through which users, application or services submit data to be labelled;

Orchestration Service – PLATYPUS is a collection of completely independent web services. The business logic, and specifically the manner in which unlabelled data is routed between the web services, is provided by the Orchestration Service. By separating the business logic from the individual web services the overall flexibility of the solution is increased, allowing it to support additional use cases;

Data Manipulation Service – The Data Manipulation Service is responsible for preparing the data for content analysis. This includes scanning the data for threats, determining the data type and converting the data to a common data format;

Classification Service – The Classification Service leverages content analysis and contextual information (e.g., data, system, user) as prescribed in the overarching policy in order to determine the security classification of the data; and

Label Creation Service – The Label Creation Service takes the security classification of the data, as determined by the Classification Service, and applies the appropriate security label and markings. The security label and markings are cryptographically bound to the data using a digital signature.

In addition to the logical design the report also includes a prototype design. The prototype design takes the logical design and details how to implement it using primarily open source components, albeit supplemented with a limited amount of custom code. While the selection of most open source products was relatively straightforward, it was the selection of the data classifier that was more complicated. Not only is it the key component of PLATYPUS but there appeared to be two viable candidates for use in this capacity; Apache Mahout and RapidMiner. Consequently, an options analysis was conducted of these two solutions. After considerable analysis it was determined that RapidMiner was the more mature offering and at this point in time better suited for inclusion in the prototype.

Based on the prototype design documented in this report it is recommended that DRDC proceed with the prototype development in order prove the viability of the technology and the design. Assuming that this next phase of the project is successful, DRDC should consider its integration in the SAMSON project.

References

[Reference 1] Brown, D. and Charlebois, D., Security classification using automated learning (SCALE), (DRDC Ottawa TM 2010-215) Defence R&D Canada – Ottawa, December 2010.

[Reference 2] Magar, A., Investigation of Technologies and Techniques for Labelling Information Objects to Support Access Management, (DRDC Ottawa CR 2005-166), Defence R&D Canada – Ottawa, November 2005.

[Reference 3] Owen, S. et al., Mahout in Action, Manning Publications, 2011.

[Reference 4] Oudkerk, S., Bryant, I., Eggen A., and Haakseth, R., A Proposal for an XML Confidentiality Label and Related Binding of Metadata to Data Objects, NATO RTO-MP-IST-091, pp. 22-1 – 22-10, November 2010.

[Reference 5] Eggen, A. (editor), XML Confidentiality Label Syntax, XML in Cross-domain Security Solutions, NATO RTO IST-068, Annex F, 2010. (preprint)

[Reference 6] Eggen, A. (editor), Binding of Metadata to Data Objects, XML in Cross-domain Security Solutions, NATO RTO IST-068, Annex G, 2010. (preprint)

[Reference 7] Sonnenburg, S. et al., The Need for Open Source Software in Machine Learning, Journal of Machine Learning Research 8, pp. 2443-2466, October 2007.

[Reference 8] Magar, A., private correspondence with D. Brown, April 2010.

[Reference 9] Chu, C., et al., Map-Reduce for Machine Learning on Multicore, Computer Science Department, Stanford University, 2007.

[Reference 10] Wu, X. et al., Top 10 Algorithms in Data Mining, Knowledge and Information Systems vol. 14, pp. 1-37, 2008.

[Reference 11] Rennie, J. et al., Tackling the Poor Assumptions of Naïve Bayes Text Classifiers, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 2003.

[Reference 12] Komarek, P., Logistic Regression for Data Mining and High-Dimensional Classification, Department of Math Sciences, Carnegie Mellon University, Technical Report TR-04-34, 2004.

[Reference 13] Liu et al., XEngine: A Fast and Scalable XACML Policy Evaluation Engine, SIGMETRICS’08, June 2008.

[Reference 14] Challam, V., Contextual Information Retrieval Using Ontology Based User Profiles, M.Sc. Thesis, Jawaharlal Nehru Technological University, Hyderabad, India, 2004.

[Reference 15] Heath, T., et al., Supporting User Tasks and Context: Challenges for Semantic Web Research, in Proceedings of ESWC2005 Workshop on End-User Aspects of Semantic Web, 2005.

[Reference 16] Heath, T., et al., Uses of Contextual Information to Support Online Tasks, in Proceedings of 1st AKT Doctoral Symposium, pp. 107-113, Knowledge Media Institute, The Open University, 2005.

[Reference 17] White, R., et al., Predicting User Interests from Contextual Information, SIGIR’09, Boston MA, July 2009.

Acronyms & Abbreviations

AGPL Affero General Public License API Application Programming Interface ASCII American Standard Code for Information Interchange AV AntiVirus Balie Baseline Information Extraction BPEL Business Process Execution Language CA Certification Authority CANUS Canadian U.S. CDS Cross Domain Solution CEO Canadian Eyes Only CIFS Common Internet File System COTS Commercial-Off-The-Shelf DCMI Dublin Core Metadata Initiative DCR Data Context Retrieval DI Data Identification DSML Directory Services Markup Language EC2 Elastic Compute Cloud ELS External Labelling Service EM Expectation Maximization FiST File Sanitization Tool Gate General Architecture for Text Engineering GC Government of Canada GPL General Public License GUI Graphical User Interface HTML HyperText Markup Language IDE Integrated Development Environment IIS Internet Information Server IP Internet Protocol JAR Java ARchive JOD Java OpenDocument JRE Java Runtime Engine

kNN k-Nearest Neighbour LDAP Lightweight Directory Access Protocol LLDP Link Layer Discovery Protocol LMCD Labelled Marked Cryptographically Bound MAC Machine Address Code MCBFF Microsoft Compound Binary File Format MCDFF Microsoft Compound Document File Format MED Media Endpoint Discovery MLOSS Machine Learning Open Source Software NAC Network Access Control NaCTeM National Center for Text Mining NATO North Atlantic Treaty Organization NB Naïve Bayes NLTK National Language Toolkit OASIS Organization for the Advancement of Structured Information Standards ODE Orchestration Director Engine PAP Policy Administration Point PDF Portable Document Format PDP Policy Decision Point PEM Privacy Enhanced Mail PLATYPUS Policy-based Labelling Automated TechnologY Prototype for Unlabelled

SourcesRAM Random Access Memory RCOMM RapidMiner Community Meeting and Conference S/MIME Secure/Multipurpose Internet Mail Extensions SAML Security Assertion Markup Language SAMSON Secure Access Management for Secret Operational Networks SELinux Security Enhanced Linux SGD Stochastic Gradient Descent SMB Server Message Block SME Subject Matter Expert SOA Service Oriented Architecture SPIF Security Policy Information File SPML Service Provisioning Markup Language SSL Secure Socket Layer

SVM Support vector Machines TCG Trusted Computing Group TD Threat Detection TNC Trusted Network Connect UDDI Universal Description Discovery and Integration URI Uniform Resource Indicator VM Virtual Machine W3C World Wide Web Consortium WAR Web Application aRchive WS Web ServicesWS-CDL Web Services – Choreography Description Language WSDL Web Services Description Language XACML eXtensible Access Control Markup Language XML eXtensible Markup Language XMLDSig XML Digital Signature XMPP eXtensible Messaging and Presence Protocol YALE Yet Another Learning Environment

Annex A – Context Two previous data classification initiatives, Security classification using automated learning (SCALE) [Reference 1] and [Reference 8], found that the security classification of data could be determined with approximately 80% accuracy if a sufficient quantity of prototypical documents were available with which to train the machine learner. It is envisioned that this accuracy rate could be significantly improved if contextual information47 could be factored into the classification decision process. Contextual information, illustrated in Figure 18, consists of the following three types of information:

Data Context – Data context refers to additional information that can be inferred, extracted or accompanies the data to be classified. Metadata created by the user falls into this category;

System Context – System context refers to information pertaining to the system from which the labelling request originated or on which the data was stored; and

User Context – User context refers to information that can be determined about either the user submitting the data for classification or the author of the data.

Figure 18 - Contextual Information

47 Merriam-Webster defines context as 1) : the parts of a discourse that surround a word or passage and can throw light on its meaning 2) : the interrelated conditions in which something exists or occurs : ENVIRONMENT, SETTING.

Data Context The context of the data can be extremely useful in determining the security classification of the data, either directly or by helping to determine the topic which ultimately aids in the classification process. Data context can itself be sub-divided into syntactic and semantic information. Syntactic refers to information that can be discerned from the appearance of the data or is computable from the data itself (e.g., data type, size).

In contrast, semantic information tends to be descriptive in nature and is typically stored in the form of metadata. Metadata can be stored intrinsically or extrinsically. In the case of intrinsic metadata, the metadata is for all intents and purposes part of the data itself. In the case of extrinsic metadata, the metadata is typically stored in an external repository. A link to the metadata repository may be included with the data.

While not all data will have associated metadata, some will. The Dublin Core Metadata Initiative (DCMI)48, which is an open organization engaged in the development of interoperable metadata standards, defines in excess of twenty elements in their metadata profile. These elements include such things as source, relation, coverage, creator, publisher, contributor, rights, date, audience, provenance, etc. which are directly applicable in terms of classifying data.

System Context The amount of information that can be discerned about the system, either the originating or requesting, is dependent on the capabilities and the cooperation of this system. The Trusted Computing Group (TCG)49, which is a not-for-profit organization formed to develop, define and promote open, vendor-neutral, industry standards for trusted computing building blocks and software interfaces across multiple platforms, defines five classes of endpoints that provide varying amounts of system context information. The higher the class of endpoint the more system context information is available. The five classes are as follows:

1) Completely unresponsive – A completely unresponsive system will only provide information that is externally observable. This information includes the Machine Address Code (MAC), the Internet Protocol (IP) address and any behavioral information that can be obtained;

2) Link Layer Discovery Protocol – Media Endpoint Discovery (LLDP-MED) – LLDP is a link-layer protocol that advertises device information (e.g., system name, port), device capabilities and media specific configuration information to other devices on the network;

3) Unauthenticated endpoint – An unauthenticated endpoint will provide invalid system credentials, likely a system name and password or X.509 certificate;

4) Authenticated endpoint with unverifiable integrity – An authenticated endpoint will provide valid system credentials, likely a system name and password, but no integrity information; and

48 http://dublincore.org/ 49 http://www.trustedcomputinggroup.org/

5) Authenticated endpoint with Trusted Network Connect (TNC) client – This type of authenticated endpoint provides valid system credentials, likely a system name and password, as well as integrity information. It is based on this information that the Network Access Control (NAC) solution can dynamically assign access privileges and even quarantine endpoints that don’t meet the minimum specifications.

In terms of the DND operational environment, either (2) or (4) are likely the most prevalent. While (5) may eventually be adopted this is unlikely to occur in the immediate future.

User Context User context can be divided into two main categories; an explicit user profile and a contextual user profile.

The explicit user profile is basically a digital representation of the user’s identity. It contains such pertinent information as role, nationality, security clearance, etc. This information is usually easily retrievable from a central identity repository using LDAP/Directory Service Markup Language (DSML) or by querying an identity service using a protocol such as Service Provisioning Markup Language (SPML).

The contextual user profile is more difficult to obtain and in many cases may be impossible to obtain. It is usually constructed by monitoring the activity of the user on his system and then building a contextual profile from this information. The contextual user profile is typically built using non-invasive approaches, which monitor what the user is working on without interfering. The contextual user profile can be built by using a combination of the types of documents that the user works on, search text, social networks, services or third parties they trust, etc. Contextual profiles are closely related to work on the semantic web.50 However, instead of using the contextual profile to improve user search results, the PLATYPUS would leverage it to improve data classification results. Related research includes the following:

Contextual Information Retrieval Using Ontology Based User Profiles [Reference 14];Supporting User Tasks and Context: Challenges for Semantic Web Research [Reference 15];Uses of Contextual Information to Support Online Tasks [Reference 16]; and Predicting User Interests from Contextual Information [Reference 17].

50 The semantic web is a term coined by World Wide Web Consortium (W3C) director Tim Berners-Lee to refer to a Web in which information is described in a manner understandable by computers. For example, in the semantic web computers would understand therelationships between things and consequently be able to differentiate between similar sounding but fundamentally different concepts.

Annex B – Open Source Data Classification Listed below are some of the other open source data classification projects that were considered.

ABNER http://pages.cs.wisc.edu/~bsettles/abner/

BANNER http://banner.sourceforge.net/

Baseline Information Extraction (Balie)51 http://balie.sourceforge.net/

The Dragon Toolkit - http://dragon.ischool.drexel.edu/default.asp

Ephyra http://www.ephyra.info/

General Architecture for Text Engineering (Gate) http://gate.ac.uk/

JULIE http://www.julielab.de/

KNIME http://www.knime.org/

LingPipe http://alias-i.com/lingpipe/

MALLET http://mallet.cs.umass.edu/index.php/Main_Page

MaltParser http://maltparser.org/

MinorThird http://sourceforge.net/apps/trac/minorthird/wiki

MontyLingua http://web.media.mit.edu/~hugo/montylingua/

MorphAdorner http://morphadorner.northwestern.edu/morphadorner/

National Center for Text Mining (NaCTeM) http://www.nactem.ac.uk/

Natural Language Toolkit (NLTK) http://www.nltk.org/

OpenFst http://openfst.org/

OpenNLP http://opennlp.sourceforge.net/

R http://cran.stat.ucla.edu/web/views/NaturalLanguageProcessing.htmlhttp://www.jstatsoft.org/v25/i05

RASP http://www.informatics.sussex.ac.uk/research/groups/nlp/rasp/

51 Balie was created at the University of Ottawa in April 2004.

SecondString http://secondstring.sourceforge.net/

SimMetrics http://sourceforge.net/projects/simmetrics/

Stanford Parser http://nlp.stanford.edu/software/lex-parser.shtml

Weka http://www.cs.waikato.ac.nz/ml/weka/

Annex C – PLATYPUS Policy Languages This section will provide a high level overview of the various extensible languages mentioned in the PLATYPUS architecture. These include the following:

BPEL;

XACML;

and XML SPIF.

Business Process Execution Language (BPEL) BPEL52, which is short for Web Services Business Process Execution Language (WS-BPEL), originated as a joint effort between BEA, IBM and Microsoft in 2002. SAP and Siebel Systems joined slightly later. It is now both an Organization for the Advancement of Structured Information Standards (OASIS) standard and an industry standard. BPEL can be used to specify the order with which web services are invoked as part of a business process. BPEL, which is XML-based, interoperates with other web services standards such as WSDL, UDDI and SOAP. It is worth noting that there is a complementary effort called BPELJ53 in which BPEL and Java are combined. BPELJ enables Java code to be included in BPEL process definitions.

Listed below is an example of a BPEL process document.54

<process name="HelloWorld" targetNamespace="http://jbpm.org/examples/hello" xmlns:tns="http://jbpm.org/examples/hello" xmlns:bpel="http://schemas.xmlsoap.org/ws/2003/03/business-process/" xmlns="http://schemas.xmlsoap.org/ws/2003/03/business-process/"> <partnerLinks>  <partnerLink name="caller" partnerLinkType="tns:Greeter-Caller" myRole="Greeter" /> </partnerLinks> <variables>  <variable name="request" messageType="tns:nameMessage" />  <variable name="response" messageType="tns:greetingMessage" /> </variables> <sequence name="MainSeq">  <receive name="ReceiveName" operation="sayHello" partnerLink="caller" portType="tns:Greeter" variable="request" createInstance="yes" />  <assign name="ComposeGreeting"> <copy> <from expression="concat('Hello, ', bpel:getVariableData('request', 'name'), '!')" /> <to variable="response" part="greeting" />

52 Additional information on BPEL can be found at http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=wsbpel. 53 Additional information on BPELJ can be found at http://www.ibm.com/developerworks/library/specification/ws-bpelj/. 54 The BPEL example, and a good overview on creating BPEL processes, is available at http://docs.jboss.com/jbpm/bpel/v1.1/userguide/tutorial.hello.html.

</copy> </assign>  <reply name="SendGreeting" operation="sayHello" partnerLink="caller" portType="tns:Greeter" variable="response" /> </sequence> </process>

eXtensible Access Control Markup Language (XACML)

XACML55 is an OASIS standard that is now widely considered the defacto standard for an access control policy language. The current version of XACML is version 2.0, however version 3.0 is currently under development.

Listed below is an example of a XACML policy.56

<?xml version="1.0" encoding="UTF-8"?> <Policy xmlns="urn:oasis:names:tc:xacml:1.0:policy" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:tc:xacml:1.0:policy cs-xacml-schema-policy-01.xsd" PolicyId="urn:oasis:names:tc:xacml:1.0:conformance-test:IIA002:policy" RuleCombiningAlgId="urn:oasis:names:tc:xacml:1.0:rule-combining-algorithm:deny-overrides"> <Description> Policy for Conformance Test IIA002. </Description> <Target> <Subjects> <AnySubject/> </Subjects> <Resources> <AnyResource/> </Resources> <Actions> <AnyAction/> </Actions> </Target> <Rule RuleId="urn:oasis:names:tc:xacml:1.0:conformance-test:IIA002:rule" Effect="Permit"> <Description> A subject with a role attribute of "Physician" can read or write Bart Simpson's medical record. </Description> <Target> <Subjects> <Subject> <SubjectMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:string-equal"> <AttributeValue DataType="http://www.w3.org/2001/XMLSchema#string">Physician</AttributeValue> <SubjectAttributeDesignator

55 Additional information on XACML is available at http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xacml. 56 This example was taken from http://wiki.oasis-open.org/xacml/PolicyExamples.

AttributeId="urn:oasis:names:tc:xacml:1.0:example:attribute:role" DataType="http://www.w3.org/2001/XMLSchema#string"/> </SubjectMatch> </Subject> </Subjects> <Resources> <Resource> <ResourceMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:anyURI-equal"> <AttributeValue DataType="http://www.w3.org/2001/XMLSchema#anyURI">http://medico.com/record/patient/BartSimpson</AttributeValue> <ResourceAttributeDesignator AttributeId="urn:oasis:names:tc:xacml:1.0:resource:resource-id" DataType="http://www.w3.org/2001/XMLSchema#anyURI"/> </ResourceMatch> </Resource> </Resources> <Actions> <Action> <ActionMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:string-equal"> <AttributeValue DataType="http://www.w3.org/2001/XMLSchema#string">read</AttributeValue> <ActionAttributeDesignator AttributeId="urn:oasis:names:tc:xacml:1.0:action:action-id" DataType="http://www.w3.org/2001/XMLSchema#string"/> </ActionMatch> </Action> <Action> <ActionMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:string-equal"> <AttributeValue DataType="http://www.w3.org/2001/XMLSchema#string">write</AttributeValue> <ActionAttributeDesignator AttributeId="urn:oasis:names:tc:xacml:1.0:action:action-id" DataType="http://www.w3.org/2001/XMLSchema#string"/> </ActionMatch> </Action> </Actions> </Target> </Rule> </Policy>

XML Security Policy Information File (SPIF) XML SPIF57 is a policy file that explicitly denotes the manner in which security labels and security markings should be applied to data. The standard was developed by xmlspif.org in 2009 and is supported by approximately nine messaging and guard vendors. By abstracting Security Policy into a SPIF, this definition becomes separate from the product that enforces or supports the Security Policy. A centrally defined SPIF format that is supported by all products, so that only one SPIF needs to be created by the organization.58

57 Additional information on XML SPIF is available at http://www.xmlspif.org/. 58 http://www.isode.com/whitepapers/why-spif.html

DOCUMENT CONTROL DATA (Security classification of title, body of abstract and indexing annotation must be entered when the overall document is classified)

(The name and address of the organization preparing the document. Organizations for whom the document was prepared, e.g. Centre sponsoring a contractor's report, or tasking agency, are entered in section 8.)

(Overall security classification of the document including special warning terms if applicable.)

(The complete document title as indicated on the title page. Its classification should be indicated by the appropriate abbreviation (S, C or U) in parentheses after the title.)

(last name, followed by initials – ranks, titles, etc. not to be used)

(Month and year of publication of document.) (Total containing information, including Annexes, Appendices, etc.)

(Total cited in document.)

(The category of the document, e.g. technical report, technical note or memorandum. If appropriate, enter the type of report, e.g. interim, progress, summary, annual or final. Give the inclusive dates when a specific reporting period is covered.)

(The name of the department project office or laboratory sponsoring the research and development – include address.)

(If appropriate, the applicable research and development project or grant number under which the document was written. Please specify whether project or grant.)

(If appropriate, the applicable number under which the document was written.)

(The official document number by which the document is identified by the originating activity. This number must be unique to this document.)

(Any other numbers which may be assigned this document either by the originator or by the sponsor.)

(Any limitations on further dissemination of the document, other than those imposed by security classification.)

(Any limitation to the bibliographic announcement of this document. This will normally correspond to the Document Availability (11). However, where further distribution (beyond the audience specified in (11) is possible, a wider announcement audience may be selected.))

drakus

Non_Controlled DCD

(A brief and factual summary of the document. It may also appear elsewhere in the body of the document itself. It is highly desirable that the abstract of classified documents be unclassified. Each paragraph of the abstract shall begin with an indication of the security classification of the information in the paragraph (unless the document itself is unclassified) represented as (S), (C), (R), or (U). It is not necessary to include here abstracts in both official languages unless the text is bilingual.)

(Technically meaningful terms or short phrases that characterize a document and could be helpful in cataloguing the document. They should be selected so that no security classification is required. Identifiers, such as equipment model designation, trade name, military project code name, geographic location may also be included. If possible keywords should be selected from a published thesaurus, e.g. Thesaurus of Engineering and Scientific Terms (TEST) and that thesaurus identified. If it is not possible to select indexing terms which are Unclassified, the classification of each should be indicated as with the title.)

automatic security labelling prototype system...

Documents