an ontology based query engine for querying biological...

Faculteit Bio-ingenieurswetenschappenAcademiejaar 2015-2016

An ontology based query engine for queryingbiological sequences

Jim ClauwaertPromotor: Prof. Dr. ir. Wim van CriekingeTutor: Martijn Devisscher

Masterproef voorgedragen tot het behalen van de graad vanMaster in de bio-ingenieurswetenschappen: Cel- en genbiotechnologie

Foreword

This thesis came about during the academic year of 2015-2016, and it has been worked on as a finalproject for my masters degree in bio-engineering. In many ways, this year has been very heavy. Eventhough many times working on my thesis meant not working on something else I should have beenworking on, I have enjoyed researching the subject and am satisfied when looking back at the workinvested. This feeling of comfort is due to many external influences that have guided and supportedme. I wish to extend my gratitude to the people that have stood by my side throughout the lastyear.

First, I would like to thank Martijn Devisscher, my tutor and the spiritual father of boinq. Therich experience obtained through my work on boinq and The Semantic Web is mainly attributed tothe positive working environment he created. I have been given both the responsibility and the trustto handle important parts of the boinq program. This gave me not only the opportunity, but alsothe ability to think for myself and introduce solutions when these presented themselves. Throughweekly appointments, I was able to follow-up and discuss work, and get directions when no path wasobvious. Through these elements I feel that I was able to contribute in the creation of boinq, andthat my input was of value. This has been both my strongest motivation and fulfilling aspect of mythesis. I also extend my gratitude to the BioBix group. Specifically, to my promoter, Prof. WimVan Criekinge, for helping in making this thesis a possibility, Prof. Tim de Meyer and dr. GerbenMenschaert, for helping me define a use case and assisting me during. I want to thank my family forsupporting me all these years. I want to thank my friends for being awesome in general. A specialthanks to Meaghan Blanchard, for being the first helping hand when correcting and revising my work,and being there for whatever reason.

Gent, 2016Jim Clauwaert

i

Table of Contents

Foreword i

1 Abstract 1

2 Introduction 3

3 The Semantic Web 53.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 What is The Semantic Web? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.3 RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.3.1 Structure of RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3.2 Vocabularies of RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.4 Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.2 Linked databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.5 RDF data management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.5.1 RDF formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.5.2 Triplestores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.6 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.6.1 SPARQL syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Boinq 234.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2.1 Data unification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.2 Data organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3 Comparison to other frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3.1 Biological query building . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3.2 Semantic access to sequence information . . . . . . . . . . . . . . . . . . . . 26

4.4 Material and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Genomic Data Implementation 295.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Genomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2.1 Browser Extensible Data format . . . . . . . . . . . . . . . . . . . . . . . . 305.2.2 Generic Feature Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2.3 Variant Call Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2.4 Sequence Alignment/Map format . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Data integration into The Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . 365.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3.2 Basic data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.3.3 Vocabularies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.3.4 Data models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3.5 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.3.6 Practical implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.4.1 sparql-bed and sparql-vcf . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.4.2 Big data files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.4.3 JBrowse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

iii

iv TABLE OF CONTENTS

6 Biological research in RDF 556.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2 A biomarker for colon cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2.2 Material and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Conclusion and Future Prospects 63

A Code Examples 65

B Tables 71

C Figures 75

List of Acronyms

List of Acronyms

B

Boinq Bio ontology integrated query platformBED Browser Extensible Data

C

CDS Coding DNA SequenceCNV Copy Number VariationsCIMP CpG Island Methylator PhenotypeCRC Colorectal CancerCTD Comparative Toxicogenomics Database

D

DBMS Database Management SystemsDDBJ DNA Databank of JapanDKO Double Knock-Out

E

EBI-EMBL The European Bioinformatics Institute

G

GDA Gene Disease AssociationGFF/GFF3 General Feature FormatGFVO Genomic Feature and Variation OntologyGMOD Generic Model Organism DatabaseGRC Genome Reference ConsortiumGTF Genetic Transfer Object

I

v

vi TABLE OF CONTENTS

IRI International Resource Identifier

J

JSON-LD JavaScript Object Notation for Linked Data

M

MeSH Medical Subject Headings

N

NCBI National Center for Biotechnology InformationNCI National Cancer InstituteNHGRI National Human Genome Research Institute

O

OWL Web Ontology Language

R

RDF Resource Description FrameworkRDFS Resource Description Framework Schema

S

SKOS Simple Knowledge Organization SystemSNP Single Nucleotide PolymorphismSO Sequence OntologySPARQL SPARQL Protocol and RDF Query LanguageSTS Spring Tool Suite

T

TCGA The Cancer Genome Atlas

U

UniProt The Universal Protein Resource

TABLE OF CONTENTS vii

URI Uniform Resource IdentifierURL Uniform Resource Locator

V

VCF Variant Call Format

W

W3C World Wide Web ConsortiumWT Wild TypeWWW World Wide Web

X

XML Extensible Markup LanguageXSD XML Schema Definition

1Abstract

English version The Semantic Web is an enhancement of the World Wide Web with a focus on

providing a standardized framework for exchanging data. This allows for a web of data not limited

by applications and data formats. Technologies created for The Semantic Web have increasingly

been adapted by public databases. Boinq is a web platform that aims to connect the researcher to

biological databases based upon semantic web technologies. One design goal is the ability to manage

and implement custom data into the data framework of The Semantic Web.

Data integration of four different data formats has been realized with the creation of custom data

structures and converters. Integrated data covers varying levels of high throughput sequencing data,

represented in the BED, GFF, VCF, and SAM format. It has been shown that the use of The Semantic

Web offers a fast way to select and combine data from public databases. Obstacles preventing a

widespread use of the technology are still existing, including the level of knowledge needed about

The Semantic Web and used databases, a lack of tools to manage and analyze data from a semantic

environment, and the incomplete state of several public databases.

1

2 CHAPTER 1. ABSTRACT

Nederlandse versie Het Semantische web is een gevorderde versie van het World Wide Web met

een focus op het creÃńren van een gestandardiseerde omgeving voor het distribueren van data. Hierbij

wordt een web van data verwezenlijkt dat niet gelimiteerd is door de diverse applicaties en datafor-

maten. TechnologiÃńn gecreÃńerd voor Het Semantische Web worden met toenemende interesse

geadopteerd door publieke databanken. Boinq is een webapplicatie die ernaar streeft biologische

databanken gebouwd op semantische technologien toegankelijker te maken voor de onderzoeker.

EÃľn van de doeleinden van het project is het aanmaken van een functionaliteit die eigen data kan

inbrengen en beheren in een semantische omgeving.

De data integratie van vier verschillende dataformaten is mogelijk gemaakt met de creatie van

aangepaste data structuren and converters. GeÃŕntegreerde data is terug te vinden in diverse niveaus

van high throughput sequencing data, zoals te vinden in het BED, GFF, VCF en SAM formaat. Er

is aangetoond dat het gebruik van Het Semantische Web een snelle optie biedt voor het selecteren

en combineren van data komende van publieke databanken. Hindernissen in een algemeen gebruik

van Het Semantisch Web zijn echter nog bestaand, daarbij horen een hoge eis aan kennis over Het

Semantische Web and gebruikte datasets, toepassingen voor het beheer en de analyze van data, en

de incomplete status van publieke datasets.

2Introduction

Since Tim Berners-Lee invented the World Wide Web in 1989, he has continuously worked on definingand improving its construction [79]. In 1994, he founded the World Wide Web Consortium (W3C),an organization focused on generating specifications, guidelines, software and tools to improve theinternet. In 2004, W3C defined the specifications of the Resource Description Framework (RDF) inits first iteration. The RDF was created as a guideline and framework to optimize data interchangethroughout an ever growing web, a first step towards The Semantic Web. The specifications for RDF1.1, the second iteration, followed in 2014 [76].

In 1955, the first amino acid sequence was determined by Robert W. Holley and his colleagues. Itwas the catalyst for a boom in genetic sequence data that has continued to grow exponentially since1995. In 2007, cost reductions of genome sequencing allowed for another significant boost in newdata generation. The vast influx of genomic data has brought the birth of many different databasesand formats, causing a hindrance in cataloging, processing and researching data between differentsources. Due to the further development and maturing of the technologies created by The SemanticWeb, an increasing investment into the adaptation of these technologies for genomic databases hasbeen realized. Although semantic web integration is only adapted by some databases, a consciouseffort is invested to expand this technology by major bioinformatic institutes, including EMBL-EBI.

Boinq [25] is a web platform that aims to serve as a connection between the researcher andThe Semantic Web. It is designed to manage an RDF environment used for the integration andmanipulation of data. The integration of custom genomic data into an RDF dataset is investigatedduring this study. Specifically, a data conversion tool for common formats such as BED, GFF, VCFand SAM has been created. file converters have been integrated into the functionality of boinq. Annewly designed RDF structure has been outlined and elaborated on for each of the supported formats.Further design goals of boinq include the implementation of a graphical interface at the front endof the program, at which files can be uploaded and relevant information retrieved; A server thatconverts supported data formats into triples; A data and metadata structure in RDF, constructedfollowing W3C standards.

To review the possibilities of the current state of public biological databases and the implementationof the converter, a case study has been defined. For this, the expression and methylation data retrievedfrom wild type and double knockout cancer cells (HCT116) have been analysed. With the use ofcustom user data and data retrieved from The Semantic Web, a list of candidate biomarkers wasselected for further analysis. The complete process, done in an RDF environment, has been reviewedin the last section.

3

3The Semantic Web

3.1 IntroductionThe Semantic Web is the concept of an idealized data network designed by W3C. One of the mainfeatures of this web of data is the use of standardized data formats and exchange protocols thatallows quick and easy access to its users. The use of many data formats of today’s web, controlledby different applications used over the web, inhibits a straight-forward way to request, link, processand display information. The use of a standardized format will place all of this data in a shared web,under one set of rules.

3.2 What is The Semantic Web?A typical example used to illustrate the problem with today’s web is the fact that connecting linkedinformation found on the web often takes an unnecessary amount of work. Suppose a student issearching for a place to stay close by his university. A common tool to find the distance betweentwo places is Google Maps. However, it doesn’t feature every place for rent on the map. Instead,the student will have to search for different websites featuring places for rent, find their address andcopy/paste this information into Google Maps. The only way to find out the distance from eachplace to his university is by transferring information manually. Why does today’s web force everystudent encountering this problem to do the same task over and over again? Why is the address ofthese places not automatically linked with tools such as Google maps. This would enable the serviceto highlight all places of interest with a simple selection tool on the map.

The problem inhibiting map-tools to acquire information, like in this example, is the many formatsin which the desired addresses are to be found. Common possible formats are standard HTML-code,Word or Excel documents. Furthermore, there is no unified way to tag an address as a place thatis for rent. Finding these places would require tools that search whole files for keywords, and eventhen, there will be uncertainty to whether or not the information is correct.

Data found on the web is often only linked with one another by human language. When one findsan address listed underneath the pictures of a house with an e-mail address, one will know who tocall if one is interested to see a house. However, although a human can interpret and link informationstored on the web, it can not thus be interpreted by a computer. There would be no direct way torequest information on the web by giving an address as input.

The use of a Semantic Web offers a more accessible workspace for many applications. The in-tegration of various data formats into one furthermore enables an easier and quicker way to findpossible relations in data resources for academic purposes or the cataloging of contents found at

5

6 CHAPTER 3. THE SEMANTIC WEB

a particular web site, page, or digital library are just two possible examples [72]. Herein, data canbe accessed by using a general web architecture. The relations of information to one another aredefined in a standard way, allowing data to be shared and reused across applications, to be processedautomatically by tools as well as manually. Data can be related to one another in both ways.

The Semantic Web aims to be a unification of all data, a place that offers a more accessibleworkspace for its users and applications. With the unification of all data also comes the need for asingle database of definitions. Data that is stored in different formats by various applications or toolsis defined by the rules and environment of the application itself. To illustrate this we take the exampleof a website database that lists the Uniform Resource Locator (URL) of stored websites under thedefinition of ’address’. A map tool could represent street addresses under the same definition. Theuser accessing these different databases will know in which environment the listed data is defined,and will thus have no problem processing the use of the same word ’address’ and understanding itsdifferent meaning. A problem arises when this user will access a database listing both of previousdatasets, where no thought was put into the varying meanings of the word ’address’. Referring to thefirst example, a student querying for an email address using the home address of a location mightencounter problems in finding the required information, depending on how the search tool handles thesituation. Thus, creating a single environment where all data is defined and related to one anotherhas to be supported by the creation of an organized structure that defines the boundaries betweenthe different ontologies of a word.

Ontology is a Greek word first used in philosophical studies. It defines the nature and being of anentity. It is a more complete and distinct notion of the concept given to an entity than the word it isdefined by. The study and formulation of ontologies is important in the setup of The Semantic Web,as it is a necessary element in structuring the data it stores.

A semantic web, just like the World Wide Web, is built from different databases. It containsinformation about movies, religions, chemicals, family trees and many other things. This informationis hosted by multiple servers around the world, each with their own topic subject. The creation ofsemantic web technologies, such as the Resource Description Framework (RDF), enables interrelatingdata amongst different datasets. The collection of public linked datasets is known as Linked OpenData [75]. 3.1 gives an overview of datasets published in Linked Open Data format, and it is knownas ’The Linking Open Data Cloud’ (LOD Cloud). Each node in this cloud diagram represents adistinct data set published as Linked Data. The arcs indicate that resources are shared between twodata sets [23].

3.3 RDFThe set of rules and specifications defining the framework of The Semantic Web is known as theResource Description Framework (RDF). The first iteration of this data model was published in 2004as RDF 1.0. The second, current iteration was released in 2014, known as RDF 1.1. [78].

As the name suggests, RDF is the framework that aims to create the environment or universe fordata in which the theoretical aspects of a semantic web are to be upheld. RDF was introduced byW3C as a means to extend the existing World Wide Web into The Semantic Web. A fast evolutionof the Web was expected by the creators once the specification of the RDF model were released, buteven today, the vast majority of websites still haven’t adapted to this model [11].

Information can be derived from or about anything. Examples are the data output of a scientificexperiment, the formulation of a philosophical concept, the chemical properties of substances or evenan abstract concept. As outlined in the previous part about the Semantic Web, the construction ofRDF is mainly focused on a seamless sharing of information across different datasets without the lossor change of meaning. Other uses of RDF, stated by W3C, include:

• Adding machine-readable information to Web pages using, for example, the popular schema.orgvocabulary, enabling them to be displayed in an enhanced format on search engines or to beautomatically processed by third-party applications.

3.3. RDF 7

Figure 3.1: Linked datasets of the Semantic Web. The pink nodes at the bottom right annotatedatabases concerning life sciences.

• Enriching a dataset by linking it to third-party datasets. For example, a dataset about paintingscould be enriched by linking them to the corresponding artists in Wikidata, therefore givingaccess to a wide range of information about them and related resources.

• Interlinking API feeds, making sure that clients can easily discover how to access more infor-mation.

• Using the datasets currently published as Linked Data. For example building aggregations ofdata around specific topics.

• Building distributed social networks by interlinking RDF descriptions of people across multipleWeb sites. Providing a standards-compliant way for exchanging data between databases.

• Interlinking various datasets within an organisation, enabling cross-dataset queries to be per-formed using SPARQL, a query language used for RDF databases.

3.3.1 Structure of RDFThe general focus of the design is to create both generality and precision. With the generation of

an ontologically neutral environment, the RDF framework is able to uphold the expression of dataabout any topic. An RDF dataset is built out of triples that can be organized into different graphs.A graph is an optional feature of a dataset which can contain parts of the data. Graphs are verycommon amongst large datasets and can offer several advantages. An example of using graphs is toseparate the data and metadata of a dataset. For genome data it is common to use different graphsas a means of separation of data by species. The use of graphs can also bring the advantage of aperformance improvement when querying the database.

3.3.1.1 Triple

The core structure of data in RDF is called a triple. A triple is a single statement about a resource.As the name suggests, it is built up out of three parts; the subject, predicate and object. Thesubject denotes the resource, the predicate denotes the property, trait or aspect of the resource,


expressing a relationship between the subject and the object. Subjects and objects are representedby different nodes in an RDF graph. Predicates, shown as arches connecting nodes, connect thesenodes expressing the relationship. Code example 3.1 shows the layout of a basic triple shown in acommonly used turtle format. The end of a statement is given by the use of a full stop or period.Triples are in some cases written over multiple lines, so it is important to know that a period signifiesthe end of a triple, and not a newline [77].

Code Example 3.1: Structure of a triple

<subject> <predicate> <object> .

Three different types of nodes exist: IRIs, blank nodes and literals.

IRIs International Resource Identifiers are the generalization of a Uniform Resource Identifier (URI),as it uses a broader range of Unicode characters. URIs and IRIs are both used when talkingabout resource identifiers. This paper will generally only refer to IRIs to avoid confusion,although both are options are valid. URIs are commonly known as for their subtype, the URL,as it is used extensively to navigate through the World Wide Web. IRIs, on the other hand,are used as identifiers, and although recommended, are not necessary accessible through theweb. IRIs can sometimes be used as web address, where a definition can be found. IRIs areidentifiers, and can identify both resources (nodes) and properties (arches). The referent isthe resource denoted by the IRI, which can be the subject, object and/or the predicate of thetriple.Code Example 3.2 contains three IRIs which complete a triple. The triple in this exampleexpresses a specific street address belonging to a specific house. IRIs used are constructedspecifically for the example.

Code Example 3.2: A triple containing three IRIs

<http://a/house> <http://located/on><http://a/street/address> .

Literals These are absolute values including strings, numbers and booleans. Optionally, they arerepresented by a value in lexical form followed by an IRI identifying the data type of the value.The two elements are separated by two carets. Although the addition of a data type identifieris optional, it is considered good practice. Through the identification of the data type, literalscan easily be extracted and processed. They are always found in between quotes. The literalvalue is the resource denoted by the literal, they can only be represented by the subject of thetriple. Code Example 3.3 is a variation on the previous triple. The street address is this timeexpressed as a string. Although the use of literals offer an easy interpretation to the user orprogram, it has one main limitation. It is not a unique identifier and can thus not be used asa subject or predicate.

Code Example 3.3: A triple containing two IRIs and a literal

<http://a/house> <http://located/on>"Coupure Links 653"ˆˆ<http://www.w3.org/2001/XMLSchema#string> .

Blank nodes These are empty nodes that are used when an undefined node is known to exist.The existence of blank nodes is known to exist by its relation to other nodes or the logicalexistence of the entity, e.g. the unknown melting temperature of a chemical substance. Theyare expressed as ’ :’, followed by a unique identifier. Blank nodes can be represented by boththe subject and object of the triple.Code Example 3.4 gives an example of a triple including all three different kinds of nodes.

3.3. RDF 9

Code Example 3.4: A triple containing a blank node, an IRI and a literal

_:ahouse <http://relation/gives/street/address>"Coupure Links 653"ˆˆ<http://www.w3.org/2001/XMLSchema#string> .

From this point forward, in line with the general theme of this thesis, examples are going to bein line with bioinformatics. The aim is to ensure a build-up of required understanding towards themain part of this research through the use of more related examples.

3.3.2 Vocabularies of RDFTypically, different sets of vocabularies are used to structure and provide semantic meaning to theRDF dataset. A vocabulary consists of a list of definitions, which can formalize a class/property(object or subject), or relation (predicate). A list of definitions can also be referred to as an ontology.The term ontology is typically used for a more general or abstract collection of definitions. Vocabularyon the other hand, is considered to be used for a list of definitions about more specific subjects. Theuse of both terms are common, with no clear line separating the correct usage of either.

The creation of vocabularies is open to everyone, and is thus found on many different locationson the web. As an RDF database needs to be able to store data about every subject, it is importantthat definitions are unique and well defined. Terms defined in vocabularies are defined through IRIs.Since vocabularies can be created by anybody, a huge amount of lists have been published. IRIs offera high rate of variability, and can thus easily be chosen to be unique. Public vocabularies can featuredomain names (URL) for each IRI or make the IRI directly resolvable. Since the meaning of an IRIcan be hard to derive from its string, it is often useful to be able to quickly retrieve their definitions.BioPortal is a web service that collects information about the available ontologies with a biologicalbackground. As the service offers you to browse a variety of vocabularies by keywords, it facilitatesthe search for specific ontologies [51].

Code Example 3.5 is an example of a triple stored in the public Ensembl triplestore. Not beingfamiliar with a vocabulary beforehand can make it difficult to understand the meaning of an IRI.The subject denotes a resource from the Ensembl database. The predicate defines the instantiationof a class type. The object defines a class that features the exact position of a base pair. Thus,the object specifies what type of resource the subject is. In this example, the triple states that theresource identified by <http://rdf.ebi.ac.uk/resource/ensembl/77/chromosome:GRCh38:4:6000001:0> refers to an exact position of a base pair.

Code Example 3.5: A triple from the Ensembl database 1

<http://rdf.ebi.ac.uk/resource/ensembl/77/chromosome:GRCh38:4:6000001:0><http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://biohackathon.org/resource/faldo#ExactPosition> .

Both the predicate and the subject have a hashtag in their name. A hashtag is used to separatethe vocabulary identifier (http://biohackathon.org/resource/faldo) and the reference to aspecific element of the vocabulary (ExactPosition). The first substring is named the namespaceIRI, as it is the same for every definition listed by the vocabulary. The second part is called thepointer as it points to a specific section in the vocabulary list. The retrieval of a resource on theWorld Wide Web through their IRI is called dereferencing.

It is important to note that no exact position is given by the triple. The triple only identifiesthe type of the subject. To find out the exact position another predicate is used, namely http://biohackathon.org/resource/faldo#position. Code Example 3.6 contains the triple assigninga value to the subject. Thus, through the use of the correct predicate, both the membership of aclass (Code Example 3.5) and information about that object (Code Example 3.6) can be stated. Theused subject is an example of an IRI built up out of identity specific parameters, such as the location


of the base pair on the chromosome. More information that can be directly derived from the stringare the species, reference genome, chromosome and strand specification.

Code Example 3.6: A triple from the Ensembl database 2

<http://rdf.ebi.ac.uk/resource/ensembl/77/chromosome:GRCh38:4:6000001:0><http://biohackathon.org/resource/faldo#position>"6000001"ˆˆ<http://www.w3.org/2001/XMLSchema#int> .

IRIs referring to definitions from vocabularies can be quite long. As they are often used in RDFdata formats, naive data formats take up a lot of storage. It is undesirable to have the full versionof every IRI repeated each time it is used. To solve this problem, namespace prefixes are introduced.

3.3.2.1 Namespace prefix

A namespace prefix is associated and replaced by convention to longer namespace substrings usedin vocabularies. Table 3.2 gives an overview to the most common namespaces that will come backthroughout the length of this thesis.

Table 3.2: A list of namespace prefixes used to substitute their longer namespace IRI variants.Listed vocabularies will be used often throughout this thesis.

Namespace prefixesNamespace prefix Namespace IRIrdf http://www.w3.org/1999/02/22-rdf-syntax-ns#rdfs http://www.w3.org/2000/01/rdf-schema#obo http://purl.obolibrary.org/obo/dcterms http://purl.org/dc/terms/sio http://semanticscience.org/resource/faldo http://biohackathon.org/resource/faldo#so http://purl.obolibrary.org/obo/gfvo http://biointerchange.org/gfvo/ensembl http://rdf.ebi.ac.uk/resource/ensembl/xsd http://www.w3.org/2001/XMLSchema#void http://rdfs.org/ns/void#tcga http://tcga.deri.ie/schema/

Namespace prefixes are found in exported triplestore data formats and when using the SPARQLquery language. Code Example 3.7 shows the use of namespace prefixes applied on Code Example3.5. Due to the shortness of the Code Example, no clear improvement can be seen. Namespaces areused to decrease the amount in characters in exported databases, which can contain up to millionsof triples, and to simplify the notation when using SPARQL. A further elaboration on SPARQL isgiven in Section 3.6.

Code Example 3.7: The use of namespace prefixes in Turtle format

#HEADER@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix faldo: <http://biohackathon.org/resource/faldo#> .#BODY<http://rdf.ebi.ac.uk/resource/ensembl/77/chromosome:GRCh38:4:6000001:0>rdf:type faldo:ExactPosition .

3.3. RDF 11

RDF and RDF Schema (RDFS) The most basic vocabularies used to form the structure of anRDF dataset. They are used to classify the semantic meanings of objects, and can almostalways be found back in the construction of a dataset or vocabulary. The first version of RDFSwas published in 1998 by W3C. Table 3.3 lists the seven most important constructs of RDFS,and probably the whole semantic web. The use of these definitions cover an important part ofthe creation of a cohesive dataset.

Table 3.3: Most important constructs of the RDF Schema language. P and C are objects whichare classes and properties, respectively. The table is copied from [78].

RDF schema vocabularyConstruct Syntactic form DescriptionClass C rdf:type rdfs:Class C is an RDF classProperty P rdf:type rdf:Property C is an RDF propertytype I rdf:type C I is an instance of CsubClassOf C1 rdfs:subClassOf C2 C1 is a subclass of C2subPropertyOf P1 rdfs:subPropertyOf P2 P1 is a sub-property of P2domain P rdfs:domain C domain of P is Crange P rdfs:range C range of P is C

Web Ontology Language (OWL) OWL is a language used for the instantiation of ontologies,designed by W3C. It is used to define properties and relationships to and between ontologies.OWL can be considered as an extension of RDFS, and has been designed to further extend theaccessibility of web resources to automated processes [70].

Simple Knowledge Organization System (SKOS) SKOS is a vocabulary used for the knowledgeorganization of ontologies or concepts. SKOS has been designed by W3C and is created for abroader functionality of indexing and classification of data structures [71].

Ontology for Biomedical Investigation (OBI) A vocabulary created by The OBO Foundry. Acollaborative, international effort to serve as a means to annotate biomedical protocols, instru-mentation and data generated in research. The database has just over 3000 definitions [8].

DCMI Metadata Terms (dcterms) A vocabulary created by The Metadata Community. It fea-tures specifications for a wide array of metadata terms. Although the vocabulary lists only 102definitions, it is extensively used by other public vocabularies [27].

Semanticscience Integrated Ontology (SIO) The Semanticscience foundation is the creator ofthe SIO vocabulary. It provides definitions giving a rich description of objects, processes andtheir attributes. Ensembl and DisGeNET are two examples that have integrated the SIOvocabulary.

Feature Annotation Location Description Ontology (FALDO) The FALDO vocabulary is madeto describe the position of sequence features in the genome. It can be used to annotate regionsthat are described in different file formats including the General Feature Format (GFF3), Vari-ant Call format (VCF) and BED format. It does not contain vocabulary to describe featuresof regions itself [15].

Sequence Ontology (SO) The SO vocabulary has originally been created by the Gene OntologyConsortium. It aims to be a collection of ontologies used to describe and annotate features ofa biological sequence. SO has a wide variety of contributers, such as the GMOD communityand the Sanger Institute [29].

Genomic Feature and Variation (GFVO) The GFVO vocabulary was created to aid non-RDFgenome analysis data to RDF resources. The vocabulary is created by BioInterchange, as theycreate tools for the conversion of genome data into RDF data. GFVO has been used in thecreation of a VCF data structure [9].


Ensembl/Uniprot The European Bioinformatics Institute (EMBL-EBI) specifically made vocabular-ies for the conversion of their databases into a linked RDF data structure. EMBL-EBI providesmany well known databases for genome data which are featured as Linked Data includingEnsembl, UniProt, ChEMBL and Expression Atlas [22].

XML Schema Definition (XSD) The XSD schema consists of terms which are used to describethe Extensible Markup Language (XML). It is typically used in adjunction with literals to specifytheir data format.

The Vocabulary of Interlinked Datasets (VoID) Another creation of W3C is the VoID vocab-ulary. It focuses on definitions concerned with the metadata of an RDF datasets. VoIDdescriptions range from data discovery, cataloging to archiving of datasets [73].

The Cancer Genome Atlas (TCGA) The Cancer Genome Atlas project was created to get adeeper understanding of the molecular basis if cancers through genome analysis. The datahas also been published as RDF data, with the creation of the TCGA vocabulary as a means..TCGA contains a collection of annotated genomes, information about cancer patients and theirtreatments process. All data is anonymous [81].

3.4 Linked DataThe Semantic Web is a web of data, a unified structure supporting all data through the use ofSemantic technologies such as RDF. To achieve this goal, a standardized data format was introducedand backed up by vocabularies to support the structure necessary for interpretation of data resources.But the aim of The Semantic Web is not to simply link data in a dataset. The framework describedis likewise able to support relationships between data from different datasets. The collection ofinterrelated datasets is known as Linked Data. It enables integration and reasoning across multipledatasets.

For a dataset to be considered Linked Data, it has to be built out of logical constructs that canbe interpreted by semantic web tools. These constructs include RDFS vocabulary and was laterextended with the creation of OWL and SKOS. These tools can be seen as vocabularies that are usedto define other vocabularies, as they define relationships in between definitions. These connectionsbetween definitions can subsequently create new logic links in between data. This process is referredto as inference [75].

3.4.1 InferenceInference is the creation and discovery of new relations as a result of automatic procedures by TheSemantic Web. The creation of these new relationships between resources is a result of the logiccreated by core semantic web technology formats such as RDFS, OWL and SKOS. Vocabularies arein fact a hierarchal construct that create a classification of the resources. Through the existence oflogical relations amongst definitions of a vocabulary, inference is possible. To give a better view onhow these connections are formed, an example of RDFS inference is given, using constructs given inTable 3.3.

Code Example 3.8: A database introducing inference.

#HEADER@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix obo: <http://purl.obolibrary.org/obo/> .

#BODYobo:SO_0000704 rdf:type rdfs:Class .obo:SO_0000704 rdfs:subClassOf obo:SO_0001411 .

Code Example 3.8 defines the ontology used to annotate a gene, using the SO vocabulary.The ontology is defined to be a class type (thus used as a subject/object), and is a subclass of

3.4. LINKED DATA 13

Table 3.5: The descriptions of the definitions used in Code Example 3.8, The descriptions werefound using the IRI as a web address.

Vocabulary definitionsIRI Descriptionobo:SO 0000704 A region (or regions) that includes all of the sequence elements necessary

to encode a functional transcript. A gene may include regulatory regions,transcribed regions and/or other functional sequence regions.

obo:SO 0001411 A region defined by its disposition to be involved in a biological process.

obo:SO 0001411. Through inference, every member of the class obo:SO 0000704 is also pro-cessed as a member of the class obo:SO 0001411. This is an extremely simple example, and morecomplex relationships can exist between different ontologies. The use of the RDFS vocabulary, ofwhich some ontologies are given in Table 3.5, is next to OWL and RDF the main building block forcreating relations within or across different vocabularies. It is important to point out that the struc-ture defining the relations between different ontologies is defined in the vocabulary files themselves.Vocabularies are exported under .rdf or .owl extensions using a XML format.

Inferencing can be more than just the creation of new links in between resources. Constructsexist that define that two resources with specific overlapping properties can be considered identical.This can be used to merge properties from resources existing as two nodes, to fill in blank nodesor to conclude that two resources are one and the same. As vocabularies are used for many publicdatabases, the creation of new relationships is possible over different datasets [69]. In a similar way,inconsistencies between datasets can be detected if theoretically overlapping nodes show differences.

Figure 3.2: Zoomed view on the linked datasets of the Semantic Web. The pink nodes annotatedatabases about life sciences.

3.4.2 Linked databasesLinked Data is also a term used to describe a recommended practice for exposing, sharing, and

connecting pieces of data, information, and knowledge on the Semantic Web using IRIs and RDF.


A database must uphold certain key aspects to be considered part of Linked Data, as discussed byTim-Berners Lee [12]:

• Use URIs/IRIs as names for things

• Use HTTP URIs/IRIs so that people can look up those names.

• Provide useful information, using the standards (RDF/RDFS/..., SPARQL) when someonelooks up a URI/IRI.

• Include links to other URIs. so that they can discover more things.

Linked Data that is open to the public is called Linked Open Data. Figure 3.2 gives a close-upview of the life science databases of the Linked Open Data Cloud [23]. The links between the datasetsshow that data from one node has direct connections with data from another node. The specificrequirement for a link to be represented between two databases is if at least 50 triples have an IRIoriginating from another database. An arrow pointing to one database means that the database fromwhich the arrow emanated has triples from the database at which the arrow points. Each datasethas a SPARQL endpoint, which is a specific web address at which SPARQL-queries can be sent. Thefollowing list contains a set of important contributors to Linked Data in the life science department.Every dataset has their own set of vocabularies which are needed to successfully navigate throughtheir data. SPARQL endpoints are listed in Table 3.7.

Bio2RDF The largest network of Linked Data for life sciences is constructed by the Bio2RDFproject. It was created by an individual party with a first release in 2010. The network isa collection of multiple datasets converted to homogeneous RDF resources. The 3rd releaseof Bio2RDF, dated from July 2014, consisted of a total of 11 billion triples from across 35datasets. Featured datasets include dbSNP, NCBI, OMIM, GenAge, PubMed and LSR. TheLinked Data is not up-to-date with featured datasets (last update: 2014). Bio2RDF makes useof multiple SPARQL endpoints to browse different datasets [17]. Although Bio2RDF converts avariety of public databases, it has not been used in this thesis as the datasets are not maintainedor checked for errors.

Ensembl Ensembl is a joint project between EMBL-EBI and the Wellcome Trust Sanger Institute. Itfeatures the assembled annotations of genomes from multiple species. The project also includescomparative genomics, variations and regulatory data [30]. The datasets are maintained andupdated by the Ensembl team and are thus a viable source of information.

The Universal Protein Resource (UniProt) EMBL-EBI, in collaboration with the Swiss Instituteof Bioinformatics (SIB), created UniProt. The database is focused on assembling proteinknowledge and annotation data. It exists out of different parts, being the UniProt Knowl-edgebase (UniProtKB), UniProt Reference Clusters (UniRef) and UniProt Archive (UniParc).Protein data can be listed under two distinct datasets: TrEMBL and Swiss-Prot. TrEMBL isa collection of computationally analyzed and unreviewed data. After this data gets reviewedand annotated, it is listed in Swiss-Prot [6]. The datasets are maintained and updated by theUniProt team and are thus a viable source of information.

Other EBI Resources EMBL-EBI features a variety of other datasets in RDF. All of these featurea public SPARQL endpoint and web platform [32]. These include BioModels, BioSamples,ChEMBL, Expression Atlas and Reactome. Most of these datasets are still in the pipeline andare thus subject to downtime and do not feature a complete dataset. Due to the fact thatthese databases are still in the middle of development, changes to the data structure are to beexpected.

The Cancer Genome Atlas (TCGA) TCGA is a coordinated effort to collect and further knowl-edge of the molecular basis of cancer. It is a collection of large-scale genome sequences. Theproject is a collaboration between the National Cancer Institute (NCI) and the National Hu-man Genome Research Institute (NHGRI). The database is split up in a public dataset and alicensed private dataset. The private dataset is the connection of the anonymous public datato personal data from the humans analyzed.

3.5. RDF DATA MANAGEMENT 15

DisGeNET Human gene-disease associations (GDA) obtained from different sources are collectedand published by DisGeNET, also featuring an RDF store. The project aims to collect allinformation about genetic diseases found on different levels. Data stored is split up in threeimportant compartments: curated data, predicted data and data from literature. Curateddata is the integration of GDA found back on expert sites such as UniProt, The ComparativeToxicogenomics Databse (CTD) and ClinVar. Predicted data comes from GDA found back inrats and mice predicted to exist in humans. The literature data is a collection of GDA foundfrom data mining through publications. DisGeNET furthermore offers a dataset combining theprevious three collections through a scoring system which can be found back on their site [34].

Table 3.7: A list from some important SPARQL endpoints in the field of life sciences. Distinctionin data collections can be taken care of by using varying SPARQL endpoints (e.g. TCGA and

Bio2RDF) or by using different graphs (e.g. DisGeNET)

SPARQL endpointsDatabase SPARQL endpointBio2RDF (PubMed) http://pubmed.bio2rdf.org/sparqlEnsembl http://wwwdev.ebi.ac.uk/rdf/services/ensembl/sparqlUniProt http://sparql.uniprot.org/TCGA (Bladder Cancer) http://vmlion14.deri.ie/node42/8082/sparqlDisGeNET http://rdf.disgenet.org/lodestar/sparql

3.4.2.1 Schema of data structure

Linked Open Data is used for many databases. Furthermore, it is a community that keeps growingas more and more projects become involved in the conversion of data into the RDF model. Aproblem existing with the current status of The Semantic Web is that only a small scope of peopleare familiar with the concepts of it. The use of the SPARQL query language offers many advantagesfurther discussed in 3.6. But to use SPARQL effectively, understanding of the underlying structurein data is needed. Since the conversion of datasets into an RDF framework is facilitated by theuse and sometimes creation of specific vocabularies, there is no unified way in which databases arestructured. To help the community out in gaining an understanding of how data is structured intoan RDF environment, schema’s representing links in between data are published.

Figure 3.3 is a representation of the data structure of the Linked TCGA database. The data isbuilt around a central node, which refers to a specific case (anonymous patient) for which data isavailable. The arrows indicate relations with different classes or IRIs (circles) and values or literals(squares). The predicate connecting two nodes is written along the arch. In Section 3.6 SPARQLqueries are constructed to illustrate how data can be obtained. The creation of these queries weremade using this schema.

3.5 RDF data management

RDF data is commonly stored and accessed through triplestores, which are tools for RDF datamanagement. Triplestores can offer many functions, such as the creation of a SPARQL endpoint forthe data that is stored. Navigation tools can also be integrated, making a way to navigate throughdata by presenting all nodes a specific resource is connected with. A variety of commercial and non-commercial triplestores are available, each competing with each other to attain the best performance,storage room and functionality. The demand for these tools are increasing as good data managementis key to companies dealing in Big Data. Facebook and Google are just two examples of companiesthat have based their own technologies and query languages upon the principles of The SemanticWeb to manage their data.


Figure 3.3: TCGA schema representing the structure of the Linked Data. Circles and squares are arepresentation for IRIs and literals, respectively. Predicates linking two nodes together are displayed

on the arches. The schema is taken from their site [49].

3.5.1 RDF formats

The simplest storage method is the storage of the RDF data as flat files accessible through the web.Following examples feature the most common format used to display and export RDF data. Allthese formats have been constructed to be readable and editable by humans. They are all text basedformats [78].

N-Triples The most basic format to distribute triples is N-Triples. The format does not supportthe integration of namespace prefixes. Triples displayed in the first part of this (Code Example2.1-2.6) are examples of N-Triples. The N-Triples format is not common anymore as it hasbeen superseded by the Turtle format. The file extension for RDF data stored as N-triples is.nt.

Turtle The Turtle format is an extension of N-Triples which introduces the support of namespaceprefixes, lists and shorthands. The Turtle format uses the file extension .ttl . Databases cancontain billions of triples, which have a high amount of repetitiveness through their IRIs. Byusing the Turtle format over the N-Triple format, significantly smaller files can be obtainedwhen exporting datasets. Some changes introduced in Turtle are represented in Code Example3.9. @base on line 1 has the same functionality as a namespace prefix, listing an IRI that willbe added to each incomplete (no http :) IRI listed in the dataset. Line 7-9 provide a shorthandfor a set of triples with the same subject. When a triple ends with ”;”, the subject is implicitlyrepeated. Line 7 furthermore lists a shorthand for rdf:type, which can be substituted by theshorter version a.

3.5. RDF DATA MANAGEMENT 17

Code Example 3.9: The Turtle format

1 @base <http://ugent.resources.be/bioinformatics/rdf-database/> .2 @prefix obo: <http://purl.obolibrary.org/obo/> .3 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .4 @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .5

6 <a/gene/object#1>7 a obo:SO_0000704 ;8 rdf:label "BRCA1"ˆˆxsd:string .

TriG The TriG format is considered an extension of Turtle, introducing the ability to specify multiplegraphs in the RDF dataset. Multiple graphs can be used to create divisions between large chunksof data in the dataset. It is common practice to specify the metadata of the featured data in aseparate graph. The usage of the graph syntax is introduced in Code Example 3.10. The TriGformat uses the file extension .trig.

Code Example 3.10: The TriG format

1 @base <http://ugent.resources.be/bioinformatics/rdf-database/> .2 @prefix dcterms: <http://purl.org/dc/terms/> .3 @prefix obo: <http://purl.obolibrary.org/obo/> .4 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .5 @prefix void: <http://rdfs.org/ns/void#> .6 @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .7

8 GRAPH <graph/containing/metadata>{9 <this/dataset>

10 a void:dataset ;11 dcterms:title "The Trig Format"ˆˆxsd:string ;12 dcterms:modified "2015-12-13"ˆˆxsd:date .13 }14 GRAPH <graph/containing/data> {15 <a/gene/object#1>16 a obo:SO_0000704 ;17 rdf:label "BRCA1"ˆˆxsd:string> .18 }

N-Quads An extension of the N-Triples format came with the creation of N-Quads, this introducesa fourth element to every line. The extra element bears the name of the graph to which thetriple statement belongs. The file extension for the N-Quads format is .nq.

JavaScript Object Notation for Linked Data (JSON-LD) The JSON-LD format is used to rep-resent Linked Data in JSON format. JSON is a format used in Web-based programming andservices. With the change to JSON-LD, web based resources are added to the RDF frameworkwith a minimal change. The file extension for the JSON-LD format is .jsonld.

RDF/XML The representation of RDF data is possible through the means of XML syntax. RDF/XMLwas the first format RDF was introduced to, as it was developed in 1990. It took another 11years for the introduction of N-Triples to be introduced in 2001. The file extension for theRDF/XML format is .rdf .

3.5.2 TriplestoresTriplestores are Database Management Systems (DBMS) for RDF data used to store, access andmanage data. They have the ability to store trillions of triples, supported by multiple datasets. Akey feature of any triplestore is the creation of a SPARQL endpoint that can be public or private. A


SPARQL endpoint is an access point through which the stored data can be queried. These SPARQLendpoints are public for all of the previous databases discussed. With the steady and exponentialincrease of collected information, making data analysis tools as performant as possible has been themain focus of many DBMS. The Uniprot Linked Data, to give an example, counted a staggering18,957,878,319 triples in the release of December 2015. Applications for RDF data have the followingkey functions:

RDF Parser/Serializer The RDF parser is able to transform data from one format into the RDFdata model and vice versa. RDF parsers need to be able to interpret all the RDF formats tohave a functional triplestore. RDF serializers are necessary to convert stored data into differentRDF files.

RDF Store The RDF store is able to store and retrieve data according to the principles of TheSemantic Web. An important property of the RDF store is the performance of processingSPARQL queries. At the core of a performant triplestore is the indexing of data.

RDF Query Engine The RDF query engine interprets and executes SPARQL queries. After trans-lating the prompted query, the triplestore is able to retrieve the fitting information.

Linked Open Data has the advantage of datasets available to download. Acquiring data to loadinto one’s own triplestore can offer advantages such as performance gains and offline availability.Most triplestores are web application tools that can easily be run and accessed on external servers.The applications featured all have a graphic user interface through which SPARQL queries can bebuilt and datasets can be managed. The following server applications were used throughout thisthesis:

Fuseki Apache Jena, a free and open source Java framework for building The Semantic Web andLinked Data applications. Fuseki can be run as a stand-alone server or as a web applicationvia Apache Tomcat [5].

Blazegraph Blazegraph is a triplestore released by Systap featuring both commercial as open-sourcelicensing. Customers can pay for support and development subscriptions that offer up-to-datereleases and hotfixes before they are released open-source [65]. Blazegraph has been awardedthe Big Data Startup Award in 2015 for their innovative work on GPU accelerated graphanalystics [66]. It offers the same basic functionalities as Fuseki, but has been found to runfaster with larger datasets.

3.6 SPARQLSPARQL is a recursive acronym for SPARQL Protocol And RDF Query Language. It is made byW3C as a standardized query language for RDF data. It was made as a response to several differentother query languages for RDF being available in the young days of The Semantic Web. SPARQLhas a collection of different functionalities integrated which will partially be discussed in the followingsection.

SPARQL bears many resemblances to the triple layout used in the discussed Turtle, N-Triples andTriG format. It is powerful tool to retrieve information, discover new relationships (inferencing) andcompare data over multiple datasets. One of the main principles of The Semantic Web, having ahomogenized web of data, is the ability for queries to retrieve connections between nodes that canbe distant from one another. Through the usage of SPARQL, it becomes easy to create a list ofhouses for rent in a certain area. The addition of extra information, such as telephone numbers orrent prices, can be added to the query request with a slight change to the query [74].

3.6.1 SPARQL syntaxSPARQL elements will be explained using the public TCGA database. It offers an opportunity for asimple query buildup as the data structure is mainly built upon the TCGA vocabulary. Queries areperformed on the database for bladder cancer. The SPARQL endpoint used can be found back in Table

3.6. SPARQL 19

Figure 3.4: Zoomed view on the TCGA schema representing the RDF data structure. Circles andsquares are a representation for IRIs and literals, respectively. Predicates linking two nodes together

are displayed on the arches. The schema is taken from their site [49].

3.7. SPARQL endpoints sometimes feature a user interface for a SPARQL query editor. Followingexamples can be copied into the query editor featured when accessing the endpoint through a browser.Figure 3.4 gives a close up of part of the data structure that is going to be used. Predicates usedare displayed on the arches connecting two nodes. The central green circle is the unique IRI everypatient is given.

3.6.1.1 Example 1

Code Example 3.11 is a SPARQL query for the TCGA database which lists the ID, gender and vitalstatus upon analysis of a patient. The results of the query can be found in Table 3.9. The query canbe broken down into multiple elements.

Code Example 3.11: SPARQL Ex. 1, Endpoint: http://vmlion14.deri.ie/node42/8082/sparql

1 PREFIX tcga: <http://tcga.deri.ie/schema/>2

3 SELECT DISTINCT ?patientIRI ?barcode ?gender ?status4 WHERE {5 ?patientIRI6 tcga:gender ?gender ;7 tcga:bcr_patient_barcode ?barcode ;8 tcga:vital_status ?status .9 }

10 LIMIT 4

PREFIX - The PREFIX clause is used for the initialization of a namespace prefix that will be usedin the body of the query.

SELECT - The respons of the query is featured through the SELECT clause. Elements of interestcan only be variables. Variables are defined as strings starting with a question mark.


DISTINCT - An additional parameter that is given to the SELECT clause that removes duplicateelements or duplicate combination of elements.

WHERE - The body of the query is contained by the WHERE clause. This syntax is only usedwhen the query only evaluates the data of the local database, i.e. the data linked to theSPARQL endpoint. WHERE is always used with curled brackets.

LIMIT - To limit the number of results to a given integer, LIMIT is used. This statement comesbehind the body of the query. LIMIT is mainly used to prevent the triplestore from processingheavy queries, as not all possible results are needed.

Lines 5-7 of Code Example 3.11 feature the body of the query. SPARQL queries are built usingtriple patterns, using defined and undefined elements. The logic is that every matching possibility isa result stored into the variables. This means that the subjects and objects of all triples, havingtcga : bcr patient barcode as a predicate, are stored into ?patientIRI and ?patientBarcode,respectively. The code is evaluated sequentially, meaning that the results stored in variables aresaved for the following statements. Variables will only be kept if none of the nodes in a triple areblank. This also means that once a variable contains a set of results, they can only be reduced bysequential statements. A shorthand can be used for triples using the same subject, as seen in theexample. LIMIT is used to finish the query, or the result would otherwise contain the information ofall patients in the dataset.

Table 3.9: The result given by Code Example 3.11.

SPARQL result for Code Example 3.11patientIRI barcode gender statushttp://tcga.deri.ie/TCGA-HD-8314 TCGA-HD-8314 MALE Alivehttp://tcga.deri.ie/TCGA-CQ-5333 TCGA-CQ-5333 MALE Deadhttp://tcga.deri.ie/TCGA-BB-8601 TCGA-BB-8601 MALE Alivehttp://tcga.deri.ie/TCGA-CV-5435 TCGA-CV-5435 MALE Dead

3.6.1.2 Example 2

Code Example 3.12 is a SPARQL query that lists the ID, latest vital status and optionally the daysuntil death (after first diagnosis) of all male patients. The results of the query can be found in Table3.11. There are three new modifiers presented in the query.

Code Example 3.12: SPARQL Ex. 2

1 PREFIX tcga: <http://tcga.deri.ie/schema/>2

3 SELECT DISTINCT ?patientIRI ?barcode ?vital ?death {4 SERVICE <http://vmlion14.deri.ie/node42/8082/sparql> {5 ?patientIRI6 tcga:gender "MALE" ;7 tcga:bcr_patient_barcode ?barcode ;8 tcga:follow_up [tcga:vital_status ?vital] .9

10 OPTIONAL{?patientIRI tcga:follow_up [tcga:days_to_death ?death] }11

12 }}13 LIMIT 414 ORDER BY ?death

SERVICE - The SERVICE clause is used to access external datasets. The use of SERVICE enablesthe user to commit queries from a SPARQL tool of choice, to any SPARQL endpoint that isopen for public querying. The main advantage of SERVICE comes with the ability to call upon

3.6. SPARQL 21

multiple SPARQL endpoints in one query, enabling data analysis over multiple datasets. Theclause comes with the declaration of a SPARQL endpoint. The query executed on the endpointis defined within curled brackets.

OPTIONAL - Variables and statements defined within the optional field won’t restrict the resultset if no data is available. It returns a blank field when no value can be found. OPTIONALtriples are introduced in between curled brackets.

ORDER BY - Ordering results by a variable can happen through the modifier ORDER BY. Bydefault, the results are displayed in ascending order. To return the values in a descending order,DESC() is added. e.g. ORDER BY DESC(?death). The use of ORDER BY requires the engineto retrieve all possible results in a dataset, even in the case LIMIT is used in the query.

Code Example 3.12 introduces a new shorthand construct on line 8 and line 10. Square bracketsare useful for navigation through nodes when one is interested to link distant variables with oneanother. The square brackets contain the predicate and object that are linked to the subject itreplaces. Specifications of the queried data can be given by defining triples, as shown on line 6.The SPARQL language features a multitude of other clauses, e.g. the usage of regular expressionusing filters and graph selection, which enhances the functionality and power of the language. A fulldocumentations is available online [74].

Table 3.11: The result given by Code Example 3.12.

SPARQL result for Code Example 3.12patientIRI barcode gender days to deathhttp://tcga.deri.ie/TCGA-CV-5432 TCGA-CV-5432 Alivehttp://tcga.deri.ie/TCGA-CQ-5334 TCGA-CQ-5334 Dead 128http://tcga.deri.ie/TCGA-CQ-5333 TCGA-CQ-5333 Dead 341http://tcga.deri.ie/TCGA-CV-5435 TCGA-CV-5435 Dead 2318

4Boinq

4.1 IntroductionThe continuous development of technologies created for The Semantic Web have made it a morefinished and attractive asset each year. Because of this, an increased interest has been shownin adapting biological datasets into an RDF structure. A considerable amount of data is alreadyfeatured by a variety of databases, albeit data requested over The Semantic Web is still lacking.

Boinq is an open source platform that leverages the semantic web to share, organize and combinesequence based information. It furthermore aims to be a tool through which user data can be injectedonto The Semantic Web. A boinq installation can be used locally, and is intended to interoperatewith other public endpoints.

4.2 DesignThe boinq platform intends to ease the organization of genome annotations irrespective of the originaldata format. Organizing these data includes importing from widely used file formats, recombining,and uploading into a general purpose triplestore that can be exposed as a SPARQL endpoint. Theplatform should be able to recombine local data with data from public databases, and should leverageexisting sources of sequence based information and existing ontologies. The design requirements ofthe tool were as follows:

• The system should be accessible as a web platform. This will allow easy communication betweendifferent endpoints for data sharing. Web platforms also allow the use and accession of multipleusers to one instance.

• The platform should be able to use a triplestore of choice and not impose a given frameworkor technology. Due to differences in functionalities found in a variety of triplestores, a certaintriplestore might not be optimal for the user. The triplestores are furthermore the key softwarein ensuring fast and optimized data storage and retrieval, fully functional softwares at this datecan be missing features, slow and outdated and thus become obsolete in the future.

• The platform should import and convert genomic data into a triplestore. The Semantic Web isbuilt up from an RDF framework, which means that the data should be represented as triples.Implemented functionalities for data analysis are largely adapted to the data structure.

• Recombining data from different sources should be possible without manual query writing.Boinq wants to bring The Semantic Web to the user without the need of an excessive knowledgeon the matter. To use The Semantic Web without assisting, knowhow on a variety of mattersin necessary including SPARQL, RDF and the RDF schemas of remote endpoints.

23

24 CHAPTER 4. BOINQ

• The platform should support visualization of stored information. To further user-friendlinessand functionality, boinq aims to provide a graphic visualization of stored or queried data. It isconventional for genomic data to be represented in a graphical way, as representation offers aneasy way to browse and interpret this data. The creation of this functionality requires a lot ofwork, and options to implement third party tools have been considered in Section 5.4.3.

4.2.1 Data unificationThe integration of data into an RDF environment has been realized and adapted in boinq as discussedin Chapter 5. In general, two approaches are available. Either a custom vocabulary is created thatdefines the structure of data, or existing vocabularies are used and combined. Although a completelycustom data structure gives the advantage of complete adaptability to the functionality of boinq, itwould not adapt structural elements with genomic data stored in public RDF stores. To ensure thealignment of the data structure to public efforts the use of existing vocabularies is preferred.

Furthermore, the conversion of data comes with the effort to create a standard schema that isadaptable for every data format. It is advantageous to follow a unified approach for usability andsimplification of a automated processing. Figure 4.1 gives a representation of the schema introducedthat supports a unified data structure. In general, every entry of a file represents a feature of a certaintype with an identity/label and attributes. Attributes can also be pertained to the entry instead of thefeature. Features can hold relations with other features introduced by other entries. Every featureis bound to a location. More information about the different formats, data implementation andconversion is discussed in Chapter 5. To unify representation of features into triples, the followingprinciples were adopted:

• entry nodes are introduced to add information to a feature that is inherent to the file formatinstead of a biological attribute.

• attributes pertaining to the file entry are separated from attributes pertaining to the feature itrepresents

• rdf/rdfs is used to describe the entity of a node.

• location on the reference is represented by FALDO terms

• genomic feature types, attributes and relations are represented by SO terms.

4.2.2 Data organizationBoinq offers a way to manage data stored in a triplestore. Data can be introduced that is obtainedfrom both the user and The Semantic Web. To properly differentiate between different sets ofdata, a hierarchal structure has been introduced in the organization of data as available to the user.Different SPARQL endpoints, used as a source of annotation information, are attributed to a datasource. This can be referred to as a work space. A data source consists of different tracks, a trackbeing a collection of features organized along a common theme. Data sources are used to allocateall the data from, e.g. one project. In these data sources, tracks can be selected which divides thedata from a project into different graphs. An example is to store the genomic data from differentorganisms into different tracks. Another example is the allocation of varying data types as foundin the different data formats, or the allocation of data obtained during different steps of a researchproject. Metadata of different tracks is stored as triples in a single graph for every data source.Metadata consists of information of the original and converted data, and about the capabilities ofthe endpoint. These include operations that can be conducted for data analysis. A custom vocabularyis created to handle high level, domain specific abstractions used by boinq’s query builder.

Data analysis through recombination with the use of SPARQL and Linked Open Data is the mainfunctionality of boinq. A graphical approach to data recombination from different tracks is beingworked on. To represent the network of architecture of The Semantic Web, tracks are representedby nodes that can be dragged to a diagram. Nodes are therefore sources of features. These canbe filtered by applying criteria that depend on the endpoint, and are customizable by modifying themetadata for the track. The following criteria are supported:

4.3. COMPARISON TO OTHER FRAMEWORKS 25

Figure 4.1: A representation of the general data structure as implemented in the RDF framework.Every connection of a node to another node represents a triple. Namespace prefixes are defined in

the bottom right corner. IRI objects are represented as circles; literals as squares. Green nodesdefine the type of the entity it is connected to, orange nodes represent objects linked over multiple

entries.

Location Limit features to a certain location, such as reference sequence, strand or a genomicregion. Location filters can both be explicit as locations derived from the feature in thedatabase.

FeatureType Limit features to a certain type; metadata includes the available types in the track.

MatchTerm Limit features to those linked to a given term from a target ontology. Metadataincludes a path expression linking the feature entity to the term and information about thetarget terms.

MatchInteger Limit features to those linked to a certain integer value that matches a given valueor lies within a given range. Metadata includes path expression from the feature entity to theinteger value.

MatchDecimal Similar to the integer match, but for decimal values rather than integer values.

MatchString Limit features to those linked to a certain string value. Options are to have eitherexact matches or regular expressions matching a substring. Both are case-insensitive. Metadataincludes path expression from the feature entity to the integer value

4.3 Comparison to other frameworksBoinq has functionalities that have similarities to existing projects. The analysis of independentprojects can offer advancement in the development of boinq and will be shortly discussed. Someexamples are used for both implementation of data as evaluation of the constructed framework.

4.3.1 Biological query buildingFrameworks exist to help build SPARQL queries for biological data. Biogateway [3] presents exampleSPARQL queries as easily adaptable templates. SPARQLGraph [61] is a service that allows theconstruction of integrative queries over biological databases using a GUI. These tools are very usefulfor helping the advance of semantic web use among bioinformatics, yet they do not focus on thespecific use case of managing sequence based information. Indeed, boinq does not intend primarily thegeneration of SPARQL queries, but rather uses SPARQL as a tool to reorganize sequence annotations.

26 CHAPTER 4. BOINQ

The mentioned tools could be used to exploit sequence based information such as offered by a boinqendpoint in refined integrative queries.

4.3.2 Semantic access to sequence informationsparql-bed [13] and sparql-vcf [14] are tools for direct SPARQL querying, performed on the BEDand VCF format, respectively. These tools are only making use of a location schema in the RDFframework using FALDO. Each feature is represented by its entry and no further information can beextracted on top of their location. Thus, no queries involving attributes or labeling of features canbe performed. Both tools are useful for performing a quick command line query, however they donot manage an endpoint that can be used for exposing a SPARQL endpoint, nor do they allow forany data management.

Integration of sequence based information is also a goal of Biointerchange [10]. A payed tooldeveloped by Codamono. The service supports the integration of VCF and GFF files into an RDFframework, Data integration has been constructed such that data conversion is possible from bothan RDF to GFF/VCF and back. No features to enhance for connectivity of integrated data with TheSemantic Web are published.

4.4 Material and methodsResearch and work done in the creation of custom data structures for next generation sequencing hasbeen directly applied in the development of boinq. Thus, a fully functional tool for the conversion ofthe four data formats is created. The development of this endeavor was executed using the followingtools, used in a Windows 7 environment:

Development environment Coding on boinq has mainly been performed using Spring Tool Suite(STS), an Eclipse based environment for development of programs and tools. STS offers a multitudeof functionalities for building, debugging and coding and has been documented for development usinga JHipster stack. Version 3.7.1 has been used during development.

Architecture and Libraries Boinq has been built using the JHipster stack [36], enabling the use ofstate of the art components and best practices for web applications. It eases deployment of industrystandard frameworks and provides complex functionality like security, caching, or logging, out of thebox. In the server application, several technologies are combined. HTSJDK [59] and Jannovar havebeen used for handling flat file access. Apache Jena [4] is used for programmatic query building, asa SPARQL and SPARQL/Update client, and for generating Java classes as shorthand for ontologyterms. Quartz is used for asynchronous job handling. Elda is currently being implemented to offerthe triplestore data as resolvable URIs. Boinq uses a local database to store information necessaryfor the webapp, such as known data sources and tracks, users and credentials, and stores metadataand actual data in a triplestore using SPARQL 1.1.

Vocabularies To allow for a seamless communication between the platform and the triplestore twoindependent vocabularies have been constructed and implemented. Ontology building was performedusing the Protege software [50]. Protege is a free, open source ontology building tool with a varietyof functionalities which enable for an intelligent framework. The created ontologies are divided intotwo main purposes; to structure data and metadata implementation. These vocabularies are namedthe ’format’ and ’track’ vocabulary, respectively. The format vocabulary is used in the construction ofa data structure from varying data formats. The track vocabulary is used to construct the metadataaccording to the functionality of boinq. Practical uses of these vocabularies are laid out in Chapter 5.These vocabularies can be viewed and downloaded from the GitHub repository [24]. Protege version5.0.0 beta 17 has been used to construct ontologies.

Triplestores As elaborated in Section 3.5.2, Blazegraph and Fuseki were both for development andresearch purposes. Blazegraph version 1.3.4 to 2.1.0 and Fuseki version 2 have been used. Blazegraphwas run on both a local machine and a server.

4.4. MATERIAL AND METHODS 27

Server A remote version of boinq and blazegraph (2.1.0) was run on a server for the executionof the use case and analysis of correct functioning with big data files. The server was provided byGenohm and is a virtualized CentOS version 6.7 with eight cores (Intel(R) Xeon(R) CPU E5-2650v3 @ 2.30GHz) and 32 Gb RAM assigned.

Programming languages Boinq has mainly been written in Java (server-side) and JavaScript(client-side), with a local installment of Java 8 Update 60. Python version 3.5.0 and SQLite version3.11.1 were used for data file manipulation, analysis and conversion. Distribution curves of themethylation and expression data (as explained in Chapter 6) have been created in RStudio version0.99.896 with a local installment of R for Windows version 3.2.4.

Version management Github has been used for the management of different versions and thecreation of a remote backup. A local installment of Git version 2.5.1 is used. The repository islocated on https://github.com/Kleurenprinter/boinq2.

5Genomic Data Implementation

5.1 IntroductionThe introduction of second generation sequencing technologies, such as Roche 454 and Illumina,have induced exponential growth in the acquisition of genomic data. These inventions have caused arapid cost reduction per megabase of sequencing information [46]. In 2007, the National Center forBiotechnology Information (NCBI) began the collection of raw sequencing data from these platforms.NCBI was followed by the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan(DDBJ). In 2009, The Sequence Read Archive (SRA) was established as a central database to helpthe research community gain access to sequencing data for scientific purposes [38]. The total amountof raw sequence data contained by the SRA is currently more than 3.6 petabases.

The historical growth rate of genomic data, since the development of second generation sequencing,has been doubling the amount of data every seven months [64]. Figure 5.1 shows the growthof genomic data, the total amount of human genomes sequenced and the data capacity. Severalprojects such as the 1000 Genomes project, TCGA and the Exome Aggregation Consortium (ExAC)contributed large amounts of data to the scientific community. Future growth of data is displayedaccording to several predictions: the historical growth rate representing a doubling of all data everyseven months, an estimation given by Illumina [56] and the doubling of data capacity every two years,as stated by Moore’s Law [64].

5.2 Genomic DataGenomic data is a heterogeneous collection of data. A variety of elements, such as the Coding DNASequence (CDS), translated proteins and variations in nucleotides can be retrieved. Acquired genomicdata goes through different stages of processing. It requires sorting through multiple levels of datafrom Next Generation Sequencing (NGS), containing probabilities, calculations and parameters.

The high variety of genomic data is reflected in the existence of specialized public databases anddata formats. Although an effort has been made to list references from one database to another, themining of these datasets requires, in most cases, user-made scripts.

Data acquisition and distribution by end-users are highly heterogeneous and are performed with theuse of different data formats, coding languages and tools. With the consideration that the analyticalstudies of genomic data can exceed the capabilities of modern computers, the improvement of currenttechniques and the creation of new tools for faster and specialized data analysis have received a highamount of focus. Several data formats exist, each created and designed according to differentstandards and design goals.

29

30 CHAPTER 5. GENOMIC DATA IMPLEMENTATION

Figure 5.1: A plot portraying the growth of DNA sequencing showing both the annual sequencingcapacity (right) and total number of human genomes. Important contributors, such as the 1000

Genome, TCGA and ExAC project, are shown on their respective launch dates. Three future growthestimates are given: red following the historical growth rate, orange following the Illumina estimate

and blue following Moore’s Law. [64]

With the attendance of public databases supporting RDF data, an opportunity arises to get rid ofthe existing boundaries that exist when calling upon data from multiple public databases. In orderto make Linked Open Data accessible and useful to end-users, the data conversion’s most commonlyused file formats has been integrated in boinq. Since most data is not found in RDF, a multitude ofconverters have been introduced. This feature will help the user to convert their data, keeping themfrom having to find other third-party tools. The conversion of data by boinq also offers the advantagethat the RDF data is created according to a data structure design that fits the logic of further dataanalysis queries constructed by boinq. The current version of boinq integrates the conversion of fivedifferent file formats: BED, GFF3, GTF, SAM and BAM. These formats will be discussed here.

5.2.1 Browser Extensible Data formatThe UCSC Genome Browser is a maintained web tool displaying annotations and features mappedacross the length of specified chromosome. It is developed and maintained by the Genome Bioin-formatics Group from the University of California Santa Cruz (UCSC) [62]. The database containsassembled genomes of all sequenced species. The Browser Extensible Data (BED) format has beendeveloped to represent genomic features and annotations, displayed through their web tool, in aconcise and flexible way. It is a tab delimited text format that supports up to twelve columns, ofwhich only the first three are obligatory [68]. Code Example 5.1 gives a representation of the datastructure in a BED format. A more complete description of the BED format can be found on theUCSC Genome Browser website (https://genome.ucsc.edu/FAQ/FAQformat.html#format1)

Code Example 5.1: Example of a BED file1 browser hide all2 chr2 178707289 178707561 Hs.666133 0 + 178707289 178707561 0 3 85,177,8, 0,87,264,3 chr1 178709699 178711955 Hs.377257 0 + 178709699 178711955 0 1 2256, 0,4 chr1 178711404 178712057 Hs.688767 0 - 178711404 178712057 0 1 653, 0,5 chr2 178777793 178778272 Hs.541631 0 - 178777793 178778272 0 1 479, 0,6 chr2 178908612 178916376 Hs.318775 0 + 178908612 178916376 0 4 2644,3067,464,1588, 0,2645,5712,6176,

The first three required fields are sequentially:

5.2. GENOMIC DATA 31

1. chrom - Lists the chromosome of the given region. The chromosome or contig is typicallydenoted with or without the prefix ’chr’ or ’ctg’, respectively.

2. chromStart - The start position of the feature. base pair counting starts at 0. The 0-basedcoordinate system numbers between nucleotides.

3. chromEnd - The end position of the feature.

An additional nine fields can be added. Empty fields are not allowed, meaning that for each field alllisted previous fields must be occupied.

4. name - The name, label or ID under which a feature is commonly specified.

5. score - The features displayed in the Genome Browser are given a gray-value. This is stored inthe BED file as a score ranging from 0 to 1000. This field is often used to store experimentallyderived information of a feature

6. strand - A value being either ’+’ or ’-’, representing that the annotation is found on the forwardor backward strand, respectively.

7. thickStart - The coordinate at which the Genome Browser displays the feature as a solidrectangle.

8. thickEnd - The coordinate at which the Genome Browser stops displaying the feature as asolid rectangle.

9. itemRgb -The RGB color value that is used as an alternative to the gray-value score.

10. blockCount - The number of sub-elements in a feature, e.g. the number of exons in a gene.

11. blockSizes - The size of the specified sub-elements.

12. blockStarts - The start of the specified sub-elements. These start positions are listed accordingto the sizes of each sub-element as specified in the blockSizes field.

5.2.2 Generic Feature Format

The Generic Feature Format (GFF) is a tab delimited text file used for storing DNA, RNA and proteinfeatures. GFF files are broadly used for exporting genomic data from public databases, found on e.g.Uniprot and Ensembl. The file format gives a brief representation of genome data from specificregions. Different versions of GFF are being used with the latest and more complete one being GFF3.GFF3 is an extension on the GFF2 format developed to solve its predecessor’s shortcomings [31].The Genetic Transfer Format (GTF) is a format borrowed from GFF. It was developed in betweenthe creation of GFF2 and GFF3, and is therefore sometimes referred to as GFF2.5. The format hashigh similarity with GFF3, with only small variations. One example is different naming of features inthe attribute list.

GFF3 files feature a fixed amount of nine columns in which the information is represented. Differentcolumns and rows are seperated by tabs and newlines respectively. Empty fields are denoted with aperiod. Code Example 5.2 gives a representation of a generic GFF file. A more complete description ofGFF3 with some examples can be found on several websites, e.g. http://www.sequenceontology.org/gff3.shtml.


Code Example 5.2: Example of a GFF file

1 ##gff-version 3.2.12 ##sequence-region ctgA 1 14972283 ctgA example gene 1050 9000 . + . ID=EDEN;Name=EDEN;Note=Protein Kinase4 ctgA example mRNA 1050 9000 . - . ID=EDEN.1;Parent=EDEN;Name=EDEN.15 ctgA example five_prime_UTR 1050 1200 . + . Parent=EDEN.16 ctgA example CDS 1201 1500 . + 0 Parent=EDEN.17 ctgA example CDS 3000 3902 . + 0 Parent=EDEN.18 ctgA example CDS 5000 5500 . + 0 Parent=EDEN.19 ctgA example CDS 7000 7608 . + 0 Parent=EDEN.1

10 ctgA example three_prime_UTR 7609 9000 . + . Parent=EDEN.111 ctgA example mRNA 1050 9000 . + . ID=EDEN.2;Parent=EDEN;Name=EDEN.212 ctgA example five_prime_UTR 1050 1200 . + . Parent=EDEN.213 ctgA example CDS 1201 1500 . + 0 Parent=EDEN.214 ctgA example CDS 5000 5500 . + 0 Parent=EDEN.215 ctgA example CDS 7000 7608 . + 0 Parent=EDEN.216 ctgA example three_prime_UTR 7609 9000 . + . Parent=EDEN.217 ...

The header of GFF files can contain information about the version of the format and other metadata.It is annotated with a double hashtag (##) and is not required. The nine required fields of the bodyof the file are sequentially:

1. seqname - The name given to the chromosome or scaffold. The chromosome or contig istypically denoted with or without the prefix ’chr’ or ’ctg’, respectively.

2. source - The name of the database, project or program that annotated the given feature.

3. feature - The type of the feature, e.g. Gene, Exon.

4. start - The start position of the feature. base pair counting starts at 1. The 1-based coordinatesystem numbers nucleotides directly.

5. end - The end position of the feature.

6. score - The score of a given feature. A variety of scores can be chosen from, such as an E-valuefor sequence similarity or a P-value for gene prediction.

7. strand - A value being either ’+’ or ’-’, representing that the annotation is found on the forwardor backward strand, respectively.

8. phase - A value being ’0’, ’1’ or ’2’, can also be interpreted as the frame of the feature. avalue ’0’ indicates the start of a codon at the beginning of the feature. a value ’1’ indicatesthe start of the codon at the second base of the feature and the value ’2’ indicates the startat the third base. The phase is used for a Coding DNA Sequence (CDS) feature.

9. attribute - A list of values separated by a semicolon defining additional information about thefeature. The length of the list is undefined.

ID - Indicates the ID of a feature. An idea, unlike a name or label, is a unique value withinthe GFF file. IDs are used as the reference name when connecting features through Parentor ”Derives from”.

Name - Carries the label of the feature. It does not have to be a unique value.Alias - A secondary or alternative name of a feature. Genes are common to have multiple

labels.Parent - Indicates the relationship between two features. Indicating that one feature is part

of the other it points to. The value given with parent is always the ID of another feature.Target - Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment.

the Target field is followed by the four fields; target ID, start, end, and optionally strand.Gap - Linked to the Target field, the Gap field contains a CIGAR string to indicate the gaps

in the alignment.


Derives from - A temporal relationship of one feature to another. This field is used to dis-tinguish the structural relation given through the ”Parent” field with a temporal relation,needed in polycistronic genes.

Note - A comment given on the feature.Dbxref - Contains a database cross reference.Is circular - A flag indicating whether the feature is circular. Can be adapted for features

indicating a bacterial genome.

5.2.3 Variant Call FormatThe Variant Call Format (VCF) was introduced with the appearance of large scale genotyping andsequencing projects. It was initially specified by the 1000 Genomes Project, an international studylaunched in 2008 aimed to create a database collecting human genetic variation on a global scale.It is aimed to offer a more complete understanding about the effect of genomic differences such asSingle Nucleotide Polymorphisms (SNP), Copy Number Variations (CNV) and structural variationson the phenotype. Since VCF was created to store the results of whole genome sequencing, theyare often very large files containing several millions of reads [19]. VCF portrays nucleotide variationsfound from HTS data compared to a reference genome.

The Variant Call Format is, just like the BED and GFF formats, a tab delimited text file formatlisting all genetic variations of a sequenced genome. Unlike the BED and GFF format no completesequences have to be listed. VCF permits the creation of custom fields as there are no strict rules onthe amount of columns. Custom fields are specified in the header in a predefined format. The additionof new elements that every new version has brought and the addition of customized fields ensure theappropriate amount of information for every read. Although a broad customization of the format ispossible, the first eight fields are fixed. Empty fields are annotated with a period [1]. The extensionand maintenance of the VCF file format has been integrated with SAMtools, a suite of programs forinteracting with high-throughput sequencing data [60]. The official and more complete descriptionof the the latest VCF format (version 4.3, visited December 2015) can be found on the GitHubrepository of the SAMtools project: (http://samtools.github.io/hts-specs/VCFv4.3.pdf).

Code Example 5.3: Example of a VCF file

1 ##fileformat=VCFv4.02 ##fileDate=201006103 ##source=glfTools v34 ##reference=1000GenomesPilot-NCBI365 ##phasing=NA6 ##INFO=<ID=NS,Number=1,Type=Integer,Description=""Number of Samples With Mapped Reads"">7 ##INFO=<ID=DP,Number=1,Type=Integer,Description=""Total Depth"">8 ##INFO=<ID=DB,Number=0,Type=Flag,Description=""dbSNP membership, build 129"">9 ##INFO=<ID=H2,Number=0,Type=Flag,Description=""HapMap2 membership"">

10 ##FILTER=<ID=NUYR,Description=""Variant in non-unique Y region"">11 ##FORMAT=<ID=GT,Number=1,Type=String,Description=""Genotype"">12 ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=""Genotype Quality"">13 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description=""Depth"">14 ##INFO=<ID=AC,Number=.,Type=Integer,Description=""Allele count in genotypes"">15 ##INFO=<ID=AN,Number=1,Type=Integer,Description=""Total number alleles in called genotypes"">16 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA0000117 Y 27284 rs2058276 T C 32 . AC=2;AN=2;DB;DP=182 GT:GQ:DP 0|0:48:118 Y 27342 . G A,C 31 . AC=1;AN=2;DP=196;NS=63 GT:GQ:DP 0|2:48:319 Y 27432 . C T 25 . AC=1;AN=2;DP=275;NS=66 GT:GQ:DP 0|0:3:620 Y 27467 . A G 34 . AC=2;AN=2;DP=179;NS=64 GT:GQ:DP 1|1:2:721 Y 27779 . T A 67 . AC=1;AN=2;DP=225;NS=67 GT:GQ:DP 1|0:48:422 Y 27825 rs2075640 A G 38 . "AC=1;AN=2;DB;DP=254;H2;NS=66 GT:GQ:DP 0|0:17:223 Y 27837 . G A 51 . AC=1;AN=2;DP=217;NS=67 GT:GQ:DP 0|1:48:324 ...

The header can hold a variety of information, with no restrictions given on it’s length or format.The only rule given is that entries start with a double hashtag (’##’). The header is meant to beinterpreted by the user, providing metadata about the file and specifications about the informationgiven with every read. All fields are optional. Common fields include, but are not limited to:

filefomat - A specification of file format


fileDate - A specification of date at which the file was created

source - The source from which data was retrieved

contig - A specification that gives a specification of the species and used assembly.

INFO - The info fields specify and describe keys used for giving results/values of tests/attributesto reads.

FILTER - The filter fields specify and described filters used for the quality control of reads.

FORMAT - The format fields specify and describe keys used for giving results/values oftests/attributes to samples

The body of the format consists of a minimal of eight fixed fields, given in sequential order:

1. CHROM - The name given to the chromosome on which the read was registered. The chro-mosome or contig is typically denoted with or without the prefix ’chr’ or ’ctg’, respectively.

2. POS - The position of the reference genome. VCF files are ordered by increasing positionnumber and uses a 1-based coordinate system.

3. ID - The identifier used for the given base pair or set of base pairs. SNPs featured in thedbSNP database are commonly annotated with an rs number.

4. REF - The reference base, this can be one of A,C,G,T or N. Multiple bases are permitted.

5. ALT - The alternate base, this can be a comma separated list when multiple samples readswere called. Strings made up of A,C,G,T,N,* are permitted. ’*’ is used for missing allelescaused by upstream deletions

6. QUAL - A quality score for the assertions made in ALT. The score is called the Phred scaledquality score.

7. FILTER - Reads can be evaluated by filters that qualify based on their Phred score. ’PASS’is given to reads that passed all filters. Codes of all filters for which the read failed for arelisted in a semi-colon separated list. e.g. q10 means that the quality (Phred-score) of the siteis below 10. Filters can be described in the header.

8. INFO - A field listing additional information. The information, separated by semi-colons, isattributed to keys that are defined in the header. Although custom keys can be added by theuser, a large list of keys has been defined by the format. Attribute keys can described in theheader. These include, but are not limited to:

AA - ancestral alleleAC - allele count in genotypes, for each ALT allele, in the same order as listedAF - allele frequency for each ALT allele in the same order as listed: - use this whenestimated from primary data, not called genotypesAN - total number of alleles in called genotypesDB - dbSNP membershipDP - combined depth across multiple samplesH2 - membership in hapmap 2H3 - membership in hapmap 3HQ - RMS mapping qualityNS - number of samples1000G - membership of 1000 Genomes

Additionally, genotype information might be present. This data is preceded by a FORMAT column,which represents the interpretation of every item present on the columns that are followed. Keysused in the FORMAT column are generally specified in the header of the file. Genotype informationcan be spread over multiple column representing the different samples the data has been obtainedfrom. Commonly used items are:


GT - Genotype specification of the sample. ’0’ values are used for the reference sequence, ’1’,’2’,...values for the alternative sequence. Multiple alternative sequences can be given. The values fordifferent alleles are separated by ’/’ or ’|’, meaning that the genotype is unphased, or phased,respectively.

DP - The read depth at the position of the specified sample.

PL - The phred-scaled genotype likelihoods rounded to the closest integer.

GP - The phred-scaled genotype posterior probabilities.

GQ - The conditional genotype quality.

HQ - The haplotype qualities, two phred scores separated by a comma.

EC - A comma separated list of expected alternate allele counts.

5.2.4 Sequence Alignment/Map formatThe Sequence Alignment/Map (SAM) format is a data format introduced and used by SAMtools.It integrates short DNA sequence read alignments retrieved from HTS and is used for the post-processing of this data to a reference genome. This low level data is commonly used for the calculationof variants. SAMtools can be used to perform a variety of operations on the format, including, butnot limited to alignment viewing, sorting, indexing and data conversion as well as extraction . SinceSAM files can be to up to several tens of Gigabytes large, the Binary Alignment/Map (BAM) formatwas introduced, which is a more data compact alternative to the SAM format [40].

Code Example 5.4: Example of a SAM file

1 @SQ SN:ref LN:452 @SQ SN:ref2 LN:403 r001 163 ref 7 30 8M4I4M1D3M = 37 39 TTAGATAAAGAGGATACTG * XX:B:S,125614 r002 0 ref 9 30 1S2I6M1P1I1P1I4M2I * 0 0 AAAAGATAAGGGATAAA * H0:i:125 r003 0 ref 9 30 5H6M * 0 0 AGCTAA *6 r004 0 ref 16 30 6M14N1I5M * 0 0 ATAGCTCTCAGC *7 r003 16 ref 29 30 6H5M * 0 0 TAGGC *8 r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT *9 x1 0 ref2 1 30 20M * 0 0 aggttttataaaacaaataa *

10 x2 0 ref2 2 30 21M * 0 0 ggttttataaaacaaataatt *11 x3 0 ref2 6 30 9M4I13M * 0 0 ttataaaacAAATaattaagtctaca *12 x4 0 ref2 10 30 25M * 0 0 CaaaTaattaagtctacagagcaac *13 x5 0 ref2 12 30 24M * 0 0 aaTaattaagtctacagagcaact *14 x6 0 ref2 14 30 23M * 0 0 Taattaagtctacagagcaacta *

The SAM format is a TAB-delimited text format consisting of an optional header section andalignment section. The header contains lines starting with ’@’ and contains metadata about thealignment data. Each entry consists of 11 mandatory fields and one optional field. Although allfields have to be present, values constituting that no information is available can be used. Theseare dependent for each field. The official and more complete description of the SAM/BAM format(version 1.0, visited December 2015) can be found on the Github repository of the SAMtools project.(https://samtools.github.io/hts-specs/SAMv1.pdf).The first eleven required fields are:

1. QNAME - Query template name. Reads having an identical QNAME are regarded to comefrom the same template

2. FLAG - The FLAG is an integer value composed out of 12 bits, each of them conferringinformation about the read depending on their 0/1-value

3. RNAME - Reference sequence name. This field gives the name of the reference alignmentused for a given read.

4. POS - The value of the leftmost mapping position of the first matching base to the referencesequence. BAM uses a 0-based coordinate system while SAM uses a 1-based coordinate system.


5. MAPQ - Mapping Quality. MAPQ Gives the quality of the alignment between the givensequence and the reference sequence. The score is phred-based taking the probability that agiven alignment is correct.

6. CIGAR - The CIGAR string is a short way of representing how different parts of the givensequence align with the reference sequence.

7. RNEXT - Reference sequence name of the NEXT alignment of the same read. This field isused for reads having multiple alignments. when no string is given, either ’*’ is used to denotethat the information is unknown or ’=’ is used meaning the the reference sequence name is thesame for the next read.

8. PNEXT - Position of the next alignment of an identical read in the template. This field isused for reads having multiple alignments. When no integer is given, ’0’ is used when theinformation is unavailable.

9. TLEN - Signed observed template length. No position

10. SEQ - Segment Sequence. SEQ contains the read from the HTS which is aligned to thereference genome. a ’*’ denotes that the sequence is not stored and a ’=’ denotes that thesequence is equal to the reference.

11. QUAL - Base Quality. Gives a string of phred-based scores calculated from the probabilitythat a given base in SEQ is wrong. The string has the same length as the SEQ string. Whenno information is given, a ’*’ value is used.

The optional field has a structure following the TAG:TYPE:VALUE format. The TAG field is aunique two-character string which is the key referring to which information is given. Lower case keysare reserved for end users, whom can define the descripton of a given tag in the header field. TheTYPE field is reserved to a single case-sensitive letter, defining the format of the information givenin the VALUE field. The VALUE fields contains data, which can be both a single value as a vectorof values, as defined by the TYPE field.

5.2.4.1 CIGAR

CIGAR notation is a shorthand explanation of how two sequences are aligned. Values commonlyused are ’M’ for Match, ’I’ for Insertion (to reference) and ’D’ for Deletion (from reference). Thus,the CIGAR string of the first entry of Code Example 5.4, ’8M4I4M1D3M’, is exemplified in CodeExample 5.5. The asterisk is used for bases that are unknown. A full list of all the CIGAR charactersis listed in the official SAM documentation.

Code Example 5.5: Interpretation of CIGAR string

1 Entry: r001 163 ref 7 30 8M4I4M1D3M = 37 39 TTAGATAAAGAGGATACTG XX:B:S,125612

3 POS: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 224 REF: * * * * * * T T A G A T A A G A T A * C T G5 SEQ: T T A G A T A A A G A G G A T A C T G

5.3 Data integration into The Semantic WebMaking RDF data around the web accessible to the end-user holds several obstacles, one of whichis the non-existence of RDF data usage outside of The Semantic Web. For users to process theirown data using extern RDF data sources, the integration of their data into the structure of the RDFschema is the first obstacle. The correct integration of data brings forth many challenges, requiringa good understanding of the RDF data and a knowledge of existing ontologies/structures. Thereare currently no public tools available that offer the integration of genomic data from common fileformats. To make sure boinq is an accessible and attractive tool, file converters for data integrationhave been implemented. boinq integrates the conversion of GFF, GTF. BAM, SAM, and BED files,summing to a collection of formats that cover multiple levels of experimental data and are commonlyused.

5.3. DATA INTEGRATION INTO THE SEMANTIC WEB 37

5.3.1 OverviewThe implementation of the different file converters in boinq, with the usage and exploration of

correct ontologies that build up to a fitting data structure, has been an constantly evolving subjectwith many iterations and changes. The implementation of genomic data coming from existing dataformats has only been documented by the Genomic Feature and Variation Ontology (GFVO). GFVOis an open-source vocabulary created by Codamono as part of the BioInterchange project, a privateand paid software tool made for the integration of GFF and VCF files into the RDF data structure.The GFVO vocabulary has been created as a stand-alone ontology library that supports the completeintegration of data from GFF and VCF file formats into an RDF schema. GFVO has many ontologiesneeded to define the many attributes that are generally found in VCF and GFF formats.

The creation of a custom vocabulary has advantages such as the liberty to customize ontologiesand relations at will. No work has to be invested into the study of existing ontologies in case these areintegrated and combined with different vocabularies. The integration of multiple vocabularies intoone data structure needs careful assessment to verify whether given ontologies are meant to be usedin the environment that one created. To combine and retrieve ontologies from existing vocabularies,a substantial amount of effort has to be made in making sure that ontologies are used correctly,analyzing their definitions, and the subclasses they belong to.

The main limitation of using GFVO for our purpose lies in the fact that it is a vocabulary designedto represent the data as structured by the specific file format. This defers from the design goal forthe data integration by boinq. In contrast to BioInterchange, the ability to convert data from boththe RDF framework to other data formats and back is not implemented. One of the foundations inthe construction of The Semantic Web is the elimination of barriers existing between datasets dueto the different formats data is implemented in. Different data formats in genomic data have beencreated to divide and group specific information together. The existence of data formats have nofunction but to support a specific structure in which data is easily representable. Since data in thisweb is structured using the same triple construct, representing the format through custom ontologiesserves no purpose.

Data integration created for boinq is aimed to follow a model that can work for any file format, usingdifferent vocabularies created by the community to cover the different categories of data retrievedfrom popular data formats for the storage of genomic information. The retrieval of suitable ontologieshas not always resulted in an answer, as the active community is still very limited. For lack of analternative, the GFVO vocabulary was partially integrated for VCF.

The extension of existing vocabularies by customization is not an option. These custom ontologieswould not be supported by the original creators. To make sure your created ontologies will besupported, is to get your ontologies accepted into the next version of a publicly published andmaintained vocabulary. As the process would take many months and revisions, there hasn’t been anywork invested into this.

Next to the usage of a RDF data for the implementation and analysis of data, boinq also usesthe RDF data framework for the integration and retrieval of data used for the functionality of boinq.The exchange of information between the server and the triplestore works through automated querieshard-coded into the boinq software. When data is converted into the boinq triplestore, metadata isgenerated and stored in a predefined graph. This metadata can therefore be accessed and queried byboth the user as the software. Data stored in this graph includes:

• A list linking local regions with external ones. This is explained in Section 5.3.6.2.

• specifics about the different tracks (graphs) in which data can be found

• specifics about the source files data was converted from

• specifics about various feature types stored in each track/file.

• parsed/unparsed headers from converted data files.


A custom vocabulary is created linking the metadata stored in this graph. As this information iscollected in an environment created to serve boinq, the use of a custom vocabulary is appropriate.

5.3.2 Basic data modelThe basic information embedded in every data format consists of the description of a feature and thelocation it is positioned, as shown in Figure 5.2. Feature entities, as well as any object defined as atriple, can be given a custom IRI. The IRI is a unique pointer to the object and does not define thecharacteristics of it. Although IRIs do not have to contain any information, these have been chosento be built from a logical set of strings. This helps with the identification of different aspects of anobject and can even contain a certain amount of information also defined specifically through triples.

Figure 5.2: Basic structure of the model found in every data format. A feature of the genome isdescribed and bound to a specific region

The model represented in Figure 5.2 has been proven to fit the standards and design goals explainedin the previous section. By converting different data formats into one model that represents thefeatures and locations as independent objects, even though linked by their relationships, it successfullygets rid of the differences of these objects based upon their source file characteristics. Data describingthe features are defined with triples starting from the feature node. Data describing the elements ofthe location node, identified as a chromosomal region, are triples connected with that node.

Figure 5.2 shows an ideal model describing genome features and their locations on the chromosome.The implementation of data gave rise to several issues, causing need for expansion of the given model.Problems arise when dealing with data linked to a feature that cannot be considered an attribute ofthat feature. An example is the allocation of an RGB color scheme to a feature assigned in a BEDfile. The color is not an attribute of the feature, but rather a customization specifically used for therepresentation of the feature in the UCSC browser. Another issue arises with the appropriation ofattributes linked to a feature for which no ontology is found, common in more complex and newerformats such as the BAM/SAM format. The converter has not yet been expanded to include thetranslation of all fields into processed information, an example is the FLAG bit values given in theSAM formats.

To offer a solution to these problems, and to be able to implement above mentioned information,the model was expanded with an extra layer, as shown in Figure 5.3. The entry offers the solutionby tying all information that has not been fitted into the previous model. Unlike the feature andlocation nodes, the entry node does not adapt community made ontologies and is introduced as arepository to store unprocessed data, or data correlated to the entry of a file rather than a genomicfeature. The expansion also offers a capability to assign metadata about the entry. The introductionof an extra layer of data has made it possible to make a clear separation between a standard genomicand custom mapping. These parts are clearly separated with the use of different vocabularies.

Figure 5.3: Basic structure of the model found in every data format. A feature of the genome isdescribed and bound to a specific region


5.3.3 VocabulariesA list of all used vocabularies is displayed in Table 5.2. The resource description framework and re-source description framework schema vocabulary are used to describe the different objects. RDF/RDFSare the groundwork for every other vocabulary and are therefore always used. Ontologies used are:

rdf:type - A predicate to link an object to a class. By default, the identity of an object is definedthis way.

rdf:value - is used to link a literal to an object. A literal can be different kinds of data such asstrings, integers and booleans.

rdfs:label - label an object. The label is also a literal and can be best seen as the name an objectcarries.

rdfs:comment - is used to link comments to an object. The comment value is typically a stringvalue and can be about anything.

rdfs:description - is used to give a description of an object. Unlike the comment field the descrip-tion value is specifically used to describe the object.

Dublin Core terms (DCterms) and Simple Knowledge Organization System (SKOS) are both in-troduced to extend the labeling of objects. Features can carry different labels or identifier. Anidentifier is unique pointer for the database the object is listed in. Differences between a primarylabel, alternative labels and identifier are represented with the use of rdfs:label, skos:altLabel anddcterms:identifier, respectively.

The XML Schema Definitions (XSD) is used to define the data type of literals. It is one of themain building blocks of creating semantic data. The characterization of literals through the use ofthe XSD vocabulary has been explained in Chapter 2.

An early misconception in the first iterations of the data structure was the misuse of certainontologies. An ontology, next to carrying a definition, is also bound to a strict rule defining thecircumstances of usage. Specifically, ontologies can be used to define classes, datatype properties,and object properties. Datatype properties are properties of an object expressed through a literal.Object properties are predicates that define links between two different objects. It is important toidentify the type of predicate used. To exemplify, consider the ontology Score, which is defined asan object class. Figure 5.4 A shows the incorrect use of the Score ontology, linking the score valuewith a literal to the object. As the score value is a class object, it is used to define the object carryingthe value through the use of rdf:value. This object is then linked to the feature using an objectproperty, such as hasAttribute. The existence of a data property for score is in theory possible. Yet,the relations carried by predicates is kept as overlapping as possible. Because of this, the amount ofpredicates created in a vocabulary is usually only a small part of the total ontologies found.

The direct appropriation of data to an object, as displayed in example A, gives the advantage ofkeeping a model simple. It also offers a minimal use of triples to define a relationship. Model Buses three triples where model A uses one. A database adapting model B would use considerablymore data storage than a database adapting model A. Even though the use of an extensive modelas displayed by model B seems unnecessary, it is a common practice for data to be allocated assuch. This offers the advantage to give specifications about the attribute objects or expand it in anyother way. With the association of extra information to these objects, better filtering criteria canfurthermore be constructed.

Other vocabularies used are the Feature Annotation Location Description Ontology (FALDO).FALDO consists of a comprehensive collection of ontologies used to define regions bound to genomefeatures, for which it is subsequently used. The GFVO vocabulary, as previously discussed, offersthe possibility to integrate complex elements from the VCF format into the model. The SequenceOntology (SO) is a collection of ontologies defining sequence features, attributes and relationshipsas found in genome annotation. Lastly we have the Format Ontology, constructed specifically forboinq , made to support the addition of the unprocessed and custom labeling found in each entry.


Figure 5.4: Differences between the wrong (A) and correct (B) use of a given ontology to specifydata. Model A uses the ontology for score as a predicate to link a literal to an object, Model B

defines correctly uses the score ontology to specify the identity of an object through the rdf:typepredicate, and giving that object the value of the score through rdf:value

. IRI objects are represented as circles; literals as squares.

Table 5.2: Vocabularies used to create the data structure in which discussed file formats areconverted to.

Used vocabulariesVocabulary Namespace prefixResource Description Framework rdfResource Description Framework Schema rdfsDublin Core terms dctermsSimple Knowledge Organization System skosXML Schema Definitions xsdFeature Annotation Location Description Ontology faldoGenomic Feature and Variation Ontology gfvoSequence Ontology oboFormat Ontology format

5.3.4 Data models5.3.4.1 Location

The first model created is the data structure representing information about the location of a givenfeature. The structure of data determining the regional attributes of a feature is identical for alldata formats. The FALDO vocabulary was used to implement the information into an RDF frame-work, as it is a format that has been adapted by external databases, such as Ensembl and Uniprot.Figure 5.5 gives a representation of the final model. faldo:reference points to a literal defining thechromosome or contig on which the region is located. faldo:begin and faldo:end are both objectproperties connecting the objects, specifying the begin position and the end position, respectively.faldo:position links the integer value of the position to an object, as found in the file. The 1-basedcoordinate system is used for all data. Since BAM and BED both use the 0-based coordinate system,adjustments are made to their start positions as the data is implemented. The objects pointed to byfaldo:begin and faldo:end are both objects of the type faldo:ExactPosition, meaning that, as thename suggests, the position given through faldo:position is the exact position. Given features can beallocated at either the forward- or reverse strand, annotated by faldo:ForwardStrandPosition andfaldo:ReverseStrandPosition, respectively. If no information about the strand is present, neither


Figure 5.5: The schema used to describe the implementation of the data annotating the region ofa feature into the RDF framework. Namespace prefixes are defined in the bottom right corner. IRIobjects are represented as circles; literals as squares. Green circles are used to represent the type of

the object it is connected to.

object is used.

The IRIs of the location, begin and end entity are all carrying the information formally expressedthrough triples. An important matter when creating thousands or even millions of IRIs is to make surethat no entities, which are meant to be separate, are merged in a triplestore due to the acquisitionof an identical IRI. Since the data is represented through triples, failing to create a unique IRI willmerge the incoming and outgoing links of both data entities under the same subject IRI. On theother hand, it is sometimes useful or necessary to be able to merge identical mappings, such as aspecific region, under one data object. For example: it is possible for two features to be mapped onthe same region of an identical chromosome or contig. A specific example is the mapping of millionsof experimental reads to a reference genome, another case can be when two or more genes carrydifferent labels, but are in fact representing an identical gene. By building up the IRI for a specificregion with the necessary information to specify that exact region, the features of previous examplesare automatically going to be linked to the same object.

IRIs used for the location, begin and end node (as represented in Figure 5.5) are built from thespecies, assembly, contig/chromosome, begin and end position, and strand. The strand value isnot always specified in the data format, in which case the tail of the IRI falls away. The species ofthe genomic data can be communicated through the client, which offers the input of several variablesbefore conversion. If no value is given, the field is set to ”Unknown”. The specific configuration ofthe nodes is as follows:

Location : http://www.boinq.org/resource/species/assembly/contig:begin-end:strand

Begin : http://www.boinq.org/resource/species/assembly/contig:begin:strand

End : http://www.boinq.org/resource/species/assembly/contig:end:strand

The implementation of both the species and assembly in the IRI are necessary to create thedistinction between different species or multiple samples of a specific species. This information is


in many cases found in the header of the flat format file. Because the necessary information isn’talways present, the user is prompted to input both the species and assembly of the data before thedata can be processed. Code Example 5.6 gives the generated triples implementing the first line ofCode Example 5.2 into the model given in Figure 5.6.

Code Example 5.6: The implementation of the first entry of Code Example 5.2 into the FALDOschema.

<http://www.boinq.org/resource/homo_sapiens/GRCh38/1:1050-9000:1>a <http://biohackathon.org/resource/faldo#Region> ;<http://biohackathon.org/resource/faldo#begin>

<http://www.boinq.org/resource/homo_sapiens/GRCh38/1:1050:1> ;<http://biohackathon.org/resource/faldo#end>

<http://www.boinq.org/resource/homo_sapiens/GRCh38/1:9000:1> ;<http://biohackathon.org/resource/faldo#reference>

<http://www.boinq.org/resource/homo_sapiens/GRCh38/1> .

<http://www.boinq.org/resource/homo_sapiens/GRCh38/1:1050:1>a <http://biohackathon.org/resource/faldo#ExactPosition> ,<http://biohackathon.org/resource/faldo#ForwardStrandPosition> ;

<http://biohackathon.org/resource/faldo#position>1050 .

<http://www.boinq.org/resource/homo_sapiens/GRCh38/1:9000:1>a <http://biohackathon.org/resource/faldo#ExactPosition> ,<http://biohackathon.org/resource/faldo#ForwardStrandPosition> ;

<http://biohackathon.org/resource/faldo#position>9000 .

5.3.4.2 BED

Figure 5.6: The schema used to describe the implementation of BED data into the RDFframework. Namespace prefixes are defined in the bottom right corner. IRI objects are represented

as circles; literals as squares. Green nodes define the type of the entity it is connected to.

The BED file format is a relatively simple format to implement, and no complex data parsingis required to extract relevant information. Data represented in BED is mainly limited to the rep-resentation of a feature and its sub-features to a genomic region. Additional information can begiven through the optional fields: [score] and [itemRgb]. The RGB value attribute is specific to thefile format and has no direct connection to the entity of a feature. [score] is often used for score


values denoting experimental values. For this reason, the RGB and score values are mapped to theentry and feature objects, respectively. No ontologies exist to define the RGB colors, and have thusbeen created. BED files do not contain information about the identity of the features or sub-featurespertained in the data file. To obtain this information, the user is prompted before the conversion toassign types to the feature and sub-features of the file. A representation of the data model is givenin Figure 5.6.

• three values ranging from 0-255 are stored in [itemRGB]. The data, connected to an objectwith the type format:RGBvalue, is represented as a string featuring the array of integers.

• Values implemented in the FALDO schema discussed in Section 5.3.4.1 are retrieved from[chrom], [chromStart], [chromEnd] and [strand].

• The variable extracted from [name] is linked to the feature through rdfs:label. The object isno unique identifier and dcterms:identifier is therefore not used.

• The score of a feature, retrieved from [score], is considered an attribute of the feature. Objectsspecifying attributes of a different object are linked with the usage of obo:so-xp.obo#has quality.obo:SO 0001685 is used as object type as it refers to an experimentally obtained score. Valuesto attributes, and thus the data in the [score] field, is linked to the object using rdf:value.

• Information stored in [thickStart] and [thickEnd] is not implemented, for it is believed thesefields store no relevant information.

• The values from [blockCount], [blockSizes] and [blockStarts] are used to determine the sub-features of an entry. features and sub-features have a two-sided link expressed through obo:so-xp.obo#part of and obo:so-xp.obo#has integral part. As these entities have differentgenomic regions, location mappings are created for each object.

5.3.4.3 GFF

Figure 5.7: The schema used to describe the implementation of GFF data into the RDFframework. Namespace prefixes are defined in the bottom right corner. IRI objects are representedas circles; literals as squares. Green nodes define the type of the entity it is connected to, orange

nodes represent objects linked over multiple entries.

The GFF schema has more elements than the BED schema and a more complex parsing code hasbeen implemented to convert all relations. Unlike the BED format, relations between features arelinked over different entries. These are defined in the attributes field. The value from [Parent] pointsto an ID given in another entry. The converter progresses through a line per line feed, and IDs linked


to feature nodes are mapped throughout the process. [Parent] links to an ID-value from a previousentry. Nodes displayed in orange signify that the nodes are connected over objects from differententries. The [Target] attribute is used in specific occasions when the entry specifies a ’match’ entity.It contains the ID of a genomic entity, followed by a start, end and strand value. [Gap] can be usedwhen the alignment is not perfect, and contains a CIGAR string to specify this alignment. Both the[Target] and [Gap] field have been considered too complex to parse and their explicit string valuesare therefore linked to the entry node. A representation of the data model is given in Figure 5.7.

• The values stored in [Target], [Gap] and [Source] are stored as attributes to the entry node.

• Values implemented in the FALDO schema discussed in Section 5.3.4.1 are retrieved from[seqname], [start], [end] and [strand].

• The feature type is given by [feature]. A complete set of supported features in given in TableB.2.

• Connection between features are annotated by [Parent] or [Derives from], and connectedthrough obo:so-xp.obo#part of and obo:so-xp.obo#has integral part.

• Item attributes are [score], [phase] and [Is circular], denoted by obo:SO 0001685, obo:SO 0000717and obo:SO 0000988/obo:SO 0000987, respectively.

• A variety of fields are directly linked to the feature node:

[Note] using rdfs:comment .[Name] using rdfs:label.[Alias] using skos:altLabel.[Dbxref] using rdfs:seeAlso.[ID] using dcterms:identifier.

5.3.4.4 VCF

The VCF shema is the most extensive, it is built upon the GFVO vocabulary. As no other ontologiesare available for the many data entities a VCF file consists of, the choice was simple. As discussed inSection 5.3.1 the structure and usage of the GFVO vocabulary has some disadvantages and designchoices that are different from our own. GFVO is an almost completely independent set of ontologies,meaning that, except for some broadly used terms, such as rdf:type and partial integration of FALDO,custom ontologies were created and preferred over existing ones. Some examples are:

gfvo:value instead of rdf:value

gfvo:Label instead of rdfs:label

gfvo:Comment instead of rdfs:comment

gfvo:Identifier instead of dcterms:identifier

gfvo:hasAttribute instead of obo:so-xp.obo#has quality

gfvo:value is furthermore the ontology of choice when appropriating a literal value to an object.This means that other ontologies, such as gfvo:Label and gfvo:Identifier aren’t used as predi-cates. These are instead used to annotate the type of object, appointed to with gfvo:hasAttribute,gfvo:hasAttribute and gfvo:hasIdentifier, respectively.

This design choice is not a random one and deserves consideration. The designation of data relatedto an object has different ways of implementation in the RDF framework. Data affiliated to an objecthas been implemented both directly, eg. using rdf:label and indirectly, e.g. creation of attributeobjects carrying a value. gfvo:value is a neutral ontology, and doesn’t carry any meaning except forthe fact that it is used to point towards a literal. The relationship of that value to the central featureis always defined within the type of object it is bound to. The GFVO vocabulary is thus designed tobe consistent on these levels of data appropriation. Ontologies such as label, identity or commentdefine the classes of these objects to which a literal is bound using gfvo:value.


To implement the VCF schema, some conflicts had to be solved first. Replacing some of theontologies with the ones used in previous schemas would bring overall consistency. Yet, it is notdesirable to implement only part of the GFVO structure, as it was not designed to be used withother vocabularies. Thus, the decision was made to integrate data both ways. Specifically, dataelements supported by the general schema, adapted by each format, are present twice following boththe GFVO structure and general structure. An exception to the rule is gfvo:Label, the usage ofthis ontology, coupled with the creation of three triples instead of one compared to using rdfs:label,would increase the general amount of data usage (i.e. created triples) by more than two-fold. Withthe use of rdfs:label, rdf:value is also used instead of gfvo:value. It was also found that combiningthese two structural differences results in a data structure which is over complicated.

Figure 5.8: The schema used to describe the implementation of VCF data into the RDFframework. Namespace prefixes are defined in the bottom right corner. IRI objects are representedas circles; literals as squares. Green nodes define the type of the entity it is connected to, orange


Two unique sets of entities are created before and during the conversion. The first set is the filters,a probability test listed in the [FILTER] field in case the probability that the evidence is lower thana set percentage. Filter properties are usually defined in the header, which are parsed and loadedbefore the body is converted. The converter is then able to create new Filter entities as they areencountered in the entry field. Samples, for example ’NA0001’ in Code Example 5.3 are created inthe same way. A representation of the data model is given in Figure 5.8. Filter and Sample objectsare displayed in orange circles.

• Values implemented in the FALDO schema discussed in Section 5.3.4.1 are retrieved from[CHROM] and [POS].

• dcterms:identifier is used to point towards the dbsnp identifier value stored in [ID]. The objectis assigned gfvo:Identifier as type.

• The feature types are not given by any field, but can be deducted from the data listed in theentry. A complete set of supported features in given in Table B.1.

• Filters and samples are identified using the types gfvo:VariantCalling and gfvo:BiologicalEntity,respectivelly. Connections to the feature is made using gfvo:isRefutedBy and gfvo:hasSource.[FILTER] and [SAMPLE] are integrated in the labels of these objects.


• Data stored in [REF],[ALT], [QUAL] are qualified as attributes and thus linked as object to thefeature using gfvo:hasAttribute. the ontologies of the types are gfvo:ReferenceSequence,gfvo:SequenceVariant and gfvo:PhredScore, respectively. The phred score value is boundas an attribute to the reference sequence attribute object.

• the varying keys used by [FILTER] are also implemented using gfvo:hasAttribute. A full listof the object types found in [INFO] is listed in Table B.3. The field values are stored to theseobjects using rdf:value. Keys used by the INFO field are stored to objects using rdfs:label.To increase flexibility, data retrieved from this field assigned to an unknown key are still storedto an object without a type. This still allows the user to query the data using the labelvalue of the key, linked to that object. One exception to this rule is the implementation ofgfvo:ExternalReference. These objects have no stored value as the database for which anexternal reference exists is given through the label of the key itself.

Figure 5.9: The schema used to describe the implementation of the FORMAT fields of the VCFschema. Namespace prefixes are defined in the bottom right corner. IRI objects are represented as

circles; literals as squares. Green nodes define the type of the entity it is connected to.

• The design of schema is largely based upon the construction of the vocabulary itself. This is no-table when implementing the [FORMAT] fields into the schema. A correct use of the vocabularywas shown on the official site of BioInterchange. However, it was taked down when visited inApril 2016. A central node, linked to the feature node with gfvo:hasEvidence, is connected toboth the sample and the genotype attributes. The genotype node has, furthermore, a complexdesign of which the schema can be found in Figure 5.9. The genotype node has two elements:the first and last part, which are linked with gfvo:hasFirstPart and gfvo:hasLastPart respec-tively. The value and type of these objects are obtainable from the values linked to the GTkey, retrieved from [FORMAT]. The same data is used for the determination of homozygos-ity or heterozygosity of the genotype, which is annotated using gfvo:hasQuality. The Phredscores from the determination of the haplotypes, linked to the HQ key, are granted to thehaplotype object using gfvo:PhredScore. Only two other attributes of [FORMAT] have beenimplemented, due to the limitations of a third party parser. These attributes are conditionalgenotype quality, linked to the GQ key, and the coverage of the genotype, using the key DP.


The object types are gfvo:ConditionalGenotypeQuality and gfvo:Coverage, respectively.

5.3.4.5 SAM

The SAM schema was the last to be created. The SAM format has a high amount of data entitiesfor which no ontologies are available. The integration of data into a valid structure has only beenpartially realized. Following the same reasoning as previous examples, fields containing unparsabledata have been linked to the entry node. Values have been implemented as they are extracted fromthe source. The objects are identified according to the labels of the field they are extracted from.The SAM format contains data implemented in complex formats. Both the limitations of availablepackages for data parsing and the complexity of data integration or processing are the reason thatthe current model is incomplete. Although the implementation offers integration of the most generaldata features, further adjustments might help to transfer all data coupled to the entry node to datacoupled to the feature node.

Figure 5.10: The schema used to describe the implementation of SAM data into the RDFframework. Namespace prefixes are defined in the bottom right corner. IRI objects are representedas circles; literals as squares. Green nodes define the type of the entity it is connected to, orange


Every entry features an alignment. This alignment is created from statistic probabilities whenevaluating the position of a read or query on a reference sequence. Since the rule does not applythat one read equals with one alignment, an independent Read object is implemented into theschema. Unlike the Filter and Sample object from the VCF schema, a mapping of the Read objects iscontinuously updated during the conversion. Because a SAM file can contain millions of alignments,a mapping is only kept of those reads for which [RNEXT] or [PNEXT] contain a value. An alignmentconsists furthermore of one or several blocks of sequences that are an exact match to the referencesequence. The locations of the matched sequences are extracted from the CIGAR string that isgiven in every alignment. Both the location of the Feature and Match objects are based upon thecoordinate system of the reference sequence. A representation of the data model is given in Figure5.10.

• Unprocessed data, bound as an attribute to the Entry node, contain [FLAG], [RNEXT],[PNEXT] and [QUAL]. The ontology terms bound to these objects are equal to their respectivenames.

• Values implemented in the FALDO schema discussed in Section 5.3.4.1 are retrieved from[RNAME], [POS], [TLEN] and [CIGAR]. These give the locations of the Feature and Matchobjects.


• Labels given to the reads, stored in [QNAME], are used to label the Read objects.

• The phred quality scores of the alignment and its nucleotide sequence, retrieved from [MAPQ]and [SEQ], are both stored as attributes with the types obo:SO 0001686 and obo:SO 0001683respectively. shorter nucleotide sequences which are matched are also linked as an attribute ofthe Match object.

• The quality values of each base in an alignment sequence retrieved from [QUAL] could, inprinciple, be integrated. This would be done by creating a set of objects representing individualnucleotides for every nucleotide sequence. A score attribute can subsequently be added. Sincethis would increase the amount of triples for every base with six, a literal (string) containingthese values was added to the respective objects instead, using rdf:comment.

• The extreme customization and variability of the [TAG] field, combined with limitations to theparser have prevented a complete integration of the [TAG] values into the model. Thus, [TAG]is bound to the entry node as a string

5.3.5 MetadataMetadata created during conversion is converted and implemented into the RDF framework. To keepthe triples extracted from genomic data apart from the metadata created, separate graphs are used.The collection of metadata stored in boinq is furthermore separated according to the dataset in whichdata is generated in. Metadata created in varying tracks within the same dataset is registered in thesame graph. Data implemented into the RDF framework can be partitioned into three parts:

• data created with the conversion of the header of a file

• data created with the conversion of the body of a file

• data created after conversion of a file

Both the header information and the data created after conversion are treated as metadata, andthus stored as triples in the same graph. After integration of data, a variety of properties and details issaved about the original data, the conversion operation and the newly created data. Triples containingmetadata are linked to an object representing the uploaded file, which acts as central node. Thegraph in which the metadata is stored is the collection of all metadata in a dataset. Practically, infoabout the dataset, the tracks it contains, and other items are all collected in the same environment.More information about which data can be retrieved has been explained in Chapter 3.

The structure of metadata is determined through the creation of the custom ’track’ vocabulary.This vocabulary has been built as no public vocabulary was considered adequate. The ’track’ vo-cabulary creates a data structure that fits to the functionality of boinq, and enables for an easycommunication of data. Future development of boinq will enable the communication of differentboinq instances with each other. The querying of metadata from remote servers is going to be a corepart for a proper working network that is able to share and query data over all boinq instances thatare connected.

The integration of the info stored in the header is stored as metadata to a file. The data conversionof the BED, GFF, VCF and SAM headers is currently implemented in a simple way, where every headerentry is stored as a string. A complete parsing of each object has not been achieved. This is largelydue to the complexity of some headers and the limitations of the parser. To ensure that no informationis lost after conversion, the complete and unprocessed header was copied. This should enable retrievalof data if necessary. Relevant information found in the header can be the identification of the speciesand reference assembly the data is linked to. This information has to be given by the user beforethe conversion starts. This is the best way, as the presence of this information into the header is notobligatory, although needed for the creation of reference IRIs. A collection of other data is stored,as shown in Table 5.4. These ontologies are class identifiers for objects bound to the file node usingtrack:hasAttribute. The File object is defined using track:File and is linked from the Dataset objectusing track:holds. The specific use for the storage of count values such as track:EntryCount isfurther explained in Section 5.3.6.


Table 5.4: The different ontologies used for the creation of metadata linked with the conversion ofa data file.

Custom metadata ontologies with their definitionsOntology IRI Definitiontrack:HeaderBED A header from a BED file.track:HeaderGFF A header from a GFF file.track:HeaderVCF A header from a VCF file.track:HeaderSAM A header from a SAM file.track:ConversionDate A string containing the date, time and timezone indicating

the start of the conversion. e.g. ”Sun Mar 13 17:21:18 CET2016”.

track:FileName The file name of a file. this contains the file extension.track:FileExtension The file extension of a file.track:User The login identifier of the account under which the file was

uploaded.track:EntryCount The total amount of Entry objects generated during conver-

sion.track:FeatureCount The total amount of Feature objects generated during con-

version.track TripleCount The total amount of Triples generated during conversion.track:FilterCount The total amount of Filter objects generated during conver-

sion (VCF).track:SampleCount The total amount of Sample objects generated during con-

version (VCF).track:ReadCount The total amount of Read objects generated during conver-

sion (SAM).

5.3.6 Practical implementation5.3.6.1 IRIs

The creation of many triples and objects (blue and orange nodes) raises the question of IRI naming.The specifications of the naming of objects are already partially discussed when describing the FALDOschema but have not yet been handled when referring to the other created objects. An importantmatter presents itself when figuring the way certain objects should be named. An infinite amount ofoptions are possible, with no prescribed conventions such as the use of a specific length or characters.In general, it is possible to construct IRIs in a way that offers certain advantages. An example of thisis the way location IRIs have been constructed. Containing information of the object into an IRI canoffer following advantages, of which some have already been discussed in Section 5.3.4.1:

• human interpretation about the identity/type of the object

• human interpretation about the information an object holds, including the information held byobjects that are defined upstream or downstream

• human interpretation about the level an object is identified on

• a better view on the data structure when browsing through IRIs

• the prevention of creating duplicate objects for entities that are in essence the same

In practice, unlike the construction of controlled databases such as Ensembl, in which case thehandlers of the features are constructed with the identifier of the object (e.g. http://rdf.ebi.ac.uk-/resource/ensembl/ENSG00000139618), it was not feasible to implement objects according to theirlabels or identifiers. Features contained in the processed data formats are in many cases carrying alabel rather than an identifier. Labels can be customized to the users wishes and have no certaintyof being unique. To prevent the merging of unrelated objects, the adapted method has to produce


unique IRIs for every new entry. Two options are available, one is to create an IRI through the useof a randomized string creator, the other one is to keep a count of every type of object to preventduplicates.

The usage of a randomized string creator holds a few disadvantages. Unique randomized stringshold no information whatsoever, and identification of properties about the object is thus not possibleby observing the IRI. These strings are furthermore very long. Using a simple integer to keep objectsapart has been the chosen method for the naming of IRIs. Although more complicated to implement,it offers all the above mentioned advantages with the exception of the ability to identify propertiesof the object itself. It is not feasible to keep track of every specific type of object, as identified in theschemas. Instead, the more overlapping identity of objects has been tracked, such as the numberingof Entry and Feature objects, following the naming of the blue nodes in the different schemas. Table5.5 features the base IRIs used for these objects. other objects, such as Attribute and Evidence nodes,are always tied to a central object. For this reason the IRI is defined as an extension of the objectit is tied to. An overview of all these extra components is given in Table 5.7. No extra hashtagis used as the use of multiple hashtags in an IRI carries conflicts when using namespace prefixes.Specifically, the use of a namespace prefix and a hashtag gives an error, e.g. prefix:1/atr#1. Countsof components need only be registered if multiple instances of that component can exist on thatspecific node. This can only happen for attributes and samples (evidence). Other components,referring to the genotype, haplotype and the alleles do not need a count integer.

Table 5.5: Base IRIs used with the creation of object IRIs during conversion, counts are placedbehind the hashtag.

Base components in the creation of IRIsObject type Base IRIEntry http://www.boinq.org/resource/entry#countFeature http://www.boinq.org/resource/feature#countSample http://www.boinq.org/resource/sample#countFilter http://www.boinq.org/resource/filter#countRead http://www.boinq.org/resource/read#count

Table 5.7: Components to construct IRIs during conversion, the namespace ’feature’ is used,replacing ’http://www.boinq.org/resource/feature#’. The examples are given for the conversion of

a VCF file.

Sub-components used for the creation of IRIsObject type IRI component ExampleAttribute /attribute count feature:1/attribute 1Evidence /evidence count feature:1/evidence 1Genotype /genotype feature:1/evidence 1/genotypeHaplotype /haplotype feature:1/evidence 1/haplotypeFirst Part /first part feature:1/evidence 1/genotype/first partLast Part /last part feature:1/evidence 1/genotype/last part

Different data files can be uploaded to the same track, and a correct count of the varying objectshas to be kept. This information is stored in the metadata section as explained in Section 5.3.5. Totalcount of every object is furthermore saved in the local database of the server. After the metadata ofa conversion has been added to the triplestore it is queried and stored by the server. Code Example5.7 displays the generic query used by the server to retrieve an array of counts. The sum of everyelement of the array is taken, resulting into the total count.


Code Example 5.7: Query used by the server to retrieve the total amount of created featuresin a track.

1 PREFIX track: <http://www.boinq.org/iri/ontologies/track#>2 PREFIX rfd: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>3 SELECT ?featureCount4 WHERE{5 <http://www.boinq.org/iri/graph/local#1> track:holds ?fileNode .6 ?fileNode track:hasAttribute ?attribute .7 ?attribute a track:FeatureCount> ;8 rdf:value ?featureCount .9 }

5.3.6.2 Mapping

The reference mapping of the genomic data is bound to the location node using the faldo:referencepredicate, as discussed in Section 5.3.4.1. This IRI is a unique identifier which expresses the species,assembly and chromosome. The Genome Reference Consortium (GRC) is the organization responsiblefor constructing assemblies for every organism commonly used for data annotation. Version namingis straightforward with version numbers replicating both the major as minor (patch) version of therelease. The latest reference assembly for the Human is named ’GRCh38p7’, being the 38th version,on which 7 minor updates have been performed.

Although the same reference assemblies are used by all the major databases, no unique IRIs havebeen adapted for an RDF environment. Indeed, every database has their own way to appoint thechromosome, reference assembly and species the data is mapped on. Specifying the correct referenceIRI is an essential part when using federated queries based upon locations. By not doing so, resultswill be incorrect.

To connect data with external databases, links have been defined in the metadata section of thetriplestore. In practice, a central and custom IRI is used that is specific to the boinq environment.Since the creation of reference IRIs by the boinq converter is logical, it is possible to define theserelationships before any data is uploaded. Before the conversion of a custom file, the user is giventhe option to select the species and assembly from a predefined list. Only organisms and assembliesthat have been mapped in the metadata can be used as input. From this data, the IRI is formed.For example: if the user maps his data on the GRCh38 assembly of the Homo sapiens, the IRIcreated when reading the X chromosome from the entry is <http://www.boinq.org/resource/Homo_sapiens/GRCh38/X>. As the string of these IRIs is known, links with parallel references onexternal databases can be created.

5.3.6.3 Parsers

The transformation of data from flat files to triples happens in different stages. The files are firstuploaded to the server, after which they are fed to a parser line after line. The parsers are appointedaccording to the file extension and are different for each format. The parsers then return the entryas an object containing all information. Extraction of relevant data from this object is possible withthe use of predefined functions that are part of the package the parser is distributed with.

The creation of parsers on our end was not needed, and have they been retrieved from existinglibraries. The used parsers have been replaced, updated and customized to fit our needs. Most toolsare still under active development, and are not feature-complete.

HTSJDK The HTSJDK java library is constructed and maintained as part of Samtools [59]. Sam-tools is an umbrella organization encompassing several projects on a variety of tools, designedfor the creation and manipulation of next generation sequencing data. The HTSJDK archiveis very active, with over 60 contributors to the code base. From September 2015 to April2016, eight releases have been distributed ranging from version 1.139 to 2.2.1. The HTSJDKpackage contains a variety of parsers, including the ones used for the conversion of BED, VCFand SAM/BAM files. Due to the simplicity of the BED format, work on the parser is finished.


Most of work on the library has been evolving on the extension and completion of functionali-ties surrounding the SAM and VCF format. The incomplete parts of the VCF parser are onlylimiting the functionality of the converter, as the parser is only able to extract a limited amountof attributes linked to the [FORMAT] field. The SAM parser is the least complete comparedto the others and has undergone the most apparent changes over the last year. Limitations aremostly centered on the parsing of the attributes or [TAG] field.

Jannovar - The parser used for GFF files has been retrieved from the Jannovar library. Jannovaris a project used for the annotation of VCF files in the analysis of disease-gene discoveries [35].It identifies all transcripts affected by base variants stored in VCF files. Jannovar is created tobe used as both an application and as a library. The only limitation of the parser is due to theinflexible way feature types are recognized and stored. Instead of the extraction of the featuretype string from an entry to a central object, a limited and hard-coded list of feature types arepredefined. Supported features are listed in Table B.2. To offer a solution for boinq, the listhas been extended and pushed to be included in the next release.

5.3.6.4 Client

Before data can be converted, it is uploaded by the user through the client. The client has beenscripted using AngularJS. Boinq has been designed to integrate functionalities of The Semantic Web.

Figure C.1 gives a representation of the different variables requested before conversion. Optionsare selected for the conversion of Code Example 5.1. As explained in Section 5.3.6.2, the input of thespecies and assembly is restricted to a predefined list. This data is stored in the metadata section ofthe triplestore and is obtained through a query on the metadata section of the triplestore. Assembliescan only be chosen once the species has been defined. The prefix contig is a variable implemented totake care of prefixes used in the contig field. It is important that no prefixes, e.g. chr or Chrom, arepresent in the creation of the reference IRI, as boinq will not be able to retrieve the reference IRIsused by other databases. The species and assembly field are obligatory, the contig prefix is optional.

The selection of the feature types is another example of the close interaction between boinq and thetriplestore. Figure C.2 shows the interface created for the selection of a feature type. The SequenceOntology vocabulary contains all of the possible options. Through SPARQL, an interactive list hasbeen realized by querying the vocabulary which is uploaded to the triplestore when boinq is started.A search bar is implemented as the list contains several hundred possible options.

The properties of every track can be reviewed through the client. The information is retrieved byquerying the triplestore, as outlined in section 5.3.5. Figure C.3 shows the properties of the trackafter uploading the BED file displayed in Code Example 5.1.

5.3.6.5 Code

The code written for the functionalities implemented in boinq can be reviewed on Github. Codecontributed is mostly in Java and Javascript. The repository is situated at https://github.com/Kleurenprinter/boinq2.

5.4 Evaluation5.4.1 sparql-bed and sparql-vcfsparql-bed and sparql-vcf are tools created by Jerven Bolleman, creator of the FALDO vocabulary,to directly query bed and vcf files for their locations. It is important to notice that the tools donot integrate a full mapping of the files in RDF and are only focused on implementing location datainto a FALDO schema. The tools proven unable to handle large files and complex queries, and thegiven examples are thus limited to small files. In the interest of checking the correct parsing andimplementation of the FALDO schema by boinq, we have evaluated the response of identical criteriaon queries executed by both sparql-bed/sparql-vcf and the boinq triplestore. Due to limitations of

5.4. EVALUATION 53

sparql-vcf, only a very basic query could be executed. The queries and results are featured in CodeExample A.1 and Code Example A.2.

The code used for the translation of location data into the faldo schema has been constructedas such that it is used for the conversion of all data formats. For this reason, the correct imple-mentation of the schema can be considered for all supported data formats. The BED and VCF filescan be found at https://github.com/samtools/htsjdk/blob/master/src/test/resources/htsjdk/tribble/bed/unsorted.bed and https://github.com/samtools/htsjdk/blob/master/src/test/resources/htsjdk/variant/dbsnp_135.b37.1000.vcf, respectively. Correspondingqueries all returned the same values. This is a clear indication that the conversion of the locationdata is implemented correctly.

5.4.2 Big data filesTo affirm that big data files don’t pose a problem for boinq, a VCF file of approximately 1Gbcontaining 4,686,454 unique entries has been successfully converted using boinq. The data file wasgiven confidentially and can thus not be shared. The data contains SNPs from the first humanchromosome. Code Example A.4 and A.5 show queries to find the genes and exons the SNPs aresituated in. A local version of the Ensembl endpoint was used as the official database experienceddowntime. The Ensembl database has served as a guide in the creation of the data structures, andthus bears many resemblances.

The FALDO vocabulary is integrated by the Ensembl database. The constructed query evaluatesthe position of the SNP compared to the positions of the requested features. SNPs contained withina specific gene or exon are listed. In both Code Examples, nested queries are used. These are alwaysexecuted first and return a set of variables, used in the query itself. For the queries featured, thelocation of genes and exons are retrieved first. Specifically, for genes this returns 1766 matches andfor exons 28808, the queries lasted approximately 8 and 50 hours, respectively. The data can be usedto search for genes with a significantly higher frequency of SNP. Further steps to analyse this datahave not been executed, this due to the size of the database.

5.4.3 JBrowseJBrowse is a web client software created for the visualization of genome annotations. It is built withJavaScript and HTML5 with Perl based data formatting tools implemented [16]. JBrowse is an opensource software tool maintained by the Generic Model Organism Database (GMOD) community. TheGMOD project features a collection of software tools for managing, visualising and storing geneticdata.

Boinq aims to implement a visualisation tool for RDF data through the integration of existingsoftware. Version 1.10.0 of JBrowse is the first version featuring data visualisation through con-nectivity with a SPARQL endpoint. This feature has been mainly designed during a Biohackathonand is therefore limited to basic functionalities. By downloading a region from the UCSC GenomeBrowser in BED format, converting it with Boinq and displaying it using JBrowse, both the correctfunctionality of boinq and JBrowse can be tested. Figure C.4 gives the region as shown in the UCSCbrowser. Figure C.5 gives the result as displayed by JBrowse. The configuration file of the JBrowseis shown in Code Example A.3.

Code Example 5.8: The BED file exported from the UCSC Browser and uploaded to the boinqtriplestore

1 1 29553 31097 RP11-34P13.3 0 + 29553 31097 0 3 486,104,122 0,1010,14222 1 30365 30503 MIR1302-9 0 + 30365 30503 03 1 34553 36081 FAM128A 0 - 34553 36081 0 3 621,205,361 0,723,1167

6Biological research in RDF

6.1 IntroductionResearch is not simply done through single rounds of data comparison. Many research objectivesrequire the analysis of data over multiple iterations, done in an environment that makes it possibleto manage data created and obtained at different steps of the research. To make The Semantic Weba useful asset for research purposes, the management of RDF data is necessary. Having introduceda way to integrate data from next generation sequencing into a local RDF environment, researcherscan now compare custom data with public databases. This chapter is an exploratory study of howtechnologies from The Semantic Web can be used to handle research objectives. For this, a usecase was constructed. The use case serves as an example and gives only the initial steps of a morein-depth study. No validation or statistical proof is generated.

6.2 A biomarker for colon cancer6.2.1 IntroductionColorectal cancer (CRC) is the third most frequently diagnosed cancer in the United States andUnited Kingdom and the second leading cause of cancer death in the western world [45]. CRC has ahigh survivability rate for early stage prognosis, 93% and 83% in the first five years for Stage I and II,respectively. Yet, late stage cancer survivability goes down drastically, with 60% survivability for stageIII cancer and only 8% for stage IV [52]. Invasive techniques such as colonoscopy offer significantimprovements of detection of CRC. Although the cost, risk and inconvenience have caused compliancerates of the method to be low [83]. The use of biomarkers for detection of CRC is a promising method,for there is a high need for a non-invasive early stage screening techniques.

Epigenetic alterations of the genome and remodeling of chromatin have been shown to play animportant role in the development of cancer [28]. Epigenetic changes are known to happen from earlydevelopment of cancer, such as the creation of aberrant methylation patterns often present in thepromotor regions. These have been shown to result in the silencing of tumor suppressor genes, or theactivation of oncogenes [57]. A change in methylation patterns, furthermore, constitutes a wide rangeof different expression patterns in tumor cells compared to healthy cells [7]. Detection methods existfor both methylation and expression patterns of DNA. These are of high interest in today’s searchfor the identification of a selective and sensitive set of biomarkers for the early detection of cancer.

The CpG island methylator phenotype (CIMP), caused by methylation-driven transcriptional reg-ulation, is a fingerprint of methylation patterns used to define the state and type of CRC [33]. Atthis point, there is no standardized panel of methylation markers or methylation detection technique.

55

56 CHAPTER 6. BIOLOGICAL RESEARCH IN RDF

Fingerprints for differential RNA expression are currently being created, proposed and validated. Nev-ertheless, a robust multigene signature has not yet been defined [37]. The detection of DNA, RNAand protein markers are possible through stool samples. Markers are present in stool because ofleakage, exfoliation, or secretion of CRC [54]. Since the process can also occur in nonneoplastic cells,and stool contains genetic material of a varying amount of sources, the tests can have limited sensi-tivity and specificity. Despite numerous the discoveries and methodological advances, CRC researchhas not yet yielded a novel molecular biomarker suitable for population-wide screening purposes. Aproblem that exists due to the high variability of current screening tests [33].

The silencing of the DNMT1 and DNMT3b in cancer cells, two important methyltranferases, haveshown a reduction of methylation greater than 95%. This also resulted in a reduction of otherdetrimental effects typical in cancer cells, including a suppression of growth and a demethylation ofrepeated sequences [57]. By comparing expression data and methylation data from both a wild type(WT) and double knockout (DKO) variant of a CRC cell, it is possible to select genes with highdifferential expression and methylation. If these genes are possible effectors of factors contributingin CRC, differential patterns between methylation and expression data are to be expected betweenhealthy and cancer cells.

6.2.2 Material and methodsThe HCT116 cell line is an immortal cancer cell line of the colon [44]. Differential Expression

and methylation data is available from a DKO and WT variant. DNMT1 and DNMT3b have beenknocked out. Variable expression and methylation data of the whole genome is obtained. Thedata is generated from Reduced Representation Bisulfite Sequencing (RRBS), values are given forpromotor methylation, expressed as the difference in average percentage between the DKO and WT.Exon expression data is retrieved from mRNA-Seq. The data representing the differential expressionpatterns is expressed as the log2 fold change of the DKO compared to the WT.

Data, available in BED format, is imported in the triplestore using the developed functionalities.The two BED files containing a list of exons for which promotor methylation and expression dataare given (169137 unique entries) were converted in two separate tracks (graphs) using a boinq andblazegraph (v2.1.0) installation on a dedicated server (CentOS version 6.7, Intel(R) Xeon(R) CPUE5-2650 v3 @ 2.30GHz with 8 cores, 32 Gb RAM). All data is labeled with the Ensembl exon identifier.The first step to find genes of interest is the selection of exons with a significant difference in boththeir expression and methylation data. For this, an analysis of the distribution of the data is donein R. The distribution of expression data is given using the absolute fold change. A first selection ofgenes is made using this data, as we are only interested in the genes with both differential methylationand expression values.

An essential element in the use of an RDF environment for data analysis is the function INSERT.INSERT is a variation on SELECT, in which the variables selected through the main body of thequery can be used for the creation of new triples into the triplestore. Indeed, by storing the resultsof a query into a new graph, it is possible to iterate with smaller steps and thus decrease the needfor elaborate queries that are heavy to process. Other advantages are that it external databases needonly be queried once if the data is used multiple times, giving faster results. Every triplestore hasboth a query and an update endpoint. The update endpoint, through which data can be send tomanipulate data inside the triplestore, is also used by queries containing the INSERT function.

To select a candidate biomarker from the data a comparison is made between in-house expressionand methylation data from the WT and DKO and data found in public databases. For this, relevantLOD from The Semantic Web is used. By comparing data from other studies, more evidence canbe obtained that a given gene is of interest. Following queries are executed directly through thetriplestore, as no interface for query building has been implemented yet in boinq.

Ensembl The Ensembl database, just like all other databases maintained by EMBL-EBI, features anLOD version of their database. Although the RDF database is well structured and maintained,it has not yet reached a full implementation of the complete Ensembl dataset. An example isthe absence of CDS data. However, it does contain all the data concerning genes, exons and

6.2. A BIOMARKER FOR COLON CANCER 57

their locations. Furthermore, due to the use of different kinds of unique identifiers by differentdatabases, en extensive list of identifiers is given for every entity. This is, as will becomeapparent in the following examples, necessary when comparing data over different datasets.Code Example A.7 features the query used for the curation of data of the selected exons. Aninstallation of a local Ensembl database is used as the public endpoint experienced long periodsof downtime. Having retrieved genes for every exon, the data now consists of a selection ofgenes that are potential biomarkers for colon cancer. To find out if the selected genes shareany comparison with other research done about colon cancer, data is retrieved from DisGeNET,Expression Atlas and TCGA.

DisGeNET DisGeNET is a database connecting research with diseases and genes. A simplifiedschema of their data structure is shown in Figure 6.1. The complete schema, which is toolarge to fit on a page, is given on http://www.disgenet.org/ds/DisGeNET/html/images/disgenet-rdf-schema-125.png. Through the use of SPARQL, any amount of nodes canbe defined to select data that adheres to it. As we are interested in the selection of genesthat are recognized to be affiliated with CRC, an ontology used for this disease was selected.DisGeNET uses the Medical Subject Headings (MeSH) vocabulary for the identification ofdiseases. A quick search resulted in the retrieval of http://id.nlm.nih.gov/mesh/D003110for the identification of colonic neoplasms. Through the query given in Code Example A.8,4424 genes with associations with CRC where retrieved. DisGeNET retrieves Gene DiseaseAssociation (GDA) data from different sources, through curation, prediction and literature.To make a distinction between the level of evidence for every GDA, a score is granted rangingfrom 0 to 1. Data retrieved from DisGeNET, Expression Atlas and TCGA are stored in separategraphs.

Figure 6.1: A simplified structure of the DisGeNET data structure [26].

Expression Atlas The Expression Atlas Database is a collection of expression data researched ina variety of experiments. In line with the retrieval of data from DisGeNET, all data hasbeen retrieved from cells associated with CRC. Expression Atlas uses a different vocabulary foridentifying diseases, called the Experimental Factor Ontology. The ontology http://www.ebi.ac.uk/efo/EFO_0000365, identified as colorectal adenocarcinoma, is best suited for CRC. P-values are given for each observation of differential expression. The p-values have been adjustedusing the false discovery rate correction for multiple testing. Code Example A.9 shows the queryused for the import of data.

TCGA An obvious choice is the screening of colon cancer data from TCGA. Public endpoints toaccess this data are poorly maintained and several are not working (May 2016). RDF dumpsof their data are available, yet, instructions for the import of data into the triplestore could notbe retrieved. As the data is spread over hundreds of separate files, no data from TCGA couldbe used for this research.


Table 6.2: Data imported into a local RDF environment using the in-house data and LOD. Thetotal gene count of the in-house data is after selection of genes with a significant expression and

methylation differences. A two by two comparison of genes shared across datasets is given.

Genes present and shared in every datasetDataset #Genes ∩ DisGeNET ∩ Expression AtlasIn-house data 1833 404 823DisGeNET 4424 2352Expression Atlas 11077 2352

Table 6.2 gives an overview of the amount of genes in every dataset and the gene entries thatare shared. Genes are specified by their NCBI identifier. Once all data is retrieved, the data can beanalyzed.

6.2.3 ResultsFigure 6.2 shows a representation of distribution of differential methylation and expression data. Alocal minimum in the methylation data can be observed. Since other methyltransferase proteins exist,it is to be expected that not every exon promotor region is demethylated. This is clearly shown inFigure 6.2, where the local minimum separates a distribution of unchanged methylation values anda distribution with changed methylation values. The local minimum is situated at -22.73. To selectonly the most significant changes in expression data we want to select all values situated in thetail of the distribution graph. A mimimum absolute fold change of 1 has not been chosen, as thisconstitutes more than one third (33.9281%) of all exons. An absolute fold change of 2, which selectsapproximately 10% (10.984%) of exons is deemed a suitable cut-off rate for which only the mostsignificant values are selected. Differential methylation values lower than -22.73 and an absoluteexpression fold change larger than 2 are the chosen criteria. The query is shown in Code ExampleA.6.

Figure 6.2: The distribution of differential methylation (left) and expression (right). The absolutevalue of the fold change is taken for the expression data.

An interest is to be found in genes with the highest differences in expression data. Code ExampleA.10 gives the query used to retrieve a list ordered by highest expression. The regulation of thosegenes on Expression Atlas and scores on DisGeNET are also given if available. The top ten resultsare shown in Table 6.4.

Candidate biomarkers are genes that show significantly different expression and/or methylation pat-terns in developing CRC compared to healthy cells. As oncogenes are typically upregulated in cancercells compared to healthy cells, an interest is shown in upregulated expression patterns. Knockingout methyltransferase activity has shown to reduce factors contributing to tumor growth [57]. Thus,it is to be expected that oncogenes are downregulated while tumor suppressor genes are upregulated

6.3. DISCUSSION 59

Table 6.4: The selection of ten genes for which methylation impacts expression the most.

Results obtained through Code Example A.10Label FC Methylation P-value Regulation ScoreTIMP3 12.3750 -37.7740 2.7132E-4 UP 2.7144E-4IL32 11.5849 -59.4912 2.4937E-5 UP 1.2027E-1PAGE5 10.7696 -62.9020NPTX2 10.4534 -85.5591 5.1579E-3 UPHDGFRP3 10.3798 -87.2696 4.6599E-4 UPCOL4A1 10.3092 -87.2447 3.1173E-3 UP 0.3600LINC00667 10.2738 -82.6889LINC00221 10.2313 -84.0186SOHLH2 10.1901 -75.1615 8.9690E-8 DOWNZNF140 10.1346 -45.3473 3.0980E-3 UP

in DKO compared to WT. As only genes are selected with a promotor methylation percentage thathas gone down, genes showing less expression are bound to be suppressed by other mechanics, suchas tumor suppressor genes. For this reason, the following criteria will focus on genes that are down-regulated in DKO. Furthermore, as data is obtained from Expression Atlas, candidate oncogenes canbe selected by an upregulation of the genes in cancer cells compared to healthy cells. Code ExampleA.11 gives the query to list the genes ordered by their expression values. Results are filtered to onlyshow genes with data from Expression Atlas for which upregulation of the gene has been observed.P-values and an optional score from DisGeNET is also shown. As we are looking for new candidatebiomarkers, it is not necessary to have the gene registered in DisGeNET. Only the ten genes withthe highest downward differential expression and meeting the proposed criteria are discussed. Theseare shown in Table 6.6.

Table 6.6: The selection of ten genes with highest differential expression, which are downwardsregulated, have a significant methylation difference and are known to be up regulated in CRC.

Results obtained through Code Example A.11Label FC Methylation P-value ScoreGRIN2B -7.1794 -31.3310 1.4233E-4 0.2400EHF -6.9060 -46.1462 1.4683E-16TIAM1 -6.3210 -50.6652 8.4553E-4DSC3 -6.0103 -42.6699 3.433E-2EPAS1 -5.3557 -35.1357 1.3316E-3 0.3600ZNF462 -5.3033 -73.7690 3.4391E-2SLCO1B3 -5.0295 -35.0112 2.617E-3 0.1200ZBTB20 -4.4870 -87.0630 3.0116E-2PDE10A -4.1334 -83.6438 2.4624E-2 2.7144E-4THOC2 -4.1334 -31.1591 6.9420E-3HOXB8 -4.0678 -56.4980 2.2630E-5

6.3 Discussion6.3.1 MethodsEach file, consisting of approximately 200.000 entries, took about five minutes to convert, withboth graphs containing about three million triples at the end of the conversion. The acquisition ofinformation from The Semantic Web has been easily retrieved using public endpoints. No federatedqueries processed for longer than five minutes, where local queries only took a couple of seconds.It is important to point out that the use of simple queries are elemental for a fast paced process.


Indeed, when trying to combine triples from several endpoints in one query, the process time increasesexponentially. Retrieving data from external endpoints should always be happen in separated steps.For this, data can be downloaded to specific graphs. This method ensures data retrieval to bethe most efficiÃńnt. Data downloaded can afterwards be retrieved from a local database using thesame query construct. Combining local data instead of external data also brings the advantage ofperformance.

The combination of query elements, which defines the order data is retrieved, is of importancewhen working with (federated) queries. For example, Code Example A.7 retrieves the genes fromthe Ensembl database from the exon labels located in the user database. As the amount of exonshad already been restricted to include only the most significant ones, the process requirements werestill acceptable. Yet, if one wants to retrieve the genes for every exon listed in the original dataset,another approach is necessary. The exon label is a literal, and is therefore heavy for the triplestoreto process. Unlike an IRI, no specific object handler is given. To process a literal, the query engineis required to evaluate the alignment of the string for every literal that matches the criteria of thequery. In this example, Ensembl would have to align several hundred thousands of exon labels withthe several hundred thousands of exon labels stored in the Ensembl RDF dataset. Public endpointsare programmed to be restricted in the amount of process power they can give to incoming queries,and will thus reject the query. A possible approach is querying and storing all the necessary data fromthe Ensembl database, after which the alignment of strings can be performed by the local triplestore.An easy way to split queries into several parts is to iterate over different chromosomes.

The construction of queries can be extensive and built from a high amount of elements. Never-theless, queries are always assembled following the specific data structure of the dataset. In this usecase the retrieval of data from these datasets has been realized using the same query patterns, asshown in Code Example A.7 through A.11. The combinations and placement of elements and extraelements such as filters and optional values have their specific logic. Because of this, it is possibleto create a more user friendly method for the creation of queries. An example of such a tool isSPARQLGraph [61].

6.3.2 ResultsA quick observation of both result sets shows no relation between the methylation and expressionvalues. Furthermore, only part of the genes retrieved have entries in DisGeNET. As we are lookingfor new candidate biomarkers, this is not in itself a problem. Results from Table 6.4 show onlyupregulated genes, with fold changes all larger than 10. This can be expected from a comparisonbetween a cell with significantly less methylation compared to the other. In that same table, datafrom Expression Atlas almost exclusively show that the selected genes are found to be upregulated inexperiments between cancer cells and healthy cells. As these genes are showing the highest amountsof additional expression in DKO, it could be expected that selected gener are tumor suppressor genes.Yet, this is contradicted by data retrieved from Expression Atlas. A literature review of the givenelements is in order, and might give us conslusive evidence of the specific role of these genes.

In short, TIMP3 has been shown to be an important tumor suppressor gene [43] which is inhibitedby miRNA-191 in the development of CRC [55]. IL-32 has been reported to reduce tumor growth wheninducing overexpression in CRC [53]. Yet, other studies have also suggested IL-32 to be overexpressedin CRC [84]. NPTX2 has been shown to inhibit pancreatic cancer growth [86] and SOHLH2 ovariancancer growth [85]. HDGFRP3 has been reported as a potential angiogenic factor that helps tumorgrowth. No data about relations between the expression or methylation of long intergenic non-codingRNA 667 and 221 have been found with any cancer type. COL4A1 has been identified as a genewhich has shown consistent methylation in CRC [47]. Literature of ZNF140 are also lacking.

The set of results featured in Table 6.4 gave no clear or one-sided view on the expression of genesselected and their characteristics in the development of CRC. Although most genes showed relevanceto some extent with the development of cancers, no clear relation can be shown between the role ofthe given genes and their differential expression and methylation values.

6.3. DISCUSSION 61

Table 6.6 features a set of candidate genes that encompass a wide range of expression values, withan FC ranging from -7.179 to -4.0677. Due to a small amount of genes actually having a loweredexpression in the DKO, a larger partition of values with a FC smaller than minus two are selected. Asignificant lowering of the methylation percentage is not to be expected with the downregulation ofthe genes. There are many possible reasons attributing to this, but it is most likely due to the higherexpression of tumor suppressor genes. Due to the construction of the query, no contradictions existbetween values obtained from Expression Atlas and the in-house data. Reviewing literature sourcesreveal uniform evidence.

In short, GRIN2B and ZNF462 are suggested to be prone to mutations in CRC [2] [82]. Lowexpression of EHF has been shown to induce tumorigenic potential of prostate cancer cells. Yet,another study shows that the knockdown of EHF cells inhibited the proliferation, invasion and tu-morigenesis of ovarian cancer cells, where a correlation was found between the survival time of thepatient and expression of EHF [18]. The downregulation of TIAM1 has recently been identified tohelp supressing gastric cancer invasion and growth [42]. DSC3 has been reported to be downregulatedin CRC [21], and has been shown to have tumor suppressor activity [20]. High levels of mRNA comingfrom EPAS1 have been shown to have a correlation relapse and mortality of CRC [48]. A variant ofthe SLCO1B3 has been shown to be expressed in both colon and pancreatic cancer [67]. ZBTB20has been linked with the promotion of non-small cell lung cancer through repression of Fox01 [87].ZBTB20 has also been documented to have increased expression in hepatocellular carcinoma, and islinked with poor prognosis [80]. Reports of PDE10A include a high expression and important rolein the development in CRC [39], and is recognized as a novel target for inhibition in the preventionof CRC [41]. High levels of HOXB8 expression are documented, with direct correlations to patientsurvival [63]. Overexpression of the gene is furthermore directly linked to the immortalization of atumor cells [58]. Although not all genes have been specifically linked to colon cancer, specific roles inother cancers are links for a possible involvement in CRC. The data retrieved from the HCT116 cellline, literature studies, and data found on Expression Atlas affirm the involvement of specific genesrepresented in Table 6.6 in the development of CRC.

Although a promising set of results has been given, the selection of a specific biomarker cannotbe made. This is because an adequate biomarker is not directly dependent to the degree in whichexpression and methylation values differ. Instead, it displays consistency over a large variety ofcancer and healthy cells, retrieved from different patients. This is important as one wants to finda biomarker with a high selectivity and specificity. Furthermore, tests produced will evaluate a setof biomarkers to further the specificity and selectivity of the test. Evaluation of multiple genes isthus required. Due to the criteria composed, genes have been selected that have shown significantexpression and methylation patterns retrieved from in-house data and Expression Atlas. Many of theretrieved genes have been reported to show distinct roles in the development of cancer. Unlike theresults displayed in 6.4, 6.6 lists genes with higher expression in CRC, and can thus be used in testsmeasuring RNA expression. To evaluate whether selected genes can be retrieved through proteinexpression, possible post transcriptional silencing should be evaluated. Thus, The next step in theselection of a potential set of biomarkers is the evaluation of their expression and methylation valuesover hundreds of samples.

7Conclusion and Future Prospects

The Semantic Web has, since its creation in 2003, increasingly grown into a technology that fulfillsthe design goals it was created for. It is followed by a growing community of public databases, whichhave been gradually adapting the construction of RDF databases. This study has realized a datastructure for the integration of biological data formats into a local RDF environment. Through this,the researcher can use his own data for analysis on The Semantic Web. The BED and GFF datastructures are the most complete, with largely all elements implemented using community acceptedontologies. Both the VCF and SAM implementations lack completeness. One of the major challengesto completely integrate all featured elements into RDF is the creation or retrieval of correct ontologies.Only ontologies that explain the entity of a value outside the context of the data format are adequatefor use. Only when this design goal is followed, it is possible to create an environment where all datafrom different formats can be seamlessly combined into one environment. The schemas created inthis study have been translated into a data converter implemented in boinq. As boinq aims to be anopen source tool to serve research, it is necessary to share the proposed implementations with theactive community, and create an environment where the further creation and development of NGSdata structures is community-based.

The implementation of the functionality is one step closer to making The Semantic Web an easilyaccessible asset for the analysis and management of data. It has furthermore been shown that an RDFenvironment can be used to iterate through different steps of a workflow. This requires a high levelof knowledge, and cannot be performed by people unfamiliar with SPARQL or the data structures ofLOD databases. This obstacle can be overcome with the creation of a tool that makes query buildingfor external databases intuitive. An example of a project aiming towards this is SPARQLGraph [61].When integrated in boinq, additional functionalities are required, where a focus is set towards querybuilding that enables an iterative process of data collection and analysis in a local RDF environment.A feature complete boinq could thus become one of the first applications that enables a practical useof The Semantic Web for bioinformatics purposes.

63

ACode Examples

PREFIXES USEDPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX obo1: <http://purl.obolibrary.org/obo/so-xp.obo#>PREFIX obo2: <http://purl.obolibrary.org/obo/>PREFIX faldo: <http://biohackathon.org/resource/faldo#>PREFIX sio: <http://semanticscience.org/resource/>PREFIX skos: <http://www.w3.org/2004/02/skos/core#>PREFIX up: <http://purl.uniprot.org/core/>PREFIX dcterms: <http://purl.org/dc/terms/>PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>

Code Example A.1: A set of queries executed on the triplestore and sparql-bed.

1 #Feature count (boinq), Result: ?count = 152 SELECT (COUNT(?p) AS ?count)3 WHERE{?p <http://biohackathon.org/resource/faldo#location> ?o. }4 #Feature count (sparql-bed), Result: ?count = 155 SELECT (COUNT(?p) AS ?count)6 WHERE{?p rdf:type <http://biohackathon.org/resource/faldo#Region>.}7 #Location interval (boinq), Result: ?count = 58 SELECT (COUNT(?feature) AS ?count)9 WHERE {?feature faldo:location ?p .

10 ?p faldo:begin [faldo:position ?x];11 faldo:end [faldo:position ?y].12 FILTER(?y<178999999)?location1 && ?x>178908612}13 #Location interval (sparql-bed), Result: ?count = 514 SELECT (COUNT(?p) AS ?count)15 WHERE { ?p faldo:begin [faldo:position ?x];16 faldo:end [faldo:position ?y].17 FILTER(?y<178999999 && ?x>178908612)}

Code Example A.2: A query executed on the triplestore and sparql-vcf.

1 #Feature count (boinq), Result: ?count = 992 SELECT (COUNT(?p) AS ?count)3 WHERE{ ?p <http://biohackathon.org/resource/faldo#location> ?o. }4 #Feature count (sparql-vcf), Result: ?count = 995 SELECT (COUNT(?p) AS ?count)6 WHERE { ?p ?o <http://biohackathon.org/resource/vcf#Feature>}

65

66 APPENDIX A. CODE EXAMPLES

Code Example A.3: The configuration of the trackList.json file loaded by JBrowse. TheSPARQL query requires to be written over one single line.

1 ...2 {3 "label": "SPARQLGene",4 "key": "SPARQL Genes",5 "style" : {6 "className" : "gene",7 "histScale" : 2,8 "featureCss" : "background-color: #66F; height: 8px",9 "histCss" : "background-color: #88F",

10 "height" : "500"11 },12 "storeClass": "JBrowse/Store/SeqFeature/SPARQL",13 "type": "JBrowse/View/Track/HTMLFeatures",14 "urlTemplate": "http://localhost:9999/blazegraph/namespace/boinq/sparql",15 "queryTemplate": "prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>16 prefix xsd: <http://www.w3.org/2001/XMLSchema#> prefix obo: <http://purl.obolibrary.org/obo/> prefix faldo:17 <http://biohackathon.org/resource/faldo#> SELECT ?start ?end ?strand (?feature as ?uniqueID) ?name WHERE{ ?feature faldo:location18 ?location . OPTIONAL{ ?feature rdfs:label ?name}. ?location faldo:begin [faldo:position ?start]; faldo:end [faldo:position ?end;19 rdf:type ?strandpos] . FILTER(?strandpos = faldo:ForwardStrandPosition || ?strandpos= faldo:ReverseStrandPosition).20 BIND(xsd:integer(IF(?strandpos = faldo:ForwardStrandPosition,1,IF(?strandpos = faldo:ReverseStrandPosition,-1,1))) as ?strand)}"21 },22 ...

Code Example A.4: The SPARQL query used for the retrieval of genes SNPs are located on.

1 SELECT ?object ?idx ?id ?label2 WHERE {3 {4 SELECT ?begin ?end ?gene ?id ?label {5 SERVICE<http://localhost:8080/bigdata/namespace/Ensembl> {6 ?gene dcterms:description ?desc ;7 dcterms:identifier ?id ;8 rdfs:label ?label ;9 a ?type ;

10 faldo:location ?location .11 ?location faldo:reference [rdfs:subClassOf12 <http://rdf.ebi.ac.uk/resource/ensembl/9606/chromosome:1>].13

14 ?location faldo:begin [faldo:position ?begin] .15 ?location faldo:end [faldo:position ?end] .16 }}17 }18 ?object <http://www.biointerchange.org/gfvo#Identifier> ?idx;19 faldo:location ?location.20 ?location faldo:begin [faldo:position ?beginx] ;21 faldo:end [faldo:position ?endx] .22 FILTER(?endx<?end && ?beginx>?begin)23 }

67

Code Example A.5: The SPARQL query used for the retrieval of exons SNPs are located on.

1 SELECT ?object ?idx ?exon ?label ?genelabel2 WHERE {3 {4 SELECT ?exon ?label ?genelabel ?begin ?end {5 SERVICE<http://localhost:8080/bigdata/namespace/Ensembl> {6 ?exon a obo:SO_0000147 ;7 rdfs:label ?label ;8 faldo:location ?location .9 ?location faldo:reference [rdfs:subClassOf

10 <http://rdf.ebi.ac.uk/resource/ensembl/9606/chromosome:1>].11

12 ?gene obo:SO_has_part ?exon ;13 dcterms:description ?desc ;14 rdfs:label ?genelabel .15

16 ?location faldo:begin [faldo:position ?begin] .17 ?location faldo:end [faldo:position ?end] .18 }}19 }20 ?object <http://www.biointerchange.org/gfvo#Identifier> ?idx;21 faldo:location ?location.22 ?location faldo:begin [faldo:position ?beginx] ;23 faldo:end [faldo:position ?endx] .24 FILTER(?endx<?end && ?beginx>?begin)25 }

Code Example A.6: The query used for the selection of exons with significant difference inmethylation and expression.

1 INSERT{2 GRAPH<http://www.boinq.org/iri/graph/resultset_selected_data> {3 ?o rdfs:label ?s4 }5 }6 WHERE{7 GRAPH <http://www.boinq.org/iri/graph/local#1_111> {8 #EXPRESSION DATA9 ?o rdfs:label ?s ;

10 obo1:has_quality [rdf:value ?expresion] .11 }12

13 GRAPH <http://www.boinq.org/iri/graph/local#1_112> {14 #METHYLATION DATA15 ?w rdfs:label ?s ;16 obo1:has_quality [rdf:value ?methylation] .17 }18 FILTER(?methylationDiff<"-22.73007"ˆˆxsd:decimal && (?expresion>"4.0"ˆˆxsd:decimal||19 ?expresion<"0.25"ˆˆxsd:decimal))20 }


Code Example A.7: The query used for the curation of gene data for every exon.

1 INSERT{2 GRAPH<http://www.boinq.org/iri/graph/resultset_selected_data> {3 ?o rdfs:label ?s.4 ?o obo1:part_of ?gene.5 ?gene rdfs:label ?label;6 rdf:type ?genetype .7 rdfs:seeAlso ?link .8 ?link a <http://identifiers.org/ncbigene> .9

10 }11 }12 WHERE{13 GRAPH<http://www.boinq.org/iri/graph/resultset_selected_data> {14 ?o rdfs:label ?s15 }16 SERVICE<http://localhost:8080/bigdata/namespace/Ensembl/sparql>{17 ?exon dcterms:identifier ?s.18 ?transcript obo2:SO_has_part ?exon;19 obo2:SO_transcribed_from ?gene.20 ?gene rdfs:label ?label ;21 rdf:type ?genetype .22 OPTIONAL{23 {?gene <http://rdf.ebi.ac.uk/terms/ensembl/DEPENDENT> ?link } UNION {24 ?gene <http://rdf.ebi.ac.uk/terms/ensembl/DIRECT> ?link } .25 ?link a ?type .}26 }}

Code Example A.8: The query used to retrieve gene disease associations with their relevantgenes and scores from DisGeNET.

1 INSERT{2 GRAPH<http://www.boinq.org/iri/graph/resultset_coloncancer_genes_disgenet> {3 ?gene rdfs:label ?title .4 ?gda sio:SIO_000628 ?gene .5 ?gda sio:SIO_000216 ?scoreIRI .6 ?scoreIRI sio:SIO_000300 ?score .7 }8 }9 WHERE{

10 {11 SELECT ?gene ?title WHERE{12 SERVICE<http://rdf.disgenet.org/sparql/>{13 ?disease skos:exactMatch <http://id.nlm.nih.gov/mesh/D003110> .14 ?gda sio:SIO_000628 ?gene,?disease ;15 sio:SIO_000216 ?scoreIRI .16 ?gene rdf:type ncit:C16612 ;17 sio:SIO_000205 [dcterms:title ?title] .18 ?scoreIRI sio:SIO_000300 ?score .19 }}20 }}

69

Code Example A.9: The query used to retrieve expression data with their relevant p-valuesfrom data featuring CRC cells on Expression atlas.

1 INSERT {2 GRAPH<http://www.boinq.org/iri/graph/resultset_coloncancer_genes_disgenet> {3 ?gda sio:SIO_000628 ?geneID .4 ?gda sio:SIO_000216 ?scoreIRI .5 ?scoreIRI sio:SIO_000300 ?score .6 }7 }8 WHERE{9 GRAPH<http://www.boinq.org/iri/graph/resultset_coloncancer_genes_expressionAtlas_pE0>{

10 ?value atlasterms:pValue ?pvalue .11 ?value atlasterms:isMeasurementOf ?probe ;12 rdfs:label ?expressionValue .13 ?probe atlasterms:dbXref ?uniprot .14 ?geneID sio:SIO_010078 ?uniprot .15 }16 }

Code Example A.10: The query used to retrieve the genes that are downregulated with relevantp-values and DisGeNET scores. Genes are ordered by fold change in decreasing order.

1 SELECT DISTINCT ?label ?expression ?methylation ?pvalue ?score2 WHERE {3 GRAPH<http://www.boinq.org/iri/graph/resultset_selected_data> {4 ?gene rdf:type obo2:SO_0000704 ;5 rdfs:seeAlso ?geneID.6 ?geneID a <http://identifiers.org/ncbigene>.7 ?exon obo1:part_of ?gene;8 rdfs:label ?exonlabel.9 ?gene rdfs:label ?label .

10 }11 OPTIONAL{12 GRAPH<http://www.boinq.org/iri/graph/resultset_coloncancer_genes_disgenet> {13 ?gda sio:SIO_000628 ?geneID .14 ?gda sio:SIO_000216 ?scoreIRI .15 ?scoreIRI sio:SIO_000300 ?score .16 }17 }18 OPTIONAL{19 GRAPH<http://www.boinq.org/iri/graph/resultset_coloncancer_genes_expressionAtlas_pE0>{20 ?value atlasterms:pValue ?pvalue .21 ?value atlasterms:isMeasurementOf ?probe ;22 rdfs:label ?expressionValue .23 ?probe atlasterms:dbXref ?uniprot .24 ?geneID sio:SIO_010078 ?uniprot .25 }26 }27 GRAPH <http://www.boinq.org/iri/graph/local#1_114> {28 ?exon1 rdfs:label ?exonlabel ;29 obo1:has_quality [rdf:value ?expression] .30 }31 GRAPH <http://www.boinq.org/iri/graph/local#1_115> {32 ?exon2 rdfs:label ?exonlabel ;33 obo1:has_quality [rdf:value ?methylation] .34 }35 FILTER regex(str(?expressionValue), "UP")36 }37 ORDER BY ?expression


Code Example A.11: The query used to retrieve the genes that are downregulated with relevantp-values and DisGeNET scores. Genes are ordered by fold change in decreasing order.

1 SELECT DISTINCT ?label ?expression ?methylation ?pvalue ?score2 WHERE {3 GRAPH<http://www.boinq.org/iri/graph/resultset_selected_data> {4 ?gene rdf:type obo2:SO_0000704 ;5 rdfs:seeAlso ?geneID.6 ?geneID a <http://identifiers.org/ncbigene>.7 ?exon obo1:part_of ?gene;8 rdfs:label ?exonlabel.9 ?gene rdfs:label ?label .

10 }11 OPTIONAL{12 GRAPH<http://www.boinq.org/iri/graph/resultset_coloncancer_genes_disgenet> {13 ?gda sio:SIO_000628 ?geneID .14 ?gda sio:SIO_000216 ?scoreIRI .15 ?scoreIRI sio:SIO_000300 ?score .16 }17 }18 OPTIONAL{19 GRAPH<http://www.boinq.org/iri/graph/resultset_coloncancer_genes_expressionAtlas_pE0>{20 ?value atlasterms:pValue ?pvalue .21 ?value atlasterms:isMeasurementOf ?probe ;22 rdfs:label ?expressionValue .23 ?probe atlasterms:dbXref ?uniprot .24 ?geneID sio:SIO_010078 ?uniprot .25 }26 }27 GRAPH <http://www.boinq.org/iri/graph/local#1_114> {28 ?exon1 rdfs:label ?exonlabel ;29 obo1:has_quality [rdf:value ?expression] .30 }31 GRAPH <http://www.boinq.org/iri/graph/local#1_115> {32 ?exon2 rdfs:label ?exonlabel ;33 obo1:has_quality [rdf:value ?methylation] .34 }35 }36 ORDER BY DESC (abs(?expression))

BTables

Table B.1: The different feature types found in VCF files.

VCF schema: feature type ontologiesTypeOntology IRI

Definition

INDELhttp://purl.obolibrary.org/obo/SO 1000032 A sequence alteration which included an insertion and a deletion,

affecting 2 or more bases.

MIXEDhttp://purl.obolibrary.org/obo/SO 0000667 The sequence of one or more nucleotides added between two adja-

cent nucleotides in the sequence.http://purl.obolibrary.org/obo/SO 0000159 The point at which one or more contiguous nucleotides were ex-

cised.

MNPhttp://purl.obolibrary.org/obo/SO 0001013 A multiple nucleotide polymorphism with alleles of common length

bigger than 1, for example AAA/TTT.

NO VARIATIONhttp://purl.obolibrary.org/obo/SO 0000347 A match against a nucleotide sequence.

SNPhttp://purl.obolibrary.org/obo/SO 0000694 SNPs are single base pair positions in genomic DNA at which differ-

ent sequence alternatives exist in normal individuals in some pop-ulation(s), wherein the least frequent variant has an abundance of1

71

72 APPENDIX B. TABLES

Table B.2: Recognized fields for feature types in the GFF format

GFF schema: feature type ontologiesField valueOntology IRI

Definition

CDShttp://purl.obolibrary.org/obo/SO 0000316

A contiguous sequence which begins with, and includes, a startcodon and ends with, and includes, a stop codon.

GENEhttp://purl.obolibrary.org/obo/SO 0000704

A region (or regions) that includes all of the sequence elementsnecessary to encode a functional transcript. A gene may includeregulatory regions, transcribed regions and/or other functional se-quence regions.

MRNAhttp://purl.obolibrary.org/obo/SO 0000234

Messenger RNA is the intermediate molecule between DNA andprotein. It includes UTR and coding sequences. It does not containintrons.

CDNAhttp://purl.obolibrary.org/obo/SO 0000756

DNA synthesized by reverse transcriptase using RNA as a template.

OPERONhttp://purl.obolibrary.org/obo/SO 0000178

A group of contiguous genes transcribed as a single (polycistronic)mRNA from a single regulatory region.

PROMOTORhttp://purl.obolibrary.org/obo/SO 0000167

A regulatory region composed of the TSS(s) and binding sites forTF complexes of the basal transcription machinery.

TF BINDING SITEhttp://purl.obolibrary.org/obo/SO 0000235

A region of a nucleotide molecule that binds a Transcription Factoror Transcription Factor complex.

THREE PRIME UTRhttp://purl.obolibrary.org/obo/SO 0000205

A region at the 3’ end of a mature transcript (following the stopcodon) that is not translated into a protein.

FIVE PRIME UTRhttp://purl.obolibrary.org/obo/SO 0000204

A region at the 5’ end of a mature transcript (preceding the initi-ation codon) that is not translated into a protein.

INTRONhttp://purl.obolibrary.org/obo/SO 0000188

A region of a primary transcript that is transcribed, but removedfrom within the transcript by splicing together the sequences (ex-ons) on either side of it.

EXONhttp://purl.obolibrary.org/obo/SO 0000147

A region of the transcript sequence within a gene which is notremoved from the primary RNA transcript by RNA splicing.

TRANSCRIPThttp://purl.obolibrary.org/obo/SO 0000673

An RNA synthesized on a DNA or RNA template by an RNA poly-merase.

REGIONhttp://purl.obolibrary.org/obo/SO 0000001

A sequence feature with an extent greater than zero. A nucleotideregion is composed of bases and a polypeptide region is composedof amino acids.

START CODONhttp://purl.obolibrary.org/obo/SO 0000318

First codon to be translated by a ribosome.

STOP CODONhttp://purl.obolibrary.org/obo/SO 0000319

In mRNA, a set of three nucleotides that indicates the end of in-formation for protein synthesis.

NCRNAhttp://purl.obolibrary.org/obo/SO 0000655

An RNA transcript that does not encode for a protein rather theRNA molecule is the gene product.

TRNAhttp://purl.obolibrary.org/obo/SO 0000253

Transfer RNA (tRNA) molecules are approximately 80 nucleotidesin length. Their secondary structure includes four short double-helical elements and three loops (D, anti-codon, and T loops).Further hydrogen bonds mediate the characteristic L-shaped molec-ular structure. Transfer RNAs have two regions of fundamentalfunctional importance: the anti-codon, which is responsible forspecific mRNA codon recognition, and the 3’ end, to which thetRNA’s corresponding amino acid is attached (by aminoacyl-tRNAsynthetases).

RRNAhttp://purl.obolibrary.org/obo/SO 0000252

RNA that comprises part of a ribosome, and that can provide bothstructural scaffolding and catalytic activity.

73

Table B.3: Object types created for every key found in the [INFO] field, used in the VCF schema.The namespace prefix gfvo is used to replace ’http://www.biointerchange.org/gfvo#’

VCF schema: attribute type ontologiesKey valueOntology IRI

Definition

AAgfvo:AncestralSequence

Denotes an ancestral allele of a feature. May be used to denote the”ancestral allele” (”AA” additional information) of VCF formattedfiles.

ACgfvo:AlleleCount

Count of a specific allele in genotypes. Encodes for ”AC” additionalinformation in VCF files.

AFgfvo:AlleleFrequency

Proportion of a particular gene allele in a gene pool or genotype.Encodes for ”AF” additional information in VCF files.

ANgfvo:TotalNumberOfAlleles

Total number of alleles in called genotypes. Encodes for ”AN”additional information in VCF files.

BQgfvo:BaseQuality

Root mean square base quality. Accounts for ”BQ” additional in-formation in VCF files.

DB, H2, H3, 1000Ggfvo:ExternalReference

A cross-reference to associate an entity to a representation inanother database. Encodes for the ”Dbxref” attribute in GFF3and GVF. Can be used to describe the contents of the ”source”column in GTF files. Captures the ”genome-build” pragma,”source-method”, ”attribute-method”, ”phenotype-description”,and ”phased-genotypes” structured pragmas in GVF. Accounts forthe ”assembly” and ”pedigreeDB” information fields, and ”DB”,”H2”, ”H3”, ”1000G” additional information in VCF.

DPgfvo:Coverage

Number of nucleic acid sequence reads for a particular genomiclocus (a region or single base pair). Accounts for ”DP” additionalinformation in VCF files.

MQgfvo:MappingQuality

Root mean square mapping quality. Encodes values of the ”MQ”additional information in VCF files.

MQ0gfvo:NumberOfReads

Number of reads supporting a particular feature or variant. Canencode for ”MQ0” additional information in VCF files, if additionalannotations are provided to denote a mapping quality of zero for thegiven count. In GVF files, the class accounts for the ”Variant reads”attribute.

NSgfvo: SampleCount

Number of samples in the dataset. Encodes for ”NS” additionalinformation in VCF files.

SBgfvo:Note

A note is a short textual description about an entity. It providesa formal or semi-formal description of an entity, as opposed toa ”Comment”. Encodes for the ”sample-description” pragma and”Comment” key/value pairs in structured attributes in GVF. Cap-tures ”Description” key/value pairs in information fields and ”SB”information field in VCF.

SOMATICgfvo:SomaticCell

The somatic feature class captures information about genomic se-quence features arising from somatic cells. Encodes for ”genomic-source” pragma in GVF and ”SOMATIC” additional information inVCF.

VALIDATEDgfvo:ExperimentalMethod

An experimental method is a procedure that yields an experimentaloutcome (result). Experimental methods can be in vivo, in vitro orin silico procedures that are well described and can be referenced.Encodes for ”source” column contents of GFF3, GTF, and GVFfile formats as well as the ”CHROM” column in VCF. Can be usedto describe the ”capture-method” pragma in GVF; it can describe”VALIDATED” additional information in VCF.

74 APPENDIX B. TABLES

CFigures

Figure C.1: The options given before conversion of a supported file format by boinq. Mappingoptions, for which the type of your main and sub feature can be selected, are only visible when

converting BED-files. Options are selected for the conversion of Code Example 5.1

75

76 APPENDIX C. FIGURES

Figure C.2: Boinq offers a browser for the selection of a specfic feature type. This browsersupports all types given by the Sequence Ontology vocabulary. A search bar is present.

Figure C.3: The track properties as displayed by boinq after the conversion of Code Example 5.1

77

Figure C.4: A genomic region on the first chromosome of the Homo sapiens assembly (GRCh38)as displayed in the UCSC browser. This exact region has been downloaded in BED format and

imported into an RDF triplestore using boinq.

Figure C.5: JBrowse displaying the genomic region downloaded from the UCSC browser. Thespecific file BED file is featured in Code Example 5.8. JBrowse can access the SPARQL endpoint of

the triplestore using the configuration as shown in Code Example A.3.

78 APPENDIX C. FIGURES

References

[1] 1000Genomes. VCF (Variant Call Format) version4.3. http://samtools.github.io/hts-specs/VCFv4.3.pdf. Online; accessed Dec 25 2015.

[2] H. Alakus, M. L. Babicky, P. Ghosh, S. Yost, K. Jepsen,Y. Dai, A. Arias, M. L. Samuels, E. S. Mose, R. B.Schwab, M. R. Peterson, A. M. Lowy, K. A. Frazer,and O. Harismendy. Correction: Genome-wide mutationallandscape of mucinous carcinomatosis peritonei of appen-diceal origin. Genome Med, 6(7):53, 2014.

[3] E. Antezana, W. Blonde, M. Egana, A. Rutherford,R. Stevens, B. De Baets, V. Mironov, and M. Kuiper.BioGateway: a semantic systems biology tool for the lifesciences. BMC Bioinformatics, 10 Suppl 10:S11, 2009.

[4] Apache Jena. Apache Jena. https://jena.apache.org/.Online; accessed December 12 2015.

[5] Apache Jena. Fuseki: serving RDF data over HTTP. https://jena.apache.org/documentation/serving_data/. Online; accessed December 12 2015.

[6] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker,B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang,R. Lopez, M. Magrane, M. J. Martin, D. A. Natale,C. O’Donovan, N. Redaschi, and L. S. Yeh. UniProt:the Universal Protein knowledgebase. Nucleic Acids Res.,32(Database issue):D115–119, Jan 2004.

[7] C. Balch, J. S. Montgomery, H. I. Paik, S. Kim, S. Kim,T. H. Huang, and K. P. Nephew. New anti-cancer strate-gies: epigenetic therapies and biomarkers. Front. Biosci.,10:1897–1931, 2005.

[8] A. Bandrowski et al. The Ontology for Biomedical Inves-tigations. PLoS ONE, 11(4):e0154556, 2016.

[9] J. Baran, B. S. Durgahee, K. Eilbeck, E. Antezana,R. Hoehndorf, and M. Dumontier. GFVO: the GenomicFeature and Variation Ontology. PeerJ, 3:e933, 2015.

[10] J. Baran, B. S. Durgahee, K. Eilbeck, E. Antezana,R. Hoehndorf, and M. Dumontier. GFVO: the GenomicFeature and Variation Ontology. PeerJ, 3:e933, 2015.

[11] T. Berners-Lee. The semantic web. Scientific AmericanMagazin, 17, 5 2001.

[12] T. Berners-Lee. Linked data design issues. http://www.w3.org/DesignIssues/LinkedData, 2009. Online; ac-cessed December 10 2015.

[13] J. Bolleman. sparql-bed Github. https://github.com/JervenBolleman/sparql-bed. Online; accessed April2016.

[14] J. Bolleman. sparql-vcf Github. https://github.com/JervenBolleman/sparql-vcf. Online; accessed April2016.

[15] J. Bolleman, C. J. Mungall, F. Strozzi, J. Barran, M. Du-montier, R. J. P. Bonnal, R. Buels, R. Hoendorf, T. Fuji-sawa, T. Katayama, and P. J. A. Cock. Faldo: A semanticstandard for describing the location of nucleotide and pro-tein feature annotation. bioRxiv, 2014.

[16] R. Buels, E. Yao, C. M. Diesh, R. D. Hayes, M. Munoz-Torres, G. Helt, D. M. Goodstein, C. G. Elsik, S. E. Lewis,L. Stein, and I. H. Holmes. JBrowse: a dynamic web plat-form for genome visualization and analysis. Genome Biol.,17(1):66, 2016.

[17] A. Callahan, J. Cruz-Toledo, and M. Dumontier.Ontology-Based Querying with Bio2RDF’s Linked OpenData. J Biomed Semantics, 4 Suppl 1:S1, Apr 2013.

[18] Z. Cheng, J. Guo, L. Chen, N. Luo, W. Yang, and X. Qu.Knockdown of EHF inhibited the proliferation, invasionand tumorigenesis of ovarian cancer cells. Mol. Carcinog.,55(6):1048–1059, Jun 2016.

[19] L. Clarke et al. The 1000 Genomes Project: data man-agement and community access. Nat. Methods, 9(5):459–462, May 2012.

[20] T. Cui, Y. Chen, L. Yang, T. Knosel, O. Huber,M. Pacyna-Gengelbach, and I. Petersen. The p53 tar-get gene desmocollin 3 acts as a novel tumor suppres-sor through inhibiting EGFR/ERK pathway in human lungcancer. Carcinogenesis, 33(12):2326–2333, Dec 2012.

[21] T. Cui, Y. Chen, L. Yang, T. Knosel, K. Zoller, O. Huber,and I. Petersen. DSC3 expression is regulated by p53, andmethylation of DSC3 DNA is a prognostic marker in hu-man colorectal cancer. Br. J. Cancer, 104(6):1013–1019,Mar 2011.

[22] F. Cunningham et al. Ensembl 2015. Nucleic Acids Res.,43(Database issue):D662–669, Jan 2015.

[23] R. Cyganiak and A. Jentzsch. The Linking Open Datacloud diagram. http://lod-cloud.net/, 2014. Online;accessed October 11 2015.

[24] M. Devisscher. Boinq, Github. https://github.com/mr-tijn/boinq2/tree/master/ontologies.

[25] M. Devisscher, T. D. Meyer, W. V. Criekinge, andP. Dawyndt. An ontology based query engine for queryingbiological sequences. EMBnet.journal, 19(B), 2013.

[26] DisGeNET. disgenet2r: an R package to ex-plore the molecular underpinnings of human dis-eases. http://www.disgenet.org/ds/DisGeNET/html/dissemination/disgenet2r-JBI-valencia-2016.pdf.

[27] Dublin Core Metadata Initiative. DCMI MetadataTerms. http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms#. Online; accessed December10 2015.

[28] A. M. Dworkin, T. H. Huang, and A. E. Toland. Epi-genetic alterations in the breast: Implications for breastcancer detection, prognosis and treatment. Semin. Can-cer Biol., 19(3):165–171, Jun 2009.

[29] K. Eilbeck, S. E. Lewis, C. J. Mungall, M. Yandell,L. Stein, R. Durbin, and M. Ashburner. The SequenceOntology: a tool for the unification of genome annota-tions. Genome Biol., 6(5):R44, 2005.

[30] EMBI-EBI. About the Ensembl Project. http://www.ensembl.org/info/about/index.html, 2015. Online;accessed December 12 2015.

[31] EMBL-EBI. GFF/GTF File Format - Definition andsupported options. http://www.ensembl.org/info/website/upload/gff.html. Online; accessed December13 2015.

[32] EMBL-EBI. EMBL-EBI RDF Platform. https://www.ebi.ac.uk/rdf/, 2015. Online; accessed December 122015.

79

80 REFERENCES

[33] M. Gonzalez-Pons and M. Cruz-Correa. Colorectal Can-cer Biomarkers: Where Are We Now? Biomed Res Int,2015:149014, 2015.

[34] B. I. Group. DisGeNET Database Information. http://www.disgenet.org/web/DisGeNET/menu/dbinfo, 2015.Online; accessed December 12 2015.

[35] Jannovar. Jannovar Home Page. http://charite.github.io/jannovar/. Online; accessed April 15 2016.

[36] JHipster. JHipster Home Page, Github. http://jhipster.github.io/. Online; accessed April 16 2016.

[37] S. W. Jiang, J. Li, K. Podratz, and S. Dowdy. Applica-tion of DNA methylation biomarkers for endometrial can-cer management. Expert Rev. Mol. Diagn., 8(5):607–616,Sep 2008.

[38] Y. Kodama, M. Shumway, and R. Leinonen. The SequenceRead Archive: explosive growth of sequencing data. Nu-cleic Acids Res., 40:D54–56, Jan 2012.

[39] K. Lee, A. S. Lindsey, N. Li, B. Gary, J. Andrews, A. B.Keeton, and G. A. Piazza. ÃŐÂš-catenin nuclear translo-cation in colorectal cancer cells is suppressed by PDE10Ainhibition, cGMP elevation, and activation of PKG. On-cotarget, 7(5):5353–5365, Feb 2016.

[40] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan,N. Homer, G. Marth, G. Abecasis, and R. Durbin. TheSequence Alignment/Map format and SAMtools. Bioin-formatics, 25(16):2078–2079, Aug 2009.

[41] N. Li, K. Lee, Y. Xi, et al. Phosphodiesterase 10A: a noveltarget for selective inhibition of colon tumor cell growthand ÃŐÂš-catenin-dependent TCF transcriptional activity.Oncogene, 34(12):1499–1509, Mar 2015.

[42] Z. Li, X. Yu, Y. Wang, J. Shen, W. K. Wu, J. Liang,and F. Feng. By downregulating TIAM1 expression,microRNA-329 suppresses gastric cancer invasion andgrowth. Oncotarget, 6(19):17559–17569, Jul 2015.

[43] H. Lin, Y. Zhang, H. Wang, D. Xu, X. Meng, Y. Shao,C. Lin, Y. Ye, H. Qian, and S. Wang. Tissue inhibitorof metalloproteinases-3 transfer suppresses malignant be-haviors of colorectal cancer cells. Cancer Gene Ther.,19(12):845–851, Dec 2012.

[44] T. Lindgren, T. Stigbrand, A. Raberg, K. Riklund, L. Jo-hansson, and D. Eriksson. Genome wide expression anal-ysis of radiation-induced DNA damage responses in iso-genic HCT116 p53+/+ and HCT116 p53-/- colorectalcarcinoma cell lines. Int. J. Radiat. Biol., 91(1):99–111,Jan 2015.

[45] K. W. Marshall, S. Mohr, F. E. Khettabi, N. Nossova,S. Chao, W. Bao, J. Ma, X. J. Li, and C. C. Liew. Ablood-based biomarker panel for stratifying current riskfor colorectal cancer. Int. J. Cancer, 126(5):1177–1186,Mar 2010.

[46] E. Merrill, S. Corlosquet, P. Ciccarese, T. Clark, andS. Das. Semantic Web repositories for genomics data us-ing the eXframe platform. J Biomed Semantics, 5(Suppl1 Proceedings of the Bio-Ontologies Spec Interest G):S3,2014.

[47] S. M. Mitchell, J. P. Ross, H. R. Drew, T. Ho, G. S.Brown, N. F. Saunders, K. R. Duesing, M. J. Buckley,R. Dunne, I. Beetson, K. N. Rand, A. McEvoy, M. L.Thomas, R. T. Baker, D. A. Wattchow, G. P. Young,T. J. Lockett, S. K. Pedersen, L. C. Lapointe, and P. L.Molloy. A panel of genes methylated with high frequencyin colorectal cancer. BMC Cancer, 14:54, 2014.

[48] N. Mohammed, M. Rodriguez, V. Garcia, J. M. Garcia,G. Dominguez, C. Pena, M. Herrera, I. Gomez, R. Diaz,B. Soldevilla, A. Herrera, J. Silva, and F. Bonilla. EPAS1mRNA in plasma from colorectal cancer patients is asso-ciated with poor outcome in advanced stages. Oncol Lett,2(4):719–724, Jul 2011.

[49] NIH. The cancer genome atlas; program overview. http://cancergenome.nih.gov/abouttcga/overview. On-line; accessed December 12 2015.

[50] N. F. Noy, M. Crubezy, R. W. Fergerson, H. Knublauch,S. W. Tu, J. Vendetti, and M. A. Musen. ProtÃľgÃľ-2000:an open-source ontology-development and knowledge-acquisition environment. AMIA Annu Symp Proc, page953, 2003.

[51] N. F. Noy, N. H. Shah, P. L. Whetzel, B. Dai, M. Dorf,N. Griffith, C. Jonquet, D. L. Rubin, M. A. Storey, C. G.Chute, and M. A. Musen. BioPortal: ontologies and in-tegrated data resources at the click of a mouse. NucleicAcids Res., 37(Web Server issue):W170–173, Jul 2009.

[52] J. B. O’Connell, M. A. Maggard, and C. Y. Ko. Coloncancer survival rates with the new American Joint Com-mittee on Cancer sixth edition staging. J. Natl. CancerInst., 96(19):1420–1425, Oct 2004.

[53] J. H. Oh et al. IL-32 inhibits cancer cell growth throughinactivation of NFB and STAT3 signals. Oncogene,30(30):3345–3359, Jul 2011.

[54] N. K. Osborn and D. A. Ahlquist. Stool screening for col-orectal cancer: molecular approaches. Gastroenterology,128(1):192–206, Jan 2005.

[55] S. Qin, Y. Zhu, F. Ai, Y. Li, B. Bai, W. Yao, and L. Dong.MicroRNA-191 correlates with poor prognosis of colorec-tal carcinoma and plays multiple roles by targeting tissueinhibitor of metalloprotease 3. Neoplasma, 61(1):27–34,2014.

[56] A. Regalado. A. emtech: Illumina says 228,000 humangenomes will be sequenced this year, mit technology re-view.

[57] I. Rhee, K. E. Bachman, B. H. Park, K. W. Jair, R. W.Yen, K. E. Schuebel, H. Cui, A. P. Feinberg, C. Lengauer,K. W. Kinzler, S. B. Baylin, and B. Vogelstein. DNMT1and DNMT3b cooperate to silence genes in human cancercells. Nature, 416(6880):552–556, Apr 2002.

[58] M. Salmanidis, G. Brumatti, N. Narayan, B. D. Green,J. A. van den Bergen, J. J. Sandow, A. G. Bert, N. Silke,R. Sladic, H. Puthalakath, L. Rohrbeck, T. Okamoto,P. Bouillet, M. J. Herold, G. J. Goodall, A. M. Jabbour,and P. G. Ekert. Hoxb8 regulates expression of microR-NAs to control cell death and differentiation. Cell DeathDiffer., 20(10):1370–1380, Oct 2013.

[59] Samtools. HTSJDK repository, Github. https://github.com/samtools/htsjdk. Online; accessed April 15 2016.

[60] Samtools. Samtools. http://www.htslib.org/. Online;accessed Dec 24 2015.

[61] D. Schweiger, Z. Trajanoski, and S. Pabinger. SPAR-QLGraph: a web-based platform for graphically queryingbiological Semantic Web databases. BMC Bioinformatics,15:279, 2014.

[62] M. L. Speir, A. S. Zweig, et al. The UCSC GenomeBrowser database: 2016 update. Nucleic Acids Res.,44(D1):D717–725, Jan 2016.

[63] H. T. Stavnes et al. HOXB8 expression in ovarian serouscarcinoma effusions is associated with shorter survival. Gy-necol. Oncol., 129(2):358–363, May 2013.

[64] Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell,C. Zhai, M. J. Efron, R. Iyer, M. C. Schatz, S. Sinha, andG. E. Robinson. Big Data: Astronomical or Genomical?PLoS Biol., 13(7):e1002195, Jul 2015.

REFERENCES 81

[65] SYSTAP. Blazegraph Licensing. https://www.blazegraph.com/services/blazegraph-licensing/,2015. Online; accessed December 13 2015.

[66] SYSTAP. Mapgraph Technology is now in Blaze-graph GPU. https://www.blazegraph.com/mapgraph-technology/, 2015. Online; accessed December 12 2015.

[67] N. Thakkar, K. Kim, E. R. Jang, S. Han, K. Kim,D. Kim, N. Merchant, A. C. Lockhart, and W. Lee. Acancer-specific variant of the SLCO1B3 gene encodes anovel human organic anion transporting polypeptide 1B3(OATP1B3) localized mainly in the cytoplasm of colonand pancreatic cancer cells. Mol. Pharm., 10(1):406–416,Jan 2013.

[68] UCSC. Frequently Asked Questions: Data File Formats.https://genome.ucsc.edu/FAQ/FAQformat.html. On-line; accessed December 24 2015.

[69] W3C. Inference. http://www.w3.org/standards/semanticweb/inference. Online; accessed December 102015.

[70] W3C. OWL 2 Web Ontology Language Primer (Sec-ond Edition). http://www.w3.org/TR/2012/REC-owl2-primer-20121211/. Online; accessed October 11 2015.

[71] W3C. SKOS Primer. https://www.w3.org/TR/2009/NOTE-skos-primer-20090818/. Online; accessed April15 2016.

[72] W3C. W3C Semantic Web Frequently Asked Questions.http://www.w3.org/2001/sw/SW-FAQ#swonbrowser,2009. Online; accessed Sept 15 2015.

[73] W3C. Describing Linked Datasets with the VoID Vocab-ulary. http://www.w3.org/TR/void/, 2011. Online; ac-cessed December 11 2015.

[74] W3C. SPARQL Query Language for RDF. http://www.w3.org/TR/sparql11-overview/, 2013. Online; ac-cessed December 13 2015.

[75] W3C. Linked Data: What is Linked Data. http://www.w3.org/standards/semanticweb/data, 2014. On-line; accessed October 11 2015.

[76] W3C. RDF. http://www.w3.org/RDF, 2014. Online; ac-cessed Sept 14 2015.

[77] W3C. RDF 1.1 N-Triples: A line-based syntax for an RDFgraph. http://www.w3.org/TR/n-triples/, 2014. On-line; accessed October 11 2015.

[78] W3C. RDF 1.1 Primer: W3C Working Group Note 24June 2014. http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/, 2014. Online; accessed Sept 15 2015.

[79] W3C. Tim Berners-Lee Biography. http://www.w3.org/People/Berners-Lee/#Bio, 2015. Online; accessed 14Sept 2015.

[80] Q. Wang, Y. X. Tan, Y. B. Ren, L. W. Dong, Z. F. Xie,L. Tang, D. Cao, W. P. Zhang, H. P. Hu, and H. Y. Wang.Zinc finger protein ZBTB20 expression is increased in hep-atocellular carcinoma and associated with poor prognosis.BMC Cancer, 11:271, 2011.

[81] J. N. Weinstein et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet., 45(10):1113–1120,Oct 2013.

[82] J. L. Wilding, S. McGowan, Y. Liu, and W. F. Bod-mer. Replication error deficient and proficient colorec-tal cancer gene expression differences caused by 3’UTRpolyT sequence deletions. Proc. Natl. Acad. Sci. U.S.A.,107(49):21058–21063, Dec 2010.

[83] S. Winawer, R. Fletcher, D. Rex, J. Bond, R. Burt, J. Fer-rucci, T. Ganiats, T. Levin, S. Woolf, D. Johnson, L. Kirk,S. Litin, and C. Simmang. Colorectal cancer screeningand surveillance: clinical guidelines and rationale-Updatebased on new evidence. Gastroenterology, 124(2):544–560, Feb 2003.

[84] Y. Yang, Z. Wang, Y. Zhou, X. Wang, J. Xiang, andZ. Chen. Dysregulation of over-expressed IL-32 in col-orectal cancer induces metastasis. World J Surg Oncol,13:146, 2015.

[85] H. Zhang, C. Hao, Y. Wang, S. Ji, X. Zhang, W. Zhang,Q. Zhao, J. Sun, and J. Hao. Sohlh2 inhibits human ovar-ian cancer cell invasion and metastasis by transcriptionalinactivation of MMP9. Mol. Carcinog., Jul 2015.

[86] L. Zhang, J. Gao, L. Li, Z. Li, Y. Du, and Y. Gong. Theneuronal pentraxin II gene (NPTX2) inhibit proliferationand invasion of pancreatic cancer cells in vitro. Mol. Biol.Rep., 38(8):4903–4911, Nov 2011.

[87] J. G. Zhao, K. M. Ren, and J. Tang. Zinc finger pro-tein ZBTB20 promotes cell proliferation in non-small celllung cancer through repression of FoxO1. FEBS Lett.,588(24):4536–4542, Dec 2014.

an ontology based query engine for querying biological...

Documents