curating biomedical literature using text mining'malley... · web viewbiomedical literature is...

22

Click here to load reader

Upload: dangphuc

Post on 19-May-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

BACHELOR OF COMPUTER SCIENCE (HONOURS) - UNIVERSITY OF SOUTH AUSTRALIA

Curating Biomedical Literature using Text

MiningResearch Proposal

Samuel O’Malley

110015053 – OYMSJ001

31st of May 2012

Supervisor: Professor Jiuyong Li

Associate Supervisor: Dr Jixue Liu

Page 2: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

AbstractBiomedical literature is increasing exponentially and manual curation processes are not

recording the facts fast enough. Advances in natural language processing and text mining

enable computers to assist in the curation process by categorising data into meaningful groups

so that curators only see the literature they are looking for. Also these tools can be powerful

enough that they can automatically curate the data without any human input. Currently a few

solutions exist for automatically discovering protein-protein interactions from biomedical

literature, however there is a clear lack of tools for microRNA literature. MicroRNA research

is increasing as the technology for deep sequencing becomes cheaper and the interest in

microRNA is growing. MicroRNA recognition has challenges due to the large number of

synonyms and the large number of species which are referred to in the literature. The research

proposed here will provide a solution to microRNA recognition and attempt to automatically

extract information from biomedical literature abstracts and generate a structured database of

facts.

Page 3: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

Contents1. Introduction 4

1.1 Background and Motivation 4

1.2 Research Question 4

1.2.1 microRNA Entity Recognition 51.2.2 microRNA Relationship Detection 5

1.3 Justification 5

2. Literature Review 6

2.1 Text Mining 6

2.1.1 Definition of Text Mining 62.1.2 Entity Recognition 62.1.3 Information Retrieval 62.1.4 Information Extraction 6

2.2 Mining Biomedical Literature 6

2.2.1 DRENDA Disease Related Enzyme information Database 62.2.2 Gene Name Normalisation 72.2.3 BioPPIExtractor: Protein - Protein Interaction Extractor 82.2.4 Biolexicon 82.2.5 miRCancer 8

3. Methodology 9

3.1 Data Acquisition 9

3.2 Pre-processing 10

3.3 Entity Recognition 10

3.4 Relationship Analysis 10

3.5 Results Analysis 10

3.6 Expected Results 10

4. Project Schedule 12

5. Summary 13

6. References 14

2

Page 4: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

List of FiguresFigure 1: Schematic Illustration of DRENDA workflow (Sohngen, Chang & Schomburg 2011) 9

Figure 2: Process flow diagram 11

List of tablesTable 1: Structured Database Output – randomly chosen examples 13

3

Page 5: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

1. Introduction 1.1 Background and Motivation

MicroRNA are tiny single strand lengths of non-coding RNA which inhibit protein production

in our cells. They occur naturally in the body and can potentially cure a disease or condition.

Current microRNA research is aimed at discovering the links between different microRNA

and protein production. Researchers also aim to artificially introduce microRNA into cells to

reduce problem proteins to potentially cure cancers or diseases (Xie 2010; Liu et al. 2012;

Selth et al. 2012; Zhang et al. 2012).

MicroRNA research measured in the number of published articles and journals is increasing

considerably as technology is becoming cheaper and it is becoming relatively easier to

discover new MicroRNA. Although their existence was discovered in 1993 by an American

molecular biologist Victor Ambros, the technology used to discover and sequence new

microRNA has only been widely available and for a few short years (Roads 2010). Due to this

volume of new research the data needs to be represented in a structured format in order to be

useful. Currently the databases used to store this information are curated manually by teams

of domain experts, however these databases to not adequately reflect the current state of

research and no one researcher can be an expert in their field (Jensen, Saric & Bork 2006).

Literature mining tools are becoming essential for researchers to enable them to partition the

information to only relevant publications, and potentially discover new information.

Automatic curation methods using text mining have already been developed for other fields in

biology such as protein - protein interactions however; these methods cannot be directly

applied to MicroRNA due to some limitations discussed in section Error: Reference source

not found.

1.2 Research Question

The overall aim of this research is to determine a good technique of extracting information

about microRNA interactions from biomedical literature. This research can be split into two

problems:

1. Recognising microRNA occurrences and removing ambiguity

2. Determining the relationship between co-mentioned microRNA and some other

biological entities.

4

Page 6: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

1.2.1 microRNA Entity Recognition

This research will endeavour to accurately detect occurrences of microRNA in biomedical

literature. There are many challenges faced in this research because each microRNA has

many synonyms and can be ambiguous.

1.2.2 microRNA Relationship Detection

MicroRNA can occur in the same sentence as many different types of biological terms.

Relationship detection will take the microRNA and other biological term and analyse the

relationship in order to classify the information as meaningful or not. An example relationship

would be “MicroRNA (A) inhibits Gene (B) Production” where A and B are microRNA and

gene name respectively.

1.3 Justification

This research has similarities to current research in other fields of biomedical text mining,

such as protein-protein detection and gene name normalisation (Crim, McDonald & Pereira

2005; Sun et al. 2009; Gerold, Simon & Fabio 2011; Xia et al. 2011). However due to

microRNA being a relatively new field of research, there is a clear lack of tools for assisting

in curating microRNA information from biomedical literature. This research will adapt and

extend existing tools for similar biology fields and apply them to microRNA, as discussed in

the literature review in Section 2.2.

5

Page 7: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

2. Literature Review2.1 Text Mining

This section provides an overview of Text Mining research and current applications.

2.1.1 Definition of Text Mining

Data mining is the endeavour of discovering previously unknown information from data. Text

mining is a subset of data mining with the ultimate aim of discovering new information from

free text literature. The three parts of text mining are Entity Recognition (ER), Information

Retrieval (IR) and Information Extraction (IE).

2.1.2 Entity Recognition

Entity Recognition (ER) is a subset of text mining aimed at recognising important entities in

free-text. For our research this includes recognising microRNA and gene names in biomedical

literature. Some challenges presented in ER research include disambiguating entity names and

normalisation.

2.1.3 Information Retrieval

Information Retrieval (IR) encompasses advanced queries which go beyond simple keyword

searches. IR includes entity recognition and clustering algorithms to provide better results to a

user’s query.

2.1.4 Information Extraction

Information Extraction (IE) goes one step beyond IR in that instead of providing results to a

query, it extracts facts from the literature and returns these instead of the full-text.

2.2 Mining Biomedical Literature

This section will provide an overview of the more specific field of text mining biomedical

literature.

2.2.1 DRENDA Disease Related Enzyme information Database

DRENDA is a system developed by Sohngen, Chang and Schomburg (2011) for detecting and

classifying disease-related enzyme information.

6

Page 8: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

Figure 1: Schematic Illustration of DRENDA workflow (Sohngen, Chang & Schomburg 2011)

From the DRENDA workflow diagram in Figure 1 we see that the system uses the BRENDA

database (BRaunschweig ENzyme Database) and MeSH Database (MEdical Subject

Headings) as dictionaries for entity recognition. Literature is obtained by crawling PubMed

and extracting abstracts, initial pre-processing is applied such as sentence splitting. A training

corpus is used to train the SVM (Support Vector Machine) algorithm, which generates a

classification model. Sentences with co-occuring disease and enzyme mentions are extracted

and this SVM classification model is applied. The result is a set of classified sentences which

is evaluated by using a Test corpus. Correctly evaluated sentences are added to the DRENDA

database as facts.

This system cannot be directly applied to microRNA literature; however the workflow can be

followed closely. Before this system can be extended for microRNA literature, an appropriate

microRNA dictionary resource must be identified. The evaluation methods used by Sohngen

et al. are very thorough and evaluate multiple pre-processing methods in order to determine

the best ones.

2.2.2 Gene Name Normalisation

A problem with biomedical literature is that each entity has many different names and there

are complex naming conventions which might not be faithfully followed. Naming

conventions include capitalisation to represent different species, this convention might not be

followed if the context of the literature makes it clear what species is being discussed. Sun,

Wang and Lin (2009) present a multi-level disambiguation framework for gene name

7

Page 9: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

normalization. The authors show that human genes have on average 5.5 synonyms for each

identifier. While a human reader would understand these using contextual clues, a machine

has a much harder time understanding.

Sun et. al. endeavour to introduce a context awareness algorithm to disambiguate species

amongst the different synonyms used in the literature. For example if the majority of genes

mentioned in a document are human genes, then we can safely assume that any ambiguous

gene names in the document are also human genes.

The authors use a maximum entropy model and binary classes of meaningful and not

meaningful to disambiguate gene names. This algorithm is similar to Crim, McDonald and

Pereira’s algorithm (2005) except it uses more contextual cues to disambiguate gene names.

2.2.3 BioPPIExtractor: Protein - Protein Interaction Extractor

This system extracts protein – protein interactions from biomedical literature using syntactic

grammar parsers to further understand the relationship between two proteins (Yang, Lin &

Wu 2009). The system presented here was manually evaluated for precision and recall, and

was found to perform better than two other leading systems BioRAT (Corney et al. 2004) and

IntEx (Silberztein 2000).

2.2.4 Biolexicon

The Biolexicon is a large-scale lexical resource of biological terms (Thompson et al. 2011). It

combines multiple data sources into one large resource which can be used at multiple stages

of the text mining process. This system uses its vast knowledge of biological terms to

discover new textual variants which do not occur in the database resources.

Although this system is very useful, it has no knowledge of microRNA entities. It can assist

our efforts in microRNA detection because it has knowledge of biology specific verbs such as

“retro-regulate” which do not occur in a standard dictionary (BOOTStrep Bio-Lexicon 2012).

2.2.5 miRCancer

MiRCancer is a comprehensive database for microRNA expression profiles in human cancers

based on experimental results (Xie 2010). Essentially this framework is specifically designed

to uncover relationships between microRNA and cancers in biomedical literature.

This system has a limitation of which the relationship between the microRNA and the cancer

is not detected or analysed. This would result in false positives or unimportant data in the

miRCancer database.

8

Page 10: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

3. MethodologyThe following diagram (Figure 2) represents the process flow that our program will take. The

order is symbolic for the Text Mining processes and will closely match the physical software

representation.

Figure 2: Process flow diagram

3.1 Data Acquisition

Data will be acquired from the PubMed open access database and will only include titles and

abstracts. There are two reasons for only extracting abstracts and titles for our data

acquisition. Firstly Titles and Abstracts are freely available and do not require any complex

PDF processing, this reduces the complexity and processing time of our algorithm. Secondly

the work by Wei and Collier (2011) suggests that most of the important terms are mentioned

in the Abstract and Title, and repeated with more detail in the Introduction, Results and

Conclusion sections. This suggests that if there are no occurrences of microRNA in the title or

abstract then the full paper is not worth reading. To future proof our research all abstracts will

be stored in a MySQL database and paired to the permanent URL in order to allow full-text

downloads at a later date.

Results AnalysisPrecision Recall

Relationship AnalysisClassify relationship based on joining words

Entity RecognitionmirBase Dictionary Disambiguation

PreprocessingStop word removal Tokenization

Data AcquisitionCrawl PubMed Database Extract Abstracts

9

Page 11: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

3.2 Pre-processing

Common text mining pre-processing tasks will be applied to our data. Firstly tokenisation will

be applied to separate the sentence into tokens (words without any punctuation). Then

commonly occurring English language words called Stop Words will be removed. MicroRNA

and other medical entities will then be removed from the sentence in order to reduce

confusing the classification algorithm. Completely removing medical entities has been

showed to perform better in classification tasks, compared to replacement with a generic word

(Sohngen, Chang & Schomburg 2011).

3.3 Entity Recognition

The MIRBase will be used to facilitate microRNA entity recognition (Kozomara & Griffiths-

Jones 2011). This database contains manually curated microRNA information including deep

sequence data which is the unique sequence of amino acids which make up a microRNA. The

most useful information contained in this database are various synonyms used to refer to

individual microRNA and a unique identifier which can be used to refer to microRNA

without any ambiguity. Various microRNA databases were evaluated for biomedical

applications and MIRBase was shown to be an extensive resource valuable for annotation

tasks (Tan Gana, Victoriano & Okamoto 2012).

3.4 Relationship Analysis

3.5 Results Analysis

Precision and Recall are the standard measure for evaluating text mining algorithms. However

there is no gold standard available for microRNA literature so a manual analysis will need to

be performed. A small test dataset will be compiled manually and used to evaluate our

algorithm. Precision and recall can be used to compare different algorithms even across

different fields, this means that our algorithm can be compared to existing algorithms which

do not related to microRNA. This is useful because there is currently very little research into

automatic microRNA curation.

10

Page 12: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

3.6 Expected ResultsTable 1: Structured Database Output – randomly chosen examples

MicroRNA Entity Class

hsa-mir-150 alpha-1-B glycoprotein Meaningful

hsa-mir-7a-1 apoptosis-associated tyrosine kinase Meaningful

If the research is successful, the outcome will be a structured database containing a

microRNA, another biological entity which will initially be gene names but will expand to

include diseases and other entities, and a classifier (See Table 1). At this stage the classifier is

binary of only meaningful or not meaningful, however after analysis of the returned data we

might need to introduce further classifications.

11

Page 13: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

4. Project ScheduleThis section outlines the proposed high level schedule of the research project.

Date Task

February

Literature ReviewMarch

April

MayResearch Proposal

June

JulyData Acquisition (Section 3.1) Pre-processing (Section 3.2)

August

SeptemberEntity Recognition (Section 3.3)

Relationship Analysis (Section 3.4)

October Testing and Evaluation (Section 3.5)Preparation of Thesis

November

12

Page 14: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

5. SummaryThis research project is motivated to combine computing power with biomedical domain

knowledge to assist in the process of curating microRNA literature. Even though no algorithm

will be infallible and able to replace the manual curation process completely, the added speed

advantage of computer processing will greatly advantage the curator’s task. A challenge

addressed in this research is recognising microRNA entities and their variations in biomedical

literature.

13

Page 15: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

6. ReferencesBOOTStrep Bio-Lexicon 2012, The National Centre for Text Mining - University of Manchester, <http://www.nactem.ac.uk/biolexicon/>.

Corney, DPA, Buxton, BF, Langdon, WB & Jones, DT 2004, 'BioRAT: extracting biological information from full-length papers', Bioinformatics, vol. 20, no. 17, November 22, 2004, pp. 3206-3213.

Crim, J, McDonald, R & Pereira, F 2005, 'Automatically annotating documents with normalized gene lists', BMC Bioinformatics, vol. 6, no. Suppl 1, p. S13.

Gerold, S, Simon, C & Fabio, R 2011, 'Detection of interaction articles and experimental methods in biomedical literature', BMC Bioinformatics, vol. 12, no. Suppl+8, p. S13.

Jensen, LJ, Saric, J & Bork, P 2006, 'Literature mining for the biologist: from information retrieval to biological discovery', Nat Rev Genet, vol. 7, no. 2, pp. 119-129.

Kozomara, A & Griffiths-Jones, S 2011, 'miRBase: integrating microRNA annotation and deep-sequencing data', Nucleic Acids Research, vol. 39, no. suppl 1, pp. D152-D157.

Liu, J, Gao, J, Du, Y, Li, Z, Ren, Y, Gu, J, Wang, X, Gong, Y, Wang, W & Kong, X 2012, 'Combination of plasma microRNAs with serum CA19-9 for early detection of pancreatic cancer', Int J Cancer, vol. 131, no. 3, Aug 1, pp. 683-691.

Roads, RE 2010, Progress in Molecular and Subcellular Biology, Springer, Shreveport LA.

Selth, LA, Townley, S, Gillis, JL, Ochnik, AM, Murti, K, Macfarlane, RJ, Chi, KN, Marshall, VR, Tilley, WD & Butler, LM 2012, 'Discovery of circulating microRNAs associated with human prostate cancer using a mouse model of disease', Int J Cancer, vol. 131, no. 3, Aug 1, pp. 652-661.

Silberztein, M 2000, 'INTEX: an FST toolbox', Theoretical Computer Science, vol. 231, no. 1, pp. 33-46.

Sohngen, C, Chang, A & Schomburg, D 2011, 'Development of a classification scheme for disease-related enzyme information', BMC Bioinformatics, vol. 12, no. 1, p. 329.

Sun, C-J, Wang, X-L, Lin, L & Liu, Y-C 2009, 'A Multi-level Disambiguation Framework for Gene Name Normalization', Acta Automatica Sinica, vol. 35, no. 2, pp. 193-197.

14

Page 16: Curating Biomedical Literature using Text Mining'Malley... · Web viewBiomedical literature is increasing exponentially and manual curation processes are not recording the facts fast

Tan Gana, NH, Victoriano, AFB & Okamoto, T 2012, 'Evaluation of online miRNA resources for biomedical applications', Genes to cells : devoted to molecular & cellular mechanisms, vol. 17, no. 1, pp. 11-27.

Thompson, P, McNaught, J, Montemagni, S, Calzolari, N, del Gratta, R, Lee, V, Marchi, S, Monachini, M, Pezik, P, Quochi, V, Rupp, C, Sasaki, Y, Venturi, G, Rebholz-Schuhmann, D & Ananiadou, S 2011, 'The BioLexicon: a large-scale terminological resource for biomedical text mining', BMC Bioinformatics, vol. 12, no. 1, p. 397.

Wei, Q & Collier, N 2011, 'Towards classifying species in systems biology papers using text mining', BMC Research Notes, vol. 4, no. 1, p. 32.

Xia, N, Lin, H, Yang, Z & Li, Y 2011, 'Combining multiple disambiguation methods for gene mention normalization', Expert Systems With Applications, vol. 38, no. 7, pp. 7994-7999.

Xie, B 2010, 'miRCancer: a microRNA-Cancer Association Database and Toolkit Based on Text Mining'.

Yang, Z, Lin, H & Wu, B 2009, 'BioPPIExtractor: A protein–protein interaction extraction system for biomedical literature', Expert Systems With Applications, vol. 36, no. 2, pp. 2228-2233.

Zhang, J, Zhao, H, Gao, Y & Zhang, W 2012, 'Secretory miRNAs as novel cancer biomarkers', Biochim Biophys Acta, vol. 1826, no. 1, Aug, pp. 32-43.

15