parallel text mining for large text processing

6
Parallel Text Mining for Large Text Processing Firat Tekiner 1,3 , Yoshimasa Tsuruoka 2,4 , Jun'ichi Tsujii 2,3 , Sophia Ananiadou 2,3 , John Keane 3 1:School of Computing Engineering and Physical Sciences, UCLAN, Preston, UK. 2:National Centre for Text Mining, Manchester, UK 3:School of Computer Science, University of Manchester, UK 4:School of Information Science, Japan Advanced Institute of Science and Technology (JAIST), Ishikawa, Japan [email protected] Abstract — There is an urgent need to develop new text mining solutions using High Performance Computing (HPC) and grid environments to tackle the exponential growth in textual data. Problem sizes are increasing by the day by addition of new text documents. Therefore the aim of this work is to lay the foundations for mining large text datasets (i.e. full text articles) in reasonable timeframes. The task of labelling sequence data such as part-of- speech (POS) tagging, chunking (shallow parsing) and named entity recognition is one of the most important tasks in Text Mining. This work focuses on state-of-the-art GENIA tagger and STEPP parser. GENIA is a POS tagger which is specifically tuned for biomedical text and STEPP is a full parser. A parallel version of GENIA and STEPP has been developed and performance has been compared on a number of different architectures. The focus has been particularly on scalability: scaling to 512 processors has been achieved. Furthermore, a parallel text mining framework has been proposed that enables scaling to 10000 processors for massively parallel Text Mining applications. The processing times have been reduced dramatically for the given datasets from over 70 days to hours (towards 3 orders of magnitude reduction). The parallel implementation is done using Message Passing Interface (MPI) to achieve portable code. The resulting parallel applications have been tested on a number of architectures and the entire collection of Medline text abstracts together with 125000 full text articles have been used for the tests. Keywords: HPC, Text Mining, NLP, e-science, Parallel Computing, Mapreduce. 1. INTRODUCTION The continuing rapid growth of data and knowledge expressed in the scientific literature has spurred huge interest in text mining (TM). It is widely acknowledged that individual researchers cannot easily keep up with the literature in their domain, and knowledge silos further prevent integration and cross-disciplinary knowledge sharing [1]. The expansion to new domains and the increase in scale will massively increase the amount of data to be processed by TM applications (from Gigabytes to Terabytes and beyond). This work investigates approaches using high performance computing (HPC) to tackle the problem of data deluge for large- scale TM applications. Although TM applications are data independent, data handling of large textual data is an issue due to the problem sizes under consideration. Each of the steps in the TM pipeline adds further information to the initial raw text and data size increases as processing progresses through this process. Text mining efforts have focused primarily on abstracts, because information density is greatest in the abstract compared to other sections of the text [2]. In addition, access to abstracts is easier and considerably smaller amounts of storage and computational resources are needed [3][4]. However, processing full text articles instead of abstracts will allow researchers to discover more and deeper hidden relationships from text that were not unknown. Recent studies have showed that an abstract’s length is on average 3% of the entire article [5] and includes only 20% of the useful information that can be learned from the text [6][7]. TM applications are widely applied in the biology domain and it is clear these applications will benefit from processing the increased more information available in full text documents [8][9]. MEDLINE is a huge database of around 17 million references to articles. The collection of MEDLINE abstracts contain around 1.7 billion words and is around 7GB (compressed) in size before processing. The output generated after processing MEDLINE abstracts using our TM tools is around 400GB. We expect the output of processing full MEDLINE articles to be in the order of 10s of Terabytes. This work is a first step towards creating a general TM framework to enable large scale text mining. The aim is to create a simple framework and apply this to a suite of TM applications based on state-of-the-art TM approaches that exploit a number of HPC and Grid architectures in order to process and handle terabytes of text in reasonable time – with a focus on scalability. This work focuses on the applications of tagging and parsing using HPC and Grid environments. This work demonstrates how text mining applications have been scaled linearly to 512 processing cores by using a parallel processing framework. The applications developed are portable as they are based on the de facto Message Passing Interface (MPI); as evidence, they have already run on three different HPC and Grid architectures without modification. However, when scaling to a larger number of processors, data and work distribution will also be an issue and more sophisticated load distribution models will need to be investigated. Due to the unstructured nature of the data available, this will become a major issue. In this work, we are laying foundations towards a parallel TM framework which should enable processing terabytes of text in reasonable time. The framework resembles mapreduce [36], but it is simpler and can scale to 10000s of processors. The paper presents and discusses the associated challenges, the progress to date and the future work needed to handle full papers. The paper is organised as follows: Section 2 discusses the TM pipeline and tagger; Section 3 focuses on parallel implementation and the execution environment; this is followed by analysis of experimental results in Section 4; finally, Section 5 presents conclusions and discusses future work. 2. BACKGROUND TM normally involves sequential processing of documents and the data generated from those documents. First, the documents are processed by natural language processing (NLP) EMSSN-4 348 CSNDSP 2010 978-1-86135-370-2/10 ©2010 CSNDSP

Upload: others

Post on 03-Feb-2022

12 views

Category:

Documents


0 download

TRANSCRIPT

Parallel Text Mining for Large Text Processing

Firat Tekiner1,3, Yoshimasa Tsuruoka2,4, Jun'ichi Tsujii2,3, Sophia Ananiadou2,3, John Keane3

1:School of Computing Engineering and Physical Sciences, UCLAN, Preston, UK.

2:National Centre for Text Mining, Manchester, UK

3:School of Computer Science, University of Manchester, UK

4:School of Information Science, Japan Advanced Institute of Science and Technology (JAIST), Ishikawa, Japan

[email protected]

Abstract — There is an urgent need to develop new text mining

solutions using High Performance Computing (HPC) and grid environments to tackle the exponential growth in textual data. Problem sizes are increasing by the day by addition of new text documents. Therefore the aim of this work is to lay the foundations for mining large text datasets (i.e. full text articles) in reasonable timeframes. The task of labelling sequence data such as part-of-speech (POS) tagging, chunking (shallow parsing) and named entity recognition is one of the most important tasks in Text Mining. This work focuses on state-of-the-art GENIA tagger and STEPP parser. GENIA is a POS tagger which is specifically tuned for biomedical text and STEPP is a full parser. A parallel version of GENIA and STEPP has been developed and performance has been compared on a number of different architectures. The focus has been particularly on scalability: scaling to 512 processors has been achieved. Furthermore, a parallel text mining framework has been proposed that enables scaling to 10000 processors for massively parallel Text Mining applications. The processing times have been reduced dramatically for the given datasets from over 70 days to hours (towards 3 orders of magnitude reduction). The parallel implementation is done using Message Passing Interface (MPI) to achieve portable code. The resulting parallel applications have been tested on a number of architectures and the entire collection of Medline text abstracts together with 125000 full text articles have been used for the tests.

Keywords: HPC, Text Mining, NLP, e-science, Parallel Computing, Mapreduce.

1. INTRODUCTION

The continuing rapid growth of data and knowledge expressed in the scientific literature has spurred huge interest in text mining (TM). It is widely acknowledged that individual researchers cannot easily keep up with the literature in their domain, and knowledge silos further prevent integration and cross-disciplinary knowledge sharing [1].

The expansion to new domains and the increase in scale will massively increase the amount of data to be processed by TM applications (from Gigabytes to Terabytes and beyond). This work investigates approaches using high performance computing (HPC) to tackle the problem of data deluge for large-scale TM applications. Although TM applications are data independent, data handling of large textual data is an issue due to the problem sizes under consideration. Each of the steps in the TM pipeline adds further information to the initial raw text and data size increases as processing progresses through this process.

Text mining efforts have focused primarily on abstracts, because information density is greatest in the abstract compared to other sections of the text [2]. In addition, access to abstracts is easier and considerably smaller amounts of storage and computational resources are needed [3][4]. However, processing full text articles instead of abstracts will allow researchers to discover more and deeper hidden relationships

from text that were not unknown. Recent studies have showed that an abstract’s length is on average 3% of the entire article [5] and includes only 20% of the useful information that can be learned from the text [6][7].

TM applications are widely applied in the biology domain and it is clear these applications will benefit from processing the increased more information available in full text documents [8][9]. MEDLINE is a huge database of around 17 million references to articles. The collection of MEDLINE abstracts contain around 1.7 billion words and is around 7GB (compressed) in size before processing. The output generated after processing MEDLINE abstracts using our TM tools is around 400GB. We expect the output of processing full MEDLINE articles to be in the order of 10s of Terabytes. This work is a first step towards creating a general TM framework to enable large scale text mining. The aim is to create a simple framework and apply this to a suite of TM applications based on state-of-the-art TM approaches that exploit a number of HPC and Grid architectures in order to process and handle terabytes of text in reasonable time – with a focus on scalability.

This work focuses on the applications of tagging and parsing using HPC and Grid environments. This work demonstrates how text mining applications have been scaled linearly to 512 processing cores by using a parallel processing framework. The applications developed are portable as they are based on the de facto Message Passing Interface (MPI); as evidence, they have already run on three different HPC and Grid architectures without modification.

However, when scaling to a larger number of processors, data and work distribution will also be an issue and more sophisticated load distribution models will need to be investigated. Due to the unstructured nature of the data available, this will become a major issue. In this work, we are laying foundations towards a parallel TM framework which should enable processing terabytes of text in reasonable time. The framework resembles mapreduce [36], but it is simpler and can scale to 10000s of processors. The paper presents and discusses the associated challenges, the progress to date and the future work needed to handle full papers.

The paper is organised as follows: Section 2 discusses the TM pipeline and tagger; Section 3 focuses on parallel implementation and the execution environment; this is followed by analysis of experimental results in Section 4; finally, Section 5 presents conclusions and discusses future work.

2. BACKGROUND

TM normally involves sequential processing of documents and the data generated from those documents. First, the documents are processed by natural language processing (NLP)

EMSSN-4 348 CSNDSP 2010

978-1-86135-370-2/10 ©2010 CSNDSP

techniques to analyse linguistic structures of the sentences. The documents are then passed to an information extraction (IE) engine which generates data by semantically analysing the documents. NLP is becoming increasingly important for accurate information extraction/retrieval from scientific literature [10]. The role of NLP in TM is to provide the tools in the IE phase with sophisticated linguistic analyses. Often this is done by annotating documents with information such as

sentence boundaries, part-of-speech tags and deep semantic parsing results, which can then be read by the IE tools. After this stage, the data generated during the IE phase is analysed by the data mining component which generates further data. Once generated, this data is indexed to be queried and visualised by a user client. This process is described within the text mining application framework shown in Fig. 1.

Fig. 1: Text mining application framework (current status with abstract processing)

Each of the above steps in the TM pipeline adds further information to the initial raw text and data sizes increases as processing proceeds through the process. The data generated after every step is either saved to disk to be used in the future or passed to the next step for further processing. Our work focuses only on data parallel approaches as task parallel approaches in this area do not provide the desired speed-up [11]. Dynamic work distribution approaches, master/slave models (particularly task farming approaches), would appear to be ideally suited for use on highly parallel and grid resources by employing data parallel approaches to process unbalanced data sets (the length and structure of the sentences is unknown before processing starts) [12][13]. Furthermore, the I/O requirements of each stage and I/O usage will need to be balanced in order to achieve close to optimum outcome. Therefore, in this work we apply a master slave approach to parallelism, which will be discussed in greater detail in Section 3.

The data and computation intensive processes include parsing, tagging and indexing of documents which are critical processes in TM and usually carried our periodically and incrementally. For tagging, we use the GENIA tagger which achieves an accuracy of 97 - 98% when applied to biology texts and for parsing, we use the STEPP parser [14][31]. The Enju and STEPP parsers, together with the GENIA tagger, are built using state-of-the- art disambiguation models and efficient decoding algorithms [15][16][17]. It has been estimated that the sequential tagging and parsing of MEDLINE (17 million abstracts; 1.7 billion words, 7GB compressed (before processing), 400GB after processing) would take around 8 years on a PC given 1 second to process one sentence.

Since a huge amount of biomedical knowledge is described in the literature, automatic IE from biomedical documents is increasingly important in this domain. For extracting information from text, many NLP techniques can be employed. For example, a simple approach to extracting information about

protein-protein interactions would involve scanning the text for particular verbs and neighbouring noun phrases by applying some linguistic patterns on words and their part-of-speech tags. A more sophisticated way would be to use parsers to deeply analyze the syntactic and semantic relations among the entities in the sentences. More information about the GENIA POS tagger can be found in [32]

STEPP is a full parser which analyzes the phrase structure of a sentence and provides useful input for many kinds of high-level NLP such as summarization [33], pronoun resolution [35], and IE [34]. Further details about the parser are beyond the scope of this work and it is described in detail in [31].

Recent investigations have looked at tagging/parsing all of MEDLINE abstracts and it took 9 days using around 350 CPUs on 2 clusters using a grid shell (by dividing data and processing it sequentially on different processors) [18][19][20]. This approach is based on the GXP grid shell environment which requires secure shell (ssh) access to processing nodes. This has proven to be a promising approach to parallel text processing on a PC cluster. However, it is not possible to use this approach on high performance and grid services due to limited access problems to the each host on a cluster. Moreover, a generic and portable approach that is capable of processing TBs of text data is needed.

For indexing we use the Lucene or the Cheshire information retrieval engines. Previous studies have shown that indexes could be created in data parallel fashion where indexed files can be written to separate physical files [21][22]. This should assist the implementation of advanced I/O optimisation techniques and overall system performance [23][24].

In this work our focus is on part-of-speech (POS) tagging and full parsing which are usually two of the first steps used in language-based text mining. This process is used to add appropriate linguistic knowledge to text in order to assist further

EMSSN-4 349 CSNDSP 2010

analysis by other tools. Knowing the lexical class of a word makes it much easier to perform deeper linguistic analysis such as parsing [25][26][27].

The first stage of this process involves tokenising the text by splitting it into a sequence of single word units and punctuations. This includes splitting of hyphenation, parentheses, quotations and contractions, which can otherwise cause errors with POS tagging algorithms. At this point it is possible to introduce linguistic stemming into the annotation, which predicts the base form of a word to assist in later analysis or searching [28].

In order to ensure high accuracy it is recommended that any tagging software used is trained on annotated texts from the same domain as the target documents. With this process being in the early stages of the whole TM chain any errors at this stage may grow cumulatively so it is important to have a POS tagger that is highly accurate [27][29].

3. IMPLEMENTATION AND ENVIRONMENT

The parallel implementation of the GENIA Tagger and the STEPP parser works as follows. Abstracts are cleaned and prepared initially and stored as an ASCII text file. Then a rule-based sentence splitter developed in-house is applied to separate the sentences. Each sentence is written to a new line and a new line is inserted between each abstract to detect the end of abstracts. This process is not computationally expensive and is completed in less than a minute for a hundred thousand abstracts. Therefore, in order to retain interoperability and portability of the tools we have not integrated this into the GENIA tagger. Fig. 2 shows an abstract view of how the parallel implementation of the GENIA tagger works. The same process is applied to the STEPP parser where GENIA’s output is given used an input.

Once the data is cleaned and prepared, the master node reads the cleaned and split abstracts, packs them into groups of sentences (i.e. as entire abstract) and sends them to the slave nodes. During the process, the built-in MPI Pack operations have not been used as it would be inconvenient to pack and separate sentences. Moreover, due to the number of operations and performance penalties associated with MPI-Pack (packing/unpacking penalties) it would not be as computationally effective as packing the sentences using a combination of characters [30].

Blugene/L Bluegene/P Cray

XT4 Cluster

HPCx -

DEISA

CPU 700Mhz -

IBM

850Mhz -

IBM

2.8Ghz

- AMD Opteron

3.0Ghz

- Intel XEON

1.5Ghz

IBM Power5

Memory/

Core 1GB 2GB 3GB 1GB 2GB

Interconnect 1.4GB/s 3.4GB/s 7.6GB/s 1GB/s 3GB/s

I/O System GPFS GPFS Parallel

Lustre

Raid

Disk GPFS

SMP Nodes None 4 4 4 16

No of Cores 2048 4096 11328 128 1536

O/S

Compiler

Linux -

xlC/C++ v8.0

Linux -

xlC/C++ v8.0

Linux -

PGI v7.04

Linux -

Intel v10.0

IBM AIX IBM

xl_C/C++

Table 1: HPC Architectures

The parallel implementation of the GENIA tagger uses MPI which provided portability. It has been observed that the application scales up to 512 processors on a number of

architectures. For testing purposes, we used a Bluegene/L, Bluegene/P, Cray XT4 and a cluster (see table 1).

Each of these architectures is different in terms of their processing capability, interconnects and file systems available to them. Due to the amount of I/O being done a number of approaches have been investigated and it appears that, for a text mining application, it is best to use one I/O node. This is due to the application being data parallel and having high processing time per data element. However, we believe that this approach is also the reason for being unable to scale beyond 512 processors. Therefore, we are proposing a hierarchical approach where multiple master/worker nodes will be used to scale to thousands of processors.

Fig. 2: Text mining applications' parallel implementation

Scaling up to 512 processors have been achieved on the architectures with parallel file systems and on raid based system this was observed to be 128 processors. After investigation and profiling of the applications it became apparent that writing huge amount of text to the disk via the standard output (screen) has become a bottleneck. This is because standard I/O is not optimised for parallel disk performance and although printing to the screen is equivalent to creating a Unix pipe to a file only one of the many I/O nodes would be used in the parallel file system. The Cray XT4 has 32 I/O nodes and performance was 10 fold slower. On the other hand, the Bluegene/L and P are significantly slower. This is due to slow processors used in Bluegene systems.

Each slave node loads the probabilistic models that are obtained by training the application on annotated data as described in Section 2. This process usually takes around 30 seconds at the beginning of the processing and does not need to be done again for that run (this is independent of the size of data to be processed). Then slave nodes wait for data from the master node (in reality data is sent/received via non-blocking MPI calls, therefore the data is already transferred to the slave nodes’ buffers due to the model loading penalty at the beginning). When the slave node receives the abstract, it splits the abstract into sentences to process (each abstract is packed with a special character by the master rather than going through the sentence splitting process). This is needed as the tagger and parser works per sentence.

The tagging process used within the GENIA tagger involves a deterministic classification process of the words within the sentence. Each word is classified using a maximum entropy function using the information of its two adjacent neighbours. The order of tagging is determined in such a way that high-confidence classification is performed first. Once this process is completed, the POS-tagging process is completed for the sentence and processing of the next sentence within the abstract starts.

EMSSN-4 350 CSNDSP 2010

Once the processing of the abstract is completed, the processed text is then sent back to the master node using a non-blocking send operation. The slave process therefore can continue processing the next abstract without waiting for completion of the send process. As on average each slave cannot have more than 5 abstracts waiting to be tagged, there is no danger of exceeding available memory.

The data granularity of full text articles is higher when compared to abstracts. The size of a full text article is on average thirty times larger than an abstract. However, there are cases when the size of full text article is almost a hundred times larger than abstract (i.e. long papers with 30+ pages). However, the approach taken in this work takes this into consideration and handles full text articles. When full text articles are processed, each paragraph is packed and sent to slaves for processing rather than the entire article.

4. RESULTS AND DISCUSSION

Fig. 3 and 4 shows performance characteristics for the GENIA tagger and STEPP parser for a number of architectures. It can be seen from the figure that in each case the application has scaled and the time taken to process has been reduced dramatically. This means that, application performs better (takes less time to process the given data) as a greater number of processors are used.

In addition, different platforms have shown different characteristics and generally the cluster has outperformed the other architectures. This is due to two main reasons, firstly, the performance of individual processing cores are faster on the cluster when compared to other architectures. This shows that the architecture enables an embarrassingly parallel approach with no dependency between each processor.

Fig. 3: GENIA Scaling

There is enough bandwidth to supply large number of processors as data transfer times are lower than processing times for the given portion of the data. Secondly, the performance difference is due to the availability of hash maps on many of the systems. This is a feature that both applications rely heavily and lack of this reduces the performance significantly. Applications use a hash table to speed up the processing and the lack of this degrades the performance considerably.

Fig. 4: STEPP Scaling

Further details regarding the architectures used for this work can be found in Table 1. During this work a varying size and type of data files is used as an input. Input text varied from 10000 abstracts to the entire MEDLINE collection (12 Million Abstracts). In addition 125000 full text articles have also been processed during this work (Table 2).

DataSet lines bytes

10000

Abstracts 96420 13177477

100000 Abstracts 964205 131774775

1 Million Abstracts 9642050 1317747750

Entire MEDLINE

Abstracts 129559448 12220578650

PMC 125000 articles 41648507 6002031815 Table 2: Input data

Processing could be parallelised at the sentence level, which would make it easy to distribute the tasks. However, early simulation results showed when small amounts of data are sent too frequently (i.e. the master distributes the data sentences on a sentence basis) performance was poorer compared to sending larger chunks of data (i.e. abstracts) in terms of overall processing time. This is due to poor utilisation bandwidth (each MPI message has a header) and due to lower latency achieved by establishing communication every time between sender and receiver.

The time taken to process a specific abstract/text is unknown until processing starts as it depends not only on the number and length of the sentences but also on the structure of the sentence. On the other hand, when the problem size is increased by ten fold from around ten thousand abstracts to a hundred thousand abstracts, it was observed that the time taken to process increased by 10 fold. Although the size of the text datasets depends on the length of abstracts we can consider in this case that on average an abstract has a certain number of words and sentences. Hence this is reflected in the experimental results as shown in Fig. 3&4.

Fig. 3 & 4 also shows the speedup gained when the resources are increased. It shows that as the number of processors is increased processing time is reduced correspondingly. Therefore, one can say that application scales linearly. This was an expected result given the data independent nature of the application, the size of the dataset and the limited number of processors used.

Table 3 shows processing results for 125000 PMC full text articles and 8Million MEDLINE abstracts (entire MEDLINE

EMSSN-4 351 CSNDSP 2010

abstracts available). It would have taken over 74 days to process both PMC full text articles (125000) and MEDLINE abstracts (all abstracts) using a single processor. However, it had taken less than 1 day to process (tag and parse) all of this data.

Application Dataset

Time

(seconds)

112

processors

Time -

112

processors

Estimated

single

processor

processing

time

STEPP

parser

PMC 14634 4 hrs & 3

mins ~18.5 days

Med abst. 19254 5 hrs & 21

mins ~24.5 days

GENIA

tagger

PMC 10452 2 hrs & 54

mins ~13.5 days

MEDLINE

abst. 13877

3 hrs & 51

mins ~17.5 days

Less than

1 day ~74 days

Table 3: 112 processor cluster large test runs

In addition, initially we have taken another strategy where all the worker nodes were used for reading and writing to the disk rather than returning results back to master. Performance of this approach was very poor as expected since the performance of the serial codes is the limiting factor. Therefore, there is no need to utilise I/O performance at this stage where scaling to 512 processors have been achieved. On the other hand, network and I/O performance becomes a bottleneck as can be seen from the results when processing goes beyond 512 processors. Therefore, our aim is to create a hierarchical master/worker approach where for every 512 processors there will be a master doing I/O.

5. PARALLEL TEXT MINING FRAMEWORK

Another aim of this work has been to create a general parallel TM framework which could be applied to a number of TM approaches (Fig. 5). The framework developed here can be applied to data parallel applications to achieve high levels of scalability. In addition, portability of the application has been achieved using the MPI parallel programming paradigm. Application has already run on three different HPC and Grid architectures both loosely and closely coupled systems.

Mapreduce [36] provides a similar methodology to the one developed here. It defines map and reduce functions to process large amounts of data where all the inputs and outputs are coordinated by master processes. However, the framework developed here does not do this, instead it collects results as they are processed and writes the data back to the disk using masters. Furthermore, the TM tasks themselves are computationally expensive, therefore out approach can afford simplifying the entire process. Simplification is achieved by creating a hierarchical master/worker approach, where group of masters and workers are coordinated by a higher level processor (i.e. master 0 in Fig. 5). This allows achieving scalability to 10000s of processors.

I/O is handled by masters at Level 1, this creates a naive but inevitable I/O parallelism, inputs can be read in parallel and outputs can be written in parallel. This approach works well in environments where parallel file system exists. Furthermore, it allows overlapping computation with communication, therefore higher utilisation is achieved.

Master 0

Level0

-C

on

trol

Level1

– I/O

Level2

– P

rocessin

g

Input Data

Output

Data

Master 1

Worker 0

Worker 1

Worker 2

Worker m

Worker Group 1

Master n

Worker nxm

Worker nxm+1

Worker nxm+2

Worker nxm+m

Worker Group 1

Inp

ut

Ra

w T

ex

t

Inp

ut

Ra

w T

ex

t

Ou

tpu

t Pro

ce

ss

ed

Tex

t

Ou

tpu

t Pro

ces

se

d T

ex

t

Rea

d Dat

a Read D

ata

Write

ResultW

rite Result

Initi

aliz

e Initialize

Fig. 5: Data parallel Application Framework

In this approach there is no additional data structures are used unlike to map-reduce. This reduces complexity as the only information that needs to be known by workers is the master node that they are associated with.

6. CONCLUDING REMARKS

In this work, the parallel TM application developed has achieved linear scalability. Scaling up to 512 processors has been achieved using data independent approaches, and a hundred thousand abstracts have been processed in less than 5 minutes, whereas serial processing would take around 8 hours. Furthermore processing the entire data available to us would take 74 days using 112 processor whereas it had taken less than a day to process this via the proposed parallel framework. This was an expected result due to the data independent behaviour of the algorithms in consideration.

However, in order to be able to process problem sizes which are at least ten thousand fold larger than the one used here, there is a need to use a more sophisticated approaches to work and data distribution. In addition, parallel implementation of the parsing process is underway which should allow us to combine these processes using HPC and Grid platforms to process terabytes of text documents under one framework.

7. REFERENCES

1:Ananiadou, S. & McNaught, J. (2006) Introduction to Text

Mining in Biology. In Ananiadou, S. & McNaught, J. (Eds)

Text Mining for Biology and Biomedicine, pp. 1-12, Artech

House Books.

2: L. Shi and F. Campagne, "Building a protein name

dictionary from full text: a machine learning term extraction

approach", BMC Bioinformatics 2005, 6:88.

3: E. P. G. Martin, E. G. Bremer, M. C. Guerin, C. DeSesa, O.

Jouve, "Analysis of Protein/Protein Interactions Through

Biomedical Literature: Text Mining of Abstracts vs. Text

Mining of Full Text Articles", KELSI 2004: 96-108.

EMSSN-4 352 CSNDSP 2010

4: P. K. Shah, C. Perez-Iratxeta, and M. A. Andrade,

"Information extraction from full text scientific articles: Where

are the keywords?", BMC Bioinformatics 2003, 4:20.

5: David P. A. Corney , Bernard F. Buxton , William B.

Langdon and David T. Jones, "BioRAT: extracting biological

information from full-length papers", Journal of

Bioinformatics, Oxford Journals, 2004, 20(17):3206-3213.

6: J. Natarajan, D. Berrar, W. Dubitzky, C. Hack, Y. Zhang, C.

DeSesa, J. R. Van Brocklyn and E. G. Bremer, "Text mining of

full-text journal articles combined with gene expression

analysis reveals a relationship between sphingosine-1-

phosphate and invasiveness of a glioblastoma cell line", BMC

Bioinformatics 2006, 7:373.

7: M. J. Schuemie, M. Weeber, B. J. A. Schijvenaars, E. M.

van Mulligen, C. C. van der Eijk, R. Jelier, B. Mons, and J. A.

Kors, "Distribution of information in biomedical abstracts and

full-text publications", Journal of Bioinformatics, 2004, 20:

2597-2604.

8: M. Hilario, A. Mitchell, J.-H. Kim, P. Bradley and T.

Attwood, "Classifying Protein Fingerprints", PKDD2004, Pisa

Italy.

9: S. Ananiadou, D. B. Kell and J. Tsujii, "Text mining and its

potential applications in systems biology", Journal of Trends in

Biotechnology, 24(12), 11 October 2006.

10: Miyao, Y., T. Ohta, et al. (2006). Semantic Retrieval for

the Accurate Identification of Relational Concepts in Massive

Textbases. Coling/ACL, Sydney, Australia, Association for

Computational Linguistics.

11: T Ninomiya, K Torisawa, J Tsujii, "An Agent-based

Parallel HPSG Parser for Shared-memory Parallel Machines",

Journal of Natural Language Processing, Volume 8 Pages 21-

48 Ref number 1, January 2001, ISSN 1340761

12: X. Qin, "Performance Comparisons of Load Balancing

Algorithms for I/O-Intensive Workloads on Clusters", Journal

of Network and Computer Applications. July 2006

13: Horacio Gonzalez-Velez, "Self-adaptive skeletal task farm

for computational grids", Parallel Computing, Volume 32,

Issues 7-8, September 2006, pp 479-490

14: Matsuzaki, Takuya, Yusuke Miyao and Jun'ichi Tsujii.

Efficient HPSG Parsing with Supertagging and CFG-filtering.

In the Proceedings of the Twentieth International Joint

Conference on Artificial Intelligence. January 2007

15: Yoshimasa Tsuruoka and Jun'ichi Tsujii, "Bidirectional

Inference with the Easiest-First Strategy for Tagging Sequence

Data", Proceedings of HLT/EMNLP 2005, pp. 467-474.

16: Yusuke Miyao and Jun'ichi Tsujii. 2005, "Probabilistic

Disambiguation Models for Wide-Coverage HPSG Parsing",

Proceedings of ACL-2005, pp. 83-90.

17: Yusuke Miyao, Tomoko Ohta, Katsuya Masuda,

Yoshimasa Tsuruoka, Kazuhiro Yoshida, Takashi Ninomiya

and Jun'ichi Tsujii, "Semantic Retrieval for the Accurate

Identification of Relational Concepts in Massive Textbases",

Proceedings of COLING-ACL 2006. pp. 1017--1024.

18: K. Taura. 2004. GXP : An interactive shell for the grid

environment. In Proc. IWIA2004, pages 59–67

19: Miyao, Yusuke, Tomoko Ohta, Katsuya Masuda,

Yoshimasa Tsuruoka, Kazuhiro Yoshida, Takashi Ninomiya

and Jun'ichi Tsujii. Semantic Retrieval for the Accurate

Identification of Relational Concepts in Massive Textbases. In

the Proceedings of COLING-ACL 2006. Sydney, Australia, pp.

1017--1024, July 2006.

20: T. Ninomiya, Y. Tsuruoka, Y. Miyao, K. Taura, and J.

Tsujii. 2006. Fast and scalable HPSG parsing. Traitement

automatique des langues (TAL), 46(2).

21: Edgar Meij and Maarten de Rijke "Deploying Lucene on

the Grid", SIGIR 2006 workshop on Open Source Information

Retrieval (OSIR2006), 2006

22: Robert Sanderson and Ray Larson "Indexing and Searching

Tera-Scale Grid-Based Digital Libraries", InfoScale 2006

23: Peter Braam, "I/O for 25,000 Clients-Lustre & Red Storm",

CUG 2007, Seattle, May 2007

24: Weikuan Yu, Jeffrey Vetter, H. Sarp Oral, and Richard

Barrett "Efficiency Evaluation of Cray XT Parallel IO Stack",

CUG 2007, Seattle, May 2007

25: Toutanova, K., Klein, D., Manning, C., Singer, Y. (2003).

"Feature-Rich Part-of-Speech Tagging with a Cyclic

Dependency Network." 467-474.

26: Tsuruoka, Y., Y. Tateishi, et al. (2005). "Developing a

robust part-of-speech tagger for biomedical text." Advances in

Informatics, Proceedings 3746: 382-392.

27: Yoshida, K. (2007). "Ambiguous Part-of-Speech Tagging

for Improving Accuracy and Domain Portability of Syntactic

Parsers." Proceedings of the Twentieth International Joint

Conference on Artificial Intelligence.

28: Hull, D. A. (1996). "Stemming algorithms: A case study

for detailed evaluation." Journal of the American Society for

Information Science 47(1): 70-84.

29: Yakushiji, A., Y. T. Miyao, and, et al. (2005). Biomedical

information extraction with predicate-argument structure

patterns. First International Symposium on Semantic Mining in

Biomedicine.

30:http://www.llnl.gov/computing/tutorials/mpi_performance/

31: Yoshimasa Tsuruoka, Jun’ichi Tsujii, Sophia Ananiadou

"Fast Full Parsing by Linear-Chain Conditional Random

Fields", Proceedings of the 12th Conference of the European

Chapter of the ACL, pages 790–798, Athens, Greece, 30

March – 3 April 2009.

32:Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim,

Tomoko Ohta, John McNaught, Sophia Ananiadou, and

Jun'ichi Tsujii, Developing a Robust Part-of-Speech Tagger for

Biomedical Text, Advances in Informatics - 10th Panhellenic

Conference on Informatics, LNCS 3746, pp. 382-392, 2005

33: Kevin Knight and Daniel Marcu. 2000. "Statistics based

summarization - step one: Sentence compression". In

Proceedings of AAAI/IAAI, pages 703–710.

34: Yusuke Miyao, Rune Saetre, Kenji Sage, Takuya

Matsuzaki, and Jun’ichi Tsujii. 2008. "Task-oriented

evaluation of syntactic parsers and their representations." In

Proceedings of ACL-08:HLT, pages 46–54.

35: Xiao feng Yang, Jian Su, and Chew Lim Tan. 2006.

"Kernel-based pronoun resolution with structured syntactic

features". In Proceedings of COLING/ACL, pages 41–48.

36: J. Dean and S. Ghemawat, "MapReduce: simplified data

processing on large clusters", Communications of the ACM,

Volume 51 , Issue 1 (January 2008)

EMSSN-4 353 CSNDSP 2010