multiconference proceedings (pdf, 58.653 m)

Proceedings of the InternationalMulticonference on Computer

Science and Information Technology

Volume 5 (2010)

Proceedings of the International Multiconference onComputer Science and Information Technology

Volume 5 (2010)

M. Ganzha, M. Paprzycki (editors)

ISSN 1896-7094

Polskie Towarzystwo InformatyczneOddzia Grnolskiul. Raciborska 340-074 Katowice

ISBN 978-83-60810-27-9

IEEE Computer Society Press10662 Los Vaqueros CircleLos Alamitos, CA 90720USA

TEXnical editor: Aleksandr Denisiuk

Proceedings of the InternationalMulticonference on Computer

Science and Information TechnologyOctober 1820, 2010. Wisa, Poland

Volume 5 (2010)

ear Reader, it is our pleasure to present to you Pro-ceedings of the 2010 International Multiconference on

Computer Science and Information Technology (IMCSIT), which took place in Wisa, Poland, on October 1820, 2010. IMCSIT 2010 and was co-located with the XXVI Autumn Meeting of the Polish Information Processing Society (PIPS ).

D

IMCSIT is a result of the evolutionary process. In 2005 a Scientific Session took place during the XXI Autumn Meet-ing of PIPS and consisted of 27 refereed presentations. After this relative success (we have advertised the Session very late in the year) we have decided to expand and extend it into a full-blown conference but continue cooperation (co-location) with the Autumn Meetings of PIPS. As a result of a steady growth, in 2010, IMCSIT consisted of the follow-ing events (and Proceedings are organized into sections that correspond to each of them):

5th International Symposium Advances in Artificial Intelligence and Applications (AAIA'10),

Workshop on Agent Based Computing: from Model to Implementation VII (ABC:MI'10),

International Workshop on Advances in Business ICT (ABICT'10),

Computer Aspects of Numerical Algorithms (CANA'10),

Computational LinguisticsApplications (CLA'10 ),

10th International Multidisciplinary Conference on e-Commerce and e-Government (ECOM&EGOV'10),

International Symposium on E-LearningApplications (EL-A'10),

6th Workshop on Large Scale Computations on Grids and 1st Workshop on Scalable Computing in Distributed Systems (LaSCoG-SCoDiS'10),

2nd International Workshop on Medical Informatics and Engineering (MI&E'10),

3rd International Symposium on MultimediaApplications and Processing (MMAP'10),

International Workshop on Real Time Software (RTS'10),

4th International Workshop on Secure Information Systems (SIS'10),

International Symposium on Technologies for Social Advancement (T4SA'10),

Workshop on Ad-Hoc Wireless Networks (WAHOC'10),

Workshop on Computational Optimization (WCO'10).

Each of these events had its own Organizing and Program Committee (listed in these Proceedings). We would like to express our warmest gratitude to members of all of them for their hard work in attracting and later refereeing 201 sub-missions.

Maria Ganzha, Conference Chair, Systems Research Institute Polish Academy of Sciences, Warsaw, Poland, and Gdask University, Gdask, Poland.

Marcin Paprzycki, Systems Research Institute Polish Academy of Sciences, Warsaw and Management Academy, Warsaw, Poland.

Proceedings of the International

Multiconference on Computer Science and

Information Technology

Volume 5

October 18 20, 2010. Wisa, Poland

TABLE OF CONTENTS

5th

International Symposium Advances in Artificial

Intelligence and Applications:

Call For Papers 1

A Breast Cancer Classier based on a Combination of Case-BasedReasoning and Ontology Approach 3

Essam AbdRabou, AbdEl-Badeeh Salem

Using data mining for assessing diagnosis of breast cancer 11Medhat Mohamed Ahmed Abdelaal, Muhamed Wael Farouq, Hala Abou Sena,Abdel-Badeeh Mohamed Salem

Advanced scale-space invariant, low detailed feature recognition fromimages - car brand recognition 19

tefan Badura, Stanislav Foltn

Evaluation of Clustering Algorithms for Polish Word Sense Disambiguation 25Bartosz Broda, Wojciech Mazur

Generation of First-Order Expressions from a Broad Coverage HPSGGrammar 33

Ravi Coote, Andreas Wotzlaw

PSO based modeling of Takagi-Sugeno fuzzy motion controller for dynamicobject tracking with mobile platform 37

Meenakshi Gupta, Laxmidhar Behera, Venkatesh K.S.

Hierarchical Object Categorization with Automatic Feature Selection 45Md. Saiful Islam, Andrzej Sluzek

Selecting the best strategy in a software certication process 53Waldemar Koczkodaj, Vova Babiy, Agnieszka D. Bogobowicz, Ryszard Janicki,Alan Wassyng

Extrapolation of Non-Deterministic Processes Based on ConditionalRelations 59

Juliusz Kulikowski

Reasoning in RDFgraphic formal system with quantiers 67Alena Lukasova, Marek Vajgl, Martin ek

Coevolutionary Algorithm For Rule Induction 73Pawel Myszkowski

Evolutionary Algorithm in Forex trade strategy generation 81Pawel Myszkowski, Adam Bicz

Emotion-based Image Retrievalan Articial Neural Network Approach 89Katarzyna Agnieszka Olkiewicz, Urszula Markowska-Kaczmar

v

Automatic Visual Class Formation using Image Fragment Matching 97

Mariusz Paradowski, Andrzej luzek

Learning taxonomic relations from a set of text documents 105

Mari-Sanna Paukkeri, Alberto Perez Garcia-Plaza, Sini Pessala, Timo Honkela

Metric properties of populations in articial immune systems 113

Zbigniew Pliszka, Olgierd Unold

The development features of the face recognition system 121

Rauf Sadykhov, Igor Frolov

Multiscale Segmentation Based On Mode-Shift Clustering 129

Wojciech Tarnawski, Lukasz Miroslaw, Roman Pawlikowski, Krzysztof Ociepa

Relational database as a source of ontology creation 135

Zdenka Telnarova

Emotional Speech Analysis using Articial Neural Networks 141

Jana Tuckova, Martin Sramka

Usage of reection in .NET to inference of knowledge base 149

Marek Vajgl

On the evaluation of the linguistic summarization of temporally focusedtime series using a measure of informativeness 155

Anna Wilbik, Janusz Kacprzyk

Workshop on Agent Based Computing: from Model to

Implementation VII:

Call For Papers 163

Java-based Mobile Agent Platforms for Wireless Sensor Networks 165

Francesco Aiello, Alessio Carbone, Giancarlo Fortino, Stefano Galzarano

BeesyBeesEcient and Reliable Execution of Service-based WorkowApplications for BeesyCluster using Distributed Agents 173

Pawe Czarnul, Mariusz Matuszek, Micha Wjcik, Karol Zalewski

A Technique based on Recursive Hierarchical State Machines forApplication-level Capture of Agent Execution State 181

Giancarlo Fortino, Francesco Rango

Reorganization in Massive Multiagent Systems 189

Henry Hexmoor

Eectiveness of Solving Traveling Salesman Problem Using Ant ColonyOptimization on Distributed Multi-Agent Middleware 197

Sorin Ilie, Costin Badica

Selected Security Aspects of Agent-based Computing 205

Mariusz Matuszek, Piotr Szpryngier

Agent-Oriented Modelling for Simulation of Complex Environments 209

Inna Shvartsman, Kuldar Taveter, Merle Parmak, Merik Meriste

Improving Fault-Tolerance of Distributed Multi-Agent Systems with MobileNetwork-Management Agents 217

Dejan Mitrovi, Zoran Budimac, Mirjana Ivanovi, Milan Vidakovi

Argumentative agents 223

Francesca Toni

An agent based planner for including network QoS in scientic workows 231

Zhiming Zhao, Paola Grosso, Ralph Koning, Jeroen van der Ham, Cees de Laat

vi

International Workshop on Advances in Business ICT:

Call For Papers 239

A method for consolidating application landscapes during thepost-merger-integration phase 241

Andreas Freitag, Florian Matthes, Christopher Schulz

Hybridization of Temporal Knowledg for Economic Environment Analysis 249Maria Antonina Mach

Independent Operator of Measurements as a Virtual Enterprise on theEnergy Market 255

Boena Ewa Matusiak

A Two-level algorithm of time series change detection based on a uniquedeviations similarity method 259

Tomasz Peech-Pilichowski, Jan T. Duda

STRATEGOS: A case-based approach to strategy making in SME 265Jerzy Surma

Support of the E-business by business intelligence tools and data qualityimprovement 271

Milena Tvrdkov, Ondej Koubek

Computer Aspects of Numerical Algorithms:

Call For Papers 279

The experimental analysis of GMRES convergence for solution of Markovchains 281

Beata Bylina, Jarosaw Bylina

On the Numerical Analysis of Stochastic Lotka-Volterra Models 289Tugrul Dayar, Linar Mikeev, Verena Wolf

Finite Element Approximate Inverse Preconditioning using POSIX threadson multicore systems 297

George A. Gravvanis, P. I. Matskanidis, K. M. Giannoutakis, E. A. Lipitakis

On the implementation of public keys algorithms based on algebraic graphsover nite commutative rings 303

Micha Klisowski, Vasyl Ustimenko

Analysis of Pseudo-Random Properties of Nonlinear CongruentialGenerators with Power of Two Modulus by Numerical Computing of theb-adic Diaphony 309

Ivan Lirkov, Stanislava Stoilova

Assembling Recursively Stored Sparse Matrices 317Michele Martone, Salvatore Filippone, Marcin Paprzycki, Salvatore Tucci

Use of Hybrid Recursive CSR/COO Data Structures in SparseMatrices-Vector Multiplication 327

Michele Martone, Salvatore Filippone, Pawe Gepner, Marcin Paprzycki,Salvatore Tucci

Higher order FEM numerical integration on GPUs with OpenCL 337Przemysaw Paszewski, Krzysztof Bana, Pawe Macio

Parallelization of SVD of a Matrix-Systolic Approach 343Halil Snopce, Ilir Spahiu

Solving a Kind of BVP for ODEs on heterogeneous CPU + CUDA-enabledGPU Systems 349

Przemyslaw Stpiczynski, Joanna Potiopa

vii

Computational LinguisticsApplications:

Call For Papers 355

Using Self Organizing Map to Cluster Arabic Crime Documents 357Meshrif Alruily, Aladdin Ayesh, Abdulsamad Al-Marghilani

Quality Benchmarking Relational Databases and Lucene in the TREC4Adhoc Task Environment 365

Ahmet Arslan, Ozgur Yilmazel

Parallel, Massive Processing in SuperMatrix a General Tool forDistributional Semantic Analysis of Corpus 373

Bartosz Broda, Damian Jaworski, Maciej Piasecki

Development of a Voice Control Interface for Navigating Robots andEvaluation in Outdoor Environments 381

Ravi Coote

The Role of the Newly Introduced Word Types in the Translations of Novels 389Maria Csernoch

SyMGiza++: A Tool for Parallel Computation of Symmetrized WordAlignment Models 397

Marcin Junczys-Dowmunt, Arkadiusz Sza

Semi-Automatic Extension of Morphological Lexica 403Tobias Kaufmann, Beat Pster

Automatic Extraction of Arabic Multi-Word Terms 411Khalid Al Khatib, Amer Badarneh

"Beautiful picture of an ugly place". Exploring photo collections usingopinion and sentiment analysis of user comments 419

Slava Kisilevich, Christian Rohrdantz, Daniel Keim

LEXiTRON-Pro Editor: An Integrated Tool for developing ThaiPronunciation Dictionary 429

Supon Klaithin, Patcharika chootrakool, Krit Kosawat

Automatic Detection of Prominent Words in Russian Speech 435Daniil Kocharov

Computing trees of named word usages from a crowdsourced lexical network 439Mathieu Lafourcade, Alain Joubert

RefGen: a Tool for Reference Chains Identication 447Laurence Longo, Amalia Todirascu

Is Shallow Semantic Analysis Really That Shallow? A Study on ImprovingText Classication Performance 455

Przemysaw Macioek, Grzegorz Dobrowolski

PerGram: A TRALE Implementation of an HPSG Fragment of Persian 461Stefan Mller, Masood Ghayoomi

WordnetLoom: a Graph-based Visual Wordnet Development Framework 469Maciej Piasecki, Micha Marciczuk, Adam Musia, Radosaw Ramocki, MarekMaziarz

Building and Using Existing Hunspell Dictionaries and TeX Hyphenators asFinite-State Automata 477

Tommi Pirinen, Krister Lindn

The Polish Cyc lexicon as a bridge between Polish language and theSemantic Web 485

Aleksander Pohl

Tools for syntactic concordancing 493Violeta Seretan, Eric Wehrli

Eective natural language parsing with probabilistic grammars 501Pawe Skrzewski

viii

Finding Patterns in Strings using Suxarrays 505

Herman Stehouwer, Menno Van Zaanen

Entity Summarisation with Limited Edge Budget on Knowledge Graphs 513

Marcin Sydow, Mariusz Pikua, Ralf Schenkel, Adam Siemion

Multiple Noun Expression Analysis: An Implementation of OntologicalSemantic Technology 517

Julia Taylor, Victor Raskin, Maxim Petrenko, Christian F. Hempelmann

A web-based translation service at the UOC based on Apertium 525

Luis Villarejo, Mireia Farrus, Gema Ramrez, Sergio Ortz

Tools and Methodologies for Annotating Syntax and Named Entities in theNational Corpus of Polish 531

Jakub Waszczuk, Katarzyna Gowiska, Agata Savary, Adam Przepirkowski

TREF - TRanslation Enhancement Framework for Japanese-English 541

Bartholomus Wloka, Werner Winiwarter

Matura Evaluation Experiment Based on Human Evaluation of MachineTranslation 547

Aleksandra Wojak, Filip Graliski

German subordinate clause word order in dialogue-based CALL. 553

Magdalena Wolska, Sabrina Wilske

Polish Phones Statistics 561

Bartosz Ziolko, Jakub Galka

APyCA: Towards the Automatic Subtitling of Television Content in Spanish 567

Aitor lvarez, Arantza del Pozo, Andoni Arruti

10th

International Multidisciplinary Conference on

e-Commerce and e-Government:

Call For Papers 575

Trusted Data in IBM's MDM: Accuracy Dimension 577

Przemyslaw Pawluk

Multicriteria Evaluation of DVB-RCS Satellite Internet Performance Usedfor e-Government and e-Learning Purposes 585

Andrzej M. J. Skulimowski

INFOMAT-E - public information system for people with sight and hearingdysfunctions 593

Micha Socha, Wojciech Grka, Adam Piasecki, Beata Sitek

Bidirectional voting and continuous voting concepts as possible impact ofInternet use on democratic voting process 599

Jacek Wachowicz

The Double Jeopardy Phenomenon and the Electronic Distribution ofInformation 605

Urszula wierczyska-Kaczor, Artur Borcuch, Pawe Kossecki

International Symposium on E-LearningApplications:

Call For Papers 609

Simple Blog Searching Framework Based on Social Network Analysis 611

Iwona Doliska

ix

6th

Workshop on Large Scale Computations on Grids

and 1st Workshop on Scalable Computing in

Distributed Systems:

Call For Papers 619

Exploratory Programming in the Virtual Laboratory 621Eryk Ciepiela, Daniel Harlak, Joanna Kocot, Tomasz Bartyski, MaciejMalawski, Tomasz Gubaa

Modelling, Optimization and Execution of Workow Applications with DataDistribution, Service Selection and Budget Constraints in BeesyCluster 629

Pawe Czarnul

Multi-level Parallelization with Parallel Computational Services inBeesyCluster 637

Pawe Czarnul

Managing large datasets with iRODSa performance analyses 647Denis Hnich, Ralph Mller-Pfeerkorn

Service level agreements for job control in high-performance computing 655Roland Kbert, Stefan Wesner

A Modeling Language Approach for the Abstraction of the Berkeley OpenInfrastructure for Network Computing (BOINC) Framework 663

Christian Benjamin Ries, Thomas Hilbig, Christian Schrder

Degisco Green Methodologies in Desktop Grids 671Bernhard Schott, Ad Emmen

Resource Fabrics: the next level of grids and clouds 677Lutz Schubert, Matthias Assel, Stefan Wesner

2nd

International Workshop on Medical Informatics

and Engineering:

Call For Papers 685

Agile methodology and development of software for users with specicdisorders 687

Rostislav Fojtik

3rd

International Symposium on

MultimediaApplications and Processing:

Call For Papers 693

An Hypergraph Object Oriented Model for Image Segmentation andAnnotation 695

Eugen Ganea, Marius Brezovan

Classication of Image Regions Using the Wavelet Standard DeviationDescriptor 703

Snke Greve, Marcin Grzegorzek, Carsten Saatho, Dietrich Paulus

High Capacity Colored Two Dimensional Codes 709Antonio Grillo, Alessandro Lentini, Marco Querini, Giuseppe F. Italiano

Region-based Measures for Evaluation of Color Image Segmentation 717Andreea Iancu, Bogdan Popescu, Marius Brezovan, Eugen Ganea

Undetectable Spread-time Stegosystem Based on Noisy Channels 723Valery Korzhik, Guillermo Morales-Luna, Ksenia Loban, Irina Marakova-Begoc

Building Personalized Interfaces by Data Mining Integration 729Marian Cristian Mihaescu

x

A Graphical Interface for Evaluating Three Graph-Based ImageSegmentation 735

Gabriel Mihai, Alina Doringa, Liana Stanescu

Basic Consideration of MPEG-2 Coded File Entropy and LosslessRe-encoding 741

Kazuo Ohzeki, Yuan y Wei, Eizaburo Iwata, Ulrich Speidel

Analyzes of the processing performances of a Multimedia Database 749Cosmin Stoica Spahiu

Constructive Volume Modeling 755Mihai Tudorache, Mihai Popescu, Razvan Tanasie

Real-Time Embedded Fault Detection Estimators in a Satellite's ReactionWheels 759

Nicolae Tudoroiu, Eshan Sobhani-Tehrani, Kash Khorasani, Tiberiu Letia,Roxana-Elena Tudoroiu

Application of optimal settings of the LMS adaptive lter for speech signalprocessing 767

Jan Vau, Vtzslav Stskala

Obfuscation Methods with Controlled Calculation Amounts and TableFunction 775

Yuanyu Wei, Kazuo Ohzeki

International Workshop on Real Time Software:

Call For Papers 781

Computationally eective algorithms for 6DoF INS used for miniature UAVs 783Jan Floder

Supervisory control and real-time constraints 791Wojciech Grega

Integration of Scheduling Analysis into UML Based Development ProcessesThrough Model Transformation 797

Matthias Hagner, Ursula Goltz

Laboratory real-time systems to facilitate automatic control education andresearch 805

Krzysztof Koek, Andrzej Turnau, Krystyn Hajduk, Pawe Pitek, MariuszPauluk, Dariusz Marchewka, Adam Piat, Maciej Ros, Przemysaw Gorczyca

Methods of Computer-Assisted Manual Control of Wheeled Robots 813Viktor Michna, Petr Wagner, Jiri Kotzian

Software and hardware in the loop component for an IEC 61850Co-Simulation platform 817

Haar Mohamad, Thiriet Jean Marc

Real-time controller design based on NI Compact-RIO 825Maciej Ros, Adam Piat, Andrzej Turnau

Intelligent Car Control and Recognition Embedded System 831Vilem Srovnal Jr., Zdenek Machacek, Radim Hercik, Roman Slaby, VilemSrovnal

4th

International Workshop on Secure Information

Systems:

Call For Papers 837

A Security Model for Personal Information Security Management Based onPartial Approximative Set Theory 839

Zoltn Csajbk

Social Engineering-Based AttacksModel and New Zealand Perspective 847Lech Janczewski, Lingyan (Ren) Fu

xi

International Symposium on Technologies for Social

Advancement:

Global Mobile Applications For Monitoring Health 855Tapsie Giridher Giridher, Anita Wasliewska, Jennifer Wong

A Study on the Expectations and Actual Satisfaction about Mobile Handsetbefore and after Purchase 861

JIBum Jung, seungpyo Hong

Workshop on Ad-Hoc Wireless Networks:

Call For Papers 867

Wireless Transceiver for Control of Mobile Embedded Devices 869Jan Kordas, Petr Wagner, Jiri Kotzian

Ecient Coloring of Wireless Ad Hoc Networks With DiminishedTransmitter Power 873

Krzysztof Krzywdziski

Fast Construction of Broadcast Scheduling and Gossiping in Dynamic AdHoc Networks 879

Krzysztof Krzywdziski

Workshop on Computational Optimization:

Call For Papers 885

ACO with semi-random start applied on MKP 887Stefka Fidanova, Pencho Marinov, Krassimir Atanassov

On the Probabilistic min spanning tree problem 893Boria Nicolas, Murat Ccile, Paschos Vangelis

Ecient Portfolio Optimization with Conditional Value at Risk 901Wlodzimierz Ogryczak, Tomasz Sliwinski

Enhanced Competitive Dierential Evolution for Constrained Optimization 909Josef Tvrdik, Radka Polakova

xii

he AAIA'10 will bring researchers, developers, practi-tioners, and users to present their latest research, re-

sults, and ideas in all areas of artificial intelligence. We hope that theory and successful applications presented at the AAIA'10 will be of interest to researchers and practitioners who want to know about both theoretical advances and latest applied developments in Artificial Intelligence. As such AAIA'10 will provide a forum for the exchange of ideas be-tween theoreticians and practitioners to address the impor-tant issues.

T

Papers related to theories, methodologies, and applica-tions in science and technology in this theme are especially solicited. Topics covering industrial issues/applications and academic research are included, but not limited to:

Knowledge management Decision Support System Approximate Reasoning Fuzzy modeling and control Data Mining Web Mining Machine learning Combining multiple knowledge sources in an in-tegrated intelligent system Neural Networks Evolutionary Computation Artificial Immune Systems Ant Systems in Applications Natural Language processing Image processing and understanding (interpreta-tion) Applications in Bioinformatics Hybrid Intelligent Systems Granular Computing Architectures of intelligent systems Robotics Real-world applications of Intelligent Systems

INTERNATIONAL PROGRAMME COMMITTEEJanos Abonyi, University of Pannonia, HungaryHans Jorgen Andersen, Aalborg University, DenmarkAnna Bartkowiak, Wroclaw University, PolandShlomo Berkovsky, CSIRO, AustraliaRyszard Choras, Institute of Telecommunications,

PolandKrzysztof Cios, Virginia Commonwealth University,

USAAlfredo Cuzzocrea, University of Calabria, ItalyClaudio De Stefano, University of Cassino, ItalyJeremiah Da Deng, University of Otago, New ZealandKrzysztof Goczyla, Gdansk University of Technology,

PolandAmr Goneid, Computer Science Dept.,American Univer-

sity in Cairo, Egypt

Min Henderson, University of Virginia, USAZdzislaw Hippe, University of Information Technology

and Management in Rzeszow, PolandElzbieta Hudyma, Wroclaw University of Technology,

PolandJerzy W. Jaromczyk, University of Kentucky, USAPiotr Jedrzejowicz, Gdynia Maritime University, PolandJerzy Jozefczyk, Wroclaw University of Technology,

PolandJanusz Kacprzyk, Systems Research Institute of the Pol-

ish Academy of Sciences, PolandRadosaw Katarzyniak, Wrocaw University of Tech-

nology, PolandPrzemyslaw Kazienko, Wroclaw University of Technol-

ogy, PolandVojislav Kecman, Virginia Commonwealth University ,

USAEtienne Kerre, University of Gent, BelgiumJacek Kluska, Rzeszow University of Technology,

PolandYiannis Kompatsiaris, Informatics and Telematics Insti-

tute, GreeceJozef Korbicz, University of Zielona Gora, PolandJerzy Korczak, Wroclaw University of Economics,

PolandWitlod Kosinski, Polish-Japanese Institute of Informa-

tion Technology, PolandAdam Krzyzak, Concordia University, CanadaJuliusz Lech Kulikowski, Institute of Computer Science

of the Polish Academy of Sciences, PolandLukasz Kurgan, University of Alberta, CanadaHalina Kwasnicka, Wroclaw University of Technology,

PolandSerguei Levachkine, National Polytechnic Institute,

MexicoRory Lewis, University of Colorado at Colorado Springs,

USAJoo-Hwee Lim, Institute for Infocomm Research,

A*STAR, SingaporeJie Lu, University of Technology Sydney, AustraliaAbdel-Badeeh M. Salem, Ain Shams University, EgyptJacek Mandziuk, Warsaw University of Technology,

PolandUrszula Markowska-Kaczmar, Wroclaw University of

Technology, PolandZbigniew Michalewicz, University of Adelaide, Aus-

traliaSantiago M. Mola, Universidad Politcnica de Valencia,

SpainPawel Myszkowski, Wroclaw University of Technology,

PolandTapio Pahikkala, University of Turku, Finland

5th International SymposiumAdvances in Artificial Intelligence and Applications

CELEBRATING 75TH BIRTHDAY OF PROFESSOR LEONARD BOLC

Mariusz Paradowski, Wroclaw University of Technolo-gy, Poland

Witold Pedrycz, University of Alberta, CanadaJames Peters, University of Manitoba, CanadaSheela Ramanna, University of Winnipeg, CanadaZbigniew Ras, University of North Carolina, USAPaolo Rosso, Universidad Politcnica Valencia, Spain,

SpainGunter Saake, Otto-von-Guericke-Universitt , GermanyJerzy Sas, Wroclaw University of Technology, PolandChristelle Scharff, Pace University, USARoman Slowinski, Poznan University of Technology,

PolandAndrzej Sluzek, Nanyang Technological University, Sin-

gaporeJanusz Sobecki, Wroclaw University of Technology,

PolandSiergey Subbotin, Zaporozhye National Technical Uni-

versity, Ukraine

Jerzy Swiatek, Wroclaw University of Technology, Poland

Piotr Szczepaniak, Technical University of Lodz, PolandStan Szpakowicz, SITE, University of Ottawa, CanadaRyszard Tadeusiewicz, AGH University of Science and

Technology, PolandLi-Shiang Tsay, North Carolina A&T State University,

USAJosef Tvrdik, University of Ostrava, Czech RepublicAngelina Tzacheva, Univ. of South Carolina, USAAnita Wasilewska, Stony Brook University, NY, USA,

USADaniela Zaharie, West University of Timisoara, Roma-

niaWojciech Ziarko, University of Regina, Canada

ORGANIZING COMMITTEEHalina Kwasnicka, Urszula Markowska-Kaczmar,

Wrocaw University of Technology, Poland

A Breast Cancer Classifier based on a Combination

of Case-Based Reasoning and Ontology Approach

Essam Amin M.Lotfy Abdrabou

Ph.D Candidate

Faculty of Computer and Information Sciences

Ain Shams University, Abbassia, 11566, Cairo, EGYPT

(+202) 26330636

Email: [email protected]

AbdEl-Badeeh M. Salem

Professor

Faculty of Computer and Information Sciences

Ain Shams University, Abbassia, 11566, Cairo, EGYPT

(+202) 26844284

Email: [email protected]

AbstractBreast cancer is the second most common form ofcancer amongst females and also the fifth most cause of cancerdeaths worldwide. In case of this particular type of malignancy,early detection is the best form of cure and hence timely andaccurate diagnosis of the tumor is extremely vital. Extensiveresearch has been carried out on automating the critical diagnosisprocedure as various machine learning algorithms have beendeveloped to aid physicians in optimizing the decision taskeffectively. In this research, we present a benign/malignant breastcancer classification model based on a combination of ontologyand case-based reasoning to effectively classify breast cancertumors as either malignant or benign. This classification systemmakes use of clinical data. Two CBR object-oriented frameworksbased on ontology are used jCOLIBRI and myCBR. A breastcancer diagnostic prototype is built. During prototyping, weexamine the use and functionality of the two focused frameworks.

Index TermsCase-Based Reasoning, Case-Based ReasoningFrameworks, CBR, CBR Frameworks, jCOLIBRI, myCBR,Breast Cancer

I. INTRODUCTION

BREAST cancer classification, diagnosis and prediction

techniques have been a widely researched area in the past

decade in the world of medical informatics. Several articles

have been published which tries to classify breast cancer data

sets using various techniques such as fuzzy logic, support

vector machines, Bayesian classifiers, decision trees and neural

networks. Classification accuracy as high as 98.8% has been

achieved using a learning algorithm combining simulated an-

nealing with the perceptron algorithm. Another study involving

fuzzy modeling and cooperative co-evolution has gained an

accuracy of 98.98% over one of the widely studied Wisconsin

breast cancer database [16].

This research applies a new technique in the field of

breast cancer classification. It uses a combination of ontology

and case-based reasoning by using ontology based object-

oriented case-based reasoning frameworks. Two frameworks

are examined in building the classifier. One is the open source

jCOLIBRI [5] system developed by GAIA group and provides

a framework for building CBR systems based on state-of-the-

art software engineering techniques. The other is the novel

open source CBR tool myCBR [24] developed at the German

Research Center for Artificial Intelligence (DFKI). The objec-

tive of this classifier is to classify the patient based on his/her

electronic record whether he/she is benign or malignant.

This paper is organized in four sections. Section 1 is this

introduction. Section 2 gives a theoretical background about

breast cancer, ontology, CBR and object-oriented frameworks.

Section 3 illustrates the implementation of the breast cancer

classifier on the two frameworks. Finally, section 4 discusses

and concludes the results

II. THEORITICAL BACKGROUND

A. Breast Cancer

Breast cancer is the form of cancer that either originates

in the breast or is primarily present in the breast cells. The

disease occurs mostly in women but a small population of

men is also affected by it. Breast cancer is the most common

form of cancer amongst the female population as well as the

most common cause of cancer deaths [25]. Early detection

of breast cancer saves many thousands of lives each year.

Many more could be saved if the patients are offered accurate,

timely analysis of their particular type of cancer and the

available treatment options. Since the breast tumors whether

malignant or benign share structural similarities, it becomes

an extremely tedious and time consuming task to manually

differentiate them. As seen in Figure 1 there is no visually

significant difference between the fine needle biopsy image of

the malignant and benign tumor for an untrained eye. Accurate

Fig. 1. Fine needle biopsies of breast. Malignant (left) and Benign (right) [25]

classification is very important as the potency of the cytotoxic

drugs administered during the treatment can be life threatening

or may develop into another cancer. Laboratory analysis or

biopsies of the tumor is a manual, time consuming yet accurate

Proceedings of the International Multiconference onComputer Science and Information Technology pp. 310

ISBN 978-83-60810-27-9ISSN 1896-7094

978-83-60810-27-9/09/$25.00 c 2010 IEEE 3

system of prediction. It is however prone to human errors,

creating a need for an automated system to provide a faster

and more reliable method of diagnosis and prediction for the

patients.

B. Ontology

Ontology is a formal explicit description of concepts in a

domain of discourse (classes (sometimes called concepts)),

properties of each concept describing various features and

attributes of the concept (slots (sometimes called roles or

properties)), and restrictions on slots (facets (sometimes called

role restrictions)). Ontology together with a set of individual

instances of classes constitutes a knowledge base. In reality,

there is a fine line where the ontology ends and the knowledge

base begins [8].

C. Case-Based Reasoning

In case-based reasoning (CBR) systems expertise is em-

bodied in a library of past cases, rather than being encoded in

classical rules. Each case typically contains a description of the

problem, plus a solution and/or the outcome. The knowledge

and reasoning process used by an expert to solve the problem

is not recorded, but is implicit in the solution. To solve a

current problem: the problem is matched against the cases in

the case base, and similar cases are retrieved. The retrieved

cases are used to suggest a solution that is reused and tested

for success. If necessary, the solution is then revised. Finally

the current problem and the final solution are retained as part

of a new case.

The CBR process can be represented by a schematic cycle,

as shown in Figure 2 [1].

Fig. 2. The CBR Cycle

Representation: Given a new situation, generate appropriate

semantic indices that will allow its classification and catego-

rization. This usually implies a standard indexing vocabulary

that the CBR system uses to store historical information

and problems. The vocabulary must be rich enough to be

expressive, but limited enough to allow efficient recall [2].

Retrieval: Given a new, indexed problem, retrieve the best

past cases from memory. This requires answering three ques-

tions: What constitute an appropriate case? What are the

criteria of closeness or similarity between cases? How should

cases be indexed? Part of the index must be a description of the

problem that the case solved, at some level of abstraction. Part

of the case, though, is also the knowledge gained from solving

the problem represented by the case. In other words, cases

should also be indexed by some elements of their solution [11].

Adaptation: Modify the old solutions to confirm to the new

situation, resulting in a proposed solution. With the exception

of trivial situations, the solution recalled will not immediately

apply to the new problem, usually because the old and the

new problem are slightly different. CBR researchers have

developed and used various adaptation techniques [11].

Validation: After the system checks a solution, it must

evaluate the results of this check. If the solution is acceptable,

based on some domain criteria, the CBR system is done with

reasoning. Otherwise, the case must be modified again, and

this time the modifications will be guided by the results of the

solutions evaluation [11].

Update: If the solution fails, explain the failure and learn

it, to avoid repeating it. If the solution succeeds and warrants

retention, incorporate it into the case memory as a successful

solution and stop. The CBR system must decide if a successful

new solution is sufficiently different from already-known solu-

tions to warrant storage. If it does warrant storage, the system

must decide how the new case will be indexed, on which level

of abstraction it will be saved, and where it will be put in the

case-base organization [11].

Retaining the case is the process of incorporating whatever

is useful from the new case into the case library. This involves

deciding what information to retain and in what form to retain

it; how to index the case for future retrieval; and integrating

the new case into the case library.

D. CBR Object-Oriented Frameworks

The concept of object-oriented frameworks has been intro-

duced in the late 80s and has been defined as a set of classes

that embodies an abstract design for solutions to a family of

related problems, and supports reuses at a larger granularity

than classes [9].

The goal of a framework is to capture a set of concepts

related to a domain and the way they interact. In addition, a

framework is in control of a part of the program activity and

calls specific application code by dynamic method binding.

A framework can be viewed as an incomplete application

where the user only has to specialize some classes to build

the complete application [9].

4 PROCEEDINGS OF THE IMCSIT. VOLUME 5, 2010

Frameworks allow the reuse of both code and design for a

class of problems, giving the ability to non-expert to write

complex applications quickly. Frameworks also allow the

development of prototypes which could be extended further

on by specialization or composition. A framework once un-

derstood, it can be applied in a wide range of domain, and

can be enhanced by the adding of new components [9].

Using frameworks for development of new applications

helps improve software quality. It improves programmers

productivity and quality, performance, and reliability of soft-

ware. It also enhances extensibility by providing the required

methods that allow applications to extend its stable inter-

faces [20]. Figure 3 clearly shows the difference of the effort

required for developing an application from scratch and using

a framework [15].

Fig. 3. Development Effort Reduction by using Frameworks

CBR researchers agree that the best way to satisfy the

increasing demand of developing CBR application is by de-

velopment of frameworks. Recently, some efforts within the

CBR community have developed CBR frameworks [20]. This

paper focuses on two of them jCOLIBRI developed by GAIA

group and myCBR developed by DFKI group.

III. EXPERIMENTS

A. Breast Cancer Classifications

Breast cancer has become the number one cause of cancer

deaths amongst women. Once a breast cancer is detected, it

can be classified benign (not cancerous tissue) or malignant

(cancerous tissue). In this study, the two compared CBR

frameworks are tested by developing a CBR application that

classifies the condition of the breast cancer tumor whether

it is benign or malignant. Wisconsin breast cancer data set

was used for building the case-bases. It is obtained from

the University of Wisconsin Hospitals, Madison from Dr.

William H. Wolberg [14]. Samples inside the data set arrive

periodically as Dr. Wolberg reports his clinical cases. The

number of instances inside the dataset is 699 (as of 15

July 1992). Each record contains ten attributes plus the class

attribute. Table I shows the attributes and their possible values.

65.5% of the elements belong to the benign class and 34.5% to

the malignant class. 16 elements are incomplete (an attribute

is missing) and have been excluded from the database.

TABLE IWISCONSIN BREAST CANCER DATASET

No. Attribute Possible Value

1 Sample code number id number

2 Clump Thickness 1 10

3 Uniformity of Cell Size 1 10

4 Uniformity of Cell Shape 1 10

5 Marginal Adhesion 1 10

6 Single Epithelial Cell Size 1 10

7 Bare Nuclei 1 10

8 Bland Chromatin 1 10

9 Normal Nucleoli 1 10

10 Mitoses 1 10

11 Class (2 for benign, 4 for malignant)

B. jCOLIBRI

1) Overview: jCOLIBRI is an evolution of the COLIBRI

architecture [7], that consisted of a library of problem solv-

ing methods (PSMs) for solving the tasks of a knowledge-

intensive CBR system along with ontology, CBROnto [8],

with common CBR terminology. COLIBRI was prototyped in

LISP using LOOM as knowledge representation technology.

This prototype served as proof of concept; was very useful but

it is not helpful for non-expert users. Then, people at GAIA

group have started to develop a new complete framework with

the name of jCOLIBRI. It stands for Cases and Ontology

Libraries Integration for Building Reasoning Infrastructures.

CBR ontology assumes the same vocabulary provided by any

CBR system. In jCOLIBRI, ontology is not represented as a

new source. All concepts of CBR are mapped into classes and

interfaces of framework. Classes that represent the concept of

ontology serve as templates where new CBR types should be

added. They also provide the tasks and abstract interface of the

methods. The design of the jCOLIBRI framework comprises

a hierarchy of Java classes plus a number of XML files. The

framework is organized around the following elements [2]:

Tasks and methods: The tasks supported by the framework

and the methods that solve them are all stored in a set of

XML files.

Case-base: Different connectors are defined to support several

types of case determination, from the file system to a database.

Cases: A number of interfaces and classes are included in the

framework to provide an abstract representation of cases that

support any type of actual case structure.

Problem solving methods: The actual code that supports the

methods included in the framework.

The jCOLIBRI comes in two major releases version 1 and

version 2. According to the tutorial [19], version 2 is a new im-

plementation that follows a new and clear architecture divided

into two layers: one oriented to developers and other oriented

to designers. Unfortunately, the only available distribution of

version 2 is the one that is oriented to the developers which

is out of scope of this paper. jCOLIBRI version 1 is the first

release of the framework. It includes a complete Graphical

ESSAM ABDRABOU, ABDEL-BADEEH SALEM: A BREAST CANCER CLASSIFIER 5

(a) Patient Case Definition in jCOLIBRI

(b) Managing Connectors in jCOLIBRI

(c) Configuration of Tasks in jCOLIBRI


(d) jCOLIBRI Retrieval

Fig. 4. Implementation in jCOLIBRI

User Interface (GUI) that guides the user in the design of a

CBR system. This version is recommended for non-developer

users that want to create CBR systems without programming

any code which is exactly the scope in this study. As a result,

version 1 is selected to implement the required application.

Downloading of the jCOLIBRI is an easy task; it can be

obtained through the web page of GAIA group. It comes in

a compressed distribution that can be easily extracted to have

the full package. To run jCOLIBRI, there is a ready batch file

(we are using MS Windows R platform) that can be invokeddirectly to run jCOLIBRI. It is required to have JAVA Virtual

Machine installed before running the batch file. By invoking

this batch file we get the first screen of the framework GUI.

2) Implementation: By the help of the multimedia tutorials

provided and the GUI of the jCOLIBRI, users can go through

five steps to implement and deploy a CBR System. These steps

are

Definition of case structures Building the case-base Managing similarity measures Configuring the behavior of the CBR process Testing and deploying the CBR application

Definition of Case Structures: By using jCOLIBRI GUI users

are able to create the case structure defining simple and

compound attributes that describe the cases together with

their types, weights, similarity measure -that is chosen from

a library of existing similarity functions and parameters. The

case structure can be saved or loaded in and from a XML file.

Figure 4(a) shows the definition of the patient case parameters.

Building the case-base: jCOLIBRI introduces the concept

of Connectors which cases persistence is built around. Con-

nectors are objects that know how to access and retrieve

cases from the storage media and return those cases to the

CBR system in a uniform way. Therefore connectors provide

an abstraction mechanism that allows users to load cases

from different storage sources in a transparent way [24] [21].

Defined connectors can work with plain text files, XML files,

or relational data bases. The graphical interface helps mapping

the defined case structure with the tables and columns from

the storage scheme. Figure 4(b) shows how the patient case

structure is mapped to columns in a text file containing the

Wisconsin data set patient records.

Managing similarity measures: When two cases are compared,

the local similarity functions are used to compare simple

attribute values. Global similarity functions are linked to

compound attributes and are used to gather the similarities of

the collected attributes in a unique similarity value. At last, the

similarity value of two cases is computed as the similarity of

their description concepts. The available similarity measures

are listed in a configuration file, and can be managed using

the GUI. Since our problem is simple, we leave the default

similarity assigned by jCOLIBRI.

Configuring the behavior of the CBR process: As introduced,

jCOLIBRI formalizes the CBR knowledge using CBR ontol-

ogy (CBROnto), a knowledge level description of the CBR

tasks and a library of reusable Problem Solving Methods

(PSMs) [21]. Configuration of tasks is done in an interactive

approach by choosing from a library of reusable methods

one that is suitable to solve the selected task. Constraints of

the selected task are being tracked during the configuration

process so that only applicable methods in the given context

are offered to users. In our comparison we focus only on the

retrieval task. Figure 4(c) shows the configured tasks in the

breast cancer application.

Testing and deploying the CBR application: The CBR appli-

cation is finished when all the tasks have been configured.

Users can test the system from inside the graphical interface.

The first task of the CBR system, (Obtain query task) obtains

the query that is going to be used to retrieve the most similar

cases. Figure 4(d) shows the GUI after a query. We tested

the 16 records that are excluded from the dataset according

to one missing value. Only two missed classifications are

obtained. Documentation mentions that it is possible to deploy

the developed CBR application by generating a code template

with most of the code required to run the developed system

as an independent application. We have tried this process but

it is completely failed.

C. myCBR

1) Overview: myCBR is an open-source plug-in for the

open-source ontology editor Protg [6]. Protg is based

on Java, is extensible, and provides a plug-and-play envi-

ronment that makes it a flexible base for rapid prototyping

and application development [4]. Protg [4] allows defining

classes and attributes in an object-oriented way. Furthermore,

it manages instances of these classes, which myCBR interprets

as cases [22]. So the handling of vocabulary and case base

is already provided by Protg. The myCBR plug-in provides

several editors to define similarity measures for an ontology

and a retrieval interface for testing [24]. As the main goal of

myCBR is to minimize the effort for building CBR applications

that require knowledge-intensive similarity measures, myCBR


(a) Wisconsin Dataset in a CSV File

(b) Patient Case Data Representation in myCBR

(c) Retrieval of a Case Query with a Missing Attribute Value


(d) Breast Cancer as a Stand-Alone Application

Fig. 5. Implementation in myCBR

provides comfortable GUIs for modeling various kinds of

attribute specific similarity measures and for evaluating the

resulting retrieval quality. In order to reduce also the effort of

the preceding step of defining an appropriate case represen-

tation, it includes tools for generating the case representation

automatically from existing raw data [22]. The novice as well

as the expert knowledge engineer are supported during the

development of a myCBR project through intelligent support

approaches and advanced GUI functionality [22]. Download-

ing myCBR requires two steps of downloading. The first is

to download myCBR plug-in files; this can be done directly

through myCBR web page. The second step is to download

the Protg ontology editor; this can be done through the

Protg web page. Downloading Protg is not an easy task.

Users need to do some readings on the site to be able to

select the suitable version to download. Since myCBR is a

plug-in inside Protg, users need to install Protg first. It

is required to have JAVA Virtual Machine installed before

proceeding in installation, or users may choose to download

the version that includes the JAVA. To install the myCBR

plug-in for Protg, users need to copy the myCBR plug-ins

into Protgs plug-ins directory. Then to start Protg and

create new projects, users need to enable the myCBR plug-ins

from the configuration menu of Protg. After installing and

activating the myCBR plug-in, the user interface of Protg is

extended with additional tabs to access the myCBR modules.

After developing a CBR application using the Protg plug-

in, myCBR can also be used as a stand-alone Java module,

to be integrated in arbitrary applications, for example, JSP5-

based web applications. In this application phase, the retrieval

engines of myCBR just read the XML files of the created

project generated using the plug-in interface and perform

the similarity-based retrieval [24]. For Protg manuals and

tutorial, users may consult the documentation section of the

Protg web site for available documentation. Among other

things, users may find the Protg Users Guide, a "getting

started" tutorial, and information on ontology development.

The manual for myCBR is available on its web page as HTML

version or a PDF version. The manual covers installation and

different usage issues. No multimedia tutorials are available

for the usage of myCBR.

2) Implementation: Four steps are required to develop a

CBR application:

Generation of case representations Modeling similarity measures Testing of retrieval functionality Implementation of a stand-alone application

Generation of case representations: One powerful feature

provided by myCBR is the easiness of the case representation

by CSV data import module [24]. Users have the choice to

import data instances in an existing Protg class or to create

a new class that is suitable for their raw data. Figure 5(a) shows

how Wisconsin dataset is arranged in a CSV file. myCBR

allows also slots to be added manually using Protg. Figure

5(b) shows myCBR screen after importing the dataset into a

new class Patient which will be used as query and case values

for retrieval step.

Modeling of similarity measure: myCBR follows the local-

global approach which divides the similarity definition into

a set of local similarity measures for each attribute, a set of

attribute weights, and a global similarity measure for calcu-

lating the final similarity value. This means, for an attribute-

value based case representation consisting of n attributes, the

similarity between a query q and a case c may be calculated

as follows

Sim(q, c) =N

i=1

wi Simi(qi, ci) (1)

Here, simi and wi denote the local similarity measure and theweight of attribute i, and Sim represents the global similaritymeasure [24]. The dataset used in this experiment is simple

so we leave the similarity measure definition as the default of


myCBR. We only change the weight values of the Id and Class

slots from one to zero. However, users may consult myCBR

tutorial for more options in defining local and global similarity

measure.

Testing of retrieval functionality: myCBR includes an easy

to use GUI for performing retrievals and for analyzing the

corresponding results. By providing similarity highlighting and

explanation functionality, myCBR supports the efficient analy-

sis of the outcome of the similarity computation. We tested the

16 records that are excluded from the dataset according to one

missing value. Only two missed classifications are obtained.

Figure 5(c) shows one query of these records after retrieving

the most similar cases. Another alternative of performing case

retrieval is to use a query from cases. This is also tested and

gives a similar result as shown in Figure 5(d).

Implementation of stand-alone application: myCBR can also

be used as a stand-alone Java module, to be integrated in

arbitrary applications. In this application phase, the retrieval

engines of myCBR just read the XML files of the created

project generated using the plug-in interface and perform the

similarity-based retrieval. Figure 5(d) shows the breast cancer

stand-alone application.

IV. DISCUSSION AND CONCLUSION

In this paper, we examined two object-oriented ontology

based CBR frameworks jCOLIBRI developed by GAIA group

and myCBR developed by DFKI group. A breast cancer

classifier is built by using the two selected frameworks.

During the implantation of the breast cancer diagnostic

application using jCOLIBRI we found that jCOLIBRI is user-

friendly and efficient to develop a quick application. The

classifier was successful in classification of the selected data

set. During the implantation of the breast cancer classifier

using myCBR we noticed that myCBR is a really a tool for

rapid prototyping of a new CBR application. In seconds, users

may have a running standalone CBR application by using the

CSV importing feature. myCBR is intelligent enough to build

the case structure and the case base by parsing the provided

CSV file. myCBR avoids reinventing the wheel by making the

development of a new CBR application done inside Protg.

The classifier was successful in classification of the selected

data set.

In conclusion, two CBR frameworks are very useful to

develop CBR base breast cancer classifier that can play a very

important role to help for early detecting the disease and hence

right medications can be used to save lives.

REFERENCES

[1] A. Aamodt and E. Plaza, Case-Based Reasoning: Foundational Issues,Methodological Variation and System Approaches, AICOM, vol. 7,no. 1, 1994, pp. 3958.

[2] J. J. Bello-Toms, J. A. Gonzlaez-Calero and B. Dz-Agudo, JCOL-IBRI: An Object-Oriented Framework for Building CBR Systems, inAdvances in Case-Based Reasoning, Lecture Notes in Computer Science,

Springer Berlin/ Heidelberg, vol. 3155, 2004, pp. 3246.[3] S. Bogaerts and D. Leake, Increasing AI Project Effectiveness with

Reusable Code Frameworks: A Case Study Using IUCBRF, in Proceed-ings of the 18th International Florida Artificial Intelligence Research

Society Conference, Menlo Park, CA: AAAI Press, 2005.

[4] S. Bogaerts and D. Leake, A Framework for Rapid and Modular Case-Based Reasoning System Development, Technical Report, TR 617,Computer Science Department, Indiana University, Bloomington, IN,2005.

[5] B. Dz-Agudo, P. A. Gonzlez-Calero, J. Recio-Garc and A. Sanchez-Ruiz, Building CBR systems with jCOLIBRI, Journal of Science ofComputer Programming, vol. 69, no 13, 2007, pp. 6875.

[6] J. H. Gennari, M. A. Musen, R. W. Fergerson, W. E. Grosso, M. Crubezy,H. Eriksson, N. F. Noy and S. W. Tu, The evolution of Protege anenvironment for knowledge-based systems development, Int. J. Hum.-Comput. Stud, vol. 58(1), 2003, pp. 89123.

[7] J. A. Gonzlez-Calero and B. Dz-Agudo, An architecture for knowl-edge intensive CBR systems, in E. Blanzieri and L. Portinale, edi-tors, Advances in Case-Based Reasoning (EWCBR00), Springer-Verlag,Berlin Heidelberg New York.

[8] J. A. Gonzlez-Calero and B. Dz-Agudo, CBROnto: a task/methodontology for CBR, in S. Haller and G. Simmons, editors, Procs. ofthe 15th International FLAIRS02 Conference (Special Track on CBR,

101106). AAAI Press.[9] M. Jaczynski and B. Trousse, An Object-Oriented Framework for the

Design and the Implementation of Case-Based Reasoners, in Proceed-ings of the 6th German Workshop on Case-Based Reasoning, Berlin,1998.

[10] R. Johnson and B. Foote, Designing reusable classes, Journal ofObject-Oriented Programming, vol. 1(5), 1988, pp. 2235.

[11] J. L. Kolodner, Case-Based Reasoning, 1993, Morgan Kaufmann Pub-lishers, California.

[12] D. Leake, Case Based Reasoning. Experiences, Lessons and FutureDirections, AAAI Press, MIT Press, USA, 1997.

[13] M. Manago, R. Bergmann, N. Conruyt, R. Traph ner, J. Pasley, J. LeRenard, F. Maurer, S. Wes, K. D. Althoff and S. Dumont, CASUEL:a common case representation language, ESPRIT project 6322, 1994.Task 1.1, Deliverable D1.

[14] O. L. Mangasarian and W. H. Wolberg, Cancer diagnosis via linearprogramming, SIAM News, vol. 23, no. 5, 1990, pp. 118.

[15] A. Mulder, Developing a Reusable Application Framework, Char-iot Solutions, http://www.chariotsolutions.com/javalab/presentations.jsp,2003.

[16] C. A. Pena-Rayes and M. Sipper, Applying Fuzzy CoCo to Breast CancerDiagnosis, IEEE, 2000, pp. 1168-1175.

[17] J. A. Recio-Garc, B. Dz-Agudo and P. A. Gonzlez-Calero, Proto-typing recommender systems in jCOLIBRI, in Proceedings of the 2008ACM Conference on Recommender Systems (Lausanne, Switzerland,

October 23 - 25, 2008), RecSys 08, ACM, New York, NY, pp. 243-250.[18] J. A. Recio-Garc, B. Dz-Agudo and P. A. Gonzlez-Calero, jCOL-

IBRI2 Tutorial, 2008. Group of Artificial Intelligence Application(GAIA). University Complutense of Madrid. Document Version 1.2.

[19] J. A. Recio-Garc, D. Bridge, B. Dz-Agudo and P. A. Gonzlez-Calero, CBR for CBR: A Case-Based Template Recommender System,in K. D. Althoff and R. Bergmann, editors, Advances in Case-BasedReasoning, 9th European Conference, ECCBR 2008 (in press), LNCS.Springer.

[20] J. A. Recio-Garc, B. Dz-Agudo, , A. Snchez and P. A. Gonzlez-Calero, Lessons learnt in the development of a CBR framework, in M.Petridis, editor, Proceedings of the 11th UK Workshop on Case BasedReasoning, CMS Press, University of Greenwich, 2006, pp. 6071.

[21] J. A. Recio-Garc, A. Snchez, B. Dz-Agudo and P. A. Gonzlez-Calero, jCOLIBRI 1.0 in a nutshell. A software tool for designing CBRsystems, in M Petridis, editor, Proccedings of the 10th UK Workshopon Case Based Reasoning, CMS Press, University of Greenwich, 2005,pp. 2028.

[22] T. R. Roth-Berghofer and D. Bahls Explanation Capabilities of the OpenSource Case-Based Reasoning Tool myCBR, 2008.

[23] S. Schulz, CBR-Works: A state-of-the-art shell for case-based appli-cation building, in Melis, E., ed., Proceedings of the 7th GermanWorkshop on Case-Based Reasoning, GWCBR99, Wurzburg, Germany,University of Wurzburg, pp. 166175.

[24] A. Stahl and T. R. Roth-Berghofer, Rapid prototyping of CBR appli-cations with the open source tool myCBR, in R. Bergmann and K. D.Altho, eds., Advances in Case-Based Reasoning, 2008, Springer Verlag.

[25] M. Sewak, P. Vaidya, C. C. Chan and Z. H. Duan, SVM Approach toBreast Cancer Classification, IMSCCS, vol. 2, 2007, pp. 3237.


ABCDECCBFDEED

ABCDCDEFFAAAACAACA

FABBCCDCACAFBACFBC

CFB AAAACAACA

FABBCCDCAC!FB "CFBC

#B$FDEBFABDCA

#BDEF%&BCFCFABBCADCA

EFCBEBCDEECCDEECDEEE

EEBCDEE DCCBEAAACDA

BE!DE C E DAE D!E BE

E"#EDEE!E$!ECDCA

DE!BCDDEC!DCBBE

EEEFEDEDE!EDE%&'!DE

(DE!EDEAD%&'!DE)*"+,+-.DE!ED

FCA %&' !DE )*"./0102 CADFE C EAD

%&'FCA %&'CB *"/+/+/*"/./3.BCD DEE

BCDE E E EAD %&' FCA %&' CB

*"/03/1 *"/0..1 C!E F DEE FCC" EE

DE!DECBDAEFEDECB4"+010DEE

BCDEEDECB4"//+-DEEFCCEDECB

4"/+4."

ABCDE5 DE 'ED 'BC !CD

ECD E (2 EC DEE %EEED

&ED 'DED '!DE (%&'2 DEE CC

DEECDE6"

''()*+)'+(

D(,* FB ! FF A B AEBE"AB"B!BABBABF)-AFF

BCCBAFF"B!ABACB)CB E E" B CF"A ) A F!B"BFAFABBBE.AABCEBA"BBFEFF#AA*B"AB+E.AEC""F!BCBABCCAAAABEF/001

)CBABAAAA2AEAABF%AA"CF"AAFBBBEBAEAACBABCCBBCBCB" !BC FB BA DB" ABDCBAB"A!BC!FFF%BEA"AFAC

)BEFC!A&A"BAA%A B" EA C""" F%FB!BCBA"BBEA034AB534BCF"A BAAABBB)A!BBAF-C"CC%CB"C)F#BA6#7F("%A6(7

) BCA%D "B 6D7 EBAA#A("BADF%

AB B AA B"AB C" A A" BAB C8""BAB)F"BCEABAB+CABAF9AAB"ABACBABABFB"A!A:B:BBEBF%"ACCCB"CAAACBAAC%F"

''D)D,)

)ABAAF"F"ABAAABFFAAAB

DC"CEA!B%CBFAB67! AF6F7BBAACFABABBBA67FFA%AAB"FFBAC"AAABAACFAFBFFAA AA!FFAC""AFC")FB"AFC"B%"ABB""AFC"ECB"%AFBCA(BAAA"AFC"BCBBACEBFCAB!AFFBABF)FCAABAFCAC"FCA-F#-FAACCBA!FABBAAFCAB"AFC"/001

BCABAFBABC"EABFB"AFCE)A ABAC"A!B%CBFBBADBABA;%"FF-FBF&BA!BEEABBA!BAFCAD-FAAAAB%B""BAB'ACFAA-FFB%ABA"F"2ABABBA-F"ABCCBBAABBCA)A-BAABABBAB!ABABABBABF%C/

D!AB"CAABA EBB$B! ABBA"CA%ABA AABAABFABFFC""FAB)ABBA"BFB"%CAABB"ABA"BFBAAAABC"AACABEBAAABAF9C"FABABAF"BAC/>1

)ABBA"F CCCB"C%"BCAB-F"ACCCB"EBCF%AAABA BABAAAAABCAABBFBAFBCABBCF?"BF"BAC

) "AF AE B " CCCB"67BBEACCCB"C%"FBCCA)CBBAAEABFAABAFB%CA B BCA F"BAC AB " %BBBAAECFAFB%CABF"BACABA"BAF%BCAB A"B A" )AE B%AB-CAF;@33A/0@1

)CBA A CABB F AAB B AEA CCCB"CAAAA CCCB"C AC !A -A BBAEABAAC!AB!ABAFAB

)FBAB9A C"BACFBABFB"AFA"AF8F%BB B BA E" C !FF C%EBCAB"FFBFFAF%F"AFBFFEFEB"%FBCACBE%A"AABFFA"B9EAAC"

)CE!AB%BA%CAABBFFBEF"F)CC%" CB A EB B -CF A BFFCEBEAFBABCFF%FB"FB"AEB!CA"B"FA B FBEF AA )C" A!A@%BAACACE6!FF%BF%7C"CBFBEFAC"BE%C"AAC"FAC"B!"0

"0ABBC"

)AABA-%AAABBACA%"BEA )AAFAA"FB!BCFFAABAB"A)AAB7 ,AA)CCFABAFAEA!AEAA

" ) CA -AA BC A %B%BCBCAA

;33;

;

)AA"BC3AB0A9BB%FBE.ABBFBE.A/51

''')D)')'DM),$('N,

+BACBAFFABBAAAFF%AFBCABCBFAB-FAFAB%EA!AEF2 CABCBFE FB F" FAB BA ABCAEBBAABA)A!FFBBAFBFAFA

##+*)C,)+*D$'(,

ABBF9"CBF"AE% AB "B2 F" F"F" AAAAEBAAEFA6A"A7%EF!BFABEACA&BCABFB!ABCBFAAFBBCEF A F" BCC %B

F" B BA A AEF 6A"A7 EA A AA FF B A EF FF'AA"BFBAABAAFB EF EA A AB FBB8 B AA "B" BBA!ABA9AAAACFAB%A"BA!AAFAFAFB%FAB AB F 6FBCBAF7 AAAF C -CF B F"

+BACBAFFABBAAAFF%AFBCABCBFABA-FAFABEA!EF+BAEAAA%B%A%ACBF"CABF"BAABC%6C7

'ACB8"BCFAA%ABEFFFAAEAABCA%AEAAAABAFFFA)A8BBB"ACBAAEF%AAB8B!AFABDABAAAE B 6 B! B AB F7 FFABBA"BFBCCBF"ABABACFFAAAFABAB!AA!ABA"BBAA"A%EFBBBAF!AABAA"BBABABAF)ABAFBAAB

)B FFAA FAB C FA C AAAFFAB!A(BEAB)%AEF D AAEF D" AA DO0 ; 5PP( ) "BF B FABCABBFAB!C8AEAABAAEF D)FABE-BFFB!A

FA B 6I7

FAFABCAABABABF)BABABA!BABA

FAD0

FD DF0 0F; ; F

)BACCFABABEBECC9"ABFFB!"ABA

F 3 @CFC;D0

D D 6>7

)BAAEBFFB!A

DF DBD DF

DB

DD

D

E

D DEF3

6=7

BAA"AA9BEABACAABACBBABACBA

AB!ABF"A A D

A!BF8EF!EABAAAFB!BAABAFABABBCBFABCEBFFB!A

CFCF0;F;; F;)BABACCBFABB ABFBAB

CA E AC MB AB AB B! AC-CCFFB!ABBAAFBCABEB)CBABCCFBABB)BABA $EQ%A AMF /051 " ; B! A EA! A BABFBAB

";)BABFBAB

)AFBAB$EFBABAABEAFBABAABACFBA!AF"AEABBAA8B!)BB%%A FB AB2 A B-CAB AB $ER FBABEAAAAABABAF)AB AFBABAABBABAAEFFA B C )BA B MF FBABAAFAABBAFAA AFBAB

'A AQ%AFBAB!FABFABCCBF!B%AFBABAAEA!AACAFABEFBAAEFDEF%FA'AFAQAFABABBABECBABFA/051)%%AFBABE-BFFB!A

MEDHAT MOHAMED AHMED ABDELAAL, MUHAMED WAEL FAROUQ ET AL.: USING DATA MINING FOR ASSESSING DIAGNOSIS 13

6F7

BACBF"A A"ABAEAAEA! A BE A F B!CRQ%AFBABBBBAC9"EBBAFABABAC8"AAB""AABAF

)BABFABB AB>M""BAC9%ABCAEABFABABABAC9ABBE%FC E- BFFB!A C9 A A"AABBFFB!A

D0

DDE

3 @ D0 0

DDE E DA

D0

D DDE

6037

)BAAEBFFB!A

D0

D DE33DD DED 6007

CCBF EFA B8F AB AAABC A A A AB %CBF !FEBAAABAABAA

)B8FABFBFBCF%FEAB6*&7"CB6%7)B!AB8B!!8FAB!FFEEABFAB

) *& 8F B%FA C CF AB "CBFBAFBFF%AB EA! A"A A"B AB A%AEA2 F EABBABAA%CBAF8FFBA*&D"%CB8FEAC*&8FBACA )*&AB ! CA ABAABFBCF8FA*&8FFCFFA

)ABFFABEAABFFAAAFABBB%FA AACAEAFF9EC"AAB"CBFFFSATE"8F ABAA F FAB AB E F )CBABCC8F AB A*&UF/I1)*&8FALAA8FCA'A8FABABCBF6037A

CBFE.ABFFB!A)CBAA6=7ECC9A

D0

DD

3 @ D0 0

DDE E DA

D0

D DDE

60;7

,''+()*,,

DBAFB"FCBFAE6A!B%!FA7AAAB!B!AFBA"A%EFEAE"AFBABABEF)CA BBA'AAABFAA!BA !&BBA"))BA

DBAEABAAFBAA%"AEFEBFBAABEF

,BAABB6B!7BCAB"%FAA(BAABBAFBFFSA%CFTB SFT B )ABCBABFF ASBBATBF8F A BA!!AABBAAAAB)BBABAFFBAB!AAA

DBABAAEEFAAAAB!BABA!B"B6FB7)CBAABFAAF"B)BFFSAAB"T

,''+(&++)'(L)*,,

:&BBA":A BCB"ABAABEF"AABAF%BCE"ABAABAB!A!"A"BAAAABAFBBAABCC9'CAAB"AF-ABAEABFB

))&BBAF"BACBAC9BCB"A%BCBFEFABBA*B!AACBFEFA")&BBACB"ACBAAB8B!CBF"A ))&BBAF"BACABFFCFABB

)BAEAAACEFABC9AB"AAAB$B!BCBAEFAAFFFA:BA:BAAB2!)&BBAABAAABCAFCBCAEAAACAFF)&BBACBFEEA

#) 3&0 0 &; ; 6057

AAA"A!AAA"FBA6ACA"AFB"BCBF7ABBSB%FTFC"AABA


A"#$%#$AAAABAB%%F"%ABABAAB%AFAABCAEA)&BBAF"B%AC

)AAAAABAA)FBCAAAAABABA!A%ACA AB A B ) B AAB"BA )FAFBCE"A!"ABAEABBA

FF AF A FCFF EA AFF)&BBAACBABBACFFA)&BBACBFBA"B AA BA EBEA" F" %"F%ACBF )&BBACBFBA F ABBBABBAAABF"FA!B8 )&BBA CBF F BABBBAAFABEF'FA%ABEFAABCAFFBBA%AAACBF)&BBAA$E%"BFBAB!C8A"FAAABBAFCF)&BBABAFF6AAF7CBBABABCABBAABEFBABCAB6V-WE7FB"6-7B-6-7BBAAACBF$ABBAABCAB)BAAACABBB"AFAABF"C"AB F )ABA FCA A)&BBAF"BACC8A"FAAABBA%A"B%FABBC%B!%CF"CABEABFAA"F9ABB)&BBACBF""A BAA")&BBAEF AB "B CBF 8%F FABBEFC)&BBAFEBABABA%"BFABA"AEF

,''+()*,,+*,)

DB)BACEFBBA!BABBCEABC8ABFF%ABBABADBABACFAB)&BBACBFAAAF"CEBA"B!$B!)&BBA"ABA!AABAABBA"B"ABA-AAA'BAABABA"B!CEBAAFFFABBAAAAFAFFBACEEFA

&BA)&BBABABAB"CBFBABAABABE"AB!BFABBACBF)BAAA!ABA "AAABBF"E8B!BCAABAAB)B%AAACABBB"AFAABF"C"ABF)ABAF%CAABABAF"BACC8A"F

AAABBAA"BABAEFAB"BFABCBF

)CA"BBABAAAACBFBCF-BAEF9F8"FA'ACBBSEF8EB-TF8F A!B8&BAAEFABAEBA"F%ABABACBF)"F%ACBFEAAB"A AAA"BB!AABEF%FAABABACBFEABBAA"A"FAAB

'C D(DME'D(*,M)

)C)EF'!BBCBB%)CEABBACEA!ACACAF

TABLE I.

THE RESULTS OF APPLYING CLASSIFICATION SVM AND DECISION TREES

FABC BA) &BBA")

)" CFAB )" CFAB )" CFABC 3>FI 3F;; 3=@; 3F@= 3F;; 3F==(, 3F 3I3 30@= 30=0

D, 3550 35I>@ 5F@

CB" A FFA!AF B3F;;3I

CA B A *+ !63=@5;5X33;@5>7,CFACABA*+BAEBBA"A!63@5@X335 DEFFABC*DADCCB'BDC+CBA )ACB(DACADAA(D(BDBABD,-

B'AABFBAB&'AFAAAB'&'%;303#BABB9FB

=0> &Y;33 &YS"AFC"B"TCF"&F$FE"";33;

=/> "M:DFEBBAABC:;33@AAA??!!!AA!?Z.F?FEC?

=-> A(Y!%)FB:DABABABBAABCBA8F%EF"CAB:(!EB8(EACE"A#;333

=+> Y!"*B"B!8S+!CAFBCFC""CAABT$EBB8BF'C"""BDC#;333

=.> YB)SBCAA"B"AFCCCB"CC"T)ACABBCAABBA;335

=,> BCCA0FF=SAABBBAAEBCFACCCB"C"BF"BBEAT

=4*> B8;335SBCA"B"AFCCCB"TACABBCA,""BFF"B,""ABBAFB

=44> *FLB9F9*,BBS"AFC"B"T(!Y#A$FF;33;

=41> *".*"&BCFC"FFB*#;33@

=40> CBFDYDBF8B:DAABFBBAAB"B:(B+M);)F*BA(;%)*%0FF=%353

=43> )BC*[#EY CDAADAECDEDAB)*CB)BDF*DCBAD+*AD**DB

B'BDCYBFB,ABF&BFAAA;33>CBF5;2(C&0"I0%F0

=4/> ABBAFB"AFCCB"$BC#"AAACAB?CCB"?AEACF;33F

MEDHAT MOHAMED AHMED ABDELAAL, MUHAMED WAEL FAROUQ ET AL.: USING DATA MINING FOR ASSESSING DIAGNOSIS 17

AB ACD EF EC FE AEABEABEAAABDEABBBEBEAABFEBBEBEFEBA FE B C AEAAB B EF CFBEAEABAAEBBCBEECBEFAABCCABBFFDBBDBEECCBEBBBEBEAAB FFCB BE EAEB ABDAFFABC EBB E F BD CE EABB E A B ADBEBABEEABCCEFBCAEEABDBFEAEABEA!C AAB DEAB BBE A BE "#$%& '#%EB&B(FEBE

ABCDEDFCDFFCECFCCECECFCCECC E!FCFFCD" FCC" CDFC

FC#FC"CECFCCEF$CDEFC!EECECECFFCECFC FCECC"CCCCD%!FCECE"CECDFCEDDE CFCF%C&CF'EDFCECECEC(&)$C(*E)$C(+E%F) C F C EF C FF C FF$ C F C D C ED%DE F C C FF C F" C ,F C C EF C E!FCFF C D" F- C B& C % C .F C /BEF C !EE C &FE"FCE0C123CECB&C%C4ECFCEC/BDFFFCDC"C&FE"F0 C 153 C4 CF C FF C FF C D$ C EFCFE"F$CECECDDFCECF CCFECEC!EECFDCC FFCFE"FC FCFEFCFDCC"FCEC!F$C"6"FCFC FC"CFFCDC C ECCFC!FCEC"CC!E"CEF%DEFCF%EC+FCECFC"FCCE CCFFFC%FF CDCF!FC"FCEC!EFCC"CCF C EF C EF$ C E$ C EF C C "E C EC!FDCCEFCFC FC!EEFCC FCC%DECECC FFCFDCFFC FCECECECF C C!E" C FDFFE C C FC EF C7CF'EDFC C CC FC&2C

FCD"DFCC CDEDFCCCF!FCEC!FEFCECF CFFFCCECB&C FCB&CFF%FDCF CC"CDFCCEE8FC#FC"FC FCDFB&CEC CFCFEFCC"CEFCFCC193CFCDDFCECDCC"FC FCB&CF CC FCEFCDFCC FCFCD!FCF"C"CE!FEFCDFFEFCC"F"CEECECCC C FC"FCCFFCECECEC:CEC FCECDFF%

EF C!FCE"C;

EFC "E C@" C C F CDFEF C FEF C CB&CECFCFFFCC FC"FCCECDFECEFCFDF%FECCEC FC(FECEFIC153C FCB&CC FFCFDCDF-

FFCDCFFCECE8E &FE"F C !F C " C J C FD C /FFC

DCFD0 FDCE

ABCDECBFBEFCFCB

FCB&C"FC&ECJCAFECFE"FCFFCECCCEFCCFFECDCCAFECE'CC FCAF%ECE'CC FCE'CCDEECF!E!FCCCF%EC"CCC/FFCF6C20C1F31K3

5

C5

5

C

5

C

5

5 C /20

FCFFECC CE'C$CCEC FC%E$CCE"EFC-

D

5

C5

5

5

5

C 5

C /50

B6"EC/20CEC/50CEFCFFCC""C"C$C"CFCC CEF$CCCFECFFCDEFC FCCCECFFC"CCFFC CFDFFC%EFCACECCCEFCCECD'FCFCC FC%EFC7DD'EFCF!E!FCEFCD"FCC!"C CEDDDEFCFFC/CCEDD'ECC0C&CF%FECEDD'ECDC"ECC"F-

D D CC D

FCEECFC FCEECD"CCEF FC2FKCFFCEDFCECFFCC FCFEFCCECECDECEFC2C

C"FCCEDFCCFE CEC"CEEEFC%FCC22CFFCE CE CE C EFC8FCEC;>

AF"F 2>

FE" 2K

BE F

+EF 5K

TABLE II

EXPERIMENTAL RESULTS.

)A %CB"+,-.( %CB"+,-/(

7"

F C DFEF C C F C FF C C "F C C FCE FCEDFCC FCEEEFCCEDFCCE!E%EFCFCEEFC/FC6"FF0C#FCFEFCFCEFCCE!FEF C F C E C F C EF C C "F C C FDCE F C C FE C E C F C F C EDDE CD!FCFFCF"C"CCEC"ECFE"FCCECEDFC EC FCEFC"FCCFFFCFDCB"ECC CCEFC5C C#FCFCEC F CMCEC/EFCF%FFC FCFCEC FCF'CFCFD0C4FFCF"CFFCE F!FC CCMO

1>3 B CB F E$C@CE$ C,FEC BEF$%$CCFCCFEB%F F-CBC D"FC+CECAEFCF$ C+ACQ

Evaluation of Clustering Algorithms

for Polish Word Sense Disambiguation

Bartosz Broda, Wojciech Mazur

Institute of Informatics, Wrocaw University of Technology, Poland

[email protected], [email protected]

AbstractWord Sense Disambiguation in text is still a difficultproblem as the best supervised methods require laborious andcostly manual preparation of training data. Thus, this work fo-cuses on evaluation of a few selected clustering algorithms in taskof Word Sense Disambiguation for Polish. We tested 6 clusteringalgorithms (K-Means, K-Medoids, hierarchical agglomerativeclustering, hierarchical divisive clustering, Growing HierarchicalSelf Organising Maps, graph-partitioning based clustering) andfive weighting schemes. For agglomerative and divisive algorithm13 criterion function were tested. The achieved results areinteresting, because best clustering algorithms are close in termsof cluster purity to precision of supervised clustering algorithmon the same dataset, using the same features.

I. INTRODUCTION

WORD Sense Disambiguation (WSD) deals with con-

textual resolution of lexical ambiguity. Most words in

natural language have more than one lexical meaning (sense),

but usually only one of them is active in a given context.

Typical example of ambiguous word is line, which according

to WordNet (an electronic thesaurus, cf. [1]) has 36 senses.

WSD is important problem for applications in domain of Nat-

ural Language Processing (NLP). Machine translation cannot

work without some form of disambiguation, but WSD can be

helpful also for information retrieval, information extraction

and computer aided lexicography among others [2].

WSD is a hard problem. Most difficulties arise from the

fact that the concept of a meaning is vague. Usually, there

are no clear boundaries between one sense or the other [3].

Typically, the problem of defining meaning is tackled with

using dictionaries (which are called sense inventory in a

context of WSD). I.e., from the algorithmic point of view sense

inventories are used to enumerate all the meanings that a given

word has. Now, the goal of WSD can be stated as choosing

appropriate sense from sense inventory in a given context of

a word.

There are two main approaches to WSD based on machine

learning: supervised and unsupervised [2].1 Supervised learn-

ing focuses on the usage of manually disambiguated examples

of text snippets containing ambiguous words. We need to

choose an appropriate sense inventory in advance, at early

stages of the construction of supervised WSD system. Some

1There is a plethora of other approaches to WSD, e.g., based on translationalequivalence or hand-written rules. We omit those for brevity. For extensiveoverview of other methods see, e.g., [2], [4].

features are extracted from those text snippets (or contexts2)

and classifiers are trained using this manually labeled data.

Most of the time, supervised approaches are superior to un-

supervised in terms of accuracy of automatic disambiguation

when used on the same type of texts that the systems were

trained on.

Nevertheless, there is another issue connected with the

problem of the definition of a meaning, i.e., an issue of creation

of other resources used for automatic system performingWSD.

This is especially evident in creation of corpora3 manually

annotated (tagged) with senses, which are used for training

machine learning classifiers in a supervised setting. There are

two important problems during manual sense tagging of a

corpus: low interannotator agreement (IA) and high cost of

annotation process. IA is a way of measuring how much an-

notations assigned by one annotator differers from annotations

assigned by another annotator. IA is used for estimation of an

upper bound on performance on automatic WSD. Typically, it

is not enough to give a value of percentage agreement, because

agreements and disagreements may arise by chance. Cohens

is widely used in computational linguistic community forthis purpose, but there are also other measures [5]. The cost

of annotation is high, because large effort is required during

manual annotation. Mihalcea estimated that a construction

of a corpus with sufficient amount of data for supervised

classification algorithms for 20 000 ambiguous words would

require 80 man-years of work [6].

On the other hand, unsupervised and semi-supervised algo-

rithms can be used. The amount of manual labor required is

much lower in learning without supervision. Unsupervised ap-

proaches to WSD tend to use unlabeled data and automatically

find sense distinctions. Usually those methods involve some

form of clustering. Harris distributional hypothesis [7] can be

used as a theoretical foundation for unsupervised methods of

WSD. It states that meaning of entities (...) is related to the

restrictions on combinations of these entities relative to other

entities.. In this context entities can be understood as words.

The main goal of this work is to compare various clustering

algorithms in the task of unsupervised Word Sense Disam-

biguation for Polish data. In unsupervised WSD system deals

with grouping of contexts for given word that express the

2We will use term context to denote a passage of text containing ambiguousword.

3Here we define a corpus as a collection of texts prepared for linguisticprocessing

Proceedings of the International Multiconference onComputer Science and Information Technology pp. 2532

ISBN 978-83-60810-27-9ISSN 1896-7094

978-83-60810-27-9/09/$25.00 c 2010 IEEE 25

same meaning without providing explicit sense labels for each

group (e.g., without using a dictionary) [8]. Also, this work

is motivated by the fact that clustering is important for semi-

supervised WSD algorithm called Lexicographer Controlled

Semi-automatic word Sense Disambiguation [9], [10]. So far,

the selection of the algorithm used in LexCSD was motivated

by the performance of the given algorithm in other tasks and

its analytical properties, because analysis of the performance

of different clustering algorithms in similar settings (i.e., using

similar dataset and features) for Polish WSD is difficult to find.

There are a few differences when dealing with WSD data in

comparison to classical applications of clustering. To name just

a few: the distributions of classes (senses) are skewed4, data

is represented in spaces of very large number of dimensions

(thousands or even hundreds of thousands), for some classes

only very specific, often overlapping among classes features

are important and sometimes there is difficulty in distinguish-

ing between two close classes.

The paper is organized as follows. First the selected clus-

tering algorithms are briefly described. Evaluation section

starts with the analysis of evaluation metrics used. Next, the

corpus and experimental settings are described. Section III-D

provides discussion of results. Section IV gives a summary

of performed experiments and overviews direction of further

works.

II. SELECTED ALGORITHMS FOR TESTING

For this work we have selected a few classical clustering

algorithms, but we tried to choose algorithms representing

a few different approaches to the problem of clustering.

We started with K-means and K-medoids algorithms, which

represent simple, hard and flat clustering methods. We choose

Growing Hierarchical Self-Organising Map (GHSOM) as a

representative of family of clustering using neural networks.

GHSOM is also a hierarchical clustering algorithm. We ex-

periment with standard hierarchical clustering algorithms with

different criterion functions, both from agglomerative and

divisive families of algorithms. Last but not least, we test also

graph-based clustering algorithm. We have reimplemented K-

means, K-medoids and GHSOM and use existing implemen-

tation of other algorithms [11].

We are focusing on clustering for WSD so we will use

NLP-related terminology during description of algorithms. As

a task of WSD is a contextual one, we will cluster contexts

(text snippets) containing ambiguous word. From the context

some real-valued features are extracted. So the context is a

vector of features ~v in high dimensional space. We will useterm context and context vector interchangeably. The exact

nature of context and feature extraction process are described

in Sec. III-B.

A. K-means and K-medoids

K-means is one of the simplest clustering algorithm.

K-means defines cluster as a centers of mass of contexts being

4Not all senses are represented in the data equally; distribution of sensesis biased towards a few frequent senses.

clustered [12]. Those centres are represented as centroids.

Initially random contexts are chosen as centroids. Then we

assign most similar contexts to each centroid. After this step

new centroids are computed as a mean of all the contexts in

a group. This process is then repeated until some stopping

criterion is reached, e.g., number of iteration reaches some

predefined threshold or the clustering solution do not change

significantly between subsequent iterations.

K-medoids is similar in concept to K-means algorithm. The

most fundamental difference between the two algorithms is

that K-medoids uses real contexts from the dataset as a basis

for clustering in contrast to centroids used in K-means (which

are artificial contexts). One of the realisations of K-medoids

is an approach called Partition Around Medoids, or PAM [13].

In PAM one starts with randomly selection of initial medoids.

Then every swapping of every medoid with every context is

tested in terms of decreasing cost of whole clustering solution.

This approach has its drawbacks in terms of computational

complexity, i.e., O(k(nk)2), where n is number of contextsto cluster and k is number of medoids. Thus a few extensionshave been proposed that, e.g., employ sampling (CLARA)

or randomized search (CLARANS) [13]. Nevertheless, we

use classical PAM, as both mentioned algorithms can have

negative impact on quality in comparison to PAM. This

approach is applicable in our experiments, as we use relatively

small datasets.

B. Growing Hierarchical Self-Organizing Map

The Growing Hierarchical Self-Organizing Map (GHSOM)

[14] is a natural extension of Kohonens idea of Self-

Organizing Maps (SOM) [15]. SOM is an artificial neural

network consisting of many neurons. Every neuron consists

of a weight vector. Training SOM is done in an unsupervised

manner applying winner takes most strategy. Every feature

vector is delivered to the network input several times. For

every input vector the similarity with the neuron weight vector

is computed. Weights of the most similar neuron (the winner)

and its neighbourhood are updated to be even more similar to

the input pattern. The learning algorithm is constructed in such

a way, that the neighbourhood and the degree of the weight

updating is decreasing over time.

GHSOM address one of the most important drawback of

SOM the a priori definition of the map structure. Rauber

et al. proposed an algorithm for growing SOM both in a terms

of the number of map neurons and the hierarchy [14]. After

the training stage of SOM mean quantization error for every

neuron i (mqei) is calculated as the average distance of everycontext recognised by the neuron i to its weight vector. Theaverage MQEj for whole map on level j is computed, too. IfMQEj 1 MQEj1 then the additional row or column ofneurons is added to the map and the training stage is repeated.

In the other case the mqei for every neuron is compared toMQEj . If meqi 2 MQEj1 then another layer of themap is created for contexts recognised by the neuron i.


C. Agglomerative and Divisive Clustering

Agglomerative and divisive clustering algorithms produce

hierarchical clustering trees called dendrograms. Agglom-

erative clustering starts in a situation that each context is

contained in a separate cluster, then in each step two clusters

maximising criterion function are merged. On the other hand,

divisive algorithms starts with all contexts in one cluster which

are repeatedly bisected according to the criterion function. We

are using existing implementation of hierarchical algorithms

from CLUTO5 [11]. We use rbr variant of divisive algorithm,

i.e., standard bisecting clustering is employed and is further

optimized according to criterion function [16].

Criterion function is very important aspect of both agglom-

erative and divisive clustering algorithms as it drives the whole

process. There are many criterion function available [17]. We

have tested standard criterion functions used with agglomera-

tive algorithms, i.e.: single link (slink), complete link (clink),

average link (upgma) and weighted variants of single (wslink),

complete (wclink) and average links (wupgma).

The second group of criterion function including

i1, i2, 1, G1, G1, H1, H2 can be used with both agglomerative

and divisive algorithms. The exact form of those functions

are given by [11]:

I1 = maximize

k

i=1

1

ni(

~v,~uSisim(~v, ~u)) (1)

I2 = maximize

k

i=1

~v,~uSisim(~v, ~u) (2)

1 = minimize

k

i=1

ni

vSi,uS sim(~v, ~u)

v,uSi sim(~v, ~u)(3)

G1 = minimize

k

i=1

vSi,uS sim(~v, ~u)


G1 = minimizek

i=1

n2i

vSi,uS sim(~v, ~u)


H1 = maximizeI11

(6)

H2 = maximizeI21

, (7)

where k is total number of clusters, S is total number ofcontexts to cluster, Si is a set of contexts assigned to i-thcluster, ni = |Si|, and sim(~v, ~u) is similarity between twocontext vectors ~v and ~u.

D. Graph Partitioning Based Clustering

We use an implementation of min cut graph partitioning

algorithm from CLUTO [11]. This algorithm starts with cre-

ation of neighbourhood graph based on similarities between

5CLUTO is a free software package implementing several clusteringalgorithms including partitioning, agglomerative and graph-based. Availableat: http://glaros.dtc.umn.edu/gkhome/views/cluto/

contexts and then applies min cut to partition the graph into

disjoint regions. Min cut uses approach that the size of graph

edges in a partition is minimal.

This approach achieved high quality in research on semi-

automatic extension of Polish WordNet [18] and was also used

in Polish WSD based on weakly-supervised settings using

LexCSD algorithm [10].

III. EXPERIMENTS

A. Evaluation Measures

Evaluation of clustering algorithms can be done in many

ways [19]. Some of them are based on external criteria, i.e.,

the comparison of the resulting clustering solution with some

pre-existing categories that were created manually. On the

other hand, one can use an internal criteria without resorting

to gold standard clustering. The most important drawback of

evaluation using internal criteria is that good score does not

always corresponds to good results of clustering in a given

application [20]. As we have developed semantically annotated

corpus (SCWSD, see Sec. III-B) we can use it for the need

of evaluation. The problem with SCWSD is its small size,

so there is a risk of not capturing all of the peculiarities and

biases of some large corpora in SCWSD.6

We used several measures for evaluation to capture different

aspects of created groups. For measuring how homogeneous

clusters are we used Purity:

Purity(, C) =1

N

k

maxj

|k cj |, (8)

where = {1, 2, . . . , k} is a set of clusters, a C ={c1, c2, . . . , cj} a set of pre-existing categories. In oursetting C is a set of contexts with ambiguous word annotatedwith the same sense. Purity(, C) 0, 1, where 1 is thebest case. A drawback of Purity is its preference for solutionswith large number of groups. Assigning every context to a

singelton cluster gives Purity of 1 [20].

The Rand Index measures accuracy on the basis of decisions

performed for the subsequent context pairs. If we use TP for

true positive, TN for true negative, FN for false negative and

FP for false positive. the Rand Index is given by the following

equation:

RI =TP + TN

TP + FP + FN + TN(9)

One of the drawbacks of using RI for evaluation is the equaltreatment of false positives and negatives. Using decision for

context pairs we can also use standard measures of information

retrieval, i.e., precision P , recall R and the harmonic mean ofprecision and recall F :

6On the other hand, the total size of the dat