multiconference proceedings (pdf, 58.653 m)

935

Click here to load reader

Upload: duongcong

Post on 31-Dec-2016

321 views

Category:

Documents


1 download

TRANSCRIPT

  • Proceedings of the InternationalMulticonference on Computer

    Science and Information Technology

    Volume 5 (2010)

  • Proceedings of the International Multiconference onComputer Science and Information Technology

    Volume 5 (2010)

    M. Ganzha, M. Paprzycki (editors)

    ISSN 1896-7094

    Polskie Towarzystwo InformatyczneOddzia Grnolskiul. Raciborska 340-074 Katowice

    ISBN 978-83-60810-27-9

    IEEE Computer Society Press10662 Los Vaqueros CircleLos Alamitos, CA 90720USA

    TEXnical editor: Aleksandr Denisiuk

  • Proceedings of the InternationalMulticonference on Computer

    Science and Information TechnologyOctober 1820, 2010. Wisa, Poland

    Volume 5 (2010)

  • ear Reader, it is our pleasure to present to you Pro-ceedings of the 2010 International Multiconference on

    Computer Science and Information Technology (IMCSIT), which took place in Wisa, Poland, on October 1820, 2010. IMCSIT 2010 and was co-located with the XXVI Autumn Meeting of the Polish Information Processing Society (PIPS ).

    D

    IMCSIT is a result of the evolutionary process. In 2005 a Scientific Session took place during the XXI Autumn Meet-ing of PIPS and consisted of 27 refereed presentations. After this relative success (we have advertised the Session very late in the year) we have decided to expand and extend it into a full-blown conference but continue cooperation (co-location) with the Autumn Meetings of PIPS. As a result of a steady growth, in 2010, IMCSIT consisted of the follow-ing events (and Proceedings are organized into sections that correspond to each of them):

    5th International Symposium Advances in Artificial Intelligence and Applications (AAIA'10),

    Workshop on Agent Based Computing: from Model to Implementation VII (ABC:MI'10),

    International Workshop on Advances in Business ICT (ABICT'10),

    Computer Aspects of Numerical Algorithms (CANA'10),

    Computational LinguisticsApplications (CLA'10 ),

    10th International Multidisciplinary Conference on e-Commerce and e-Government (ECOM&EGOV'10),

    International Symposium on E-LearningApplications (EL-A'10),

    6th Workshop on Large Scale Computations on Grids and 1st Workshop on Scalable Computing in Distributed Systems (LaSCoG-SCoDiS'10),

    2nd International Workshop on Medical Informatics and Engineering (MI&E'10),

    3rd International Symposium on MultimediaApplications and Processing (MMAP'10),

    International Workshop on Real Time Software (RTS'10),

    4th International Workshop on Secure Information Systems (SIS'10),

    International Symposium on Technologies for Social Advancement (T4SA'10),

    Workshop on Ad-Hoc Wireless Networks (WAHOC'10),

    Workshop on Computational Optimization (WCO'10).

    Each of these events had its own Organizing and Program Committee (listed in these Proceedings). We would like to express our warmest gratitude to members of all of them for their hard work in attracting and later refereeing 201 sub-missions.

    Maria Ganzha, Conference Chair, Systems Research Institute Polish Academy of Sciences, Warsaw, Poland, and Gdask University, Gdask, Poland.

    Marcin Paprzycki, Systems Research Institute Polish Academy of Sciences, Warsaw and Management Academy, Warsaw, Poland.

  • Proceedings of the International

    Multiconference on Computer Science and

    Information Technology

    Volume 5

    October 18 20, 2010. Wisa, Poland

    TABLE OF CONTENTS

    5th

    International Symposium Advances in Artificial

    Intelligence and Applications:

    Call For Papers 1

    A Breast Cancer Classier based on a Combination of Case-BasedReasoning and Ontology Approach 3

    Essam AbdRabou, AbdEl-Badeeh Salem

    Using data mining for assessing diagnosis of breast cancer 11Medhat Mohamed Ahmed Abdelaal, Muhamed Wael Farouq, Hala Abou Sena,Abdel-Badeeh Mohamed Salem

    Advanced scale-space invariant, low detailed feature recognition fromimages - car brand recognition 19

    tefan Badura, Stanislav Foltn

    Evaluation of Clustering Algorithms for Polish Word Sense Disambiguation 25Bartosz Broda, Wojciech Mazur

    Generation of First-Order Expressions from a Broad Coverage HPSGGrammar 33

    Ravi Coote, Andreas Wotzlaw

    PSO based modeling of Takagi-Sugeno fuzzy motion controller for dynamicobject tracking with mobile platform 37

    Meenakshi Gupta, Laxmidhar Behera, Venkatesh K.S.

    Hierarchical Object Categorization with Automatic Feature Selection 45Md. Saiful Islam, Andrzej Sluzek

    Selecting the best strategy in a software certication process 53Waldemar Koczkodaj, Vova Babiy, Agnieszka D. Bogobowicz, Ryszard Janicki,Alan Wassyng

    Extrapolation of Non-Deterministic Processes Based on ConditionalRelations 59

    Juliusz Kulikowski

    Reasoning in RDFgraphic formal system with quantiers 67Alena Lukasova, Marek Vajgl, Martin ek

    Coevolutionary Algorithm For Rule Induction 73Pawel Myszkowski

    Evolutionary Algorithm in Forex trade strategy generation 81Pawel Myszkowski, Adam Bicz

    Emotion-based Image Retrievalan Articial Neural Network Approach 89Katarzyna Agnieszka Olkiewicz, Urszula Markowska-Kaczmar

    v

  • Automatic Visual Class Formation using Image Fragment Matching 97

    Mariusz Paradowski, Andrzej luzek

    Learning taxonomic relations from a set of text documents 105

    Mari-Sanna Paukkeri, Alberto Perez Garcia-Plaza, Sini Pessala, Timo Honkela

    Metric properties of populations in articial immune systems 113

    Zbigniew Pliszka, Olgierd Unold

    The development features of the face recognition system 121

    Rauf Sadykhov, Igor Frolov

    Multiscale Segmentation Based On Mode-Shift Clustering 129

    Wojciech Tarnawski, Lukasz Miroslaw, Roman Pawlikowski, Krzysztof Ociepa

    Relational database as a source of ontology creation 135

    Zdenka Telnarova

    Emotional Speech Analysis using Articial Neural Networks 141

    Jana Tuckova, Martin Sramka

    Usage of reection in .NET to inference of knowledge base 149

    Marek Vajgl

    On the evaluation of the linguistic summarization of temporally focusedtime series using a measure of informativeness 155

    Anna Wilbik, Janusz Kacprzyk

    Workshop on Agent Based Computing: from Model to

    Implementation VII:

    Call For Papers 163

    Java-based Mobile Agent Platforms for Wireless Sensor Networks 165

    Francesco Aiello, Alessio Carbone, Giancarlo Fortino, Stefano Galzarano

    BeesyBeesEcient and Reliable Execution of Service-based WorkowApplications for BeesyCluster using Distributed Agents 173

    Pawe Czarnul, Mariusz Matuszek, Micha Wjcik, Karol Zalewski

    A Technique based on Recursive Hierarchical State Machines forApplication-level Capture of Agent Execution State 181

    Giancarlo Fortino, Francesco Rango

    Reorganization in Massive Multiagent Systems 189

    Henry Hexmoor

    Eectiveness of Solving Traveling Salesman Problem Using Ant ColonyOptimization on Distributed Multi-Agent Middleware 197

    Sorin Ilie, Costin Badica

    Selected Security Aspects of Agent-based Computing 205

    Mariusz Matuszek, Piotr Szpryngier

    Agent-Oriented Modelling for Simulation of Complex Environments 209

    Inna Shvartsman, Kuldar Taveter, Merle Parmak, Merik Meriste

    Improving Fault-Tolerance of Distributed Multi-Agent Systems with MobileNetwork-Management Agents 217

    Dejan Mitrovi, Zoran Budimac, Mirjana Ivanovi, Milan Vidakovi

    Argumentative agents 223

    Francesca Toni

    An agent based planner for including network QoS in scientic workows 231

    Zhiming Zhao, Paola Grosso, Ralph Koning, Jeroen van der Ham, Cees de Laat

    vi

  • International Workshop on Advances in Business ICT:

    Call For Papers 239

    A method for consolidating application landscapes during thepost-merger-integration phase 241

    Andreas Freitag, Florian Matthes, Christopher Schulz

    Hybridization of Temporal Knowledg for Economic Environment Analysis 249Maria Antonina Mach

    Independent Operator of Measurements as a Virtual Enterprise on theEnergy Market 255

    Boena Ewa Matusiak

    A Two-level algorithm of time series change detection based on a uniquedeviations similarity method 259

    Tomasz Peech-Pilichowski, Jan T. Duda

    STRATEGOS: A case-based approach to strategy making in SME 265Jerzy Surma

    Support of the E-business by business intelligence tools and data qualityimprovement 271

    Milena Tvrdkov, Ondej Koubek

    Computer Aspects of Numerical Algorithms:

    Call For Papers 279

    The experimental analysis of GMRES convergence for solution of Markovchains 281

    Beata Bylina, Jarosaw Bylina

    On the Numerical Analysis of Stochastic Lotka-Volterra Models 289Tugrul Dayar, Linar Mikeev, Verena Wolf

    Finite Element Approximate Inverse Preconditioning using POSIX threadson multicore systems 297

    George A. Gravvanis, P. I. Matskanidis, K. M. Giannoutakis, E. A. Lipitakis

    On the implementation of public keys algorithms based on algebraic graphsover nite commutative rings 303

    Micha Klisowski, Vasyl Ustimenko

    Analysis of Pseudo-Random Properties of Nonlinear CongruentialGenerators with Power of Two Modulus by Numerical Computing of theb-adic Diaphony 309

    Ivan Lirkov, Stanislava Stoilova

    Assembling Recursively Stored Sparse Matrices 317Michele Martone, Salvatore Filippone, Marcin Paprzycki, Salvatore Tucci

    Use of Hybrid Recursive CSR/COO Data Structures in SparseMatrices-Vector Multiplication 327

    Michele Martone, Salvatore Filippone, Pawe Gepner, Marcin Paprzycki,Salvatore Tucci

    Higher order FEM numerical integration on GPUs with OpenCL 337Przemysaw Paszewski, Krzysztof Bana, Pawe Macio

    Parallelization of SVD of a Matrix-Systolic Approach 343Halil Snopce, Ilir Spahiu

    Solving a Kind of BVP for ODEs on heterogeneous CPU + CUDA-enabledGPU Systems 349

    Przemyslaw Stpiczynski, Joanna Potiopa

    vii

  • Computational LinguisticsApplications:

    Call For Papers 355

    Using Self Organizing Map to Cluster Arabic Crime Documents 357Meshrif Alruily, Aladdin Ayesh, Abdulsamad Al-Marghilani

    Quality Benchmarking Relational Databases and Lucene in the TREC4Adhoc Task Environment 365

    Ahmet Arslan, Ozgur Yilmazel

    Parallel, Massive Processing in SuperMatrix a General Tool forDistributional Semantic Analysis of Corpus 373

    Bartosz Broda, Damian Jaworski, Maciej Piasecki

    Development of a Voice Control Interface for Navigating Robots andEvaluation in Outdoor Environments 381

    Ravi Coote

    The Role of the Newly Introduced Word Types in the Translations of Novels 389Maria Csernoch

    SyMGiza++: A Tool for Parallel Computation of Symmetrized WordAlignment Models 397

    Marcin Junczys-Dowmunt, Arkadiusz Sza

    Semi-Automatic Extension of Morphological Lexica 403Tobias Kaufmann, Beat Pster

    Automatic Extraction of Arabic Multi-Word Terms 411Khalid Al Khatib, Amer Badarneh

    "Beautiful picture of an ugly place". Exploring photo collections usingopinion and sentiment analysis of user comments 419

    Slava Kisilevich, Christian Rohrdantz, Daniel Keim

    LEXiTRON-Pro Editor: An Integrated Tool for developing ThaiPronunciation Dictionary 429

    Supon Klaithin, Patcharika chootrakool, Krit Kosawat

    Automatic Detection of Prominent Words in Russian Speech 435Daniil Kocharov

    Computing trees of named word usages from a crowdsourced lexical network 439Mathieu Lafourcade, Alain Joubert

    RefGen: a Tool for Reference Chains Identication 447Laurence Longo, Amalia Todirascu

    Is Shallow Semantic Analysis Really That Shallow? A Study on ImprovingText Classication Performance 455

    Przemysaw Macioek, Grzegorz Dobrowolski

    PerGram: A TRALE Implementation of an HPSG Fragment of Persian 461Stefan Mller, Masood Ghayoomi

    WordnetLoom: a Graph-based Visual Wordnet Development Framework 469Maciej Piasecki, Micha Marciczuk, Adam Musia, Radosaw Ramocki, MarekMaziarz

    Building and Using Existing Hunspell Dictionaries and TeX Hyphenators asFinite-State Automata 477

    Tommi Pirinen, Krister Lindn

    The Polish Cyc lexicon as a bridge between Polish language and theSemantic Web 485

    Aleksander Pohl

    Tools for syntactic concordancing 493Violeta Seretan, Eric Wehrli

    Eective natural language parsing with probabilistic grammars 501Pawe Skrzewski

    viii

  • Finding Patterns in Strings using Suxarrays 505

    Herman Stehouwer, Menno Van Zaanen

    Entity Summarisation with Limited Edge Budget on Knowledge Graphs 513

    Marcin Sydow, Mariusz Pikua, Ralf Schenkel, Adam Siemion

    Multiple Noun Expression Analysis: An Implementation of OntologicalSemantic Technology 517

    Julia Taylor, Victor Raskin, Maxim Petrenko, Christian F. Hempelmann

    A web-based translation service at the UOC based on Apertium 525

    Luis Villarejo, Mireia Farrus, Gema Ramrez, Sergio Ortz

    Tools and Methodologies for Annotating Syntax and Named Entities in theNational Corpus of Polish 531

    Jakub Waszczuk, Katarzyna Gowiska, Agata Savary, Adam Przepirkowski

    TREF - TRanslation Enhancement Framework for Japanese-English 541

    Bartholomus Wloka, Werner Winiwarter

    Matura Evaluation Experiment Based on Human Evaluation of MachineTranslation 547

    Aleksandra Wojak, Filip Graliski

    German subordinate clause word order in dialogue-based CALL. 553

    Magdalena Wolska, Sabrina Wilske

    Polish Phones Statistics 561

    Bartosz Ziolko, Jakub Galka

    APyCA: Towards the Automatic Subtitling of Television Content in Spanish 567

    Aitor lvarez, Arantza del Pozo, Andoni Arruti

    10th

    International Multidisciplinary Conference on

    e-Commerce and e-Government:

    Call For Papers 575

    Trusted Data in IBM's MDM: Accuracy Dimension 577

    Przemyslaw Pawluk

    Multicriteria Evaluation of DVB-RCS Satellite Internet Performance Usedfor e-Government and e-Learning Purposes 585

    Andrzej M. J. Skulimowski

    INFOMAT-E - public information system for people with sight and hearingdysfunctions 593

    Micha Socha, Wojciech Grka, Adam Piasecki, Beata Sitek

    Bidirectional voting and continuous voting concepts as possible impact ofInternet use on democratic voting process 599

    Jacek Wachowicz

    The Double Jeopardy Phenomenon and the Electronic Distribution ofInformation 605

    Urszula wierczyska-Kaczor, Artur Borcuch, Pawe Kossecki

    International Symposium on E-LearningApplications:

    Call For Papers 609

    Simple Blog Searching Framework Based on Social Network Analysis 611

    Iwona Doliska

    ix

  • 6th

    Workshop on Large Scale Computations on Grids

    and 1st Workshop on Scalable Computing in

    Distributed Systems:

    Call For Papers 619

    Exploratory Programming in the Virtual Laboratory 621Eryk Ciepiela, Daniel Harlak, Joanna Kocot, Tomasz Bartyski, MaciejMalawski, Tomasz Gubaa

    Modelling, Optimization and Execution of Workow Applications with DataDistribution, Service Selection and Budget Constraints in BeesyCluster 629

    Pawe Czarnul

    Multi-level Parallelization with Parallel Computational Services inBeesyCluster 637

    Pawe Czarnul

    Managing large datasets with iRODSa performance analyses 647Denis Hnich, Ralph Mller-Pfeerkorn

    Service level agreements for job control in high-performance computing 655Roland Kbert, Stefan Wesner

    A Modeling Language Approach for the Abstraction of the Berkeley OpenInfrastructure for Network Computing (BOINC) Framework 663

    Christian Benjamin Ries, Thomas Hilbig, Christian Schrder

    Degisco Green Methodologies in Desktop Grids 671Bernhard Schott, Ad Emmen

    Resource Fabrics: the next level of grids and clouds 677Lutz Schubert, Matthias Assel, Stefan Wesner

    2nd

    International Workshop on Medical Informatics

    and Engineering:

    Call For Papers 685

    Agile methodology and development of software for users with specicdisorders 687

    Rostislav Fojtik

    3rd

    International Symposium on

    MultimediaApplications and Processing:

    Call For Papers 693

    An Hypergraph Object Oriented Model for Image Segmentation andAnnotation 695

    Eugen Ganea, Marius Brezovan

    Classication of Image Regions Using the Wavelet Standard DeviationDescriptor 703

    Snke Greve, Marcin Grzegorzek, Carsten Saatho, Dietrich Paulus

    High Capacity Colored Two Dimensional Codes 709Antonio Grillo, Alessandro Lentini, Marco Querini, Giuseppe F. Italiano

    Region-based Measures for Evaluation of Color Image Segmentation 717Andreea Iancu, Bogdan Popescu, Marius Brezovan, Eugen Ganea

    Undetectable Spread-time Stegosystem Based on Noisy Channels 723Valery Korzhik, Guillermo Morales-Luna, Ksenia Loban, Irina Marakova-Begoc

    Building Personalized Interfaces by Data Mining Integration 729Marian Cristian Mihaescu

    x

  • A Graphical Interface for Evaluating Three Graph-Based ImageSegmentation 735

    Gabriel Mihai, Alina Doringa, Liana Stanescu

    Basic Consideration of MPEG-2 Coded File Entropy and LosslessRe-encoding 741

    Kazuo Ohzeki, Yuan y Wei, Eizaburo Iwata, Ulrich Speidel

    Analyzes of the processing performances of a Multimedia Database 749Cosmin Stoica Spahiu

    Constructive Volume Modeling 755Mihai Tudorache, Mihai Popescu, Razvan Tanasie

    Real-Time Embedded Fault Detection Estimators in a Satellite's ReactionWheels 759

    Nicolae Tudoroiu, Eshan Sobhani-Tehrani, Kash Khorasani, Tiberiu Letia,Roxana-Elena Tudoroiu

    Application of optimal settings of the LMS adaptive lter for speech signalprocessing 767

    Jan Vau, Vtzslav Stskala

    Obfuscation Methods with Controlled Calculation Amounts and TableFunction 775

    Yuanyu Wei, Kazuo Ohzeki

    International Workshop on Real Time Software:

    Call For Papers 781

    Computationally eective algorithms for 6DoF INS used for miniature UAVs 783Jan Floder

    Supervisory control and real-time constraints 791Wojciech Grega

    Integration of Scheduling Analysis into UML Based Development ProcessesThrough Model Transformation 797

    Matthias Hagner, Ursula Goltz

    Laboratory real-time systems to facilitate automatic control education andresearch 805

    Krzysztof Koek, Andrzej Turnau, Krystyn Hajduk, Pawe Pitek, MariuszPauluk, Dariusz Marchewka, Adam Piat, Maciej Ros, Przemysaw Gorczyca

    Methods of Computer-Assisted Manual Control of Wheeled Robots 813Viktor Michna, Petr Wagner, Jiri Kotzian

    Software and hardware in the loop component for an IEC 61850Co-Simulation platform 817

    Haar Mohamad, Thiriet Jean Marc

    Real-time controller design based on NI Compact-RIO 825Maciej Ros, Adam Piat, Andrzej Turnau

    Intelligent Car Control and Recognition Embedded System 831Vilem Srovnal Jr., Zdenek Machacek, Radim Hercik, Roman Slaby, VilemSrovnal

    4th

    International Workshop on Secure Information

    Systems:

    Call For Papers 837

    A Security Model for Personal Information Security Management Based onPartial Approximative Set Theory 839

    Zoltn Csajbk

    Social Engineering-Based AttacksModel and New Zealand Perspective 847Lech Janczewski, Lingyan (Ren) Fu

    xi

  • International Symposium on Technologies for Social

    Advancement:

    Global Mobile Applications For Monitoring Health 855Tapsie Giridher Giridher, Anita Wasliewska, Jennifer Wong

    A Study on the Expectations and Actual Satisfaction about Mobile Handsetbefore and after Purchase 861

    JIBum Jung, seungpyo Hong

    Workshop on Ad-Hoc Wireless Networks:

    Call For Papers 867

    Wireless Transceiver for Control of Mobile Embedded Devices 869Jan Kordas, Petr Wagner, Jiri Kotzian

    Ecient Coloring of Wireless Ad Hoc Networks With DiminishedTransmitter Power 873

    Krzysztof Krzywdziski

    Fast Construction of Broadcast Scheduling and Gossiping in Dynamic AdHoc Networks 879

    Krzysztof Krzywdziski

    Workshop on Computational Optimization:

    Call For Papers 885

    ACO with semi-random start applied on MKP 887Stefka Fidanova, Pencho Marinov, Krassimir Atanassov

    On the Probabilistic min spanning tree problem 893Boria Nicolas, Murat Ccile, Paschos Vangelis

    Ecient Portfolio Optimization with Conditional Value at Risk 901Wlodzimierz Ogryczak, Tomasz Sliwinski

    Enhanced Competitive Dierential Evolution for Constrained Optimization 909Josef Tvrdik, Radka Polakova

    xii

  • he AAIA'10 will bring researchers, developers, practi-tioners, and users to present their latest research, re-

    sults, and ideas in all areas of artificial intelligence. We hope that theory and successful applications presented at the AAIA'10 will be of interest to researchers and practitioners who want to know about both theoretical advances and latest applied developments in Artificial Intelligence. As such AAIA'10 will provide a forum for the exchange of ideas be-tween theoreticians and practitioners to address the impor-tant issues.

    T

    Papers related to theories, methodologies, and applica-tions in science and technology in this theme are especially solicited. Topics covering industrial issues/applications and academic research are included, but not limited to:

    Knowledge management Decision Support System Approximate Reasoning Fuzzy modeling and control Data Mining Web Mining Machine learning Combining multiple knowledge sources in an in-tegrated intelligent system Neural Networks Evolutionary Computation Artificial Immune Systems Ant Systems in Applications Natural Language processing Image processing and understanding (interpreta-tion) Applications in Bioinformatics Hybrid Intelligent Systems Granular Computing Architectures of intelligent systems Robotics Real-world applications of Intelligent Systems

    INTERNATIONAL PROGRAMME COMMITTEEJanos Abonyi, University of Pannonia, HungaryHans Jorgen Andersen, Aalborg University, DenmarkAnna Bartkowiak, Wroclaw University, PolandShlomo Berkovsky, CSIRO, AustraliaRyszard Choras, Institute of Telecommunications,

    PolandKrzysztof Cios, Virginia Commonwealth University,

    USAAlfredo Cuzzocrea, University of Calabria, ItalyClaudio De Stefano, University of Cassino, ItalyJeremiah Da Deng, University of Otago, New ZealandKrzysztof Goczyla, Gdansk University of Technology,

    PolandAmr Goneid, Computer Science Dept.,American Univer-

    sity in Cairo, Egypt

    Min Henderson, University of Virginia, USAZdzislaw Hippe, University of Information Technology

    and Management in Rzeszow, PolandElzbieta Hudyma, Wroclaw University of Technology,

    PolandJerzy W. Jaromczyk, University of Kentucky, USAPiotr Jedrzejowicz, Gdynia Maritime University, PolandJerzy Jozefczyk, Wroclaw University of Technology,

    PolandJanusz Kacprzyk, Systems Research Institute of the Pol-

    ish Academy of Sciences, PolandRadosaw Katarzyniak, Wrocaw University of Tech-

    nology, PolandPrzemyslaw Kazienko, Wroclaw University of Technol-

    ogy, PolandVojislav Kecman, Virginia Commonwealth University ,

    USAEtienne Kerre, University of Gent, BelgiumJacek Kluska, Rzeszow University of Technology,

    PolandYiannis Kompatsiaris, Informatics and Telematics Insti-

    tute, GreeceJozef Korbicz, University of Zielona Gora, PolandJerzy Korczak, Wroclaw University of Economics,

    PolandWitlod Kosinski, Polish-Japanese Institute of Informa-

    tion Technology, PolandAdam Krzyzak, Concordia University, CanadaJuliusz Lech Kulikowski, Institute of Computer Science

    of the Polish Academy of Sciences, PolandLukasz Kurgan, University of Alberta, CanadaHalina Kwasnicka, Wroclaw University of Technology,

    PolandSerguei Levachkine, National Polytechnic Institute,

    MexicoRory Lewis, University of Colorado at Colorado Springs,

    USAJoo-Hwee Lim, Institute for Infocomm Research,

    A*STAR, SingaporeJie Lu, University of Technology Sydney, AustraliaAbdel-Badeeh M. Salem, Ain Shams University, EgyptJacek Mandziuk, Warsaw University of Technology,

    PolandUrszula Markowska-Kaczmar, Wroclaw University of

    Technology, PolandZbigniew Michalewicz, University of Adelaide, Aus-

    traliaSantiago M. Mola, Universidad Politcnica de Valencia,

    SpainPawel Myszkowski, Wroclaw University of Technology,

    PolandTapio Pahikkala, University of Turku, Finland

    5th International SymposiumAdvances in Artificial Intelligence and Applications

    CELEBRATING 75TH BIRTHDAY OF PROFESSOR LEONARD BOLC

  • Mariusz Paradowski, Wroclaw University of Technolo-gy, Poland

    Witold Pedrycz, University of Alberta, CanadaJames Peters, University of Manitoba, CanadaSheela Ramanna, University of Winnipeg, CanadaZbigniew Ras, University of North Carolina, USAPaolo Rosso, Universidad Politcnica Valencia, Spain,

    SpainGunter Saake, Otto-von-Guericke-Universitt , GermanyJerzy Sas, Wroclaw University of Technology, PolandChristelle Scharff, Pace University, USARoman Slowinski, Poznan University of Technology,

    PolandAndrzej Sluzek, Nanyang Technological University, Sin-

    gaporeJanusz Sobecki, Wroclaw University of Technology,

    PolandSiergey Subbotin, Zaporozhye National Technical Uni-

    versity, Ukraine

    Jerzy Swiatek, Wroclaw University of Technology, Poland

    Piotr Szczepaniak, Technical University of Lodz, PolandStan Szpakowicz, SITE, University of Ottawa, CanadaRyszard Tadeusiewicz, AGH University of Science and

    Technology, PolandLi-Shiang Tsay, North Carolina A&T State University,

    USAJosef Tvrdik, University of Ostrava, Czech RepublicAngelina Tzacheva, Univ. of South Carolina, USAAnita Wasilewska, Stony Brook University, NY, USA,

    USADaniela Zaharie, West University of Timisoara, Roma-

    niaWojciech Ziarko, University of Regina, Canada

    ORGANIZING COMMITTEEHalina Kwasnicka, Urszula Markowska-Kaczmar,

    Wrocaw University of Technology, Poland

  • A Breast Cancer Classifier based on a Combination

    of Case-Based Reasoning and Ontology Approach

    Essam Amin M.Lotfy Abdrabou

    Ph.D Candidate

    Faculty of Computer and Information Sciences

    Ain Shams University, Abbassia, 11566, Cairo, EGYPT

    (+202) 26330636

    Email: [email protected]

    AbdEl-Badeeh M. Salem

    Professor

    Faculty of Computer and Information Sciences

    Ain Shams University, Abbassia, 11566, Cairo, EGYPT

    (+202) 26844284

    Email: [email protected]

    AbstractBreast cancer is the second most common form ofcancer amongst females and also the fifth most cause of cancerdeaths worldwide. In case of this particular type of malignancy,early detection is the best form of cure and hence timely andaccurate diagnosis of the tumor is extremely vital. Extensiveresearch has been carried out on automating the critical diagnosisprocedure as various machine learning algorithms have beendeveloped to aid physicians in optimizing the decision taskeffectively. In this research, we present a benign/malignant breastcancer classification model based on a combination of ontologyand case-based reasoning to effectively classify breast cancertumors as either malignant or benign. This classification systemmakes use of clinical data. Two CBR object-oriented frameworksbased on ontology are used jCOLIBRI and myCBR. A breastcancer diagnostic prototype is built. During prototyping, weexamine the use and functionality of the two focused frameworks.

    Index TermsCase-Based Reasoning, Case-Based ReasoningFrameworks, CBR, CBR Frameworks, jCOLIBRI, myCBR,Breast Cancer

    I. INTRODUCTION

    BREAST cancer classification, diagnosis and prediction

    techniques have been a widely researched area in the past

    decade in the world of medical informatics. Several articles

    have been published which tries to classify breast cancer data

    sets using various techniques such as fuzzy logic, support

    vector machines, Bayesian classifiers, decision trees and neural

    networks. Classification accuracy as high as 98.8% has been

    achieved using a learning algorithm combining simulated an-

    nealing with the perceptron algorithm. Another study involving

    fuzzy modeling and cooperative co-evolution has gained an

    accuracy of 98.98% over one of the widely studied Wisconsin

    breast cancer database [16].

    This research applies a new technique in the field of

    breast cancer classification. It uses a combination of ontology

    and case-based reasoning by using ontology based object-

    oriented case-based reasoning frameworks. Two frameworks

    are examined in building the classifier. One is the open source

    jCOLIBRI [5] system developed by GAIA group and provides

    a framework for building CBR systems based on state-of-the-

    art software engineering techniques. The other is the novel

    open source CBR tool myCBR [24] developed at the German

    Research Center for Artificial Intelligence (DFKI). The objec-

    tive of this classifier is to classify the patient based on his/her

    electronic record whether he/she is benign or malignant.

    This paper is organized in four sections. Section 1 is this

    introduction. Section 2 gives a theoretical background about

    breast cancer, ontology, CBR and object-oriented frameworks.

    Section 3 illustrates the implementation of the breast cancer

    classifier on the two frameworks. Finally, section 4 discusses

    and concludes the results

    II. THEORITICAL BACKGROUND

    A. Breast Cancer

    Breast cancer is the form of cancer that either originates

    in the breast or is primarily present in the breast cells. The

    disease occurs mostly in women but a small population of

    men is also affected by it. Breast cancer is the most common

    form of cancer amongst the female population as well as the

    most common cause of cancer deaths [25]. Early detection

    of breast cancer saves many thousands of lives each year.

    Many more could be saved if the patients are offered accurate,

    timely analysis of their particular type of cancer and the

    available treatment options. Since the breast tumors whether

    malignant or benign share structural similarities, it becomes

    an extremely tedious and time consuming task to manually

    differentiate them. As seen in Figure 1 there is no visually

    significant difference between the fine needle biopsy image of

    the malignant and benign tumor for an untrained eye. Accurate

    Fig. 1. Fine needle biopsies of breast. Malignant (left) and Benign (right) [25]

    classification is very important as the potency of the cytotoxic

    drugs administered during the treatment can be life threatening

    or may develop into another cancer. Laboratory analysis or

    biopsies of the tumor is a manual, time consuming yet accurate

    Proceedings of the International Multiconference onComputer Science and Information Technology pp. 310

    ISBN 978-83-60810-27-9ISSN 1896-7094

    978-83-60810-27-9/09/$25.00 c 2010 IEEE 3

  • system of prediction. It is however prone to human errors,

    creating a need for an automated system to provide a faster

    and more reliable method of diagnosis and prediction for the

    patients.

    B. Ontology

    Ontology is a formal explicit description of concepts in a

    domain of discourse (classes (sometimes called concepts)),

    properties of each concept describing various features and

    attributes of the concept (slots (sometimes called roles or

    properties)), and restrictions on slots (facets (sometimes called

    role restrictions)). Ontology together with a set of individual

    instances of classes constitutes a knowledge base. In reality,

    there is a fine line where the ontology ends and the knowledge

    base begins [8].

    C. Case-Based Reasoning

    In case-based reasoning (CBR) systems expertise is em-

    bodied in a library of past cases, rather than being encoded in

    classical rules. Each case typically contains a description of the

    problem, plus a solution and/or the outcome. The knowledge

    and reasoning process used by an expert to solve the problem

    is not recorded, but is implicit in the solution. To solve a

    current problem: the problem is matched against the cases in

    the case base, and similar cases are retrieved. The retrieved

    cases are used to suggest a solution that is reused and tested

    for success. If necessary, the solution is then revised. Finally

    the current problem and the final solution are retained as part

    of a new case.

    The CBR process can be represented by a schematic cycle,

    as shown in Figure 2 [1].

    Fig. 2. The CBR Cycle

    Representation: Given a new situation, generate appropriate

    semantic indices that will allow its classification and catego-

    rization. This usually implies a standard indexing vocabulary

    that the CBR system uses to store historical information

    and problems. The vocabulary must be rich enough to be

    expressive, but limited enough to allow efficient recall [2].

    Retrieval: Given a new, indexed problem, retrieve the best

    past cases from memory. This requires answering three ques-

    tions: What constitute an appropriate case? What are the

    criteria of closeness or similarity between cases? How should

    cases be indexed? Part of the index must be a description of the

    problem that the case solved, at some level of abstraction. Part

    of the case, though, is also the knowledge gained from solving

    the problem represented by the case. In other words, cases

    should also be indexed by some elements of their solution [11].

    Adaptation: Modify the old solutions to confirm to the new

    situation, resulting in a proposed solution. With the exception

    of trivial situations, the solution recalled will not immediately

    apply to the new problem, usually because the old and the

    new problem are slightly different. CBR researchers have

    developed and used various adaptation techniques [11].

    Validation: After the system checks a solution, it must

    evaluate the results of this check. If the solution is acceptable,

    based on some domain criteria, the CBR system is done with

    reasoning. Otherwise, the case must be modified again, and

    this time the modifications will be guided by the results of the

    solutions evaluation [11].

    Update: If the solution fails, explain the failure and learn

    it, to avoid repeating it. If the solution succeeds and warrants

    retention, incorporate it into the case memory as a successful

    solution and stop. The CBR system must decide if a successful

    new solution is sufficiently different from already-known solu-

    tions to warrant storage. If it does warrant storage, the system

    must decide how the new case will be indexed, on which level

    of abstraction it will be saved, and where it will be put in the

    case-base organization [11].

    Retaining the case is the process of incorporating whatever

    is useful from the new case into the case library. This involves

    deciding what information to retain and in what form to retain

    it; how to index the case for future retrieval; and integrating

    the new case into the case library.

    D. CBR Object-Oriented Frameworks

    The concept of object-oriented frameworks has been intro-

    duced in the late 80s and has been defined as a set of classes

    that embodies an abstract design for solutions to a family of

    related problems, and supports reuses at a larger granularity

    than classes [9].

    The goal of a framework is to capture a set of concepts

    related to a domain and the way they interact. In addition, a

    framework is in control of a part of the program activity and

    calls specific application code by dynamic method binding.

    A framework can be viewed as an incomplete application

    where the user only has to specialize some classes to build

    the complete application [9].

    4 PROCEEDINGS OF THE IMCSIT. VOLUME 5, 2010

  • Frameworks allow the reuse of both code and design for a

    class of problems, giving the ability to non-expert to write

    complex applications quickly. Frameworks also allow the

    development of prototypes which could be extended further

    on by specialization or composition. A framework once un-

    derstood, it can be applied in a wide range of domain, and

    can be enhanced by the adding of new components [9].

    Using frameworks for development of new applications

    helps improve software quality. It improves programmers

    productivity and quality, performance, and reliability of soft-

    ware. It also enhances extensibility by providing the required

    methods that allow applications to extend its stable inter-

    faces [20]. Figure 3 clearly shows the difference of the effort

    required for developing an application from scratch and using

    a framework [15].

    Fig. 3. Development Effort Reduction by using Frameworks

    CBR researchers agree that the best way to satisfy the

    increasing demand of developing CBR application is by de-

    velopment of frameworks. Recently, some efforts within the

    CBR community have developed CBR frameworks [20]. This

    paper focuses on two of them jCOLIBRI developed by GAIA

    group and myCBR developed by DFKI group.

    III. EXPERIMENTS

    A. Breast Cancer Classifications

    Breast cancer has become the number one cause of cancer

    deaths amongst women. Once a breast cancer is detected, it

    can be classified benign (not cancerous tissue) or malignant

    (cancerous tissue). In this study, the two compared CBR

    frameworks are tested by developing a CBR application that

    classifies the condition of the breast cancer tumor whether

    it is benign or malignant. Wisconsin breast cancer data set

    was used for building the case-bases. It is obtained from

    the University of Wisconsin Hospitals, Madison from Dr.

    William H. Wolberg [14]. Samples inside the data set arrive

    periodically as Dr. Wolberg reports his clinical cases. The

    number of instances inside the dataset is 699 (as of 15

    July 1992). Each record contains ten attributes plus the class

    attribute. Table I shows the attributes and their possible values.

    65.5% of the elements belong to the benign class and 34.5% to

    the malignant class. 16 elements are incomplete (an attribute

    is missing) and have been excluded from the database.

    TABLE IWISCONSIN BREAST CANCER DATASET

    No. Attribute Possible Value

    1 Sample code number id number

    2 Clump Thickness 1 10

    3 Uniformity of Cell Size 1 10

    4 Uniformity of Cell Shape 1 10

    5 Marginal Adhesion 1 10

    6 Single Epithelial Cell Size 1 10

    7 Bare Nuclei 1 10

    8 Bland Chromatin 1 10

    9 Normal Nucleoli 1 10

    10 Mitoses 1 10

    11 Class (2 for benign, 4 for malignant)

    B. jCOLIBRI

    1) Overview: jCOLIBRI is an evolution of the COLIBRI

    architecture [7], that consisted of a library of problem solv-

    ing methods (PSMs) for solving the tasks of a knowledge-

    intensive CBR system along with ontology, CBROnto [8],

    with common CBR terminology. COLIBRI was prototyped in

    LISP using LOOM as knowledge representation technology.

    This prototype served as proof of concept; was very useful but

    it is not helpful for non-expert users. Then, people at GAIA

    group have started to develop a new complete framework with

    the name of jCOLIBRI. It stands for Cases and Ontology

    Libraries Integration for Building Reasoning Infrastructures.

    CBR ontology assumes the same vocabulary provided by any

    CBR system. In jCOLIBRI, ontology is not represented as a

    new source. All concepts of CBR are mapped into classes and

    interfaces of framework. Classes that represent the concept of

    ontology serve as templates where new CBR types should be

    added. They also provide the tasks and abstract interface of the

    methods. The design of the jCOLIBRI framework comprises

    a hierarchy of Java classes plus a number of XML files. The

    framework is organized around the following elements [2]:

    Tasks and methods: The tasks supported by the framework

    and the methods that solve them are all stored in a set of

    XML files.

    Case-base: Different connectors are defined to support several

    types of case determination, from the file system to a database.

    Cases: A number of interfaces and classes are included in the

    framework to provide an abstract representation of cases that

    support any type of actual case structure.

    Problem solving methods: The actual code that supports the

    methods included in the framework.

    The jCOLIBRI comes in two major releases version 1 and

    version 2. According to the tutorial [19], version 2 is a new im-

    plementation that follows a new and clear architecture divided

    into two layers: one oriented to developers and other oriented

    to designers. Unfortunately, the only available distribution of

    version 2 is the one that is oriented to the developers which

    is out of scope of this paper. jCOLIBRI version 1 is the first

    release of the framework. It includes a complete Graphical

    ESSAM ABDRABOU, ABDEL-BADEEH SALEM: A BREAST CANCER CLASSIFIER 5

  • (a) Patient Case Definition in jCOLIBRI

    (b) Managing Connectors in jCOLIBRI

    (c) Configuration of Tasks in jCOLIBRI

    6 PROCEEDINGS OF THE IMCSIT. VOLUME 5, 2010

  • (d) jCOLIBRI Retrieval

    Fig. 4. Implementation in jCOLIBRI

    User Interface (GUI) that guides the user in the design of a

    CBR system. This version is recommended for non-developer

    users that want to create CBR systems without programming

    any code which is exactly the scope in this study. As a result,

    version 1 is selected to implement the required application.

    Downloading of the jCOLIBRI is an easy task; it can be

    obtained through the web page of GAIA group. It comes in

    a compressed distribution that can be easily extracted to have

    the full package. To run jCOLIBRI, there is a ready batch file

    (we are using MS Windows R platform) that can be invokeddirectly to run jCOLIBRI. It is required to have JAVA Virtual

    Machine installed before running the batch file. By invoking

    this batch file we get the first screen of the framework GUI.

    2) Implementation: By the help of the multimedia tutorials

    provided and the GUI of the jCOLIBRI, users can go through

    five steps to implement and deploy a CBR System. These steps

    are

    Definition of case structures Building the case-base Managing similarity measures Configuring the behavior of the CBR process Testing and deploying the CBR application

    Definition of Case Structures: By using jCOLIBRI GUI users

    are able to create the case structure defining simple and

    compound attributes that describe the cases together with

    their types, weights, similarity measure -that is chosen from

    a library of existing similarity functions and parameters. The

    case structure can be saved or loaded in and from a XML file.

    Figure 4(a) shows the definition of the patient case parameters.

    Building the case-base: jCOLIBRI introduces the concept

    of Connectors which cases persistence is built around. Con-

    nectors are objects that know how to access and retrieve

    cases from the storage media and return those cases to the

    CBR system in a uniform way. Therefore connectors provide

    an abstraction mechanism that allows users to load cases

    from different storage sources in a transparent way [24] [21].

    Defined connectors can work with plain text files, XML files,

    or relational data bases. The graphical interface helps mapping

    the defined case structure with the tables and columns from

    the storage scheme. Figure 4(b) shows how the patient case

    structure is mapped to columns in a text file containing the

    Wisconsin data set patient records.

    Managing similarity measures: When two cases are compared,

    the local similarity functions are used to compare simple

    attribute values. Global similarity functions are linked to

    compound attributes and are used to gather the similarities of

    the collected attributes in a unique similarity value. At last, the

    similarity value of two cases is computed as the similarity of

    their description concepts. The available similarity measures

    are listed in a configuration file, and can be managed using

    the GUI. Since our problem is simple, we leave the default

    similarity assigned by jCOLIBRI.

    Configuring the behavior of the CBR process: As introduced,

    jCOLIBRI formalizes the CBR knowledge using CBR ontol-

    ogy (CBROnto), a knowledge level description of the CBR

    tasks and a library of reusable Problem Solving Methods

    (PSMs) [21]. Configuration of tasks is done in an interactive

    approach by choosing from a library of reusable methods

    one that is suitable to solve the selected task. Constraints of

    the selected task are being tracked during the configuration

    process so that only applicable methods in the given context

    are offered to users. In our comparison we focus only on the

    retrieval task. Figure 4(c) shows the configured tasks in the

    breast cancer application.

    Testing and deploying the CBR application: The CBR appli-

    cation is finished when all the tasks have been configured.

    Users can test the system from inside the graphical interface.

    The first task of the CBR system, (Obtain query task) obtains

    the query that is going to be used to retrieve the most similar

    cases. Figure 4(d) shows the GUI after a query. We tested

    the 16 records that are excluded from the dataset according

    to one missing value. Only two missed classifications are

    obtained. Documentation mentions that it is possible to deploy

    the developed CBR application by generating a code template

    with most of the code required to run the developed system

    as an independent application. We have tried this process but

    it is completely failed.

    C. myCBR

    1) Overview: myCBR is an open-source plug-in for the

    open-source ontology editor Protg [6]. Protg is based

    on Java, is extensible, and provides a plug-and-play envi-

    ronment that makes it a flexible base for rapid prototyping

    and application development [4]. Protg [4] allows defining

    classes and attributes in an object-oriented way. Furthermore,

    it manages instances of these classes, which myCBR interprets

    as cases [22]. So the handling of vocabulary and case base

    is already provided by Protg. The myCBR plug-in provides

    several editors to define similarity measures for an ontology

    and a retrieval interface for testing [24]. As the main goal of

    myCBR is to minimize the effort for building CBR applications

    that require knowledge-intensive similarity measures, myCBR

    ESSAM ABDRABOU, ABDEL-BADEEH SALEM: A BREAST CANCER CLASSIFIER 7

  • (a) Wisconsin Dataset in a CSV File

    (b) Patient Case Data Representation in myCBR

    (c) Retrieval of a Case Query with a Missing Attribute Value

    8 PROCEEDINGS OF THE IMCSIT. VOLUME 5, 2010

  • (d) Breast Cancer as a Stand-Alone Application

    Fig. 5. Implementation in myCBR

    provides comfortable GUIs for modeling various kinds of

    attribute specific similarity measures and for evaluating the

    resulting retrieval quality. In order to reduce also the effort of

    the preceding step of defining an appropriate case represen-

    tation, it includes tools for generating the case representation

    automatically from existing raw data [22]. The novice as well

    as the expert knowledge engineer are supported during the

    development of a myCBR project through intelligent support

    approaches and advanced GUI functionality [22]. Download-

    ing myCBR requires two steps of downloading. The first is

    to download myCBR plug-in files; this can be done directly

    through myCBR web page. The second step is to download

    the Protg ontology editor; this can be done through the

    Protg web page. Downloading Protg is not an easy task.

    Users need to do some readings on the site to be able to

    select the suitable version to download. Since myCBR is a

    plug-in inside Protg, users need to install Protg first. It

    is required to have JAVA Virtual Machine installed before

    proceeding in installation, or users may choose to download

    the version that includes the JAVA. To install the myCBR

    plug-in for Protg, users need to copy the myCBR plug-ins

    into Protgs plug-ins directory. Then to start Protg and

    create new projects, users need to enable the myCBR plug-ins

    from the configuration menu of Protg. After installing and

    activating the myCBR plug-in, the user interface of Protg is

    extended with additional tabs to access the myCBR modules.

    After developing a CBR application using the Protg plug-

    in, myCBR can also be used as a stand-alone Java module,

    to be integrated in arbitrary applications, for example, JSP5-

    based web applications. In this application phase, the retrieval

    engines of myCBR just read the XML files of the created

    project generated using the plug-in interface and perform

    the similarity-based retrieval [24]. For Protg manuals and

    tutorial, users may consult the documentation section of the

    Protg web site for available documentation. Among other

    things, users may find the Protg Users Guide, a "getting

    started" tutorial, and information on ontology development.

    The manual for myCBR is available on its web page as HTML

    version or a PDF version. The manual covers installation and

    different usage issues. No multimedia tutorials are available

    for the usage of myCBR.

    2) Implementation: Four steps are required to develop a

    CBR application:

    Generation of case representations Modeling similarity measures Testing of retrieval functionality Implementation of a stand-alone application

    Generation of case representations: One powerful feature

    provided by myCBR is the easiness of the case representation

    by CSV data import module [24]. Users have the choice to

    import data instances in an existing Protg class or to create

    a new class that is suitable for their raw data. Figure 5(a) shows

    how Wisconsin dataset is arranged in a CSV file. myCBR

    allows also slots to be added manually using Protg. Figure

    5(b) shows myCBR screen after importing the dataset into a

    new class Patient which will be used as query and case values

    for retrieval step.

    Modeling of similarity measure: myCBR follows the local-

    global approach which divides the similarity definition into

    a set of local similarity measures for each attribute, a set of

    attribute weights, and a global similarity measure for calcu-

    lating the final similarity value. This means, for an attribute-

    value based case representation consisting of n attributes, the

    similarity between a query q and a case c may be calculated

    as follows

    Sim(q, c) =N

    i=1

    wi Simi(qi, ci) (1)

    Here, simi and wi denote the local similarity measure and theweight of attribute i, and Sim represents the global similaritymeasure [24]. The dataset used in this experiment is simple

    so we leave the similarity measure definition as the default of

    ESSAM ABDRABOU, ABDEL-BADEEH SALEM: A BREAST CANCER CLASSIFIER 9

  • myCBR. We only change the weight values of the Id and Class

    slots from one to zero. However, users may consult myCBR

    tutorial for more options in defining local and global similarity

    measure.

    Testing of retrieval functionality: myCBR includes an easy

    to use GUI for performing retrievals and for analyzing the

    corresponding results. By providing similarity highlighting and

    explanation functionality, myCBR supports the efficient analy-

    sis of the outcome of the similarity computation. We tested the

    16 records that are excluded from the dataset according to one

    missing value. Only two missed classifications are obtained.

    Figure 5(c) shows one query of these records after retrieving

    the most similar cases. Another alternative of performing case

    retrieval is to use a query from cases. This is also tested and

    gives a similar result as shown in Figure 5(d).

    Implementation of stand-alone application: myCBR can also

    be used as a stand-alone Java module, to be integrated in

    arbitrary applications. In this application phase, the retrieval

    engines of myCBR just read the XML files of the created

    project generated using the plug-in interface and perform the

    similarity-based retrieval. Figure 5(d) shows the breast cancer

    stand-alone application.

    IV. DISCUSSION AND CONCLUSION

    In this paper, we examined two object-oriented ontology

    based CBR frameworks jCOLIBRI developed by GAIA group

    and myCBR developed by DFKI group. A breast cancer

    classifier is built by using the two selected frameworks.

    During the implantation of the breast cancer diagnostic

    application using jCOLIBRI we found that jCOLIBRI is user-

    friendly and efficient to develop a quick application. The

    classifier was successful in classification of the selected data

    set. During the implantation of the breast cancer classifier

    using myCBR we noticed that myCBR is a really a tool for

    rapid prototyping of a new CBR application. In seconds, users

    may have a running standalone CBR application by using the

    CSV importing feature. myCBR is intelligent enough to build

    the case structure and the case base by parsing the provided

    CSV file. myCBR avoids reinventing the wheel by making the

    development of a new CBR application done inside Protg.

    The classifier was successful in classification of the selected

    data set.

    In conclusion, two CBR frameworks are very useful to

    develop CBR base breast cancer classifier that can play a very

    important role to help for early detecting the disease and hence

    right medications can be used to save lives.

    REFERENCES

    [1] A. Aamodt and E. Plaza, Case-Based Reasoning: Foundational Issues,Methodological Variation and System Approaches, AICOM, vol. 7,no. 1, 1994, pp. 3958.

    [2] J. J. Bello-Toms, J. A. Gonzlaez-Calero and B. Dz-Agudo, JCOL-IBRI: An Object-Oriented Framework for Building CBR Systems, inAdvances in Case-Based Reasoning, Lecture Notes in Computer Science,

    Springer Berlin/ Heidelberg, vol. 3155, 2004, pp. 3246.[3] S. Bogaerts and D. Leake, Increasing AI Project Effectiveness with

    Reusable Code Frameworks: A Case Study Using IUCBRF, in Proceed-ings of the 18th International Florida Artificial Intelligence Research

    Society Conference, Menlo Park, CA: AAAI Press, 2005.

    [4] S. Bogaerts and D. Leake, A Framework for Rapid and Modular Case-Based Reasoning System Development, Technical Report, TR 617,Computer Science Department, Indiana University, Bloomington, IN,2005.

    [5] B. Dz-Agudo, P. A. Gonzlez-Calero, J. Recio-Garc and A. Sanchez-Ruiz, Building CBR systems with jCOLIBRI, Journal of Science ofComputer Programming, vol. 69, no 13, 2007, pp. 6875.

    [6] J. H. Gennari, M. A. Musen, R. W. Fergerson, W. E. Grosso, M. Crubezy,H. Eriksson, N. F. Noy and S. W. Tu, The evolution of Protege anenvironment for knowledge-based systems development, Int. J. Hum.-Comput. Stud, vol. 58(1), 2003, pp. 89123.

    [7] J. A. Gonzlez-Calero and B. Dz-Agudo, An architecture for knowl-edge intensive CBR systems, in E. Blanzieri and L. Portinale, edi-tors, Advances in Case-Based Reasoning (EWCBR00), Springer-Verlag,Berlin Heidelberg New York.

    [8] J. A. Gonzlez-Calero and B. Dz-Agudo, CBROnto: a task/methodontology for CBR, in S. Haller and G. Simmons, editors, Procs. ofthe 15th International FLAIRS02 Conference (Special Track on CBR,

    101106). AAAI Press.[9] M. Jaczynski and B. Trousse, An Object-Oriented Framework for the

    Design and the Implementation of Case-Based Reasoners, in Proceed-ings of the 6th German Workshop on Case-Based Reasoning, Berlin,1998.

    [10] R. Johnson and B. Foote, Designing reusable classes, Journal ofObject-Oriented Programming, vol. 1(5), 1988, pp. 2235.

    [11] J. L. Kolodner, Case-Based Reasoning, 1993, Morgan Kaufmann Pub-lishers, California.

    [12] D. Leake, Case Based Reasoning. Experiences, Lessons and FutureDirections, AAAI Press, MIT Press, USA, 1997.

    [13] M. Manago, R. Bergmann, N. Conruyt, R. Traph ner, J. Pasley, J. LeRenard, F. Maurer, S. Wes, K. D. Althoff and S. Dumont, CASUEL:a common case representation language, ESPRIT project 6322, 1994.Task 1.1, Deliverable D1.

    [14] O. L. Mangasarian and W. H. Wolberg, Cancer diagnosis via linearprogramming, SIAM News, vol. 23, no. 5, 1990, pp. 118.

    [15] A. Mulder, Developing a Reusable Application Framework, Char-iot Solutions, http://www.chariotsolutions.com/javalab/presentations.jsp,2003.

    [16] C. A. Pena-Rayes and M. Sipper, Applying Fuzzy CoCo to Breast CancerDiagnosis, IEEE, 2000, pp. 1168-1175.

    [17] J. A. Recio-Garc, B. Dz-Agudo and P. A. Gonzlez-Calero, Proto-typing recommender systems in jCOLIBRI, in Proceedings of the 2008ACM Conference on Recommender Systems (Lausanne, Switzerland,

    October 23 - 25, 2008), RecSys 08, ACM, New York, NY, pp. 243-250.[18] J. A. Recio-Garc, B. Dz-Agudo and P. A. Gonzlez-Calero, jCOL-

    IBRI2 Tutorial, 2008. Group of Artificial Intelligence Application(GAIA). University Complutense of Madrid. Document Version 1.2.

    [19] J. A. Recio-Garc, D. Bridge, B. Dz-Agudo and P. A. Gonzlez-Calero, CBR for CBR: A Case-Based Template Recommender System,in K. D. Althoff and R. Bergmann, editors, Advances in Case-BasedReasoning, 9th European Conference, ECCBR 2008 (in press), LNCS.Springer.

    [20] J. A. Recio-Garc, B. Dz-Agudo, , A. Snchez and P. A. Gonzlez-Calero, Lessons learnt in the development of a CBR framework, in M.Petridis, editor, Proceedings of the 11th UK Workshop on Case BasedReasoning, CMS Press, University of Greenwich, 2006, pp. 6071.

    [21] J. A. Recio-Garc, A. Snchez, B. Dz-Agudo and P. A. Gonzlez-Calero, jCOLIBRI 1.0 in a nutshell. A software tool for designing CBRsystems, in M Petridis, editor, Proccedings of the 10th UK Workshopon Case Based Reasoning, CMS Press, University of Greenwich, 2005,pp. 2028.

    [22] T. R. Roth-Berghofer and D. Bahls Explanation Capabilities of the OpenSource Case-Based Reasoning Tool myCBR, 2008.

    [23] S. Schulz, CBR-Works: A state-of-the-art shell for case-based appli-cation building, in Melis, E., ed., Proceedings of the 7th GermanWorkshop on Case-Based Reasoning, GWCBR99, Wurzburg, Germany,University of Wurzburg, pp. 166175.

    [24] A. Stahl and T. R. Roth-Berghofer, Rapid prototyping of CBR appli-cations with the open source tool myCBR, in R. Bergmann and K. D.Altho, eds., Advances in Case-Based Reasoning, 2008, Springer Verlag.

    [25] M. Sewak, P. Vaidya, C. C. Chan and Z. H. Duan, SVM Approach toBreast Cancer Classification, IMSCCS, vol. 2, 2007, pp. 3237.

    10 PROCEEDINGS OF THE IMCSIT. VOLUME 5, 2010

  • ABCDECCBFDEED

    ABCDCDEFFAAAACAACA

    FABBCCDCACAFBACFBC

    CFB AAAACAACA

    FABBCCDCAC!FB "CFBC

    #B$FDEBFABDCA

    #BDEF%&BCFCFABBCADCA

    EFCBEBCDEECCDEECDEEE

    EEBCDEE DCCBEAAACDA

    BE!DE C E DAE D!E BE

    E"#EDEE!E$!ECDCA

    DE!BCDDEC!DCBBE

    EEEFEDEDE!EDE%&'!DE

    (DE!EDEAD%&'!DE)*"+,+-.DE!ED

    FCA %&' !DE )*"./0102 CADFE C EAD

    %&'FCA %&'CB *"/+/+/*"/./3.BCD DEE

    BCDE E E EAD %&' FCA %&' CB

    *"/03/1 *"/0..1 C!E F DEE FCC" EE

    DE!DECBDAEFEDECB4"+010DEE

    BCDEEDECB4"//+-DEEFCCEDECB

    4"/+4."

    ABCDE5 DE 'ED 'BC !CD

    ECD E (2 EC DEE %EEED

    &ED 'DED '!DE (%&'2 DEE CC

    DEECDE6"

    ''()*+)'+(

    D(,* FB ! FF A B AEBE"AB"B!BABBABF)-AFF

    BCCBAFF"B!ABACB)CB E E" B CF"A ) A F!B"BFAFABBBE.AABCEBA"BBFEFF#AA*B"AB+E.AEC""F!BCBABCCAAAABEF/001

    )CBABAAAA2AEAABF%AA"CF"AAFBBBEBAEAACBABCCBBCBCB" !BC FB BA DB" ABDCBAB"A!BC!FFF%BEA"AFAC

    )BEFC!A&A"BAA%A B" EA C""" F%FB!BCBA"BBEA034AB534BCF"A BAAABBB)A!BBAF-C"CC%CB"C)F#BA6#7F("%A6(7

    ) BCA%D "B 6D7 EBAA#A("BADF%

    AB B AA B"AB C" A A" BAB C8""BAB)F"BCEABAB+CABAF9AAB"ABACBABABFB"A!A:B:BBEBF%"ACCCB"CAAACBAAC%F"

    ''D)D,)

    )ABAAF"F"ABAAABFFAAAB

    DC"CEA!B%CBFAB67! AF6F7BBAACFABABBBA67FFA%AAB"FFBAC"AAABAACFAFBFFAA AA!FFAC""AFC")FB"AFC"B%"ABB""AFC"ECB"%AFBCA(BAAA"AFC"BCBBACEBFCAB!AFFBABF)FCAABAFCAC"FCA-F#-FAACCBA!FABBAAFCAB"AFC"/001

    BCABAFBABC"EABFB"AFCE)A ABAC"A!B%CBFBBADBABA;%"FF-FBF&BA!BEEABBA!BAFCAD-FAAAAB%B""BAB'ACFAA-FFB%ABA"F"2ABABBA-F"ABCCBBAABBCA)A-BAABABBAB!ABABABBABF%C/

  • D!AB"CAABA EBB$B! ABBA"CA%ABA AABAABFABFFC""FAB)ABBA"BFB"%CAABB"ABA"BFBAAAABC"AACABEBAAABAF9C"FABABAF"BAC/>1

    )ABBA"F CCCB"C%"BCAB-F"ACCCB"EBCF%AAABA BABAAAAABCAABBFBAFBCABBCF?"BF"BAC

    ) "AF AE B " CCCB"67BBEACCCB"C%"FBCCA)CBBAAEABFAABAFB%CA B BCA F"BAC AB " %BBBAAECFAFB%CABF"BACABA"BAF%BCAB A"B A" )AE B%AB-CAF;@33A/0@1

    )CBA A CABB F AAB B AEA CCCB"CAAAA CCCB"C AC !A -A BBAEABAAC!AB!ABAFAB

    )FBAB9A C"BACFBABFB"AFA"AF8F%BB B BA E" C !FF C%EBCAB"FFBFFAF%F"AFBFFEFEB"%FBCACBE%A"AABFFA"B9EAAC"

    )CE!AB%BA%CAABBFFBEF"F)CC%" CB A EB B -CF A BFFCEBEAFBABCFF%FB"FB"AEB!CA"B"FA B FBEF AA )C" A!A@%BAACACE6!FF%BF%7C"CBFBEFAC"BE%C"AAC"FAC"B!"0

    "0ABBC"

    )AABA-%AAABBACA%"BEA )AAFAA"FB!BCFFAABAB"A)AAB7 ,AA)CCFABAFAEA!AEAA

    " ) CA -AA BC A %B%BCBCAA

    ;33;

    ;

  • )AA"BC3AB0A9BB%FBE.ABBFBE.A/51

    ''')D)')'DM),$('N,

    +BACBAFFABBAAAFF%AFBCABCBFAB-FAFAB%EA!AEF2 CABCBFE FB F" FAB BA ABCAEBBAABA)A!FFBBAFBFAFA

    ##+*)C,)+*D$'(,

    ABBF9"CBF"AE% AB "B2 F" F"F" AAAAEBAAEFA6A"A7%EF!BFABEACA&BCABFB!ABCBFAAFBBCEF A F" BCC %B

    F" B BA A AEF 6A"A7 EA A AA FF B A EF FF'AA"BFBAABAAFB EF EA A AB FBB8 B AA "B" BBA!ABA9AAAACFAB%A"BA!AAFAFAFB%FAB AB F 6FBCBAF7 AAAF C -CF B F"

    +BACBAFFABBAAAFF%AFBCABCBFABA-FAFABEA!EF+BAEAAA%B%A%ACBF"CABF"BAABC%6C7

    'ACB8"BCFAA%ABEFFFAAEAABCA%AEAAAABAFFFA)A8BBB"ACBAAEF%AAB8B!AFABDABAAAE B 6 B! B AB F7 FFABBA"BFBCCBF"ABABACFFAAAFABAB!AA!ABA"BBAA"A%EFBBBAF!AABAA"BBABABAF)ABAFBAAB

    )B FFAA FAB C FA C AAAFFAB!A(BEAB)%AEF D AAEF D" AA DO0 ; 5PP( ) "BF B FABCABBFAB!C8AEAABAAEF D)FABE-BFFB!A

    FA B 6I7

    FAFABCAABABABF)BABABA!BABA

    FAD0

    FD DF0 0F; ; F

    )BACCFABABEBECC9"ABFFB!"ABA

    F 3 @CFC;D0

    D D 6>7

    )BAAEBFFB!A

    DF DBD DF

    DB

    DD

    D

    E

    D DEF3

    6=7

    BAA"AA9BEABACAABACBBABACBA

    AB!ABF"A A D

    A!BF8EF!EABAAAFB!BAABAFABABBCBFABCEBFFB!A

    CFCF0;F;; F;)BABACCBFABB ABFBAB

    CA E AC MB AB AB B! AC-CCFFB!ABBAAFBCABEB)CBABCCFBABB)BABA $EQ%A AMF /051 " ; B! A EA! A BABFBAB

    ";)BABFBAB

    )AFBAB$EFBABAABEAFBABAABACFBA!AF"AEABBAA8B!)BB%%A FB AB2 A B-CAB AB $ER FBABEAAAAABABAF)AB AFBABAABBABAAEFFA B C )BA B MF FBABAAFAABBAFAA AFBAB

    'A AQ%AFBAB!FABFABCCBF!B%AFBABAAEA!AACAFABEFBAAEFDEF%FA'AFAQAFABABBABECBABFA/051)%%AFBABE-BFFB!A

    MEDHAT MOHAMED AHMED ABDELAAL, MUHAMED WAEL FAROUQ ET AL.: USING DATA MINING FOR ASSESSING DIAGNOSIS 13

  • 6F7

    BACBF"A A"ABAEAAEA! A BE A F B!CRQ%AFBABBBBAC9"EBBAFABABAC8"AAB""AABAF

    )BABFABB AB>M""BAC9%ABCAEABFABABABAC9ABBE%FC E- BFFB!A C9 A A"AABBFFB!A

    D0

    DDE

    3 @ D0 0

    DDE E DA

    D0

    D DDE

    6037

    )BAAEBFFB!A

    D0

    D DE33DD DED 6007

    CCBF EFA B8F AB AAABC A A A AB %CBF !FEBAAABAABAA

    )B8FABFBFBCF%FEAB6*&7"CB6%7)B!AB8B!!8FAB!FFEEABFAB

    ) *& 8F B%FA C CF AB "CBFBAFBFF%AB EA! A"A A"B AB A%AEA2 F EABBABAA%CBAF8FFBA*&D"%CB8FEAC*&8FBACA )*&AB ! CA ABAABFBCF8FA*&8FFCFFA

    )ABFFABEAABFFAAAFABBB%FA AACAEAFF9EC"AAB"CBFFFSATE"8F ABAA F FAB AB E F )CBABCC8F AB A*&UF/I1)*&8FALAA8FCA'A8FABABCBF6037A

    CBFE.ABFFB!A)CBAA6=7ECC9A

    D0

    DD

    3 @ D0 0

    DDE E DA

    D0

    D DDE

    60;7

    ,''+()*,,

    DBAFB"FCBFAE6A!B%!FA7AAAB!B!AFBA"A%EFEAE"AFBABABEF)CA BBA'AAABFAA!BA !&BBA"))BA

    DBAEABAAFBAA%"AEFEBFBAABEF

    ,BAABB6B!7BCAB"%FAA(BAABBAFBFFSA%CFTB SFT B )ABCBABFF ASBBATBF8F A BA!!AABBAAAAB)BBABAFFBAB!AAA

    DBABAAEEFAAAAB!BABA!B"B6FB7)CBAABFAAF"B)BFFSAAB"T

    ,''+(&++)'(L)*,,

    :&BBA":A BCB"ABAABEF"AABAF%BCE"ABAABAB!A!"A"BAAAABAFBBAABCC9'CAAB"AF-ABAEABFB

    ))&BBAF"BACBAC9BCB"A%BCBFEFABBA*B!AACBFEFA")&BBACB"ACBAAB8B!CBF"A ))&BBAF"BACABFFCFABB

    )BAEAAACEFABC9AB"AAAB$B!BCBAEFAAFFFA:BA:BAAB2!)&BBAABAAABCAFCBCAEAAACAFF)&BBACBFEEA

    #) 3&0 0 &; ; 6057

    AAA"A!AAA"FBA6ACA"AFB"BCBF7ABBSB%FTFC"AABA

    14 PROCEEDINGS OF THE IMCSIT. VOLUME 5, 2010

  • A"#$%#$AAAABAB%%F"%ABABAAB%AFAABCAEA)&BBAF"B%AC

    )AAAAABAA)FBCAAAAABABA!A%ACA AB A B ) B AAB"BA )FAFBCE"A!"ABAEABBA

    FF AF A FCFF EA AFF)&BBAACBABBACFFA)&BBACBFBA"B AA BA EBEA" F" %"F%ACBF )&BBACBFBA F ABBBABBAAABF"FA!B8 )&BBA CBF F BABBBAAFABEF'FA%ABEFAABCAFFBBA%AAACBF)&BBAA$E%"BFBAB!C8A"FAAABBAFCF)&BBABAFF6AAF7CBBABABCABBAABEFBABCAB6V-WE7FB"6-7B-6-7BBAAACBF$ABBAABCAB)BAAACABBB"AFAABF"C"AB F )ABA FCA A)&BBAF"BACC8A"FAAABBA%A"B%FABBC%B!%CF"CABEABFAA"F9ABB)&BBACBF""A BAA")&BBAEF AB "B CBF 8%F FABBEFC)&BBAFEBABABA%"BFABA"AEF

    ,''+()*,,+*,)

    DB)BACEFBBA!BABBCEABC8ABFF%ABBABADBABACFAB)&BBACBFAAAF"CEBA"B!$B!)&BBA"ABA!AABAABBA"B"ABA-AAA'BAABABA"B!CEBAAFFFABBAAAAFAFFBACEEFA

    &BA)&BBABABAB"CBFBABAABABE"AB!BFABBACBF)BAAA!ABA "AAABBF"E8B!BCAABAAB)B%AAACABBB"AFAABF"C"ABF)ABAF%CAABABAF"BACC8A"F

    AAABBAA"BABAEFAB"BFABCBF

    )CA"BBABAAAACBFBCF-BAEF9F8"FA'ACBBSEF8EB-TF8F A!B8&BAAEFABAEBA"F%ABABACBF)"F%ACBFEAAB"A AAA"BB!AABEF%FAABABACBFEABBAA"A"FAAB

    'C D(DME'D(*,M)

    )C)EF'!BBCBB%)CEABBACEA!ACACAF

    TABLE I.

    THE RESULTS OF APPLYING CLASSIFICATION SVM AND DECISION TREES

    FABC BA) &BBA")

    )" CFAB )" CFAB )" CFABC 3>FI 3F;; 3=@; 3F@= 3F;; 3F==(, 3F 3I3 30@= 30=0

    D, 3550 35I>@ 5F@

  • CB" A FFA!AF B3F;;3I

  • CA B A *+ !63=@5;5X33;@5>7,CFACABA*+BAEBBA"A!63@5@X335 DEFFABC*DADCCB'BDC+CBA )ACB(DACADAA(D(BDBABD,-

    B'AABFBAB&'AFAAAB'&'%;303#BABB9FB

    =0> &Y;33 &YS"AFC"B"TCF"&F$FE"";33;

    =/> "M:DFEBBAABC:;33@AAA??!!!AA!?Z.F?FEC?

    =-> A(Y!%)FB:DABABABBAABCBA8F%EF"CAB:(!EB8(EACE"A#;333

    =+> Y!"*B"B!8S+!CAFBCFC""CAABT$EBB8BF'C"""BDC#;333

    =.> YB)SBCAA"B"AFCCCB"CC"T)ACABBCAABBA;335

    =,> BCCA0FF=SAABBBAAEBCFACCCB"C"BF"BBEAT

    =4*> B8;335SBCA"B"AFCCCB"TACABBCA,""BFF"B,""ABBAFB

    =44> *FLB9F9*,BBS"AFC"B"T(!Y#A$FF;33;

    =41> *".*"&BCFC"FFB*#;33@

    =40> CBFDYDBF8B:DAABFBBAAB"B:(B+M);)F*BA(;%)*%0FF=%353

    =43> )BC*[#EY CDAADAECDEDAB)*CB)BDF*DCBAD+*AD**DB

    B'BDCYBFB,ABF&BFAAA;33>CBF5;2(C&0"I0%F0

    =4/> ABBAFB"AFCCB"$BC#"AAACAB?CCB"?AEACF;33F

    MEDHAT MOHAMED AHMED ABDELAAL, MUHAMED WAEL FAROUQ ET AL.: USING DATA MINING FOR ASSESSING DIAGNOSIS 17

  • AB ACD EF EC FE AEABEABEAAABDEABBBEBEAABFEBBEBEFEBA FE B C AEAAB B EF CFBEAEABAAEBBCBEECBEFAABCCABBFFDBBDBEECCBEBBBEBEAAB FFCB BE EAEB ABDAFFABC EBB E F BD CE EABB E A B ADBEBABEEABCCEFBCAEEABDBFEAEABEA!C AAB DEAB BBE A BE "#$%& '#%EB&B(FEBE

    ABCDEDFCDFFCECFCCECECFCCECC E!FCFFCD" FCC" CDFC

    FC#FC"CECFCCEF$CDEFC!EECECECFFCECFC FCECC"CCCCD%!FCECE"CECDFCEDDE CFCF%C&CF'EDFCECECEC(&)$C(*E)$C(+E%F) C F C EF C FF C FF$ C F C D C ED%DE F C C FF C F" C ,F C C EF C E!FCFF C D" F- C B& C % C .F C /BEF C !EE C &FE"FCE0C123CECB&C%C4ECFCEC/BDFFFCDC"C&FE"F0 C 153 C4 CF C FF C FF C D$ C EFCFE"F$CECECDDFCECF CCFECEC!EECFDCC FFCFE"FC FCFEFCFDCC"FCEC!F$C"6"FCFC FC"CFFCDC C ECCFC!FCEC"CC!E"CEF%DEFCF%EC+FCECFC"FCCE CCFFFC%FF CDCF!FC"FCEC!EFCC"CCF C EF C EF$ C E$ C EF C C "E C EC!FDCCEFCFC FC!EEFCC FCC%DECECC FFCFDCFFC FCECECECF C C!E" C FDFFE C C FC EF C7CF'EDFC C CC FC&2C

    FCD"DFCC CDEDFCCCF!FCEC!FEFCECF CFFFCCECB&C FCB&CFF%FDCF CC"CDFCCEE8FC#FC"FC FCDFB&CEC CFCFEFCC"CEFCFCC193CFCDDFCECDCC"FC FCB&CF CC FCEFCDFCC FCFCD!FCF"C"CE!FEFCDFFEFCC"F"CEECECCC C FC"FCCFFCECECEC:CEC FCECDFF%

    EF C!FCE"C;

  • EFC "E C@" C C F CDFEF C FEF C CB&CECFCFFFCC FC"FCCECDFECEFCFDF%FECCEC FC(FECEFIC153C FCB&CC FFCFDCDF-

    FFCDCFFCECE8E &FE"F C !F C " C J C FD C /FFC

    DCFD0 FDCE

    ABCDECBFBEFCFCB

    FCB&C"FC&ECJCAFECFE"FCFFCECCCEFCCFFECDCCAFECE'CC FCAF%ECE'CC FCE'CCDEECF!E!FCCCF%EC"CCC/FFCF6C20C1F31K3

    5

    C5

    5

    C

    5

    C

    5

    5 C /20

    FCFFECC CE'C$CCEC FC%E$CCE"EFC-

    D

    5

    C5

    5

    5

    5

    C 5

    C /50

    B6"EC/20CEC/50CEFCFFCC""C"C$C"CFCC CEF$CCCFECFFCDEFC FCCCECFFC"CCFFC CFDFFC%EFCACECCCEFCCECD'FCFCC FC%EFC7DD'EFCF!E!FCEFCD"FCC!"C CEDDDEFCFFC/CCEDD'ECC0C&CF%FECEDD'ECDC"ECC"F-

    D D CC D

  • FCEECFC FCEECD"CCEF FC2FKCFFCEDFCECFFCC FCFEFCCECECDECEFC2C

    C"FCCEDFCCFE CEC"CEEEFC%FCC22CFFCE CE CE C EFC8FCEC;>

    AF"F 2>

    FE" 2K

    BE F

    +EF 5K

    TABLE II

    EXPERIMENTAL RESULTS.

    )A %CB"+,-.( %CB"+,-/(

    7"

  • F C DFEF C C F C FF C C "F C C FCE FCEDFCC FCEEEFCCEDFCCE!E%EFCFCEEFC/FC6"FF0C#FCFEFCFCEFCCE!FEF C F C E C F C EF C C "F C C FDCE F C C FE C E C F C F C EDDE CD!FCFFCF"C"CCEC"ECFE"FCCECEDFC EC FCEFC"FCCFFFCFDCB"ECC CCEFC5C C#FCFCEC F CMCEC/EFCF%FFC FCFCEC FCF'CFCFD0C4FFCF"CFFCE F!FC CCMO

  • 1>3 B CB F E$C@CE$ C,FEC BEF$%$CCFCCFEB%F F-CBC D"FC+CECAEFCF$ C+ACQ

  • Evaluation of Clustering Algorithms

    for Polish Word Sense Disambiguation

    Bartosz Broda, Wojciech Mazur

    Institute of Informatics, Wrocaw University of Technology, Poland

    [email protected], [email protected]

    AbstractWord Sense Disambiguation in text is still a difficultproblem as the best supervised methods require laborious andcostly manual preparation of training data. Thus, this work fo-cuses on evaluation of a few selected clustering algorithms in taskof Word Sense Disambiguation for Polish. We tested 6 clusteringalgorithms (K-Means, K-Medoids, hierarchical agglomerativeclustering, hierarchical divisive clustering, Growing HierarchicalSelf Organising Maps, graph-partitioning based clustering) andfive weighting schemes. For agglomerative and divisive algorithm13 criterion function were tested. The achieved results areinteresting, because best clustering algorithms are close in termsof cluster purity to precision of supervised clustering algorithmon the same dataset, using the same features.

    I. INTRODUCTION

    WORD Sense Disambiguation (WSD) deals with con-

    textual resolution of lexical ambiguity. Most words in

    natural language have more than one lexical meaning (sense),

    but usually only one of them is active in a given context.

    Typical example of ambiguous word is line, which according

    to WordNet (an electronic thesaurus, cf. [1]) has 36 senses.

    WSD is important problem for applications in domain of Nat-

    ural Language Processing (NLP). Machine translation cannot

    work without some form of disambiguation, but WSD can be

    helpful also for information retrieval, information extraction

    and computer aided lexicography among others [2].

    WSD is a hard problem. Most difficulties arise from the

    fact that the concept of a meaning is vague. Usually, there

    are no clear boundaries between one sense or the other [3].

    Typically, the problem of defining meaning is tackled with

    using dictionaries (which are called sense inventory in a

    context of WSD). I.e., from the algorithmic point of view sense

    inventories are used to enumerate all the meanings that a given

    word has. Now, the goal of WSD can be stated as choosing

    appropriate sense from sense inventory in a given context of

    a word.

    There are two main approaches to WSD based on machine

    learning: supervised and unsupervised [2].1 Supervised learn-

    ing focuses on the usage of manually disambiguated examples

    of text snippets containing ambiguous words. We need to

    choose an appropriate sense inventory in advance, at early

    stages of the construction of supervised WSD system. Some

    1There is a plethora of other approaches to WSD, e.g., based on translationalequivalence or hand-written rules. We omit those for brevity. For extensiveoverview of other methods see, e.g., [2], [4].

    features are extracted from those text snippets (or contexts2)

    and classifiers are trained using this manually labeled data.

    Most of the time, supervised approaches are superior to un-

    supervised in terms of accuracy of automatic disambiguation

    when used on the same type of texts that the systems were

    trained on.

    Nevertheless, there is another issue connected with the

    problem of the definition of a meaning, i.e., an issue of creation

    of other resources used for automatic system performingWSD.

    This is especially evident in creation of corpora3 manually

    annotated (tagged) with senses, which are used for training

    machine learning classifiers in a supervised setting. There are

    two important problems during manual sense tagging of a

    corpus: low interannotator agreement (IA) and high cost of

    annotation process. IA is a way of measuring how much an-

    notations assigned by one annotator differers from annotations

    assigned by another annotator. IA is used for estimation of an

    upper bound on performance on automatic WSD. Typically, it

    is not enough to give a value of percentage agreement, because

    agreements and disagreements may arise by chance. Cohens

    is widely used in computational linguistic community forthis purpose, but there are also other measures [5]. The cost

    of annotation is high, because large effort is required during

    manual annotation. Mihalcea estimated that a construction

    of a corpus with sufficient amount of data for supervised

    classification algorithms for 20 000 ambiguous words would

    require 80 man-years of work [6].

    On the other hand, unsupervised and semi-supervised algo-

    rithms can be used. The amount of manual labor required is

    much lower in learning without supervision. Unsupervised ap-

    proaches to WSD tend to use unlabeled data and automatically

    find sense distinctions. Usually those methods involve some

    form of clustering. Harris distributional hypothesis [7] can be

    used as a theoretical foundation for unsupervised methods of

    WSD. It states that meaning of entities (...) is related to the

    restrictions on combinations of these entities relative to other

    entities.. In this context entities can be understood as words.

    The main goal of this work is to compare various clustering

    algorithms in the task of unsupervised Word Sense Disam-

    biguation for Polish data. In unsupervised WSD system deals

    with grouping of contexts for given word that express the

    2We will use term context to denote a passage of text containing ambiguousword.

    3Here we define a corpus as a collection of texts prepared for linguisticprocessing

    Proceedings of the International Multiconference onComputer Science and Information Technology pp. 2532

    ISBN 978-83-60810-27-9ISSN 1896-7094

    978-83-60810-27-9/09/$25.00 c 2010 IEEE 25

  • same meaning without providing explicit sense labels for each

    group (e.g., without using a dictionary) [8]. Also, this work

    is motivated by the fact that clustering is important for semi-

    supervised WSD algorithm called Lexicographer Controlled

    Semi-automatic word Sense Disambiguation [9], [10]. So far,

    the selection of the algorithm used in LexCSD was motivated

    by the performance of the given algorithm in other tasks and

    its analytical properties, because analysis of the performance

    of different clustering algorithms in similar settings (i.e., using

    similar dataset and features) for Polish WSD is difficult to find.

    There are a few differences when dealing with WSD data in

    comparison to classical applications of clustering. To name just

    a few: the distributions of classes (senses) are skewed4, data

    is represented in spaces of very large number of dimensions

    (thousands or even hundreds of thousands), for some classes

    only very specific, often overlapping among classes features

    are important and sometimes there is difficulty in distinguish-

    ing between two close classes.

    The paper is organized as follows. First the selected clus-

    tering algorithms are briefly described. Evaluation section

    starts with the analysis of evaluation metrics used. Next, the

    corpus and experimental settings are described. Section III-D

    provides discussion of results. Section IV gives a summary

    of performed experiments and overviews direction of further

    works.

    II. SELECTED ALGORITHMS FOR TESTING

    For this work we have selected a few classical clustering

    algorithms, but we tried to choose algorithms representing

    a few different approaches to the problem of clustering.

    We started with K-means and K-medoids algorithms, which

    represent simple, hard and flat clustering methods. We choose

    Growing Hierarchical Self-Organising Map (GHSOM) as a

    representative of family of clustering using neural networks.

    GHSOM is also a hierarchical clustering algorithm. We ex-

    periment with standard hierarchical clustering algorithms with

    different criterion functions, both from agglomerative and

    divisive families of algorithms. Last but not least, we test also

    graph-based clustering algorithm. We have reimplemented K-

    means, K-medoids and GHSOM and use existing implemen-

    tation of other algorithms [11].

    We are focusing on clustering for WSD so we will use

    NLP-related terminology during description of algorithms. As

    a task of WSD is a contextual one, we will cluster contexts

    (text snippets) containing ambiguous word. From the context

    some real-valued features are extracted. So the context is a

    vector of features ~v in high dimensional space. We will useterm context and context vector interchangeably. The exact

    nature of context and feature extraction process are described

    in Sec. III-B.

    A. K-means and K-medoids

    K-means is one of the simplest clustering algorithm.

    K-means defines cluster as a centers of mass of contexts being

    4Not all senses are represented in the data equally; distribution of sensesis biased towards a few frequent senses.

    clustered [12]. Those centres are represented as centroids.

    Initially random contexts are chosen as centroids. Then we

    assign most similar contexts to each centroid. After this step

    new centroids are computed as a mean of all the contexts in

    a group. This process is then repeated until some stopping

    criterion is reached, e.g., number of iteration reaches some

    predefined threshold or the clustering solution do not change

    significantly between subsequent iterations.

    K-medoids is similar in concept to K-means algorithm. The

    most fundamental difference between the two algorithms is

    that K-medoids uses real contexts from the dataset as a basis

    for clustering in contrast to centroids used in K-means (which

    are artificial contexts). One of the realisations of K-medoids

    is an approach called Partition Around Medoids, or PAM [13].

    In PAM one starts with randomly selection of initial medoids.

    Then every swapping of every medoid with every context is

    tested in terms of decreasing cost of whole clustering solution.

    This approach has its drawbacks in terms of computational

    complexity, i.e., O(k(nk)2), where n is number of contextsto cluster and k is number of medoids. Thus a few extensionshave been proposed that, e.g., employ sampling (CLARA)

    or randomized search (CLARANS) [13]. Nevertheless, we

    use classical PAM, as both mentioned algorithms can have

    negative impact on quality in comparison to PAM. This

    approach is applicable in our experiments, as we use relatively

    small datasets.

    B. Growing Hierarchical Self-Organizing Map

    The Growing Hierarchical Self-Organizing Map (GHSOM)

    [14] is a natural extension of Kohonens idea of Self-

    Organizing Maps (SOM) [15]. SOM is an artificial neural

    network consisting of many neurons. Every neuron consists

    of a weight vector. Training SOM is done in an unsupervised

    manner applying winner takes most strategy. Every feature

    vector is delivered to the network input several times. For

    every input vector the similarity with the neuron weight vector

    is computed. Weights of the most similar neuron (the winner)

    and its neighbourhood are updated to be even more similar to

    the input pattern. The learning algorithm is constructed in such

    a way, that the neighbourhood and the degree of the weight

    updating is decreasing over time.

    GHSOM address one of the most important drawback of

    SOM the a priori definition of the map structure. Rauber

    et al. proposed an algorithm for growing SOM both in a terms

    of the number of map neurons and the hierarchy [14]. After

    the training stage of SOM mean quantization error for every

    neuron i (mqei) is calculated as the average distance of everycontext recognised by the neuron i to its weight vector. Theaverage MQEj for whole map on level j is computed, too. IfMQEj 1 MQEj1 then the additional row or column ofneurons is added to the map and the training stage is repeated.

    In the other case the mqei for every neuron is compared toMQEj . If meqi 2 MQEj1 then another layer of themap is created for contexts recognised by the neuron i.

    26 PROCEEDINGS OF THE IMCSIT. VOLUME 5, 2010

  • C. Agglomerative and Divisive Clustering

    Agglomerative and divisive clustering algorithms produce

    hierarchical clustering trees called dendrograms. Agglom-

    erative clustering starts in a situation that each context is

    contained in a separate cluster, then in each step two clusters

    maximising criterion function are merged. On the other hand,

    divisive algorithms starts with all contexts in one cluster which

    are repeatedly bisected according to the criterion function. We

    are using existing implementation of hierarchical algorithms

    from CLUTO5 [11]. We use rbr variant of divisive algorithm,

    i.e., standard bisecting clustering is employed and is further

    optimized according to criterion function [16].

    Criterion function is very important aspect of both agglom-

    erative and divisive clustering algorithms as it drives the whole

    process. There are many criterion function available [17]. We

    have tested standard criterion functions used with agglomera-

    tive algorithms, i.e.: single link (slink), complete link (clink),

    average link (upgma) and weighted variants of single (wslink),

    complete (wclink) and average links (wupgma).

    The second group of criterion function including

    i1, i2, 1, G1, G1, H1, H2 can be used with both agglomerative

    and divisive algorithms. The exact form of those functions

    are given by [11]:

    I1 = maximize

    k

    i=1

    1

    ni(

    ~v,~uSisim(~v, ~u)) (1)

    I2 = maximize

    k

    i=1

    ~v,~uSisim(~v, ~u) (2)

    1 = minimize

    k

    i=1

    ni

    vSi,uS sim(~v, ~u)

    v,uSi sim(~v, ~u)(3)

    G1 = minimize

    k

    i=1

    vSi,uS sim(~v, ~u)

    v,uSi sim(~v, ~u)(4)

    G1 = minimizek

    i=1

    n2i

    vSi,uS sim(~v, ~u)

    v,uSi sim(~v, ~u)(5)

    H1 = maximizeI11

    (6)

    H2 = maximizeI21

    , (7)

    where k is total number of clusters, S is total number ofcontexts to cluster, Si is a set of contexts assigned to i-thcluster, ni = |Si|, and sim(~v, ~u) is similarity between twocontext vectors ~v and ~u.

    D. Graph Partitioning Based Clustering

    We use an implementation of min cut graph partitioning

    algorithm from CLUTO [11]. This algorithm starts with cre-

    ation of neighbourhood graph based on similarities between

    5CLUTO is a free software package implementing several clusteringalgorithms including partitioning, agglomerative and graph-based. Availableat: http://glaros.dtc.umn.edu/gkhome/views/cluto/

    contexts and then applies min cut to partition the graph into

    disjoint regions. Min cut uses approach that the size of graph

    edges in a partition is minimal.

    This approach achieved high quality in research on semi-

    automatic extension of Polish WordNet [18] and was also used

    in Polish WSD based on weakly-supervised settings using

    LexCSD algorithm [10].

    III. EXPERIMENTS

    A. Evaluation Measures

    Evaluation of clustering algorithms can be done in many

    ways [19]. Some of them are based on external criteria, i.e.,

    the comparison of the resulting clustering solution with some

    pre-existing categories that were created manually. On the

    other hand, one can use an internal criteria without resorting

    to gold standard clustering. The most important drawback of

    evaluation using internal criteria is that good score does not

    always corresponds to good results of clustering in a given

    application [20]. As we have developed semantically annotated

    corpus (SCWSD, see Sec. III-B) we can use it for the need

    of evaluation. The problem with SCWSD is its small size,

    so there is a risk of not capturing all of the peculiarities and

    biases of some large corpora in SCWSD.6

    We used several measures for evaluation to capture different

    aspects of created groups. For measuring how homogeneous

    clusters are we used Purity:

    Purity(, C) =1

    N

    k

    maxj

    |k cj |, (8)

    where = {1, 2, . . . , k} is a set of clusters, a C ={c1, c2, . . . , cj} a set of pre-existing categories. In oursetting C is a set of contexts with ambiguous word annotatedwith the same sense. Purity(, C) 0, 1, where 1 is thebest case. A drawback of Purity is its preference for solutionswith large number of groups. Assigning every context to a

    singelton cluster gives Purity of 1 [20].

    The Rand Index measures accuracy on the basis of decisions

    performed for the subsequent context pairs. If we use TP for

    true positive, TN for true negative, FN for false negative and

    FP for false positive. the Rand Index is given by the following

    equation:

    RI =TP + TN

    TP + FP + FN + TN(9)

    One of the drawbacks of using RI for evaluation is the equaltreatment of false positives and negatives. Using decision for

    context pairs we can also use standard measures of information

    retrieval, i.e., precision P , recall R and the harmonic mean ofprecision and recall F :

    6On the other hand, the total size of the dat