Download - Multiconference Proceedings (PDF, 58.653 M)
-
Proceedings of the InternationalMulticonference on Computer
Science and Information Technology
Volume 5 (2010)
-
Proceedings of the International Multiconference onComputer Science and Information Technology
Volume 5 (2010)
M. Ganzha, M. Paprzycki (editors)
ISSN 1896-7094
Polskie Towarzystwo InformatyczneOddzia Grnolskiul. Raciborska 340-074 Katowice
ISBN 978-83-60810-27-9
IEEE Computer Society Press10662 Los Vaqueros CircleLos Alamitos, CA 90720USA
TEXnical editor: Aleksandr Denisiuk
-
Proceedings of the InternationalMulticonference on Computer
Science and Information TechnologyOctober 1820, 2010. Wisa, Poland
Volume 5 (2010)
-
ear Reader, it is our pleasure to present to you Pro-ceedings of the 2010 International Multiconference on
Computer Science and Information Technology (IMCSIT), which took place in Wisa, Poland, on October 1820, 2010. IMCSIT 2010 and was co-located with the XXVI Autumn Meeting of the Polish Information Processing Society (PIPS ).
D
IMCSIT is a result of the evolutionary process. In 2005 a Scientific Session took place during the XXI Autumn Meet-ing of PIPS and consisted of 27 refereed presentations. After this relative success (we have advertised the Session very late in the year) we have decided to expand and extend it into a full-blown conference but continue cooperation (co-location) with the Autumn Meetings of PIPS. As a result of a steady growth, in 2010, IMCSIT consisted of the follow-ing events (and Proceedings are organized into sections that correspond to each of them):
5th International Symposium Advances in Artificial Intelligence and Applications (AAIA'10),
Workshop on Agent Based Computing: from Model to Implementation VII (ABC:MI'10),
International Workshop on Advances in Business ICT (ABICT'10),
Computer Aspects of Numerical Algorithms (CANA'10),
Computational LinguisticsApplications (CLA'10 ),
10th International Multidisciplinary Conference on e-Commerce and e-Government (ECOM&EGOV'10),
International Symposium on E-LearningApplications (EL-A'10),
6th Workshop on Large Scale Computations on Grids and 1st Workshop on Scalable Computing in Distributed Systems (LaSCoG-SCoDiS'10),
2nd International Workshop on Medical Informatics and Engineering (MI&E'10),
3rd International Symposium on MultimediaApplications and Processing (MMAP'10),
International Workshop on Real Time Software (RTS'10),
4th International Workshop on Secure Information Systems (SIS'10),
International Symposium on Technologies for Social Advancement (T4SA'10),
Workshop on Ad-Hoc Wireless Networks (WAHOC'10),
Workshop on Computational Optimization (WCO'10).
Each of these events had its own Organizing and Program Committee (listed in these Proceedings). We would like to express our warmest gratitude to members of all of them for their hard work in attracting and later refereeing 201 sub-missions.
Maria Ganzha, Conference Chair, Systems Research Institute Polish Academy of Sciences, Warsaw, Poland, and Gdask University, Gdask, Poland.
Marcin Paprzycki, Systems Research Institute Polish Academy of Sciences, Warsaw and Management Academy, Warsaw, Poland.
-
Proceedings of the International
Multiconference on Computer Science and
Information Technology
Volume 5
October 18 20, 2010. Wisa, Poland
TABLE OF CONTENTS
5th
International Symposium Advances in Artificial
Intelligence and Applications:
Call For Papers 1
A Breast Cancer Classier based on a Combination of Case-BasedReasoning and Ontology Approach 3
Essam AbdRabou, AbdEl-Badeeh Salem
Using data mining for assessing diagnosis of breast cancer 11Medhat Mohamed Ahmed Abdelaal, Muhamed Wael Farouq, Hala Abou Sena,Abdel-Badeeh Mohamed Salem
Advanced scale-space invariant, low detailed feature recognition fromimages - car brand recognition 19
tefan Badura, Stanislav Foltn
Evaluation of Clustering Algorithms for Polish Word Sense Disambiguation 25Bartosz Broda, Wojciech Mazur
Generation of First-Order Expressions from a Broad Coverage HPSGGrammar 33
Ravi Coote, Andreas Wotzlaw
PSO based modeling of Takagi-Sugeno fuzzy motion controller for dynamicobject tracking with mobile platform 37
Meenakshi Gupta, Laxmidhar Behera, Venkatesh K.S.
Hierarchical Object Categorization with Automatic Feature Selection 45Md. Saiful Islam, Andrzej Sluzek
Selecting the best strategy in a software certication process 53Waldemar Koczkodaj, Vova Babiy, Agnieszka D. Bogobowicz, Ryszard Janicki,Alan Wassyng
Extrapolation of Non-Deterministic Processes Based on ConditionalRelations 59
Juliusz Kulikowski
Reasoning in RDFgraphic formal system with quantiers 67Alena Lukasova, Marek Vajgl, Martin ek
Coevolutionary Algorithm For Rule Induction 73Pawel Myszkowski
Evolutionary Algorithm in Forex trade strategy generation 81Pawel Myszkowski, Adam Bicz
Emotion-based Image Retrievalan Articial Neural Network Approach 89Katarzyna Agnieszka Olkiewicz, Urszula Markowska-Kaczmar
v
-
Automatic Visual Class Formation using Image Fragment Matching 97
Mariusz Paradowski, Andrzej luzek
Learning taxonomic relations from a set of text documents 105
Mari-Sanna Paukkeri, Alberto Perez Garcia-Plaza, Sini Pessala, Timo Honkela
Metric properties of populations in articial immune systems 113
Zbigniew Pliszka, Olgierd Unold
The development features of the face recognition system 121
Rauf Sadykhov, Igor Frolov
Multiscale Segmentation Based On Mode-Shift Clustering 129
Wojciech Tarnawski, Lukasz Miroslaw, Roman Pawlikowski, Krzysztof Ociepa
Relational database as a source of ontology creation 135
Zdenka Telnarova
Emotional Speech Analysis using Articial Neural Networks 141
Jana Tuckova, Martin Sramka
Usage of reection in .NET to inference of knowledge base 149
Marek Vajgl
On the evaluation of the linguistic summarization of temporally focusedtime series using a measure of informativeness 155
Anna Wilbik, Janusz Kacprzyk
Workshop on Agent Based Computing: from Model to
Implementation VII:
Call For Papers 163
Java-based Mobile Agent Platforms for Wireless Sensor Networks 165
Francesco Aiello, Alessio Carbone, Giancarlo Fortino, Stefano Galzarano
BeesyBeesEcient and Reliable Execution of Service-based WorkowApplications for BeesyCluster using Distributed Agents 173
Pawe Czarnul, Mariusz Matuszek, Micha Wjcik, Karol Zalewski
A Technique based on Recursive Hierarchical State Machines forApplication-level Capture of Agent Execution State 181
Giancarlo Fortino, Francesco Rango
Reorganization in Massive Multiagent Systems 189
Henry Hexmoor
Eectiveness of Solving Traveling Salesman Problem Using Ant ColonyOptimization on Distributed Multi-Agent Middleware 197
Sorin Ilie, Costin Badica
Selected Security Aspects of Agent-based Computing 205
Mariusz Matuszek, Piotr Szpryngier
Agent-Oriented Modelling for Simulation of Complex Environments 209
Inna Shvartsman, Kuldar Taveter, Merle Parmak, Merik Meriste
Improving Fault-Tolerance of Distributed Multi-Agent Systems with MobileNetwork-Management Agents 217
Dejan Mitrovi, Zoran Budimac, Mirjana Ivanovi, Milan Vidakovi
Argumentative agents 223
Francesca Toni
An agent based planner for including network QoS in scientic workows 231
Zhiming Zhao, Paola Grosso, Ralph Koning, Jeroen van der Ham, Cees de Laat
vi
-
International Workshop on Advances in Business ICT:
Call For Papers 239
A method for consolidating application landscapes during thepost-merger-integration phase 241
Andreas Freitag, Florian Matthes, Christopher Schulz
Hybridization of Temporal Knowledg for Economic Environment Analysis 249Maria Antonina Mach
Independent Operator of Measurements as a Virtual Enterprise on theEnergy Market 255
Boena Ewa Matusiak
A Two-level algorithm of time series change detection based on a uniquedeviations similarity method 259
Tomasz Peech-Pilichowski, Jan T. Duda
STRATEGOS: A case-based approach to strategy making in SME 265Jerzy Surma
Support of the E-business by business intelligence tools and data qualityimprovement 271
Milena Tvrdkov, Ondej Koubek
Computer Aspects of Numerical Algorithms:
Call For Papers 279
The experimental analysis of GMRES convergence for solution of Markovchains 281
Beata Bylina, Jarosaw Bylina
On the Numerical Analysis of Stochastic Lotka-Volterra Models 289Tugrul Dayar, Linar Mikeev, Verena Wolf
Finite Element Approximate Inverse Preconditioning using POSIX threadson multicore systems 297
George A. Gravvanis, P. I. Matskanidis, K. M. Giannoutakis, E. A. Lipitakis
On the implementation of public keys algorithms based on algebraic graphsover nite commutative rings 303
Micha Klisowski, Vasyl Ustimenko
Analysis of Pseudo-Random Properties of Nonlinear CongruentialGenerators with Power of Two Modulus by Numerical Computing of theb-adic Diaphony 309
Ivan Lirkov, Stanislava Stoilova
Assembling Recursively Stored Sparse Matrices 317Michele Martone, Salvatore Filippone, Marcin Paprzycki, Salvatore Tucci
Use of Hybrid Recursive CSR/COO Data Structures in SparseMatrices-Vector Multiplication 327
Michele Martone, Salvatore Filippone, Pawe Gepner, Marcin Paprzycki,Salvatore Tucci
Higher order FEM numerical integration on GPUs with OpenCL 337Przemysaw Paszewski, Krzysztof Bana, Pawe Macio
Parallelization of SVD of a Matrix-Systolic Approach 343Halil Snopce, Ilir Spahiu
Solving a Kind of BVP for ODEs on heterogeneous CPU + CUDA-enabledGPU Systems 349
Przemyslaw Stpiczynski, Joanna Potiopa
vii
-
Computational LinguisticsApplications:
Call For Papers 355
Using Self Organizing Map to Cluster Arabic Crime Documents 357Meshrif Alruily, Aladdin Ayesh, Abdulsamad Al-Marghilani
Quality Benchmarking Relational Databases and Lucene in the TREC4Adhoc Task Environment 365
Ahmet Arslan, Ozgur Yilmazel
Parallel, Massive Processing in SuperMatrix a General Tool forDistributional Semantic Analysis of Corpus 373
Bartosz Broda, Damian Jaworski, Maciej Piasecki
Development of a Voice Control Interface for Navigating Robots andEvaluation in Outdoor Environments 381
Ravi Coote
The Role of the Newly Introduced Word Types in the Translations of Novels 389Maria Csernoch
SyMGiza++: A Tool for Parallel Computation of Symmetrized WordAlignment Models 397
Marcin Junczys-Dowmunt, Arkadiusz Sza
Semi-Automatic Extension of Morphological Lexica 403Tobias Kaufmann, Beat Pster
Automatic Extraction of Arabic Multi-Word Terms 411Khalid Al Khatib, Amer Badarneh
"Beautiful picture of an ugly place". Exploring photo collections usingopinion and sentiment analysis of user comments 419
Slava Kisilevich, Christian Rohrdantz, Daniel Keim
LEXiTRON-Pro Editor: An Integrated Tool for developing ThaiPronunciation Dictionary 429
Supon Klaithin, Patcharika chootrakool, Krit Kosawat
Automatic Detection of Prominent Words in Russian Speech 435Daniil Kocharov
Computing trees of named word usages from a crowdsourced lexical network 439Mathieu Lafourcade, Alain Joubert
RefGen: a Tool for Reference Chains Identication 447Laurence Longo, Amalia Todirascu
Is Shallow Semantic Analysis Really That Shallow? A Study on ImprovingText Classication Performance 455
Przemysaw Macioek, Grzegorz Dobrowolski
PerGram: A TRALE Implementation of an HPSG Fragment of Persian 461Stefan Mller, Masood Ghayoomi
WordnetLoom: a Graph-based Visual Wordnet Development Framework 469Maciej Piasecki, Micha Marciczuk, Adam Musia, Radosaw Ramocki, MarekMaziarz
Building and Using Existing Hunspell Dictionaries and TeX Hyphenators asFinite-State Automata 477
Tommi Pirinen, Krister Lindn
The Polish Cyc lexicon as a bridge between Polish language and theSemantic Web 485
Aleksander Pohl
Tools for syntactic concordancing 493Violeta Seretan, Eric Wehrli
Eective natural language parsing with probabilistic grammars 501Pawe Skrzewski
viii
-
Finding Patterns in Strings using Suxarrays 505
Herman Stehouwer, Menno Van Zaanen
Entity Summarisation with Limited Edge Budget on Knowledge Graphs 513
Marcin Sydow, Mariusz Pikua, Ralf Schenkel, Adam Siemion
Multiple Noun Expression Analysis: An Implementation of OntologicalSemantic Technology 517
Julia Taylor, Victor Raskin, Maxim Petrenko, Christian F. Hempelmann
A web-based translation service at the UOC based on Apertium 525
Luis Villarejo, Mireia Farrus, Gema Ramrez, Sergio Ortz
Tools and Methodologies for Annotating Syntax and Named Entities in theNational Corpus of Polish 531
Jakub Waszczuk, Katarzyna Gowiska, Agata Savary, Adam Przepirkowski
TREF - TRanslation Enhancement Framework for Japanese-English 541
Bartholomus Wloka, Werner Winiwarter
Matura Evaluation Experiment Based on Human Evaluation of MachineTranslation 547
Aleksandra Wojak, Filip Graliski
German subordinate clause word order in dialogue-based CALL. 553
Magdalena Wolska, Sabrina Wilske
Polish Phones Statistics 561
Bartosz Ziolko, Jakub Galka
APyCA: Towards the Automatic Subtitling of Television Content in Spanish 567
Aitor lvarez, Arantza del Pozo, Andoni Arruti
10th
International Multidisciplinary Conference on
e-Commerce and e-Government:
Call For Papers 575
Trusted Data in IBM's MDM: Accuracy Dimension 577
Przemyslaw Pawluk
Multicriteria Evaluation of DVB-RCS Satellite Internet Performance Usedfor e-Government and e-Learning Purposes 585
Andrzej M. J. Skulimowski
INFOMAT-E - public information system for people with sight and hearingdysfunctions 593
Micha Socha, Wojciech Grka, Adam Piasecki, Beata Sitek
Bidirectional voting and continuous voting concepts as possible impact ofInternet use on democratic voting process 599
Jacek Wachowicz
The Double Jeopardy Phenomenon and the Electronic Distribution ofInformation 605
Urszula wierczyska-Kaczor, Artur Borcuch, Pawe Kossecki
International Symposium on E-LearningApplications:
Call For Papers 609
Simple Blog Searching Framework Based on Social Network Analysis 611
Iwona Doliska
ix
-
6th
Workshop on Large Scale Computations on Grids
and 1st Workshop on Scalable Computing in
Distributed Systems:
Call For Papers 619
Exploratory Programming in the Virtual Laboratory 621Eryk Ciepiela, Daniel Harlak, Joanna Kocot, Tomasz Bartyski, MaciejMalawski, Tomasz Gubaa
Modelling, Optimization and Execution of Workow Applications with DataDistribution, Service Selection and Budget Constraints in BeesyCluster 629
Pawe Czarnul
Multi-level Parallelization with Parallel Computational Services inBeesyCluster 637
Pawe Czarnul
Managing large datasets with iRODSa performance analyses 647Denis Hnich, Ralph Mller-Pfeerkorn
Service level agreements for job control in high-performance computing 655Roland Kbert, Stefan Wesner
A Modeling Language Approach for the Abstraction of the Berkeley OpenInfrastructure for Network Computing (BOINC) Framework 663
Christian Benjamin Ries, Thomas Hilbig, Christian Schrder
Degisco Green Methodologies in Desktop Grids 671Bernhard Schott, Ad Emmen
Resource Fabrics: the next level of grids and clouds 677Lutz Schubert, Matthias Assel, Stefan Wesner
2nd
International Workshop on Medical Informatics
and Engineering:
Call For Papers 685
Agile methodology and development of software for users with specicdisorders 687
Rostislav Fojtik
3rd
International Symposium on
MultimediaApplications and Processing:
Call For Papers 693
An Hypergraph Object Oriented Model for Image Segmentation andAnnotation 695
Eugen Ganea, Marius Brezovan
Classication of Image Regions Using the Wavelet Standard DeviationDescriptor 703
Snke Greve, Marcin Grzegorzek, Carsten Saatho, Dietrich Paulus
High Capacity Colored Two Dimensional Codes 709Antonio Grillo, Alessandro Lentini, Marco Querini, Giuseppe F. Italiano
Region-based Measures for Evaluation of Color Image Segmentation 717Andreea Iancu, Bogdan Popescu, Marius Brezovan, Eugen Ganea
Undetectable Spread-time Stegosystem Based on Noisy Channels 723Valery Korzhik, Guillermo Morales-Luna, Ksenia Loban, Irina Marakova-Begoc
Building Personalized Interfaces by Data Mining Integration 729Marian Cristian Mihaescu
x
-
A Graphical Interface for Evaluating Three Graph-Based ImageSegmentation 735
Gabriel Mihai, Alina Doringa, Liana Stanescu
Basic Consideration of MPEG-2 Coded File Entropy and LosslessRe-encoding 741
Kazuo Ohzeki, Yuan y Wei, Eizaburo Iwata, Ulrich Speidel
Analyzes of the processing performances of a Multimedia Database 749Cosmin Stoica Spahiu
Constructive Volume Modeling 755Mihai Tudorache, Mihai Popescu, Razvan Tanasie
Real-Time Embedded Fault Detection Estimators in a Satellite's ReactionWheels 759
Nicolae Tudoroiu, Eshan Sobhani-Tehrani, Kash Khorasani, Tiberiu Letia,Roxana-Elena Tudoroiu
Application of optimal settings of the LMS adaptive lter for speech signalprocessing 767
Jan Vau, Vtzslav Stskala
Obfuscation Methods with Controlled Calculation Amounts and TableFunction 775
Yuanyu Wei, Kazuo Ohzeki
International Workshop on Real Time Software:
Call For Papers 781
Computationally eective algorithms for 6DoF INS used for miniature UAVs 783Jan Floder
Supervisory control and real-time constraints 791Wojciech Grega
Integration of Scheduling Analysis into UML Based Development ProcessesThrough Model Transformation 797
Matthias Hagner, Ursula Goltz
Laboratory real-time systems to facilitate automatic control education andresearch 805
Krzysztof Koek, Andrzej Turnau, Krystyn Hajduk, Pawe Pitek, MariuszPauluk, Dariusz Marchewka, Adam Piat, Maciej Ros, Przemysaw Gorczyca
Methods of Computer-Assisted Manual Control of Wheeled Robots 813Viktor Michna, Petr Wagner, Jiri Kotzian
Software and hardware in the loop component for an IEC 61850Co-Simulation platform 817
Haar Mohamad, Thiriet Jean Marc
Real-time controller design based on NI Compact-RIO 825Maciej Ros, Adam Piat, Andrzej Turnau
Intelligent Car Control and Recognition Embedded System 831Vilem Srovnal Jr., Zdenek Machacek, Radim Hercik, Roman Slaby, VilemSrovnal
4th
International Workshop on Secure Information
Systems:
Call For Papers 837
A Security Model for Personal Information Security Management Based onPartial Approximative Set Theory 839
Zoltn Csajbk
Social Engineering-Based AttacksModel and New Zealand Perspective 847Lech Janczewski, Lingyan (Ren) Fu
xi
-
International Symposium on Technologies for Social
Advancement:
Global Mobile Applications For Monitoring Health 855Tapsie Giridher Giridher, Anita Wasliewska, Jennifer Wong
A Study on the Expectations and Actual Satisfaction about Mobile Handsetbefore and after Purchase 861
JIBum Jung, seungpyo Hong
Workshop on Ad-Hoc Wireless Networks:
Call For Papers 867
Wireless Transceiver for Control of Mobile Embedded Devices 869Jan Kordas, Petr Wagner, Jiri Kotzian
Ecient Coloring of Wireless Ad Hoc Networks With DiminishedTransmitter Power 873
Krzysztof Krzywdziski
Fast Construction of Broadcast Scheduling and Gossiping in Dynamic AdHoc Networks 879
Krzysztof Krzywdziski
Workshop on Computational Optimization:
Call For Papers 885
ACO with semi-random start applied on MKP 887Stefka Fidanova, Pencho Marinov, Krassimir Atanassov
On the Probabilistic min spanning tree problem 893Boria Nicolas, Murat Ccile, Paschos Vangelis
Ecient Portfolio Optimization with Conditional Value at Risk 901Wlodzimierz Ogryczak, Tomasz Sliwinski
Enhanced Competitive Dierential Evolution for Constrained Optimization 909Josef Tvrdik, Radka Polakova
xii
-
he AAIA'10 will bring researchers, developers, practi-tioners, and users to present their latest research, re-
sults, and ideas in all areas of artificial intelligence. We hope that theory and successful applications presented at the AAIA'10 will be of interest to researchers and practitioners who want to know about both theoretical advances and latest applied developments in Artificial Intelligence. As such AAIA'10 will provide a forum for the exchange of ideas be-tween theoreticians and practitioners to address the impor-tant issues.
T
Papers related to theories, methodologies, and applica-tions in science and technology in this theme are especially solicited. Topics covering industrial issues/applications and academic research are included, but not limited to:
Knowledge management Decision Support System Approximate Reasoning Fuzzy modeling and control Data Mining Web Mining Machine learning Combining multiple knowledge sources in an in-tegrated intelligent system Neural Networks Evolutionary Computation Artificial Immune Systems Ant Systems in Applications Natural Language processing Image processing and understanding (interpreta-tion) Applications in Bioinformatics Hybrid Intelligent Systems Granular Computing Architectures of intelligent systems Robotics Real-world applications of Intelligent Systems
INTERNATIONAL PROGRAMME COMMITTEEJanos Abonyi, University of Pannonia, HungaryHans Jorgen Andersen, Aalborg University, DenmarkAnna Bartkowiak, Wroclaw University, PolandShlomo Berkovsky, CSIRO, AustraliaRyszard Choras, Institute of Telecommunications,
PolandKrzysztof Cios, Virginia Commonwealth University,
USAAlfredo Cuzzocrea, University of Calabria, ItalyClaudio De Stefano, University of Cassino, ItalyJeremiah Da Deng, University of Otago, New ZealandKrzysztof Goczyla, Gdansk University of Technology,
PolandAmr Goneid, Computer Science Dept.,American Univer-
sity in Cairo, Egypt
Min Henderson, University of Virginia, USAZdzislaw Hippe, University of Information Technology
and Management in Rzeszow, PolandElzbieta Hudyma, Wroclaw University of Technology,
PolandJerzy W. Jaromczyk, University of Kentucky, USAPiotr Jedrzejowicz, Gdynia Maritime University, PolandJerzy Jozefczyk, Wroclaw University of Technology,
PolandJanusz Kacprzyk, Systems Research Institute of the Pol-
ish Academy of Sciences, PolandRadosaw Katarzyniak, Wrocaw University of Tech-
nology, PolandPrzemyslaw Kazienko, Wroclaw University of Technol-
ogy, PolandVojislav Kecman, Virginia Commonwealth University ,
USAEtienne Kerre, University of Gent, BelgiumJacek Kluska, Rzeszow University of Technology,
PolandYiannis Kompatsiaris, Informatics and Telematics Insti-
tute, GreeceJozef Korbicz, University of Zielona Gora, PolandJerzy Korczak, Wroclaw University of Economics,
PolandWitlod Kosinski, Polish-Japanese Institute of Informa-
tion Technology, PolandAdam Krzyzak, Concordia University, CanadaJuliusz Lech Kulikowski, Institute of Computer Science
of the Polish Academy of Sciences, PolandLukasz Kurgan, University of Alberta, CanadaHalina Kwasnicka, Wroclaw University of Technology,
PolandSerguei Levachkine, National Polytechnic Institute,
MexicoRory Lewis, University of Colorado at Colorado Springs,
USAJoo-Hwee Lim, Institute for Infocomm Research,
A*STAR, SingaporeJie Lu, University of Technology Sydney, AustraliaAbdel-Badeeh M. Salem, Ain Shams University, EgyptJacek Mandziuk, Warsaw University of Technology,
PolandUrszula Markowska-Kaczmar, Wroclaw University of
Technology, PolandZbigniew Michalewicz, University of Adelaide, Aus-
traliaSantiago M. Mola, Universidad Politcnica de Valencia,
SpainPawel Myszkowski, Wroclaw University of Technology,
PolandTapio Pahikkala, University of Turku, Finland
5th International SymposiumAdvances in Artificial Intelligence and Applications
CELEBRATING 75TH BIRTHDAY OF PROFESSOR LEONARD BOLC
-
Mariusz Paradowski, Wroclaw University of Technolo-gy, Poland
Witold Pedrycz, University of Alberta, CanadaJames Peters, University of Manitoba, CanadaSheela Ramanna, University of Winnipeg, CanadaZbigniew Ras, University of North Carolina, USAPaolo Rosso, Universidad Politcnica Valencia, Spain,
SpainGunter Saake, Otto-von-Guericke-Universitt , GermanyJerzy Sas, Wroclaw University of Technology, PolandChristelle Scharff, Pace University, USARoman Slowinski, Poznan University of Technology,
PolandAndrzej Sluzek, Nanyang Technological University, Sin-
gaporeJanusz Sobecki, Wroclaw University of Technology,
PolandSiergey Subbotin, Zaporozhye National Technical Uni-
versity, Ukraine
Jerzy Swiatek, Wroclaw University of Technology, Poland
Piotr Szczepaniak, Technical University of Lodz, PolandStan Szpakowicz, SITE, University of Ottawa, CanadaRyszard Tadeusiewicz, AGH University of Science and
Technology, PolandLi-Shiang Tsay, North Carolina A&T State University,
USAJosef Tvrdik, University of Ostrava, Czech RepublicAngelina Tzacheva, Univ. of South Carolina, USAAnita Wasilewska, Stony Brook University, NY, USA,
USADaniela Zaharie, West University of Timisoara, Roma-
niaWojciech Ziarko, University of Regina, Canada
ORGANIZING COMMITTEEHalina Kwasnicka, Urszula Markowska-Kaczmar,
Wrocaw University of Technology, Poland
-
A Breast Cancer Classifier based on a Combination
of Case-Based Reasoning and Ontology Approach
Essam Amin M.Lotfy Abdrabou
Ph.D Candidate
Faculty of Computer and Information Sciences
Ain Shams University, Abbassia, 11566, Cairo, EGYPT
(+202) 26330636
Email: [email protected]
AbdEl-Badeeh M. Salem
Professor
Faculty of Computer and Information Sciences
Ain Shams University, Abbassia, 11566, Cairo, EGYPT
(+202) 26844284
Email: [email protected]
AbstractBreast cancer is the second most common form ofcancer amongst females and also the fifth most cause of cancerdeaths worldwide. In case of this particular type of malignancy,early detection is the best form of cure and hence timely andaccurate diagnosis of the tumor is extremely vital. Extensiveresearch has been carried out on automating the critical diagnosisprocedure as various machine learning algorithms have beendeveloped to aid physicians in optimizing the decision taskeffectively. In this research, we present a benign/malignant breastcancer classification model based on a combination of ontologyand case-based reasoning to effectively classify breast cancertumors as either malignant or benign. This classification systemmakes use of clinical data. Two CBR object-oriented frameworksbased on ontology are used jCOLIBRI and myCBR. A breastcancer diagnostic prototype is built. During prototyping, weexamine the use and functionality of the two focused frameworks.
Index TermsCase-Based Reasoning, Case-Based ReasoningFrameworks, CBR, CBR Frameworks, jCOLIBRI, myCBR,Breast Cancer
I. INTRODUCTION
BREAST cancer classification, diagnosis and prediction
techniques have been a widely researched area in the past
decade in the world of medical informatics. Several articles
have been published which tries to classify breast cancer data
sets using various techniques such as fuzzy logic, support
vector machines, Bayesian classifiers, decision trees and neural
networks. Classification accuracy as high as 98.8% has been
achieved using a learning algorithm combining simulated an-
nealing with the perceptron algorithm. Another study involving
fuzzy modeling and cooperative co-evolution has gained an
accuracy of 98.98% over one of the widely studied Wisconsin
breast cancer database [16].
This research applies a new technique in the field of
breast cancer classification. It uses a combination of ontology
and case-based reasoning by using ontology based object-
oriented case-based reasoning frameworks. Two frameworks
are examined in building the classifier. One is the open source
jCOLIBRI [5] system developed by GAIA group and provides
a framework for building CBR systems based on state-of-the-
art software engineering techniques. The other is the novel
open source CBR tool myCBR [24] developed at the German
Research Center for Artificial Intelligence (DFKI). The objec-
tive of this classifier is to classify the patient based on his/her
electronic record whether he/she is benign or malignant.
This paper is organized in four sections. Section 1 is this
introduction. Section 2 gives a theoretical background about
breast cancer, ontology, CBR and object-oriented frameworks.
Section 3 illustrates the implementation of the breast cancer
classifier on the two frameworks. Finally, section 4 discusses
and concludes the results
II. THEORITICAL BACKGROUND
A. Breast Cancer
Breast cancer is the form of cancer that either originates
in the breast or is primarily present in the breast cells. The
disease occurs mostly in women but a small population of
men is also affected by it. Breast cancer is the most common
form of cancer amongst the female population as well as the
most common cause of cancer deaths [25]. Early detection
of breast cancer saves many thousands of lives each year.
Many more could be saved if the patients are offered accurate,
timely analysis of their particular type of cancer and the
available treatment options. Since the breast tumors whether
malignant or benign share structural similarities, it becomes
an extremely tedious and time consuming task to manually
differentiate them. As seen in Figure 1 there is no visually
significant difference between the fine needle biopsy image of
the malignant and benign tumor for an untrained eye. Accurate
Fig. 1. Fine needle biopsies of breast. Malignant (left) and Benign (right) [25]
classification is very important as the potency of the cytotoxic
drugs administered during the treatment can be life threatening
or may develop into another cancer. Laboratory analysis or
biopsies of the tumor is a manual, time consuming yet accurate
Proceedings of the International Multiconference onComputer Science and Information Technology pp. 310
ISBN 978-83-60810-27-9ISSN 1896-7094
978-83-60810-27-9/09/$25.00 c 2010 IEEE 3
-
system of prediction. It is however prone to human errors,
creating a need for an automated system to provide a faster
and more reliable method of diagnosis and prediction for the
patients.
B. Ontology
Ontology is a formal explicit description of concepts in a
domain of discourse (classes (sometimes called concepts)),
properties of each concept describing various features and
attributes of the concept (slots (sometimes called roles or
properties)), and restrictions on slots (facets (sometimes called
role restrictions)). Ontology together with a set of individual
instances of classes constitutes a knowledge base. In reality,
there is a fine line where the ontology ends and the knowledge
base begins [8].
C. Case-Based Reasoning
In case-based reasoning (CBR) systems expertise is em-
bodied in a library of past cases, rather than being encoded in
classical rules. Each case typically contains a description of the
problem, plus a solution and/or the outcome. The knowledge
and reasoning process used by an expert to solve the problem
is not recorded, but is implicit in the solution. To solve a
current problem: the problem is matched against the cases in
the case base, and similar cases are retrieved. The retrieved
cases are used to suggest a solution that is reused and tested
for success. If necessary, the solution is then revised. Finally
the current problem and the final solution are retained as part
of a new case.
The CBR process can be represented by a schematic cycle,
as shown in Figure 2 [1].
Fig. 2. The CBR Cycle
Representation: Given a new situation, generate appropriate
semantic indices that will allow its classification and catego-
rization. This usually implies a standard indexing vocabulary
that the CBR system uses to store historical information
and problems. The vocabulary must be rich enough to be
expressive, but limited enough to allow efficient recall [2].
Retrieval: Given a new, indexed problem, retrieve the best
past cases from memory. This requires answering three ques-
tions: What constitute an appropriate case? What are the
criteria of closeness or similarity between cases? How should
cases be indexed? Part of the index must be a description of the
problem that the case solved, at some level of abstraction. Part
of the case, though, is also the knowledge gained from solving
the problem represented by the case. In other words, cases
should also be indexed by some elements of their solution [11].
Adaptation: Modify the old solutions to confirm to the new
situation, resulting in a proposed solution. With the exception
of trivial situations, the solution recalled will not immediately
apply to the new problem, usually because the old and the
new problem are slightly different. CBR researchers have
developed and used various adaptation techniques [11].
Validation: After the system checks a solution, it must
evaluate the results of this check. If the solution is acceptable,
based on some domain criteria, the CBR system is done with
reasoning. Otherwise, the case must be modified again, and
this time the modifications will be guided by the results of the
solutions evaluation [11].
Update: If the solution fails, explain the failure and learn
it, to avoid repeating it. If the solution succeeds and warrants
retention, incorporate it into the case memory as a successful
solution and stop. The CBR system must decide if a successful
new solution is sufficiently different from already-known solu-
tions to warrant storage. If it does warrant storage, the system
must decide how the new case will be indexed, on which level
of abstraction it will be saved, and where it will be put in the
case-base organization [11].
Retaining the case is the process of incorporating whatever
is useful from the new case into the case library. This involves
deciding what information to retain and in what form to retain
it; how to index the case for future retrieval; and integrating
the new case into the case library.
D. CBR Object-Oriented Frameworks
The concept of object-oriented frameworks has been intro-
duced in the late 80s and has been defined as a set of classes
that embodies an abstract design for solutions to a family of
related problems, and supports reuses at a larger granularity
than classes [9].
The goal of a framework is to capture a set of concepts
related to a domain and the way they interact. In addition, a
framework is in control of a part of the program activity and
calls specific application code by dynamic method binding.
A framework can be viewed as an incomplete application
where the user only has to specialize some classes to build
the complete application [9].
4 PROCEEDINGS OF THE IMCSIT. VOLUME 5, 2010
-
Frameworks allow the reuse of both code and design for a
class of problems, giving the ability to non-expert to write
complex applications quickly. Frameworks also allow the
development of prototypes which could be extended further
on by specialization or composition. A framework once un-
derstood, it can be applied in a wide range of domain, and
can be enhanced by the adding of new components [9].
Using frameworks for development of new applications
helps improve software quality. It improves programmers
productivity and quality, performance, and reliability of soft-
ware. It also enhances extensibility by providing the required
methods that allow applications to extend its stable inter-
faces [20]. Figure 3 clearly shows the difference of the effort
required for developing an application from scratch and using
a framework [15].
Fig. 3. Development Effort Reduction by using Frameworks
CBR researchers agree that the best way to satisfy the
increasing demand of developing CBR application is by de-
velopment of frameworks. Recently, some efforts within the
CBR community have developed CBR frameworks [20]. This
paper focuses on two of them jCOLIBRI developed by GAIA
group and myCBR developed by DFKI group.
III. EXPERIMENTS
A. Breast Cancer Classifications
Breast cancer has become the number one cause of cancer
deaths amongst women. Once a breast cancer is detected, it
can be classified benign (not cancerous tissue) or malignant
(cancerous tissue). In this study, the two compared CBR
frameworks are tested by developing a CBR application that
classifies the condition of the breast cancer tumor whether
it is benign or malignant. Wisconsin breast cancer data set
was used for building the case-bases. It is obtained from
the University of Wisconsin Hospitals, Madison from Dr.
William H. Wolberg [14]. Samples inside the data set arrive
periodically as Dr. Wolberg reports his clinical cases. The
number of instances inside the dataset is 699 (as of 15
July 1992). Each record contains ten attributes plus the class
attribute. Table I shows the attributes and their possible values.
65.5% of the elements belong to the benign class and 34.5% to
the malignant class. 16 elements are incomplete (an attribute
is missing) and have been excluded from the database.
TABLE IWISCONSIN BREAST CANCER DATASET
No. Attribute Possible Value
1 Sample code number id number
2 Clump Thickness 1 10
3 Uniformity of Cell Size 1 10
4 Uniformity of Cell Shape 1 10
5 Marginal Adhesion 1 10
6 Single Epithelial Cell Size 1 10
7 Bare Nuclei 1 10
8 Bland Chromatin 1 10
9 Normal Nucleoli 1 10
10 Mitoses 1 10
11 Class (2 for benign, 4 for malignant)
B. jCOLIBRI
1) Overview: jCOLIBRI is an evolution of the COLIBRI
architecture [7], that consisted of a library of problem solv-
ing methods (PSMs) for solving the tasks of a knowledge-
intensive CBR system along with ontology, CBROnto [8],
with common CBR terminology. COLIBRI was prototyped in
LISP using LOOM as knowledge representation technology.
This prototype served as proof of concept; was very useful but
it is not helpful for non-expert users. Then, people at GAIA
group have started to develop a new complete framework with
the name of jCOLIBRI. It stands for Cases and Ontology
Libraries Integration for Building Reasoning Infrastructures.
CBR ontology assumes the same vocabulary provided by any
CBR system. In jCOLIBRI, ontology is not represented as a
new source. All concepts of CBR are mapped into classes and
interfaces of framework. Classes that represent the concept of
ontology serve as templates where new CBR types should be
added. They also provide the tasks and abstract interface of the
methods. The design of the jCOLIBRI framework comprises
a hierarchy of Java classes plus a number of XML files. The
framework is organized around the following elements [2]:
Tasks and methods: The tasks supported by the framework
and the methods that solve them are all stored in a set of
XML files.
Case-base: Different connectors are defined to support several
types of case determination, from the file system to a database.
Cases: A number of interfaces and classes are included in the
framework to provide an abstract representation of cases that
support any type of actual case structure.
Problem solving methods: The actual code that supports the
methods included in the framework.
The jCOLIBRI comes in two major releases version 1 and
version 2. According to the tutorial [19], version 2 is a new im-
plementation that follows a new and clear architecture divided
into two layers: one oriented to developers and other oriented
to designers. Unfortunately, the only available distribution of
version 2 is the one that is oriented to the developers which
is out of scope of this paper. jCOLIBRI version 1 is the first
release of the framework. It includes a complete Graphical
ESSAM ABDRABOU, ABDEL-BADEEH SALEM: A BREAST CANCER CLASSIFIER 5
-
(a) Patient Case Definition in jCOLIBRI
(b) Managing Connectors in jCOLIBRI
(c) Configuration of Tasks in jCOLIBRI
6 PROCEEDINGS OF THE IMCSIT. VOLUME 5, 2010
-
(d) jCOLIBRI Retrieval
Fig. 4. Implementation in jCOLIBRI
User Interface (GUI) that guides the user in the design of a
CBR system. This version is recommended for non-developer
users that want to create CBR systems without programming
any code which is exactly the scope in this study. As a result,
version 1 is selected to implement the required application.
Downloading of the jCOLIBRI is an easy task; it can be
obtained through the web page of GAIA group. It comes in
a compressed distribution that can be easily extracted to have
the full package. To run jCOLIBRI, there is a ready batch file
(we are using MS Windows R platform) that can be invokeddirectly to run jCOLIBRI. It is required to have JAVA Virtual
Machine installed before running the batch file. By invoking
this batch file we get the first screen of the framework GUI.
2) Implementation: By the help of the multimedia tutorials
provided and the GUI of the jCOLIBRI, users can go through
five steps to implement and deploy a CBR System. These steps
are
Definition of case structures Building the case-base Managing similarity measures Configuring the behavior of the CBR process Testing and deploying the CBR application
Definition of Case Structures: By using jCOLIBRI GUI users
are able to create the case structure defining simple and
compound attributes that describe the cases together with
their types, weights, similarity measure -that is chosen from
a library of existing similarity functions and parameters. The
case structure can be saved or loaded in and from a XML file.
Figure 4(a) shows the definition of the patient case parameters.
Building the case-base: jCOLIBRI introduces the concept
of Connectors which cases persistence is built around. Con-
nectors are objects that know how to access and retrieve
cases from the storage media and return those cases to the
CBR system in a uniform way. Therefore connectors provide
an abstraction mechanism that allows users to load cases
from different storage sources in a transparent way [24] [21].
Defined connectors can work with plain text files, XML files,
or relational data bases. The graphical interface helps mapping
the defined case structure with the tables and columns from
the storage scheme. Figure 4(b) shows how the patient case
structure is mapped to columns in a text file containing the
Wisconsin data set patient records.
Managing similarity measures: When two cases are compared,
the local similarity functions are used to compare simple
attribute values. Global similarity functions are linked to
compound attributes and are used to gather the similarities of
the collected attributes in a unique similarity value. At last, the
similarity value of two cases is computed as the similarity of
their description concepts. The available similarity measures
are listed in a configuration file, and can be managed using
the GUI. Since our problem is simple, we leave the default
similarity assigned by jCOLIBRI.
Configuring the behavior of the CBR process: As introduced,
jCOLIBRI formalizes the CBR knowledge using CBR ontol-
ogy (CBROnto), a knowledge level description of the CBR
tasks and a library of reusable Problem Solving Methods
(PSMs) [21]. Configuration of tasks is done in an interactive
approach by choosing from a library of reusable methods
one that is suitable to solve the selected task. Constraints of
the selected task are being tracked during the configuration
process so that only applicable methods in the given context
are offered to users. In our comparison we focus only on the
retrieval task. Figure 4(c) shows the configured tasks in the
breast cancer application.
Testing and deploying the CBR application: The CBR appli-
cation is finished when all the tasks have been configured.
Users can test the system from inside the graphical interface.
The first task of the CBR system, (Obtain query task) obtains
the query that is going to be used to retrieve the most similar
cases. Figure 4(d) shows the GUI after a query. We tested
the 16 records that are excluded from the dataset according
to one missing value. Only two missed classifications are
obtained. Documentation mentions that it is possible to deploy
the developed CBR application by generating a code template
with most of the code required to run the developed system
as an independent application. We have tried this process but
it is completely failed.
C. myCBR
1) Overview: myCBR is an open-source plug-in for the
open-source ontology editor Protg [6]. Protg is based
on Java, is extensible, and provides a plug-and-play envi-
ronment that makes it a flexible base for rapid prototyping
and application development [4]. Protg [4] allows defining
classes and attributes in an object-oriented way. Furthermore,
it manages instances of these classes, which myCBR interprets
as cases [22]. So the handling of vocabulary and case base
is already provided by Protg. The myCBR plug-in provides
several editors to define similarity measures for an ontology
and a retrieval interface for testing [24]. As the main goal of
myCBR is to minimize the effort for building CBR applications
that require knowledge-intensive similarity measures, myCBR
ESSAM ABDRABOU, ABDEL-BADEEH SALEM: A BREAST CANCER CLASSIFIER 7
-
(a) Wisconsin Dataset in a CSV File
(b) Patient Case Data Representation in myCBR
(c) Retrieval of a Case Query with a Missing Attribute Value
8 PROCEEDINGS OF THE IMCSIT. VOLUME 5, 2010
-
(d) Breast Cancer as a Stand-Alone Application
Fig. 5. Implementation in myCBR
provides comfortable GUIs for modeling various kinds of
attribute specific similarity measures and for evaluating the
resulting retrieval quality. In order to reduce also the effort of
the preceding step of defining an appropriate case represen-
tation, it includes tools for generating the case representation
automatically from existing raw data [22]. The novice as well
as the expert knowledge engineer are supported during the
development of a myCBR project through intelligent support
approaches and advanced GUI functionality [22]. Download-
ing myCBR requires two steps of downloading. The first is
to download myCBR plug-in files; this can be done directly
through myCBR web page. The second step is to download
the Protg ontology editor; this can be done through the
Protg web page. Downloading Protg is not an easy task.
Users need to do some readings on the site to be able to
select the suitable version to download. Since myCBR is a
plug-in inside Protg, users need to install Protg first. It
is required to have JAVA Virtual Machine installed before
proceeding in installation, or users may choose to download
the version that includes the JAVA. To install the myCBR
plug-in for Protg, users need to copy the myCBR plug-ins
into Protgs plug-ins directory. Then to start Protg and
create new projects, users need to enable the myCBR plug-ins
from the configuration menu of Protg. After installing and
activating the myCBR plug-in, the user interface of Protg is
extended with additional tabs to access the myCBR modules.
After developing a CBR application using the Protg plug-
in, myCBR can also be used as a stand-alone Java module,
to be integrated in arbitrary applications, for example, JSP5-
based web applications. In this application phase, the retrieval
engines of myCBR just read the XML files of the created
project generated using the plug-in interface and perform
the similarity-based retrieval [24]. For Protg manuals and
tutorial, users may consult the documentation section of the
Protg web site for available documentation. Among other
things, users may find the Protg Users Guide, a "getting
started" tutorial, and information on ontology development.
The manual for myCBR is available on its web page as HTML
version or a PDF version. The manual covers installation and
different usage issues. No multimedia tutorials are available
for the usage of myCBR.
2) Implementation: Four steps are required to develop a
CBR application:
Generation of case representations Modeling similarity measures Testing of retrieval functionality Implementation of a stand-alone application
Generation of case representations: One powerful feature
provided by myCBR is the easiness of the case representation
by CSV data import module [24]. Users have the choice to
import data instances in an existing Protg class or to create
a new class that is suitable for their raw data. Figure 5(a) shows
how Wisconsin dataset is arranged in a CSV file. myCBR
allows also slots to be added manually using Protg. Figure
5(b) shows myCBR screen after importing the dataset into a
new class Patient which will be used as query and case values
for retrieval step.
Modeling of similarity measure: myCBR follows the local-
global approach which divides the similarity definition into
a set of local similarity measures for each attribute, a set of
attribute weights, and a global similarity measure for calcu-
lating the final similarity value. This means, for an attribute-
value based case representation consisting of n attributes, the
similarity between a query q and a case c may be calculated
as follows
Sim(q, c) =N
i=1
wi Simi(qi, ci) (1)
Here, simi and wi denote the local similarity measure and theweight of attribute i, and Sim represents the global similaritymeasure [24]. The dataset used in this experiment is simple
so we leave the similarity measure definition as the default of
ESSAM ABDRABOU, ABDEL-BADEEH SALEM: A BREAST CANCER CLASSIFIER 9
-
myCBR. We only change the weight values of the Id and Class
slots from one to zero. However, users may consult myCBR
tutorial for more options in defining local and global similarity
measure.
Testing of retrieval functionality: myCBR includes an easy
to use GUI for performing retrievals and for analyzing the
corresponding results. By providing similarity highlighting and
explanation functionality, myCBR supports the efficient analy-
sis of the outcome of the similarity computation. We tested the
16 records that are excluded from the dataset according to one
missing value. Only two missed classifications are obtained.
Figure 5(c) shows one query of these records after retrieving
the most similar cases. Another alternative of performing case
retrieval is to use a query from cases. This is also tested and
gives a similar result as shown in Figure 5(d).
Implementation of stand-alone application: myCBR can also
be used as a stand-alone Java module, to be integrated in
arbitrary applications. In this application phase, the retrieval
engines of myCBR just read the XML files of the created
project generated using the plug-in interface and perform the
similarity-based retrieval. Figure 5(d) shows the breast cancer
stand-alone application.
IV. DISCUSSION AND CONCLUSION
In this paper, we examined two object-oriented ontology
based CBR frameworks jCOLIBRI developed by GAIA group
and myCBR developed by DFKI group. A breast cancer
classifier is built by using the two selected frameworks.
During the implantation of the breast cancer diagnostic
application using jCOLIBRI we found that jCOLIBRI is user-
friendly and efficient to develop a quick application. The
classifier was successful in classification of the selected data
set. During the implantation of the breast cancer classifier
using myCBR we noticed that myCBR is a really a tool for
rapid prototyping of a new CBR application. In seconds, users
may have a running standalone CBR application by using the
CSV importing feature. myCBR is intelligent enough to build
the case structure and the case base by parsing the provided
CSV file. myCBR avoids reinventing the wheel by making the
development of a new CBR application done inside Protg.
The classifier was successful in classification of the selected
data set.
In conclusion, two CBR frameworks are very useful to
develop CBR base breast cancer classifier that can play a very
important role to help for early detecting the disease and hence
right medications can be used to save lives.
REFERENCES
[1] A. Aamodt and E. Plaza, Case-Based Reasoning: Foundational Issues,Methodological Variation and System Approaches, AICOM, vol. 7,no. 1, 1994, pp. 3958.
[2] J. J. Bello-Toms, J. A. Gonzlaez-Calero and B. Dz-Agudo, JCOL-IBRI: An Object-Oriented Framework for Building CBR Systems, inAdvances in Case-Based Reasoning, Lecture Notes in Computer Science,
Springer Berlin/ Heidelberg, vol. 3155, 2004, pp. 3246.[3] S. Bogaerts and D. Leake, Increasing AI Project Effectiveness with
Reusable Code Frameworks: A Case Study Using IUCBRF, in Proceed-ings of the 18th International Florida Artificial Intelligence Research
Society Conference, Menlo Park, CA: AAAI Press, 2005.
[4] S. Bogaerts and D. Leake, A Framework for Rapid and Modular Case-Based Reasoning System Development, Technical Report, TR 617,Computer Science Department, Indiana University, Bloomington, IN,2005.
[5] B. Dz-Agudo, P. A. Gonzlez-Calero, J. Recio-Garc and A. Sanchez-Ruiz, Building CBR systems with jCOLIBRI, Journal of Science ofComputer Programming, vol. 69, no 13, 2007, pp. 6875.
[6] J. H. Gennari, M. A. Musen, R. W. Fergerson, W. E. Grosso, M. Crubezy,H. Eriksson, N. F. Noy and S. W. Tu, The evolution of Protege anenvironment for knowledge-based systems development, Int. J. Hum.-Comput. Stud, vol. 58(1), 2003, pp. 89123.
[7] J. A. Gonzlez-Calero and B. Dz-Agudo, An architecture for knowl-edge intensive CBR systems, in E. Blanzieri and L. Portinale, edi-tors, Advances in Case-Based Reasoning (EWCBR00), Springer-Verlag,Berlin Heidelberg New York.
[8] J. A. Gonzlez-Calero and B. Dz-Agudo, CBROnto: a task/methodontology for CBR, in S. Haller and G. Simmons, editors, Procs. ofthe 15th International FLAIRS02 Conference (Special Track on CBR,
101106). AAAI Press.[9] M. Jaczynski and B. Trousse, An Object-Oriented Framework for the
Design and the Implementation of Case-Based Reasoners, in Proceed-ings of the 6th German Workshop on Case-Based Reasoning, Berlin,1998.
[10] R. Johnson and B. Foote, Designing reusable classes, Journal ofObject-Oriented Programming, vol. 1(5), 1988, pp. 2235.
[11] J. L. Kolodner, Case-Based Reasoning, 1993, Morgan Kaufmann Pub-lishers, California.
[12] D. Leake, Case Based Reasoning. Experiences, Lessons and FutureDirections, AAAI Press, MIT Press, USA, 1997.
[13] M. Manago, R. Bergmann, N. Conruyt, R. Traph ner, J. Pasley, J. LeRenard, F. Maurer, S. Wes, K. D. Althoff and S. Dumont, CASUEL:a common case representation language, ESPRIT project 6322, 1994.Task 1.1, Deliverable D1.
[14] O. L. Mangasarian and W. H. Wolberg, Cancer diagnosis via linearprogramming, SIAM News, vol. 23, no. 5, 1990, pp. 118.
[15] A. Mulder, Developing a Reusable Application Framework, Char-iot Solutions, http://www.chariotsolutions.com/javalab/presentations.jsp,2003.
[16] C. A. Pena-Rayes and M. Sipper, Applying Fuzzy CoCo to Breast CancerDiagnosis, IEEE, 2000, pp. 1168-1175.
[17] J. A. Recio-Garc, B. Dz-Agudo and P. A. Gonzlez-Calero, Proto-typing recommender systems in jCOLIBRI, in Proceedings of the 2008ACM Conference on Recommender Systems (Lausanne, Switzerland,
October 23 - 25, 2008), RecSys 08, ACM, New York, NY, pp. 243-250.[18] J. A. Recio-Garc, B. Dz-Agudo and P. A. Gonzlez-Calero, jCOL-
IBRI2 Tutorial, 2008. Group of Artificial Intelligence Application(GAIA). University Complutense of Madrid. Document Version 1.2.
[19] J. A. Recio-Garc, D. Bridge, B. Dz-Agudo and P. A. Gonzlez-Calero, CBR for CBR: A Case-Based Template Recommender System,in K. D. Althoff and R. Bergmann, editors, Advances in Case-BasedReasoning, 9th European Conference, ECCBR 2008 (in press), LNCS.Springer.
[20] J. A. Recio-Garc, B. Dz-Agudo, , A. Snchez and P. A. Gonzlez-Calero, Lessons learnt in the development of a CBR framework, in M.Petridis, editor, Proceedings of the 11th UK Workshop on Case BasedReasoning, CMS Press, University of Greenwich, 2006, pp. 6071.
[21] J. A. Recio-Garc, A. Snchez, B. Dz-Agudo and P. A. Gonzlez-Calero, jCOLIBRI 1.0 in a nutshell. A software tool for designing CBRsystems, in M Petridis, editor, Proccedings of the 10th UK Workshopon Case Based Reasoning, CMS Press, University of Greenwich, 2005,pp. 2028.
[22] T. R. Roth-Berghofer and D. Bahls Explanation Capabilities of the OpenSource Case-Based Reasoning Tool myCBR, 2008.
[23] S. Schulz, CBR-Works: A state-of-the-art shell for case-based appli-cation building, in Melis, E., ed., Proceedings of the 7th GermanWorkshop on Case-Based Reasoning, GWCBR99, Wurzburg, Germany,University of Wurzburg, pp. 166175.
[24] A. Stahl and T. R. Roth-Berghofer, Rapid prototyping of CBR appli-cations with the open source tool myCBR, in R. Bergmann and K. D.Altho, eds., Advances in Case-Based Reasoning, 2008, Springer Verlag.
[25] M. Sewak, P. Vaidya, C. C. Chan and Z. H. Duan, SVM Approach toBreast Cancer Classification, IMSCCS, vol. 2, 2007, pp. 3237.
10 PROCEEDINGS OF THE IMCSIT. VOLUME 5, 2010
-
ABCDECCBFDEED
ABCDCDEFFAAAACAACA
FABBCCDCACAFBACFBC
CFB AAAACAACA
FABBCCDCAC!FB "CFBC
#B$FDEBFABDCA
#BDEF%&BCFCFABBCADCA
EFCBEBCDEECCDEECDEEE
EEBCDEE DCCBEAAACDA
BE!DE C E DAE D!E BE
E"#EDEE!E$!ECDCA
DE!BCDDEC!DCBBE
EEEFEDEDE!EDE%&'!DE
(DE!EDEAD%&'!DE)*"+,+-.DE!ED
FCA %&' !DE )*"./0102 CADFE C EAD
%&'FCA %&'CB *"/+/+/*"/./3.BCD DEE
BCDE E E EAD %&' FCA %&' CB
*"/03/1 *"/0..1 C!E F DEE FCC" EE
DE!DECBDAEFEDECB4"+010DEE
BCDEEDECB4"//+-DEEFCCEDECB
4"/+4."
ABCDE5 DE 'ED 'BC !CD
ECD E (2 EC DEE %EEED
&ED 'DED '!DE (%&'2 DEE CC
DEECDE6"
''()*+)'+(
D(,* FB ! FF A B AEBE"AB"B!BABBABF)-AFF
BCCBAFF"B!ABACB)CB E E" B CF"A ) A F!B"BFAFABBBE.AABCEBA"BBFEFF#AA*B"AB+E.AEC""F!BCBABCCAAAABEF/001
)CBABAAAA2AEAABF%AA"CF"AAFBBBEBAEAACBABCCBBCBCB" !BC FB BA DB" ABDCBAB"A!BC!FFF%BEA"AFAC
)BEFC!A&A"BAA%A B" EA C""" F%FB!BCBA"BBEA034AB534BCF"A BAAABBB)A!BBAF-C"CC%CB"C)F#BA6#7F("%A6(7
) BCA%D "B 6D7 EBAA#A("BADF%
AB B AA B"AB C" A A" BAB C8""BAB)F"BCEABAB+CABAF9AAB"ABACBABABFB"A!A:B:BBEBF%"ACCCB"CAAACBAAC%F"
''D)D,)
)ABAAF"F"ABAAABFFAAAB
DC"CEA!B%CBFAB67! AF6F7BBAACFABABBBA67FFA%AAB"FFBAC"AAABAACFAFBFFAA AA!FFAC""AFC")FB"AFC"B%"ABB""AFC"ECB"%AFBCA(BAAA"AFC"BCBBACEBFCAB!AFFBABF)FCAABAFCAC"FCA-F#-FAACCBA!FABBAAFCAB"AFC"/001
BCABAFBABC"EABFB"AFCE)A ABAC"A!B%CBFBBADBABA;%"FF-FBF&BA!BEEABBA!BAFCAD-FAAAAB%B""BAB'ACFAA-FFB%ABA"F"2ABABBA-F"ABCCBBAABBCA)A-BAABABBAB!ABABABBABF%C/
-
D!AB"CAABA EBB$B! ABBA"CA%ABA AABAABFABFFC""FAB)ABBA"BFB"%CAABB"ABA"BFBAAAABC"AACABEBAAABAF9C"FABABAF"BAC/>1
)ABBA"F CCCB"C%"BCAB-F"ACCCB"EBCF%AAABA BABAAAAABCAABBFBAFBCABBCF?"BF"BAC
) "AF AE B " CCCB"67BBEACCCB"C%"FBCCA)CBBAAEABFAABAFB%CA B BCA F"BAC AB " %BBBAAECFAFB%CABF"BACABA"BAF%BCAB A"B A" )AE B%AB-CAF;@33A/0@1
)CBA A CABB F AAB B AEA CCCB"CAAAA CCCB"C AC !A -A BBAEABAAC!AB!ABAFAB
)FBAB9A C"BACFBABFB"AFA"AF8F%BB B BA E" C !FF C%EBCAB"FFBFFAF%F"AFBFFEFEB"%FBCACBE%A"AABFFA"B9EAAC"
)CE!AB%BA%CAABBFFBEF"F)CC%" CB A EB B -CF A BFFCEBEAFBABCFF%FB"FB"AEB!CA"B"FA B FBEF AA )C" A!A@%BAACACE6!FF%BF%7C"CBFBEFAC"BE%C"AAC"FAC"B!"0
"0ABBC"
)AABA-%AAABBACA%"BEA )AAFAA"FB!BCFFAABAB"A)AAB7 ,AA)CCFABAFAEA!AEAA
" ) CA -AA BC A %B%BCBCAA
;33;
;
-
)AA"BC3AB0A9BB%FBE.ABBFBE.A/51
''')D)')'DM),$('N,
+BACBAFFABBAAAFF%AFBCABCBFAB-FAFAB%EA!AEF2 CABCBFE FB F" FAB BA ABCAEBBAABA)A!FFBBAFBFAFA
##+*)C,)+*D$'(,
ABBF9"CBF"AE% AB "B2 F" F"F" AAAAEBAAEFA6A"A7%EF!BFABEACA&BCABFB!ABCBFAAFBBCEF A F" BCC %B
F" B BA A AEF 6A"A7 EA A AA FF B A EF FF'AA"BFBAABAAFB EF EA A AB FBB8 B AA "B" BBA!ABA9AAAACFAB%A"BA!AAFAFAFB%FAB AB F 6FBCBAF7 AAAF C -CF B F"
+BACBAFFABBAAAFF%AFBCABCBFABA-FAFABEA!EF+BAEAAA%B%A%ACBF"CABF"BAABC%6C7
'ACB8"BCFAA%ABEFFFAAEAABCA%AEAAAABAFFFA)A8BBB"ACBAAEF%AAB8B!AFABDABAAAE B 6 B! B AB F7 FFABBA"BFBCCBF"ABABACFFAAAFABAB!AA!ABA"BBAA"A%EFBBBAF!AABAA"BBABABAF)ABAFBAAB
)B FFAA FAB C FA C AAAFFAB!A(BEAB)%AEF D AAEF D" AA DO0 ; 5PP( ) "BF B FABCABBFAB!C8AEAABAAEF D)FABE-BFFB!A
FA B 6I7
FAFABCAABABABF)BABABA!BABA
FAD0
FD DF0 0F; ; F
)BACCFABABEBECC9"ABFFB!"ABA
F 3 @CFC;D0
D D 6>7
)BAAEBFFB!A
DF DBD DF
DB
DD
D
E
D DEF3
6=7
BAA"AA9BEABACAABACBBABACBA
AB!ABF"A A D
A!BF8EF!EABAAAFB!BAABAFABABBCBFABCEBFFB!A
CFCF0;F;; F;)BABACCBFABB ABFBAB
CA E AC MB AB AB B! AC-CCFFB!ABBAAFBCABEB)CBABCCFBABB)BABA $EQ%A AMF /051 " ; B! A EA! A BABFBAB
";)BABFBAB
)AFBAB$EFBABAABEAFBABAABACFBA!AF"AEABBAA8B!)BB%%A FB AB2 A B-CAB AB $ER FBABEAAAAABABAF)AB AFBABAABBABAAEFFA B C )BA B MF FBABAAFAABBAFAA AFBAB
'A AQ%AFBAB!FABFABCCBF!B%AFBABAAEA!AACAFABEFBAAEFDEF%FA'AFAQAFABABBABECBABFA/051)%%AFBABE-BFFB!A
MEDHAT MOHAMED AHMED ABDELAAL, MUHAMED WAEL FAROUQ ET AL.: USING DATA MINING FOR ASSESSING DIAGNOSIS 13
-
6F7
BACBF"A A"ABAEAAEA! A BE A F B!CRQ%AFBABBBBAC9"EBBAFABABAC8"AAB""AABAF
)BABFABB AB>M""BAC9%ABCAEABFABABABAC9ABBE%FC E- BFFB!A C9 A A"AABBFFB!A
D0
DDE
3 @ D0 0
DDE E DA
D0
D DDE
6037
)BAAEBFFB!A
D0
D DE33DD DED 6007
CCBF EFA B8F AB AAABC A A A AB %CBF !FEBAAABAABAA
)B8FABFBFBCF%FEAB6*&7"CB6%7)B!AB8B!!8FAB!FFEEABFAB
) *& 8F B%FA C CF AB "CBFBAFBFF%AB EA! A"A A"B AB A%AEA2 F EABBABAA%CBAF8FFBA*&D"%CB8FEAC*&8FBACA )*&AB ! CA ABAABFBCF8FA*&8FFCFFA
)ABFFABEAABFFAAAFABBB%FA AACAEAFF9EC"AAB"CBFFFSATE"8F ABAA F FAB AB E F )CBABCC8F AB A*&UF/I1)*&8FALAA8FCA'A8FABABCBF6037A
CBFE.ABFFB!A)CBAA6=7ECC9A
D0
DD
3 @ D0 0
DDE E DA
D0
D DDE
60;7
,''+()*,,
DBAFB"FCBFAE6A!B%!FA7AAAB!B!AFBA"A%EFEAE"AFBABABEF)CA BBA'AAABFAA!BA !&BBA"))BA
DBAEABAAFBAA%"AEFEBFBAABEF
,BAABB6B!7BCAB"%FAA(BAABBAFBFFSA%CFTB SFT B )ABCBABFF ASBBATBF8F A BA!!AABBAAAAB)BBABAFFBAB!AAA
DBABAAEEFAAAAB!BABA!B"B6FB7)CBAABFAAF"B)BFFSAAB"T
,''+(&++)'(L)*,,
:&BBA":A BCB"ABAABEF"AABAF%BCE"ABAABAB!A!"A"BAAAABAFBBAABCC9'CAAB"AF-ABAEABFB
))&BBAF"BACBAC9BCB"A%BCBFEFABBA*B!AACBFEFA")&BBACB"ACBAAB8B!CBF"A ))&BBAF"BACABFFCFABB
)BAEAAACEFABC9AB"AAAB$B!BCBAEFAAFFFA:BA:BAAB2!)&BBAABAAABCAFCBCAEAAACAFF)&BBACBFEEA
#) 3&0 0 &; ; 6057
AAA"A!AAA"FBA6ACA"AFB"BCBF7ABBSB%FTFC"AABA
14 PROCEEDINGS OF THE IMCSIT. VOLUME 5, 2010
-
A"#$%#$AAAABAB%%F"%ABABAAB%AFAABCAEA)&BBAF"B%AC
)AAAAABAA)FBCAAAAABABA!A%ACA AB A B ) B AAB"BA )FAFBCE"A!"ABAEABBA
FF AF A FCFF EA AFF)&BBAACBABBACFFA)&BBACBFBA"B AA BA EBEA" F" %"F%ACBF )&BBACBFBA F ABBBABBAAABF"FA!B8 )&BBA CBF F BABBBAAFABEF'FA%ABEFAABCAFFBBA%AAACBF)&BBAA$E%"BFBAB!C8A"FAAABBAFCF)&BBABAFF6AAF7CBBABABCABBAABEFBABCAB6V-WE7FB"6-7B-6-7BBAAACBF$ABBAABCAB)BAAACABBB"AFAABF"C"AB F )ABA FCA A)&BBAF"BACC8A"FAAABBA%A"B%FABBC%B!%CF"CABEABFAA"F9ABB)&BBACBF""A BAA")&BBAEF AB "B CBF 8%F FABBEFC)&BBAFEBABABA%"BFABA"AEF
,''+()*,,+*,)
DB)BACEFBBA!BABBCEABC8ABFF%ABBABADBABACFAB)&BBACBFAAAF"CEBA"B!$B!)&BBA"ABA!AABAABBA"B"ABA-AAA'BAABABA"B!CEBAAFFFABBAAAAFAFFBACEEFA
&BA)&BBABABAB"CBFBABAABABE"AB!BFABBACBF)BAAA!ABA "AAABBF"E8B!BCAABAAB)B%AAACABBB"AFAABF"C"ABF)ABAF%CAABABAF"BACC8A"F
AAABBAA"BABAEFAB"BFABCBF
)CA"BBABAAAACBFBCF-BAEF9F8"FA'ACBBSEF8EB-TF8F A!B8&BAAEFABAEBA"F%ABABACBF)"F%ACBFEAAB"A AAA"BB!AABEF%FAABABACBFEABBAA"A"FAAB
'C D(DME'D(*,M)
)C)EF'!BBCBB%)CEABBACEA!ACACAF
TABLE I.
THE RESULTS OF APPLYING CLASSIFICATION SVM AND DECISION TREES
FABC BA) &BBA")
)" CFAB )" CFAB )" CFABC 3>FI 3F;; 3=@; 3F@= 3F;; 3F==(, 3F 3I3 30@= 30=0
D, 3550 35I>@ 5F@
-
CB" A FFA!AF B3F;;3I
-
CA B A *+ !63=@5;5X33;@5>7,CFACABA*+BAEBBA"A!63@5@X335 DEFFABC*DADCCB'BDC+CBA )ACB(DACADAA(D(BDBABD,-
B'AABFBAB&'AFAAAB'&'%;303#BABB9FB
=0> &Y;33 &YS"AFC"B"TCF"&F$FE"";33;
=/> "M:DFEBBAABC:;33@AAA??!!!AA!?Z.F?FEC?
=-> A(Y!%)FB:DABABABBAABCBA8F%EF"CAB:(!EB8(EACE"A#;333
=+> Y!"*B"B!8S+!CAFBCFC""CAABT$EBB8BF'C"""BDC#;333
=.> YB)SBCAA"B"AFCCCB"CC"T)ACABBCAABBA;335
=,> BCCA0FF=SAABBBAAEBCFACCCB"C"BF"BBEAT
=4*> B8;335SBCA"B"AFCCCB"TACABBCA,""BFF"B,""ABBAFB
=44> *FLB9F9*,BBS"AFC"B"T(!Y#A$FF;33;
=41> *".*"&BCFC"FFB*#;33@
=40> CBFDYDBF8B:DAABFBBAAB"B:(B+M);)F*BA(;%)*%0FF=%353
=43> )BC*[#EY CDAADAECDEDAB)*CB)BDF*DCBAD+*AD**DB
B'BDCYBFB,ABF&BFAAA;33>CBF5;2(C&0"I0%F0
=4/> ABBAFB"AFCCB"$BC#"AAACAB?CCB"?AEACF;33F
MEDHAT MOHAMED AHMED ABDELAAL, MUHAMED WAEL FAROUQ ET AL.: USING DATA MINING FOR ASSESSING DIAGNOSIS 17
-
AB ACD EF EC FE AEABEABEAAABDEABBBEBEAABFEBBEBEFEBA FE B C AEAAB B EF CFBEAEABAAEBBCBEECBEFAABCCABBFFDBBDBEECCBEBBBEBEAAB FFCB BE EAEB ABDAFFABC EBB E F BD CE EABB E A B ADBEBABEEABCCEFBCAEEABDBFEAEABEA!C AAB DEAB BBE A BE "#$%& '#%EB&B(FEBE
ABCDEDFCDFFCECFCCECECFCCECC E!FCFFCD" FCC" CDFC
FC#FC"CECFCCEF$CDEFC!EECECECFFCECFC FCECC"CCCCD%!FCECE"CECDFCEDDE CFCF%C&CF'EDFCECECEC(&)$C(*E)$C(+E%F) C F C EF C FF C FF$ C F C D C ED%DE F C C FF C F" C ,F C C EF C E!FCFF C D" F- C B& C % C .F C /BEF C !EE C &FE"FCE0C123CECB&C%C4ECFCEC/BDFFFCDC"C&FE"F0 C 153 C4 CF C FF C FF C D$ C EFCFE"F$CECECDDFCECF CCFECEC!EECFDCC FFCFE"FC FCFEFCFDCC"FCEC!F$C"6"FCFC FC"CFFCDC C ECCFC!FCEC"CC!E"CEF%DEFCF%EC+FCECFC"FCCE CCFFFC%FF CDCF!FC"FCEC!EFCC"CCF C EF C EF$ C E$ C EF C C "E C EC!FDCCEFCFC FC!EEFCC FCC%DECECC FFCFDCFFC FCECECECF C C!E" C FDFFE C C FC EF C7CF'EDFC C CC FC&2C
FCD"DFCC CDEDFCCCF!FCEC!FEFCECF CFFFCCECB&C FCB&CFF%FDCF CC"CDFCCEE8FC#FC"FC FCDFB&CEC CFCFEFCC"CEFCFCC193CFCDDFCECDCC"FC FCB&CF CC FCEFCDFCC FCFCD!FCF"C"CE!FEFCDFFEFCC"F"CEECECCC C FC"FCCFFCECECEC:CEC FCECDFF%
EF C!FCE"C;
-
EFC "E C@" C C F CDFEF C FEF C CB&CECFCFFFCC FC"FCCECDFECEFCFDF%FECCEC FC(FECEFIC153C FCB&CC FFCFDCDF-
FFCDCFFCECE8E &FE"F C !F C " C J C FD C /FFC
DCFD0 FDCE
ABCDECBFBEFCFCB
FCB&C"FC&ECJCAFECFE"FCFFCECCCEFCCFFECDCCAFECE'CC FCAF%ECE'CC FCE'CCDEECF!E!FCCCF%EC"CCC/FFCF6C20C1F31K3
5
C5
5
C
5
C
5
5 C /20
FCFFECC CE'C$CCEC FC%E$CCE"EFC-
D
5
C5
5
5
5
C 5
C /50
B6"EC/20CEC/50CEFCFFCC""C"C$C"CFCC CEF$CCCFECFFCDEFC FCCCECFFC"CCFFC CFDFFC%EFCACECCCEFCCECD'FCFCC FC%EFC7DD'EFCF!E!FCEFCD"FCC!"C CEDDDEFCFFC/CCEDD'ECC0C&CF%FECEDD'ECDC"ECC"F-
D D CC D
-
FCEECFC FCEECD"CCEF FC2FKCFFCEDFCECFFCC FCFEFCCECECDECEFC2C
C"FCCEDFCCFE CEC"CEEEFC%FCC22CFFCE CE CE C EFC8FCEC;>
AF"F 2>
FE" 2K
BE F
+EF 5K
TABLE II
EXPERIMENTAL RESULTS.
)A %CB"+,-.( %CB"+,-/(
7"
-
F C DFEF C C F C FF C C "F C C FCE FCEDFCC FCEEEFCCEDFCCE!E%EFCFCEEFC/FC6"FF0C#FCFEFCFCEFCCE!FEF C F C E C F C EF C C "F C C FDCE F C C FE C E C F C F C EDDE CD!FCFFCF"C"CCEC"ECFE"FCCECEDFC EC FCEFC"FCCFFFCFDCB"ECC CCEFC5C C#FCFCEC F CMCEC/EFCF%FFC FCFCEC FCF'CFCFD0C4FFCF"CFFCE F!FC CCMO
-
1>3 B CB F E$C@CE$ C,FEC BEF$%$CCFCCFEB%F F-CBC D"FC+CECAEFCF$ C+ACQ
-
Evaluation of Clustering Algorithms
for Polish Word Sense Disambiguation
Bartosz Broda, Wojciech Mazur
Institute of Informatics, Wrocaw University of Technology, Poland
[email protected], [email protected]
AbstractWord Sense Disambiguation in text is still a difficultproblem as the best supervised methods require laborious andcostly manual preparation of training data. Thus, this work fo-cuses on evaluation of a few selected clustering algorithms in taskof Word Sense Disambiguation for Polish. We tested 6 clusteringalgorithms (K-Means, K-Medoids, hierarchical agglomerativeclustering, hierarchical divisive clustering, Growing HierarchicalSelf Organising Maps, graph-partitioning based clustering) andfive weighting schemes. For agglomerative and divisive algorithm13 criterion function were tested. The achieved results areinteresting, because best clustering algorithms are close in termsof cluster purity to precision of supervised clustering algorithmon the same dataset, using the same features.
I. INTRODUCTION
WORD Sense Disambiguation (WSD) deals with con-
textual resolution of lexical ambiguity. Most words in
natural language have more than one lexical meaning (sense),
but usually only one of them is active in a given context.
Typical example of ambiguous word is line, which according
to WordNet (an electronic thesaurus, cf. [1]) has 36 senses.
WSD is important problem for applications in domain of Nat-
ural Language Processing (NLP). Machine translation cannot
work without some form of disambiguation, but WSD can be
helpful also for information retrieval, information extraction
and computer aided lexicography among others [2].
WSD is a hard problem. Most difficulties arise from the
fact that the concept of a meaning is vague. Usually, there
are no clear boundaries between one sense or the other [3].
Typically, the problem of defining meaning is tackled with
using dictionaries (which are called sense inventory in a
context of WSD). I.e., from the algorithmic point of view sense
inventories are used to enumerate all the meanings that a given
word has. Now, the goal of WSD can be stated as choosing
appropriate sense from sense inventory in a given context of
a word.
There are two main approaches to WSD based on machine
learning: supervised and unsupervised [2].1 Supervised learn-
ing focuses on the usage of manually disambiguated examples
of text snippets containing ambiguous words. We need to
choose an appropriate sense inventory in advance, at early
stages of the construction of supervised WSD system. Some
1There is a plethora of other approaches to WSD, e.g., based on translationalequivalence or hand-written rules. We omit those for brevity. For extensiveoverview of other methods see, e.g., [2], [4].
features are extracted from those text snippets (or contexts2)
and classifiers are trained using this manually labeled data.
Most of the time, supervised approaches are superior to un-
supervised in terms of accuracy of automatic disambiguation
when used on the same type of texts that the systems were
trained on.
Nevertheless, there is another issue connected with the
problem of the definition of a meaning, i.e., an issue of creation
of other resources used for automatic system performingWSD.
This is especially evident in creation of corpora3 manually
annotated (tagged) with senses, which are used for training
machine learning classifiers in a supervised setting. There are
two important problems during manual sense tagging of a
corpus: low interannotator agreement (IA) and high cost of
annotation process. IA is a way of measuring how much an-
notations assigned by one annotator differers from annotations
assigned by another annotator. IA is used for estimation of an
upper bound on performance on automatic WSD. Typically, it
is not enough to give a value of percentage agreement, because
agreements and disagreements may arise by chance. Cohens
is widely used in computational linguistic community forthis purpose, but there are also other measures [5]. The cost
of annotation is high, because large effort is required during
manual annotation. Mihalcea estimated that a construction
of a corpus with sufficient amount of data for supervised
classification algorithms for 20 000 ambiguous words would
require 80 man-years of work [6].
On the other hand, unsupervised and semi-supervised algo-
rithms can be used. The amount of manual labor required is
much lower in learning without supervision. Unsupervised ap-
proaches to WSD tend to use unlabeled data and automatically
find sense distinctions. Usually those methods involve some
form of clustering. Harris distributional hypothesis [7] can be
used as a theoretical foundation for unsupervised methods of
WSD. It states that meaning of entities (...) is related to the
restrictions on combinations of these entities relative to other
entities.. In this context entities can be understood as words.
The main goal of this work is to compare various clustering
algorithms in the task of unsupervised Word Sense Disam-
biguation for Polish data. In unsupervised WSD system deals
with grouping of contexts for given word that express the
2We will use term context to denote a passage of text containing ambiguousword.
3Here we define a corpus as a collection of texts prepared for linguisticprocessing
Proceedings of the International Multiconference onComputer Science and Information Technology pp. 2532
ISBN 978-83-60810-27-9ISSN 1896-7094
978-83-60810-27-9/09/$25.00 c 2010 IEEE 25
-
same meaning without providing explicit sense labels for each
group (e.g., without using a dictionary) [8]. Also, this work
is motivated by the fact that clustering is important for semi-
supervised WSD algorithm called Lexicographer Controlled
Semi-automatic word Sense Disambiguation [9], [10]. So far,
the selection of the algorithm used in LexCSD was motivated
by the performance of the given algorithm in other tasks and
its analytical properties, because analysis of the performance
of different clustering algorithms in similar settings (i.e., using
similar dataset and features) for Polish WSD is difficult to find.
There are a few differences when dealing with WSD data in
comparison to classical applications of clustering. To name just
a few: the distributions of classes (senses) are skewed4, data
is represented in spaces of very large number of dimensions
(thousands or even hundreds of thousands), for some classes
only very specific, often overlapping among classes features
are important and sometimes there is difficulty in distinguish-
ing between two close classes.
The paper is organized as follows. First the selected clus-
tering algorithms are briefly described. Evaluation section
starts with the analysis of evaluation metrics used. Next, the
corpus and experimental settings are described. Section III-D
provides discussion of results. Section IV gives a summary
of performed experiments and overviews direction of further
works.
II. SELECTED ALGORITHMS FOR TESTING
For this work we have selected a few classical clustering
algorithms, but we tried to choose algorithms representing
a few different approaches to the problem of clustering.
We started with K-means and K-medoids algorithms, which
represent simple, hard and flat clustering methods. We choose
Growing Hierarchical Self-Organising Map (GHSOM) as a
representative of family of clustering using neural networks.
GHSOM is also a hierarchical clustering algorithm. We ex-
periment with standard hierarchical clustering algorithms with
different criterion functions, both from agglomerative and
divisive families of algorithms. Last but not least, we test also
graph-based clustering algorithm. We have reimplemented K-
means, K-medoids and GHSOM and use existing implemen-
tation of other algorithms [11].
We are focusing on clustering for WSD so we will use
NLP-related terminology during description of algorithms. As
a task of WSD is a contextual one, we will cluster contexts
(text snippets) containing ambiguous word. From the context
some real-valued features are extracted. So the context is a
vector of features ~v in high dimensional space. We will useterm context and context vector interchangeably. The exact
nature of context and feature extraction process are described
in Sec. III-B.
A. K-means and K-medoids
K-means is one of the simplest clustering algorithm.
K-means defines cluster as a centers of mass of contexts being
4Not all senses are represented in the data equally; distribution of sensesis biased towards a few frequent senses.
clustered [12]. Those centres are represented as centroids.
Initially random contexts are chosen as centroids. Then we
assign most similar contexts to each centroid. After this step
new centroids are computed as a mean of all the contexts in
a group. This process is then repeated until some stopping
criterion is reached, e.g., number of iteration reaches some
predefined threshold or the clustering solution do not change
significantly between subsequent iterations.
K-medoids is similar in concept to K-means algorithm. The
most fundamental difference between the two algorithms is
that K-medoids uses real contexts from the dataset as a basis
for clustering in contrast to centroids used in K-means (which
are artificial contexts). One of the realisations of K-medoids
is an approach called Partition Around Medoids, or PAM [13].
In PAM one starts with randomly selection of initial medoids.
Then every swapping of every medoid with every context is
tested in terms of decreasing cost of whole clustering solution.
This approach has its drawbacks in terms of computational
complexity, i.e., O(k(nk)2), where n is number of contextsto cluster and k is number of medoids. Thus a few extensionshave been proposed that, e.g., employ sampling (CLARA)
or randomized search (CLARANS) [13]. Nevertheless, we
use classical PAM, as both mentioned algorithms can have
negative impact on quality in comparison to PAM. This
approach is applicable in our experiments, as we use relatively
small datasets.
B. Growing Hierarchical Self-Organizing Map
The Growing Hierarchical Self-Organizing Map (GHSOM)
[14] is a natural extension of Kohonens idea of Self-
Organizing Maps (SOM) [15]. SOM is an artificial neural
network consisting of many neurons. Every neuron consists
of a weight vector. Training SOM is done in an unsupervised
manner applying winner takes most strategy. Every feature
vector is delivered to the network input several times. For
every input vector the similarity with the neuron weight vector
is computed. Weights of the most similar neuron (the winner)
and its neighbourhood are updated to be even more similar to
the input pattern. The learning algorithm is constructed in such
a way, that the neighbourhood and the degree of the weight
updating is decreasing over time.
GHSOM address one of the most important drawback of
SOM the a priori definition of the map structure. Rauber
et al. proposed an algorithm for growing SOM both in a terms
of the number of map neurons and the hierarchy [14]. After
the training stage of SOM mean quantization error for every
neuron i (mqei) is calculated as the average distance of everycontext recognised by the neuron i to its weight vector. Theaverage MQEj for whole map on level j is computed, too. IfMQEj 1 MQEj1 then the additional row or column ofneurons is added to the map and the training stage is repeated.
In the other case the mqei for every neuron is compared toMQEj . If meqi 2 MQEj1 then another layer of themap is created for contexts recognised by the neuron i.
26 PROCEEDINGS OF THE IMCSIT. VOLUME 5, 2010
-
C. Agglomerative and Divisive Clustering
Agglomerative and divisive clustering algorithms produce
hierarchical clustering trees called dendrograms. Agglom-
erative clustering starts in a situation that each context is
contained in a separate cluster, then in each step two clusters
maximising criterion function are merged. On the other hand,
divisive algorithms starts with all contexts in one cluster which
are repeatedly bisected according to the criterion function. We
are using existing implementation of hierarchical algorithms
from CLUTO5 [11]. We use rbr variant of divisive algorithm,
i.e., standard bisecting clustering is employed and is further
optimized according to criterion function [16].
Criterion function is very important aspect of both agglom-
erative and divisive clustering algorithms as it drives the whole
process. There are many criterion function available [17]. We
have tested standard criterion functions used with agglomera-
tive algorithms, i.e.: single link (slink), complete link (clink),
average link (upgma) and weighted variants of single (wslink),
complete (wclink) and average links (wupgma).
The second group of criterion function including
i1, i2, 1, G1, G1, H1, H2 can be used with both agglomerative
and divisive algorithms. The exact form of those functions
are given by [11]:
I1 = maximize
k
i=1
1
ni(
~v,~uSisim(~v, ~u)) (1)
I2 = maximize
k
i=1
~v,~uSisim(~v, ~u) (2)
1 = minimize
k
i=1
ni
vSi,uS sim(~v, ~u)
v,uSi sim(~v, ~u)(3)
G1 = minimize
k
i=1
vSi,uS sim(~v, ~u)
v,uSi sim(~v, ~u)(4)
G1 = minimizek
i=1
n2i
vSi,uS sim(~v, ~u)
v,uSi sim(~v, ~u)(5)
H1 = maximizeI11
(6)
H2 = maximizeI21
, (7)
where k is total number of clusters, S is total number ofcontexts to cluster, Si is a set of contexts assigned to i-thcluster, ni = |Si|, and sim(~v, ~u) is similarity between twocontext vectors ~v and ~u.
D. Graph Partitioning Based Clustering
We use an implementation of min cut graph partitioning
algorithm from CLUTO [11]. This algorithm starts with cre-
ation of neighbourhood graph based on similarities between
5CLUTO is a free software package implementing several clusteringalgorithms including partitioning, agglomerative and graph-based. Availableat: http://glaros.dtc.umn.edu/gkhome/views/cluto/
contexts and then applies min cut to partition the graph into
disjoint regions. Min cut uses approach that the size of graph
edges in a partition is minimal.
This approach achieved high quality in research on semi-
automatic extension of Polish WordNet [18] and was also used
in Polish WSD based on weakly-supervised settings using
LexCSD algorithm [10].
III. EXPERIMENTS
A. Evaluation Measures
Evaluation of clustering algorithms can be done in many
ways [19]. Some of them are based on external criteria, i.e.,
the comparison of the resulting clustering solution with some
pre-existing categories that were created manually. On the
other hand, one can use an internal criteria without resorting
to gold standard clustering. The most important drawback of
evaluation using internal criteria is that good score does not
always corresponds to good results of clustering in a given
application [20]. As we have developed semantically annotated
corpus (SCWSD, see Sec. III-B) we can use it for the need
of evaluation. The problem with SCWSD is its small size,
so there is a risk of not capturing all of the peculiarities and
biases of some large corpora in SCWSD.6
We used several measures for evaluation to capture different
aspects of created groups. For measuring how homogeneous
clusters are we used Purity:
Purity(, C) =1
N
k
maxj
|k cj |, (8)
where = {1, 2, . . . , k} is a set of clusters, a C ={c1, c2, . . . , cj} a set of pre-existing categories. In oursetting C is a set of contexts with ambiguous word annotatedwith the same sense. Purity(, C) 0, 1, where 1 is thebest case. A drawback of Purity is its preference for solutionswith large number of groups. Assigning every context to a
singelton cluster gives Purity of 1 [20].
The Rand Index measures accuracy on the basis of decisions
performed for the subsequent context pairs. If we use TP for
true positive, TN for true negative, FN for false negative and
FP for false positive. the Rand Index is given by the following
equation:
RI =TP + TN
TP + FP + FN + TN(9)
One of the drawbacks of using RI for evaluation is the equaltreatment of false positives and negatives. Using decision for
context pairs we can also use standard measures of information
retrieval, i.e., precision P , recall R and the harmonic mean ofprecision and recall F :
6On the other hand, the total size of the dat