knowledge organization research in the last two decades: 1988-2008 fidelia ibekwe-sanjuaneric...
TRANSCRIPT
Knowledge Organization Research in the last two decades: 1988-2008
Fidelia Ibekwe-SanJuan Eric SanJuan
Outline
Previous workGoalData collectionAnalysis methodolgyResultsDiscussion
Previous works
• On trends survey in KO: – McIlwaine & Williamson (1999) – McIlwaine (2003) – Hjorland & Albrechtsen (1999) – Lopez-Huertas (2008) – Saumure & Shiri (2008)– Smiraglia (2009)
Previous works
• Personal readings of journals & ISKO proceedings• Query: was a query constructed and submitted to a database in order to retrieve
records?• Publications: reading / perusing of full texts?• Records: bibliographic records (titles & abstracts)
Author Period Data source Query Size
1988-1998 personal readings*McIlwaine (2003) 1998-2003 personal readings
Lopez-Huertas (1998) 1989-1998?
Saumure & Shiri 1998 1966-2006 LISTA
McIlwaine & Williamson (1999) 575 publications
Wos (SSCI) + LISA + personal readings
knowledge organization + othersinformation organization, knowledge organization 219 records
Previous works• Major findings: – 1998-20031998-2003: McIlwaine & Williamson (1999); McIlwaine
(2003)• Classification schemes (UDC, DCC, LCSH,..)• Bias in classification (gender, culture)• Interoperability of KO vocabularies• Rise of Internet technology, search engines, impact on KO• Resource discovery• Emerging trends in expert systems (NLP, ontologies, automatic indexing...)• Terminology management problems• Thesauri design• Information visualisation in online context
Previous works• Major findings: – 1989-1998?1989-1998?: Lopez-Huertas (1998);
• Mainstream research in KO are reformulations of old problems (classification, thesauri)
• Recasting them in web era gives them a new life!• Especially since KO is more & more entwined with sister fields• 2 major driving forces of research in KO:
– demand for quality & interoperability in a multilingual, multicultural world
– Managing emergent knowledge in KOS in the semantic web era• Both are reformulations of multidimensionality of knowledge• Necessitating an inter- and multi-disciplinary effort• etc...
Previous works• Major findings: – 1966-2006 (40 yrs!)1966-2006 (40 yrs!): pre & post-web era Saumure & Shiri
(1998);• Organizing corporate or business information• Machine-assisted knowledge organization• Information professionals• Interoperability• Cataloging and classification• Classifying the web• Digital preservation and digital libraries• Metadata applications and uses• Cognition• Education• Indexing and abstracting• Thesauri initiatives
Previous works• Major findings: – Saumure & Shiri (1998): 1966-2006 (40 yrs!)1966-2006 (40 yrs!): pre & post-
web era ;• Trends b/w pre (<1993, date of 1st navigator, Mosaic) and post-
web era• KO research focused throughout on mainstream topics• Cataloguing, classification• Pre-web era: more focused on indexing and cataloguing• Post-web era: metadata generation & harvesting,
interoperability, thus more technological thrust
Previous works• SummarySummary– Despite methodological differences in data collection and
analysis methods– Important overlaps in findings– Mainstream research is still driving KO (classification
research, cataloguing, thesauri, bias,...)– Reformulations in the web era (interoperability, metadata
creation & harvesting, assisted indexing & retrieval, terminology issues...)
Goal
• Trends survey of research on KO issues over past 2 decades (1988-2008), 21 yrs.
• What can we get from automatic data analysis methods?
• Can they provide any useful insight?
Goal
• Epistemology:– Empiricism (how): methodology - observation of evidence
from data– Pragmatism (why): is it useful and for whom?
• Some connection with bibliometrics but focus is not on mapping authors but on mapping contents
• Methodological difference with mainstream data analysis techniques: symbolic (linguistic & terminology) vs bag-of-word approach
Data collection (1)
• issue
• ISKO proceedingsISKO proceedings: not indexed in a machine-processable format (database)No problem for peer-reviewed journals...
• But ambiguityambiguity of KO conceptKO concept!At the end of the day... a manual selection of KO & LIS-related journalsRecords downloaded from Web-of-Science (WoS)
Data collection (2)
• List of 31 selected journals at http://fidelia1.free.fr/isko2010/data/list-journals.pdf
• 931 records out of which 838 came from KOKO & ancestor (International ClassificationInternational Classification)
• 45 000 words in titles & abstracts• Research trends will portray mostly publications from KOKO
journal.• Not the entire realm of publications on KOKO but we had to
be content with that...
Sample record from ISI-WoSPT JAU RADA, R ROSSIMORI, A PATON, R RECTOR, A MAGLIANI, F ROBBE, PD
TI THE GALEN DREAM
SO INTERNATIONAL CLASSIFICATION
AB Outlines the origin, needs and principles of GALEN, the Generalized Architecture for Languages, Encyclopedias, and Nomenclatures as applicable to Medicine. Short-term and long-term plans of GALEN have been elaborated to cope with
possible developments. ''Milestones'' are given indicating what should be reached when and how much funding will be required for each milestone. In two ''vision'' pictures the situation before and after the introduction of GALEN is shown and the responsibilities at 4 different levels are listed.
SN 0340-0050PY 1992VL 19IS 4BP 188EP 191UT ISI:A1992KH33900002
Analysis methodology (1)
Empirical observations of how terminology depicts knowledge artefacts (titles & abstracts)
– Terminology engineering
Descriptive text data analysis (propose automatically a partition in the data)
Hierarchical agglomerative clustering– Mapping & Visualisation:– Multidimensional view of domain structure: symbolic & numerical
information
•TermWatch system TermWatch system (SanJuan & Ibekwe-SanJuan 2006)
Analysis methodology (2)
- Corpus split in 2 periods* 1988-1997
* 1998-2008
- Terminology modeling* Automatic extraction of terms
* Term variant search
- Clustering by semantic relations- Linking clusters by co-occurrence- Mapping & visualization
Analysis methodology (3)
- Terminology modeling* Automatic extraction of terms* surface morpho-syntactic properties of terms
* rule implementation
* extraction of likely candidates
* filtering: statistical measures or manual
* Problem: statistical measures work on massive data
Analysis methodology (4)
- Terminology modeling* Term variant search* surface morpho-syntactic operations b/w terms
* spelling variantsspelling variants (WordNet)
* synonymssynonyms (USE/UFUSE/UF)(WordNet)
* likely BT/NTBT/NT candidates: syntactic information
* likely RTRT: lexico-syntactic information
* some errors and noise
* but in automation you do a trade off!
Analysis methodology (5)
• Some term variants acquired Paradigmatic organization (BT/NT)
classification scheme
universal classification scheme
generic classification scheme
knowledge classification scheme
Library of Congress – LC (USE/UF)
knowledge organisation scheme knowledge organization tool (RT)
• The system does not tag these relations as such• They are assumed to be implied by the variations
Analysis methodology (6)
• Assumptions behind terminology modeling• Consensus from studies on terminology/lexicography: new terms
(denominations of concepts) are mostly created from existing terms
• Rare creation of terms ad nihilo• Surface linguistic operations reveal semantic (conceptual?)
relations between domain concepts• By studying these operations and visualising how they relate terms• Reveal the conceptual structure of a domain
Analysis methodology (7)
• Clustering• 3 tier process:
1st group terms by close semantic relations
2nd hierarchical clustering by lesser semantic relations (many iterations)
3rd link cluster labels by co-occurrence of labels or that of their variants
• VisualisationThematic maps (Pajek)Navigation interface (browser)
Results (1)
Results (2)
Main topics for period 1 (1988-1997)– Global structure : typical « core - peripheral » layout
– KnowledgeKnowledge is the structuring poleClassificationClassification
– Subjects gravitating around the Knowledge pole:
analysis
online vocabulary control standardization
bibliographic information system
• indexing (automatic & manual)
thesaurus construction and usage
information documentation system
translation
Results (3)
In the last decade (1998-2008):• Research network is much more intertwined
No one center but several « core » issues connected to one another
Major topics are intertwined:
KO issues ↔ classification ↔ information theoretic ↔ indexing language ↔ user evaluation
• Newer topics: web issues, metadata, knowledge discovery, computer algorithm,...
Results (4)1998-2008, equal divide b/w:theoretical research• information science, concept, classification theory, epistemological foundation,...
user-oriented studies• user librarian, user-defined descriptor, user evaluation
mainstream KO issues • classification, thesaurus, KO, term selection
technology oriented handling of KO issues• knowledge, system, transfer, knowledge representation, knowledge engineering, knowledge discovery, information processing, computer algorithm...
• web, web designer, web document
• information retrieval, terminology structuring, metadata, metadata quality
Discussion
Evaluation of clusters: information-theoretic problem. No solution.No gold standardGoal of the method: precisely to propose a partition amongst the dataIs it the best one?Reliance on external criteria: human (expert) evaluationSo response from the community neeeded!So response from the community neeeded!
• Thank you for listening