leuven, 2007-05-22 computer aided document indexing system for accessing legislation a joint venture...
TRANSCRIPT
Leuven, 2007-05-22
Computer Aided Document Indexing System for Accessing Legislation
A Joint Venture of Flanders and Croatia
Bojana Dalbelo BašićFaculty of Electrical Engineering and Computing, University of Zagreb
Marko TadićFaculty of Humanities and Social Sciences, University of Zagreb
Marie-Francine MoensCentre for Law and IT / Dept. of Computer Science, Katholieke
Universiteit [email protected]
Leuven, 2007-05-22
Talk overview
document indexing and computer aided document indexing
project AIDE
CADIS workstation: features
project CADIAL
eCADIS workstation: additional features
machine learning techniques
future developments
conclusions
Leuven, 2007-05-22
Computer Aided Document Indexing document indexing
attachment of descriptors from a controlled thesaurus to a document
descriptors = labels representing the content of a document
necessary for document retrieval in many document collections
parliamentary documentation
legislation
technical documentation
…
usually done manually
tedious, error prone, slow (max. 30-40 documents/day)
could computers be of any help in this process?
if we build a Computer Aided Document Indexing System (CADIS)
Leuven, 2007-05-22
Project AIDE in Croatia
idea for a project
September 2004
interdisciplinary collaboration of 3 institutions
Croatian Information Documentation Referral Agency (HIDRA)
Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS)Faculty of Electrical Engineering and ComputingUniversity of Zagreb
Institute of Linguistics (ZZL)Faculty of Humanities and Social SciencesUniversity of Zagreb
Leuven, 2007-05-22
AIDE – collaborating institutions HIDRA
collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia
coordinator Maja Cvitaš, M.A.
ZEMRIS
research in the field of artificial intelligence, neural networks, machine learning, data and text mining
coordinators prof. Bojana Dalbelo Bašić and Jan Šnajder, M.Sc.
ZZL
computational linguistic research and building language technologies for Croatian
coordinator prof. Marko Tadić
Leuven, 2007-05-22
AIDE – project objective
Development of intelligentsystem for automatic indexingof the official documentation
of the Republic of Croatiawith descriptors from Eurovoc thesaurus
Leuven, 2007-05-22
AIDE – how? AIDE = Automatic Indexing of Documents with Eurovoc
automatic indexing, how? program which “learns to index” documents
conference in Joint Research Center of EC (JRC), Ispra, Italy, 2004-09 at least 10,000 manually indexed documents 3-5 descriptors per document 10-15 documents per descriptor indexed documents stored in XML format Steinberger (2003)
compiling a corpus of Croatian manually indexed documents for machine learning of automatic indexing with Eurovoc descriptors
situation with Croatian documentation in 2004-09 there were only few hundreds of documents indexed manual indexing: painfully slow
how could we speed up the manual indexing?
Leuven, 2007-05-22
AIDE – activities
investigate and develop algorithms in the field of computational linguistics/language technologies
include that knowledge into the Computer Aided Document Indexing System (CADIS)
demonstration of CADIS in European parliament (2006-03-10)
Leuven, 2007-05-22
CADIS features
Enhanced user interface
list of descriptors literary appearing in document
Leuven, 2007-05-22
CADIS features
Integration of corpus analysis
greyed n-grams are statistically relevant in the corpus i.e. collocations
Leuven, 2007-05-22
CADIS features
Manual marking of significant n-grams
important step towards further refinment of automatic indexing
Leuven, 2007-05-22
AIDE – activities
investigate and develop algorithms in the field of computational linguistics/language technologies
include that knowledge into the Computer Aided Document Indexing System (CADIS)
demonstration of CADIS in European parliament (2006-03-10)
ca 10,000 Croatian documents indexed in HIDRA using CADIS workstation during 2006
joint project proposal with Katholieke Universiteit Leuven for CADIAL project
Leuven, 2007-05-22
CADIAL project Computer Aided Document Indexing for Accessing Legislation
a joint Flemish-Croatian project
Department International Flanders, grant no. KRO/009/06
partners:
Katholieke Universiteit Leuven (prof. Marie-Francine Moens)
University of Zagreb, Hidra (prof. Bojana Dalbelo Bašić)
started: 2007-03
duration: 2 years
web: www.cadial.org
the goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia
new version of CADIS (eCADIS) is one of modules in this project planned as a web-based service
Leuven, 2007-05-22
CADIAL project 2
used the 10,000 manually indexed documents to train the
system for automatic indexing of documents in Croatian
used the 20,000 manually indexed documents from Acquis to
train the system for automatic indexing of documents in
English
included that training data into the next version: eCADIS (-
version)
Leuven, 2007-05-22
eCADIS () features
Automatic suggestion of relevant descriptorsi.e. automatic indexing
application of machine learning techniques
Leuven, 2007-05-22
eCADIS () features
Manual marking of inappropriate suggestions
another step in further refinment of automatic indexing
Leuven, 2007-05-22
eCADIS () on document in English
Automatic suggestion of relevant descriptorsi.e. automatic indexing
Leuven, 2007-05-22
Training the classifiers already existing classifiers
profile classifier (Steinberger 2003)
K-nearest neighbours
binary classifiers
SVM, Logistic Regression, Rocchio, Bayes, …
classifiers used for the preliminary training
ca 3500 independent binary classifiers
need to be further evaluated
Logistic Regression used for 10,000 documents in Croatian
SVM used for 20,000 documents in English
features tokens, lemmas, stems, character n-grams
various feature selection methods and their combinations: 2, ig, mi…
Leuven, 2007-05-22
Further development of eCADIS
training with new features and feature selection methods
collocations, word n-grams, chunks
new measures for evaluation of results
sensitive to thesaurus hierarchy
web-interface for eCADIS for inclusion into the CADIAL system
eCADIS for other languages
now only Croatian and English (-version) covered
usable for other languages as it is, but without the linguistic module less efficient
no list of lemmas, but types poor statistics for n-grams
cooperation with language technology experts in different languages for development of linguistic modules
Leuven, 2007-05-22
Further development of eCADIS … eCADIS for other languages
training the automatic indexing system for other languages
enables automatic suggestions of relevant descriptors in new, unseen documents
analysis of manual markings descriptors, word n-grams, suggestions
promote the use of eCADIS in other countries beyond the scope of CADIAL project
e.g. Belgium (Flanders)
linguistic module for Dutch and French needed
computational lingustics expertise
training data from Acquis can be used to make an automatic indexing system for Dutch and French
machine learning expertise
Leuven, 2007-05-22
Conclusion CADIAL
a joint Flemish-Croatian project sponsored by Flemish government
better public access to Croatian official documentation
faster and improved document indexing
automatic content metadata generation (Semantic Web)
easier document retrieval and exploration of legislation
multilingual access via standardized EU thesaurus Eurovoc
a test-case for the usage of such a system in Flanders
Web information on CADIAL project and eCADIS
www.cadial.org
contact:
Leuven, 2007-05-22
Computer Aided Document Indexing System for Accessing Legislation
A Joint Venture of Flanders and Croatia
Bojana Dalbelo BašićFaculty of Electrical Engineering and Computing, University of Zagreb
Marko TadićFaculty of Humanities and Social Sciences, University of Zagreb
Marie-Francine MoensCentre for Law and IT / Dept. of Computer Science, Katholieke
Universiteit [email protected]