abrapt mini-curso 30.08.04 the corpógrafo theory and practice belinda maia & luís sarmento...
Post on 20-Dec-2015
219 views
TRANSCRIPT
ABRAPT Mini-curso 30.08.04
The CorpógrafoTheory and Practice
Belinda Maia & Luís Sarmento
PoloFLUP
LINGUATECA
ABRAPT Mini-curso 30.08.04
A bit of history
• PALC ’97 – 'Do-it-yourself corpora ... with a little bit of help from your friends!'
• CULT 1998 - ‘Making corpora – a learning process’
Contrastive linguistics Corpora linguistics Translation teaching
General > specific language
ABRAPT Mini-curso 30.08.04
A bit of history
• 2000 – First Master’s in Terminology and Translation at FLUP
• PALC 2001 - ‘Training Translators in Terminology and Information Retrieval using Comparable and Parallel Corpora’
Specialized translation and terminology
Contact with domain experts
Importance of IT Need for technical help
for more ambitious students!
ABRAPT Mini-curso 30.08.04
A bit of history
• LREC 2002 - ‘Corpora for terminology extraction – the differing perspectives and objectives of researchers, teachers and language services providers’
• 2002 – Second Master’s in Terminology and Translation at FLUP
Plea for help to Diana Santos
October 2002
LINGUATECA - Polo FLUP
ABRAPT Mini-curso 30.08.04
LINGUATECA
• See http://www.linguateca.pt
• Leader > Diana Santos (SINTEF – Oslo)
• Objective - to create resources and tools for the computational processing of Portuguese
• Poles at Oslo, Lisbon, Braga and Porto
• Porto – Polo CLUP/FLUP
ABRAPT Mini-curso 30.08.04
Polo CLUP/FLUP
• See http://www.linguateca.pt/poloclup/• On-line suite of corpora tools to work with
comparable corpora with emphasis on bilingual research– Focus on special domains – Construction of terminology databases,
ontologies and domain modelsCorpógrafo
ABRAPT Mini-curso 30.08.04
Polo CLUP/FLUP
• See http://www.linguateca.pt/poloclup/
• General help in constructing resources specific to the need of FLUP/CLUP – For researchers, teachers and students – For teaching methodology at FLUP
BNC & Reuter’s corpora on intranet A small ‘chat’ corpus
ABRAPT Mini-curso 30.08.04
More history
• 2003 – Poster of the GC – at CL2003• 2003 – ‘What are comparable corpora?’
CL2003• 2003 – Experimentation with evaluation of
Machine Translation• 2003 – Experimentation with GC• 2003 – Third Master’s in Terminology and
Translation at FLUP
ABRAPT Mini-curso 30.08.04
GC – Integrated Web Environment for Corpora Linguistics
Motivation
• Lack of Comprehensive, wide-scope Corpora Tools • Commercial Packages are usually difficult to Integrate/Customize• Tools are not prepared to support cooperative work.• Linguistic knowledge is not usually integrated in tools.
What is GC?GC is a Web tool being developed at Linguateca/CLUP that aims to provide a comprehensive work environment for Corpora-Based Linguistic Research. GC allows users to:
• access several Corpora tools from a single entry point using a regular web browser
• access and query generic Corpora (BNC, Reuter’s, COMPARA, CETEMPúblico)
• build personal simple, parallel and comparable Corpora from text files (PDF, PS, Word, HTML, TXT)
• use several (on-line/off-line) tools with their personal Corpora (statistics, POS-taggers, Filters, etc.)
• communicate and exchange results with other usersInternet Integration
GC provides seamless integration with the World Wide Web allowing users to:
• search specific Corpora resources on the Internet
• query the web for concordances
• use available translation-engines in parallel.
DOC HTML
TXT
PSPDF
RTF
BNCCETEMPúblico
COMPARA Others
PersonalCorpora
Custom Interface
DEV
Inter-userCommunication
ADMUSER
Administrator’s Tasks:
• Users, Groups and Disk Quotas
• Corpora Taxonomy (see box)
• Documentation Organization
• Access Service StatisticsVirtual
Desktop
Custom Interface Custom Interface Custom Interface
Tool Pool• Concordance Engine
• Taggers
• Aligner (Semi-Auto)
• Corpora Bot
• Statistics
• Custom Tools
InternetTerminology DB
• Medium: written, spoken, multimedia• Domain: Engineering, medicine, etc.• Genre: scientific, technical, informative, etc.
Corpora Taxonomy
Terminology Extraction Tool (Auto/Semi-Auto)
Developer Task:
Developer’s Tasks:
• Integrate Existing Tools/Resources
• Develop Additional Generic Tools
• Interact with Users/Administrator
• Develop Custom Tools for particular research needs
Inter-User Communication
• Tagging and Aligning Cooperatively
• Messaging Service
• Exchange of Corpora Resources
• Provide on-line tutorials
• Provide links to:
• on-line teaching material
• bibliography and other resources
Teacher’s Tasks:
ABRAPT Mini-curso 30.08.04
And then...
• PoloCLUP’s 3rd function:• Evaluation of Machine Translation
– Experimentation with evaluation – Teaching + research focus
• Results: – TrAva – MT evaluation tool
– CorTA – Corpus of 1 EN input + 4 MT output sentences
ABRAPT Mini-curso 30.08.04
Prescriptive v descriptive terminology
• Paper > digital form
• Static > dynamic resources
• ‘Democratization’ of terminology
• ISO standards > socioterminology
• Knowledge structures increasingly recognized as structured but dynamic - ask Gerhard Budin to explain this to you ….
ABRAPT Mini-curso 30.08.04
Perspectives of terminology users
• Domain experts and vested interests
• Translators • Information retrieval• Knowledge
engineering
Standardized terminology
Getting the right word Finding information Perfecting Google
Structuring knowledgeFinding it fast
ABRAPT Mini-curso 30.08.04
Bridging the Gap
• General linguists• Translation teachers• Translation students• Corpus linguists• Computational
linguists• Computer engineers
Computer-phobia
Computer-worship
ABRAPT Mini-curso 30.08.04
The Corpógrafo combines:
• Terminology, translation and language study and research (Belinda)
• Terminology databases (Domain experts)• Computational linguistics research and
production of resources (Diana)• Information retrieval and artificial
intelligence (Luís)= Discussions on priorities!
ABRAPT Mini-curso 30.08.04
Corpora and Terminology
• Corpora as input
• Terminology extraction
• Terminology databases
• Structuring of domain knowledge
• Further corpora
ABRAPT Mini-curso 30.08.04
CorporaCorpora Analysis
TerminologyDatabase
InternetInternet
Text details Text details Text details
ABRAPT Mini-curso 30.08.04
Working with the Corpógrafo
• Corpógrafo is a suite of integrated tools for INDIVIDUAL or GROUP research
• All research done ONLINE• Each username/password = separate space on our
server• At present > anyone can work with it using 10 MB
space for FREE• BUT - you get an empty space + tools + tutorial!
ABRAPT Mini-curso 30.08.04
Terminologyold v new
• Prescriptive > descriptive • Paper > digital form• Static > dynamic resources• ‘Democratization’ of terminology • ISO standards > socioterminology• Knowledge structures increasingly
recognized as structured but dynamic - ask Gerhard Budin to explain this to you ….
ABRAPT Mini-curso 30.08.04
Perspectives of terminology users
• Domain experts and vested interests
• Translators • Information retrieval• Knowledge
engineering
Standardized terminology
Getting the right word Finding information Perfecting Google
Structuring knowledgeFinding it fast
ABRAPT Mini-curso 30.08.04
Bridging the Gap
• General linguists• Translation teachers• Translation students• Corpus linguists• Computational
linguists• Computer engineers
Computer-phobia
Computer-worship
ABRAPT Mini-curso 30.08.04
Focus of Corpógrafo
• Design priorities are to:– See the Big Picture– Create the Overall Framework– Get feedback from users to see their needs– Develop according to real research needs– Fill in the details and improve techniques as
needed
ABRAPT Mini-curso 30.08.04
Corpógrafo and special domains
• Master’s in Terminology and Translation• Terminology projects with the support of domain
specialists in:– Engineering – Electronics, Mechanical Engineering
– Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion,
– Medicine - Kidney support machines, Neurology
– Science – Genetics
– Technology – GPS – Geographical Positioning Systems
ABRAPT Mini-curso 30.08.04
Corpógrafo and terminology/translation research
• Ongoing dissertations on aspects of:– Terminology – databases for different uses,
neologisms, definition searches, semantic relations, conceptual analysis
– Corpora – text analysis, corpora construction
– Technical writing > Electrical Appliances
– Localization
– Terminology in documentaries
– Translation of Multimedia
ABRAPT Mini-curso 30.08.04
Linguateca
• Linguateca’s policy - all resources and tools freely available online
• Primary users - Portuguese and Brazilian
ABRAPT Mini-curso 30.08.04
Polo CLUP/FLUP
• Bi- or multi-lingual in interest
• Corpógrafo available for experiments on a small scale to the general public
• Possibilities of future work on projects with users from other universities and other countries
ABRAPT Mini-curso 30.08.04
Contacts
If you are interested is finding out more, please contact me:
Belinda Maia
The Corpógrafo can be used
(with a username and password) at:
http://www.linguateca.pt and
http://poloclup.linguateca.pt/ferramentas/gc
ABRAPT Mini-curso 30.08.04
Corpógrafo
1. File Manager - area where each individual or group can:
– convert various text formats to .txt– upload texts to their space on server– ‘clean’ them of unnecessary material– check tokenization and sentence divisions– consult wordlists – alphabetical, frequency etc– group texts into corpora– register full information on source, domain and text
type
ABRAPT Mini-curso 30.08.04
Corpógrafo
2. Corpora analysis area:– Concordancing tools allowing for
• KWIC concordancing
• KWIC concordancing with sorted according to word to left or right
– N-gram tool• N-grams
• Term-candidates– With filters for PT
ABRAPT Mini-curso 30.08.04
Corpógrafo
3. Terminology database– Terms– Definitions– Examples– Morphology – Multilingual equivalents– Sources and text details of corpora used– Semantic relations – further complexity
ABRAPT Mini-curso 30.08.04
CorporaCorpora Analysis
TerminologyDatabase
InternetInternet
Text details Text details Text details
ABRAPT Mini-curso 30.08.04
Future developments – general policy
• General testing and improvement of the Corpógrafo
• Experimentation with ideas from other projects:- e.g. Wordnet, Framenet
• Experimentation with theories of semantic primitives, human universals etc
• Development of new ideas or functions – using isomorphic relationships between researchers’ needs and our possibilities
ABRAPT Mini-curso 30.08.04
Future developments- File Manager
• Creation of overall framework – perhaps UDC based – for:– consultation of research available to public– information on ongoing research
• Coordination of individual corpus projects into bigger projects, when possible or necessary
ABRAPT Mini-curso 30.08.04
File ManagerTheoretical questions
• Domain organization – UDC or ?• Categorization of text by genre – how many
genres? • Reliability of texts from Internet – how does one
guarantee quality?• Is a translator or linguist able to distinguish a
‘good text’?• Should the domain specialist choose the texts?
ABRAPT Mini-curso 30.08.04
Corpora constructiontheoretical questions / problems
• How large is a good domain corpus?
• No domain corpus will produce EVERY term in the area
• Comparable corpora v. Parallel corpora
• Aligning comparable corpora at term level
ABRAPT Mini-curso 30.08.04
Future developments- Corpora analysis
• Development of finer-grained concordancing
• Experimentation with finding definitions in context
• Semi-automatic creation of keyword shortlists for further text retrieval
ABRAPT Mini-curso 30.08.04
Corpora AnalysisTheoretical questions
• How far can one rely on the computational linguist or computer engineer to produce analyses of corpora?
• If (semi-) automated processes produce 80% possible results, should the linguist / translator rubbish these processes?
• Can we leave it all the computer engineer?
ABRAPT Mini-curso 30.08.04
Future developments- terminology databases
• Refinement of terminology fields
• Development of further multi-lingual functions
• Development of organized and robust set of semantic relations
• Semi-automatic visualizing of semantic relations
ABRAPT Mini-curso 30.08.04
Terminology databasesTheory
• How much information does a database need?
• How much does the user of a database need?
• Is it reasonable to hope that all our databases could one day communicate with each other and help us with translation / information retrieval – or whatever?
ABRAPT Mini-curso 30.08.04
How is the Corpógrafo being used at present?
• Master’s in Terminology and Translation• Terminology projects with the support of domain
specialists in:– Engineering – Electronics, Mechanical Engineering
– Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion,
– Medicine - Kidney support machines, Neurology
– Science – Genetics
– Translation and Localization
ABRAPT Mini-curso 30.08.04
How is the Corpógrafo being used at present?
• Dissertations completed on:
– Definitions for different purposes + pedagogical glossary for Corrosion, Electrical engineering http://www.fe.up.pt/~cdm/QAE/QAE_gloss_b.htm
– Socioterminology – in the area of Composite Materials
– Graphical representation of Conceptual systems
– Terminology and Metaphors
– Football Metaphors
ABRAPT Mini-curso 30.08.04
How is the Corpógrafo being used at present?
• Ongoing dissertations on aspects of:– Terminology – databases for different uses,
neologisms, conceptual analysis– Corpora – text analysis, corpora construction– Translation and localization terminology– Technical writing > Electrical Appliances– Terminology in documentaries
ABRAPT Mini-curso 30.08.04
Pedagogical applications of the Corpógrafo
• Undergraduate courses – only possible if both teachers and students are trained to use it
• Postgraduate research – Terminology and translation (Belinda + domain
experts)
– Computational linguistics (Diana)
– Information retrieval (Luís)
• Long live team work!
ABRAPT Mini-curso 30.08.04
To what extent is the Corpógrafo available to others?
• Linguateca’s policy is to make all resources and tools available online
• Primary users are expected to be Portuguese and Brazilian as most of resources and tools are for Portuguese
• PoloFLUP’s main objective – comparable corpora and terminology tools
ABRAPT Mini-curso 30.08.04
To what extent is the Corpógrafo available to others?
• PoloFLUP is, by definition, bi- or multi-lingual in interest
• The Corpógrafo is therefore available for experiments on a small scale to the general public
• In the future – we hope to be able to work on projects with users from other universities and other countries
ABRAPT Mini-curso 30.08.04
Contacts
If you are interested is finding out more, please contact me:
Belinda Maia
The Corpógrafo can be used
(with a username and password) at:
http://www.linguateca.pt and
http://poloclup.linguateca.pt/ferramentas/gc