tool for text-‐based terminology
TRANSCRIPT
SketchengineTOOL FOR TEXT-‐BASED TERMINOLOGY
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
Using texts for term miningv Building a corpus.vChoosing textsvConverting into common format (txt)vAnnotationv Croatian: http://nlp.ffzg.hr/api-‐for-‐our-‐language-‐technologies/
vAlignmentv CAT tools (SDL, memoQ) or LF ALigner, https://sourceforge.net/p/aligner/wiki/Home/
v Searching the corpus.v Concordance tools: AntConc (free), Wordsmith (€), ParaConc (free)vWeb-‐based corpus workbench: Sketchengine, http://www.sketchengine.co.uk
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
What is the Sketchengine?v Very powerful corpus workbench: https://www.sketchengine.co.uk/
v Provides access to multiple pre-‐compiled corpora (British National Corpus, hrWaC, DGT corpora and many more)
v NOT free, but not expensiveJ (5,99 € per month)
v Allows the creation of ad hoc corpora from web texts
v Supports TMX import (for bilingual texts!)
v Provides ways to extract terminology semi-‐automatically
v Online tutorials: https://www.sketchengine.co.uk/sketch-‐engine-‐video-‐tutorials/
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
Simple concordances
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
Other query types
v simple: searches for word and its inflected forms
v lemma: searches for all words with this lemma
v phrase: for searching multiple words
v word: to search for a specific wordform
v character: to search for a string of characters
v CQL: corpus query language
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
WordSketches
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
Thesaurus – similar words
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
Keyword extraction
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
Term queries in the DGT parallel corpusv simple queries: ribolov, brancin, grdobina
v lemma queries: ribolov -‐> ribolova, ribolovu, ribolov
v parallel query:
v querying using CQL syntax:
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
Basic CQLv Typical format: [attribute="value"], e.g. [lemma=“riba”]
v Specifying word class or case: [tag=“N.*”] (any noun), [tag=“A.*”] (any adjective)
v Regular expressions: v . (dot) matches any single characterv * (asterisk) matches 0-‐100 repetitionsv + (plus) matches 1-‐100 repetitionsv {n,k} specifies exact range of repetitions, from n to k
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
[lemma=“rad”]
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
[tag=“A.*”][lemma=“riba”]
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
"ulov.*" []{0,3} [tag="N.*"]
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
Challengesv Search for verbs occurring before the word “ugovor” with up to 2 words in between.
v Search for words ending with “anje”.
v Search for defining contexts containing a noun in the nominative case followed by “je” followed by an adjective and noun in the nominative case.
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
Looking for definitionsv Exploit typical definition patterns: v[X] is a [Y]v [X] is defined as [Y]v [X] is a kind of [Y]v …
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
WebBootCatv Tool to create text collections from web pages
v User provides keywords & optionally selects sites to crawl
v When the corpus is compiled it can be used for queries or download.
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
TMX Uploadv Allows you to create corpora from your translation memories
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
Terminology extractionv Works for languages with a predefined “term grammar”
v Manage corpus -‐> Keywords and terms
v Terms can be exported into TBX or CSV
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
Exercisev Use the corpus-‐derived information on the following slides to create a term entry for “bluetongue”.
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB