2 body of language data collected (or curated) for a particular purpose various types of language...
TRANSCRIPT
![Page 1: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/1.jpg)
Corpora
![Page 2: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/2.jpg)
2
Corpus (pl. corpora)
Body of language data Collected (or curated) for a particular purpose
Various types of language Spoken Text Images Gestures
Very valuable resource for linguist(ic)s and anyone else who is interested in language
![Page 3: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/3.jpg)
3
Purposes for corpora
Language instruction Task analysis Information access (search, indexing,
etc.) Computer systems development
Training, testing/evaluating systems Knowledge source development
(dictionaries, lexicons, etc.)
![Page 4: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/4.jpg)
Types of corpora
Text Speech Discourse Bitext Experimental transcripts Competition datasets Lyrics
![Page 5: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/5.jpg)
5
Sources for text corpora
Electronic text centers Digital libraries
Project Gutenberg Bibliomania
Corpus collections Wikipedia The web
![Page 6: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/6.jpg)
Corpus distributors
LDC BYU has a membership Catalog Top 10 corpora
ELRA: like LDC except based in Europe Government agencies (NIST, census,
etc.) Companies (news agencies, etc.) Universities 6
![Page 7: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/7.jpg)
7
Data formats
Text File formats: ASCII, EBCDIC, UNICODE, proprietary With or without markup (rtf, html, etc.) Application specific (doc, wpd, etc.) Can vary widely across languages
Speech Huge amount of variation across projects/hw/sw TIMIT, NIST (US Gov.), AIFF (Apple), SUNAU8 (Sun), OGI
File Format, WAV (Microsoft) Binary/machine formats
Sound/speech: MP3, AU, WAV, RA, … Graphical: GIF, JPEG, BMP, WMF, …
Knowledge of a scripting language (e.g. Perl) is invaluable!
![Page 8: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/8.jpg)
Corpus metrics
Size Tokens: # of words, count ALL of them Types: # of words, only count each once
Term frequency Genre/topic Dispersion
![Page 9: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/9.jpg)
9
Corpora at BYU
Lots of corpora listed here that are available for BYU faculty/student use.
corpus.byu.edu scriptures.byu.edu General Conference corpus
![Page 10: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/10.jpg)
Sample jobDate: Thu, 21 Feb 2013 10:40:22University or Organization: H5Job Location: California, USAWeb Address: http://www.h5.comJob Rank: Consultant Specialty Areas: Discourse Analysis; Semantics; Syntax; Text/Corpus Linguistics About H5:H5 serves the needs of leading law firms and corporate clients, using powerful proprietary software to provide technology-assisted review and expert search consulting & research. H5’s document review and analytic services uniquely support our clients’ requirements for large-scale litigation, investigation, records retention, and regulatory compliance. H5’s "hybrid" approach to technology-assisted review combines patented information retrieval technology and expert professional services. Through this model, H5 has created a fully integrated document review system that is unparalleled in performance, as proven in independent, benchmarked studies. For more information, visit www.h5.com. Overview:The H5 Professional Services Group includes linguists, lawyers, researchers, statisticians, e-discovery and data modeling experts and project managers. Our multidisciplinary teams use H5’s proprietary software and a well-defined process to build linguistic models that classify electronic data and support strategic search for documents that help our clients win. H5 is seeking candidates with backgrounds in linguistics (or related fields of textual corpus analysis), an affinity for developing novel search strategies, and a desire to collaborate with professional teams and sophisticated search technologies. Primary Responsibilities:- Analyzing linguistic data;- Researching large corpora for linguistic patterns;- Creating search strategies based on linguistic patterns;- Researching subject matter and factual issues in complex litigation;- Rapidly developing an understanding of new subject matter;- Reading a wide variety of documents, from e-mail to academic articles;- Synthesizing large amounts of information from a variety of sources;- Designing, building, and testing search models unique to each project. Key Competencies:- Understanding of syntax, semantics, and pragmatics, in written communication;- Experience in corpus, text, or discourse analysis a plus;- Experience in ethnography or anthropology can be helpful, particularly as it relates to an understanding of contextual cues in text-based communication;- Leadership skills, personal incentive and a demonstrated ability to initiate, develop, and successfully conclude projects;- A sharp eye for detail and precise thinking;- The ability to make analytical judgments;- A practiced sense of order and organization;- Ability to work under pressure and meet deadlines, both autonomously and collaboratively;- Strong interpersonal skills, flexibility, curiosity, creativity, and collaborative spirit;- Strong computer and software competency in a PC/Windows environment, including Microsoft Office;- Experience in a software development environment a plus. Minimal Qualifications:- Solid academic credentials: advanced-undergraduate and/or graduate-level coursework in linguistics, textual corpus analysis, or related field;- Experience applying linguistic and search expertise to real language data;- Experience in a professional or business environment;- Mastery of the English language.
![Page 11: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/11.jpg)
11
Purpose of standards
Avoid duplication of effort Allow synergy, integration, exchange Specific goals
Reusable text and tagging formats Representative of
domain/discipline/genre Copyright
![Page 12: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/12.jpg)
12
Text markup standards
SGML (ISO standard) Standard Generalized Markup Language DTD, XOM, etc.
HTML (W3C standard) Hypertext Markup Language SGML with specific DTD
XML (W3C standard) Logical SGML subset replacement (?) for HTML
![Page 13: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/13.jpg)
![Page 14: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/14.jpg)
14
Sample corpus analysis task ID terminology, collocations from
previous publications Find most-used vocabulary Find inconsistencies, varied usages Get a handle on domains, topics, size
of vocabulary Groundwork for tech writers,
translators
![Page 15: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/15.jpg)
15
Types of vocabulary lists
Single-word term lists Collocations and compound lists KWIC listings Frequency lists Saliency lists Weirdness: typos, low-freq words,
etc.
![Page 16: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/16.jpg)
16
Starting point
All English-language documentation ever published for which there was a machine-readable version (typesetting)
Several hundred documents of all kinds: repair manuals, warranty notices, user manuals, testing documents, etc.
Total number of files processed: 861
![Page 17: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/17.jpg)
17
Canonicalizing the input
Standardize character representation Tokenize punctuation Strip formatting codes Uncapitalize sentence-initial words
![Page 18: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/18.jpg)
18
ID, count single words
De-inflect morphological variants (base-form reduction, lemmatization)
-ing, -ed forms are problematic After fitting the pipe into the basin … The aft fitting is larger on the new… The tightly fitting bracket should be…
Fuel will be shunted… / The shunted fuel…
![Page 19: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/19.jpg)
19
Single-word statistics
Total number of sw occurrences: 7,230,000
Total number of unique sw occurrences: 12,000
![Page 20: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/20.jpg)
20
ID, count nominal compounds
Involve at least two of the following: Nouns Nominalized verb forms Some adjectives Any word whose category is not known
but not: Numbers, special characters, non-nouns
![Page 21: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/21.jpg)
21
Sample nominal compoundshub caplow amplitudeboom foot pin assemblyhydraulic oil tank drain plugcard cage type regulator voltage adjustment controls
There are ambiguities:
check valvetesting equipment
![Page 22: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/22.jpg)
22
Nominal Compound Statistics
Total number of nominal compounds: 1,034,861
Total number of unique nominal compounds: 110,298
![Page 23: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/23.jpg)
23
Sample long nominal compounds
off-highway truck final drive first reduction planetary assembly
parking brake/travel stop pilot control valve pressure switch
right front suspension cylinder pressure sensor circuit fault
fuel injection pump drive sprocket bearing lubrication line
track motor manifold valve high pressure relief setting
ground level right rear leg elevation control valve
axle wish bone ball joint flange mounting bolts
stick cylinder rod end check valve lines group
ground engaging tool bolt torques chart
scraper key start switch relay terminal
![Page 24: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/24.jpg)
24
NC Frequency Distribution :freq # terms-----------------1 458772 222073 82774 70265 35546 34417 19028 18919 136710 116915 52720 355
freq # terms-----------------30 16650 6675 33100 17250 2501 11098 13410 13862 13966 14889 16092 1
![Page 25: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/25.jpg)
25
NC Frequencies
6092 lb ft
4889 cooling system
3966 fuel injection
3862 parking brake
3410 relief valve
2789 control valve
2587service hours
2421 hydraulic oil
2588personal injury
2373 caterpillar dealer
![Page 26: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/26.jpg)
26
NC Frequencies (cont.)
2037 lift truck
1432 oil filter
953 seat belt
488 master cylinder
205 directional control
109 petroleum jelly
64 ball joint
33 caterpillar service technology group
10 outlet water temperature regulators
5 coolant leak
1 conveyor drive pump electrical displacement controls
![Page 27: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/27.jpg)
27
Term Length Distribution
Len # of terms2 508943 390434 151895 39516 9367 2078 499 1010 911 212 313 215 2
![Page 28: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/28.jpg)
28
Semantic Classes of NC’s parts and components conditions vehicles product offerings tools and hardware measurements humans and occupations corporate entities and procedures
![Page 29: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/29.jpg)
29
Non-nominal Collocations
hand tighten make sure air dry away from air to air aftercooler hydraulically released disc brakes
![Page 30: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/30.jpg)
30
Prep/adv-based Ambiguity (technical vs. not)
down arrow keys inside cab light left camshaft oil gallery accelerator pedals down air inside bulldozer tilt left
![Page 31: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/31.jpg)
31
Variation in NC’s
Alternate spellings Typos Abbreviations Morphological variation ( &
possessives) Word-boundary variation
![Page 32: 2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable](https://reader035.vdocument.in/reader035/viewer/2022062519/5697c00d1a28abf838cc9364/html5/thumbnails/32.jpg)
32
Compositionality
((ground level)(front leg)*(ground ((level front) leg))
BUT:hand fuel priming pump