dlls 20031 ontologically-based searching for jobs in linguistics deryle lonsdale lonz@byu.edu funded...
Post on 20-Dec-2015
217 Views
Preview:
TRANSCRIPT
DLLS 2003 1
Ontologically-based Searching for Jobs in
Linguistics
Deryle Lonsdalelonz@byu.edu
Funded by:
DLLS 2003 2
The BYU Data Extraction Group Group of faculty (5) and students
(15) from CS, Linguistics, SOAIS Goal: ontology-based data
extraction NSF funding: CISE/IIS/IDM TIDIE Website: www.deg.byu.edu/
Papers, presentations Tools Demos
DLLS 2003 4
Overview Ontology-based extraction Building knowledge sources Jobs in linguistics (Sproat) Putting it all together Some sample results
DLLS 2003 7
Conceptual modeling (OSM)
Year Price
Make Mileage
Model
Feature
PhoneNr
Extension
Car
hashas
has
has is for
has
has
has
1..*
0..1
1..*
1..* 1..*
1..*
1..*
1..*
0..1 0..10..1
0..1
0..1
0..1
0..*
1..*
DLLS 2003 8
Recognition and Extraction
Car Year Make Model Mileage Price PhoneNr0001 1989 Subaru SW $1900 (336)835-85970002 1998 Elantra (336)526-54440003 1994 HONDA ACCORD EX 100K (336)526-1081
Car Feature0001 Auto0001 AC0002 Black0002 4 door0002 tinted windows0002 Auto0002 pb0002 ps0002 cruise0002 am/fm0002 cassette stereo0002 a/c0003 Auto0003 jade green0003 gold
DLLS 2003 9
Car-Ads Ontology (textual)Car [->object];Car [0..1] has Year [1..*];Car [0..1] has Make [1..*];Car [0...1] has Model [1..*];Car [0..1] has Mileage [1..*];Car [0..*] has Feature [1..*];Car [0..1] has Price [1..*];PhoneNr [1..*] is for Car [0..*];PhoneNr [0..1] has Extension [1..*];Year matches [4]
constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … …End;
DLLS 2003 10
The data-frame library Low-level patterns implemented as
regular expressions Match items such as email
addresses, phone numbers, names, etc.
Mileage matches [8] constant { extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000"; },
{ extract "[1-9]\d{0,2}?,\d{3}"; context "[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{0,2}?,\d{3}"; context "(mileage\:\s*)[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";},
{ extract "[1-9]\d{3,6}"; context "[^\$\d][1-9]\d{3,6}\s*mi(\.|\b\les\b)";}, { extract "[1-9]\d{3,6}"; context "(mileage\:\s*)[^\$\d][1-9]\d{3,6}\b";}; keyword "\bmiles\b", "\bmi\.", "\bmi\b", "\bmileage\b";end;
DLLS 2003 11
Lexicons Repositories of enumerable classes
of lexical information FirstNames, LastNames, USstates,
ProvoOremApts, CarMakes, Drugs, CampGroundFeats, etc.
DLLS 2003 12
Accessing the output Extracted information is stored in a
relational database Results can be queried using SQL Wide range of views is possible
DLLS 2003 13
Finding jobs in linguistics Linguistlist.org, LSA Email distribution lists (corpora,
langage naturelle, CAAL/ACLA, etc.) Usual commercial sites
(monster.com, flipdog.com, dice.com)
Word-of-mouth sources
DLLS 2003 14
Sproat’s analysis Random sample (224/2250) of LinguistList
postings, 1994-2001 Development vs. research, academic vs.
industrial Linguists are most often (approx. 80% of
the time) offered development jobs Linguists hired more for specific tasks
(e.g. grammar, lexicon development) rather than for more general research-oriented tasks (e.g. creating new technological approaches.)
DLLS 2003 15
The banner yearsYear Academia Industry % Industry
1994 27 2 7%
1995 45 5 10%
1996 52 3 5%
1997 48 3 6%
1998 57 3 5%
1999 56 14 20%
2000 55 43 39%
2001 (mid) 22 10 31%
Dramatic rise in 1999, 2000
Steep drop-off since 2001
Rising demand for technical, computational skills
DLLS 2003 16
Linguistic jobs ontology Why?
user-specifiable constraints
Somewhat closely follows existing ontologies (e.g. jobs, software)
DLLS 2003 17
Data frames and lexicons Language names
ethnologue (sub)fields of linguistics
Linguistlist.org Tools, toolkits Software components, programming
languages Linguistics-related job titles Activities Responsibilities Country names
DLLS 2003 18
The corpus 3237 postings (LinguistList, Corpora, LN,
WoM):1998 5411999 5752000 8712001 952 2002 788
Some noise (non-English, factored, program descriptions, attachments, etc.)
Semi-automatic edits (boilerplate, publicity blurbs about institutions, etc.)
DLLS 2003 20
Observations 270 don’t have linguist* (!) Demand for knowledge of English
equals that for all other languages combined (G, F, S, J, C)
Computer/computational background required for almost 1/3 (1116)
Noticeable amount of headhunting, particularly in Seattle, DC areas
DLLS 2003 21
Programming languages
0
100
200
300
400
500
600
700
C/C++ CGI HTML/SGMLJ ava/ J script Lisp/Python PerlProlog SQL TclVB XML/XSLT
DLLS 2003 22
Popular subfields
0
100
200
300
400
500
600
700
IE/ IR Morpho NLP Phonetics
Phonology Pragmatics Speech SyntaxSemantics MT TESOL/EFL Translation
DLLS 2003 23
Subfields (another perspective)
0
200
400
600
800
Psycho Neuro HistoricalTypological Acquisition CognitionSocioling Lexicography PhilologyPhilosophy Anthropo
DLLS 2003 24
An engineering discipline? 160 linguistics jobs ending in “engineer” Software development cycle
research e., software design e. development e., software e. software quality e., linguistic test e., linguistic quality e. linguistic support e., user experience e. presales e., technical sales e.
Specific subfields web site e. speech e., voice recognition e., speech recognition application e.,
speech e., ASR tuning e., audio e. dialog e.
tools e. AI e., NLP e. knowledge e. linguist e., natural language e. staff e. human factors e., user interface e.
DLLS 2003 25
Paradigms
0
50
100
150
200
250
300
Machine learning Finite- stateStatistical Stoch/ProbMath GenerativeField Methods
DLLS 2003 26
Other observations Often a job title is not even listed (!) More in18 of data frames (e.g. email,
ph. #) Great need for (preferably hierarchical)
lexical repositories related to linguistics job titles theoretical frameworks, subfields typical linguist job activities linguistic research/development venues
top related