cross-domain evaluation of word embeddings using an ... · (schlumberger, dnv-gl), it companies...

IntroductionMaterials and methods

Experiments and resultsError Analysis

Conclusion and Future workAppendix

University of Oslo

Cross-domain Evaluation of Word Embeddingsusing an Oilfield Glossary

Farhad Nooralahzadeh

LTG seminar

9th May 2017 Cross-domain Evaluation of Word Embeddings ... 1 / 27

University of Oslo

SIRIUS

• Norwegian Centre for Research-driven Innovation (8-years)

• Support digitization in the oil and gas industry

• Addresse the problems of scalable data access in the oil and gas industry

Work-packages/Strands: Consortium of companies anduniversities:

• Operators (Statoil),Service companies(Schlumberger, DNV-GL),IT companies (IBM,Kadme, ...)

• UIO, NTNU, Oxford andSimula ResearchLaboratories

http://sirius-labs.no/

University of Oslo

PhD Project

Title: Domain adapted language technology for Oil and Gas

1 Domain study

2 Domain knowledgeembedding

3 Domain adaptedinformation extractionpipeline

4 Domain knowledge baseenrichment

University of Oslo

Cross-domain Evaluation of Word Embeddings using anOilfield Glossary

Objectives:

• Create domain-specific vocabulary model

• Evaluate the quality of the model

• Provide a reliable input representation for downstreamclassification tasks

University of Oslo

Word embedding evaluation

Intrinsic evaluation and Extrinsic evaluation

Intrinsic: Interpret the encoding content of a model in term ofsome linguistics relations.

• Semantic Relatedness: Having the proximity score to theword pairs, measuring the degree of correlation between thescores provided by the model and the human rating .

Gold standard resources: SimLex-999 [Hill et al., 2014], WordSim-353

[Finkelstein et al., 2001]

For example: smart/intelligent: 9.2, navy/army: 6.43, water/salt:1.3

University of Oslo

Word embedding evaluation

• Analogy: Ask the model to detect whether two pairs of wordsstand in the same relation (linguistic relations: morphologicaland semantic relations)

Gold standard resource: AN-19.5K [Miklov et al., 2013]

For example: germany:berlin:: france:paris, man:woman::king:queen,man:men::dollar: dollars, bad:worse::big:bigger

Extrinsic: Investigate the contribution of an embedding model toa performance of specific downstream task.

For example: NER (F1): without embed.: x% with embed.: y%

Majority of works focus on general domain embeddings andrelations between frequent and generic terms

University of Oslo

Domain specific query inventory

The Schlumberger oilfield glossary (SLB)

University of Oslo

Domain specific query inventory

n-grams Nb. Noun Verb Adj. Adv. Pre-position Transitive-Verb

uni-gram 1499 1261 89 189 1 2 9bi-gram 2569 2505 33 36 1 1 2tri-gram 660 644 13 4 0 0 0>3 158 149 3 6 0 0 0

All 4886 4559 138 235 2 3 11

Table: N-grams & Part of speech tags

The final query inventorycontains:

• Synonym pairs: 878

• Antonym pairs: 284

• Alternative form pairs: 934

Multi-word entities

• Multi-word units: (3,387terms, 70%)

• Nouns (3,298 terms, 72%)

University of Oslo

Corpora and Pre-processing

Source Abbreviation Description Nb. documents Nb. sentences

American Association of Petroleum Geologist AAPG Scientific articles 3,382 72,243C&C Reservoirs-Digital Analogs CCR Field evaluation reports 1,140 244,017Elsevier ELS Scientific articles, magazines 40,757 7,703,447Geological Society, London Memoirs GSL Scientific articles 152 32,352Norwegian Petroleum Directory NPD Norwegian Field info 514 49,426Tellus TELLUS Basin info 1,478 179,450

Total 47,423 8,280,935 (108M tokens)

• Tokenization, lemmatization using StanfordCoreNLP

• English stopwords and sentences with less than 3 words arealso removed from the corpus.

• Shuffling: It makes the effect of all text almost equivalent

University of Oslo

Word Embeddings model

• The phrase model of gensim: bi-grams and tri-grams

• CBOW and Skip-gram architectures with default settings.

• Different system design settings:

Hyper parameter values

vector dimension (dim) 50, 100, 200, 300, 400, 500, 600context window size (win) 2, 3, 5, 10, 15, 20negative sampling (neg) 3, 5, 10, 15frequency cut off (min.count) 2,3, 5, 10n in top n-most-similar 5

• 22 Embedding models

University of Oslo

Evaluation

• For each term in the inventory an embedding model should be able topropose similar words which are related semantically as either synonym,alternative form or antonym.

• Measuring by looking at a target words relation set, for instance itssynonyms, and top n-most-similar words based on the embeddings model

Example:

Query inventory: filter cakeSynonymy−−−−−→ mud cake

5-most-similars: mud cake , filtercake, gel, filter-cake, cake

• Symmetric relations: the pairs (ti , tj) and (tj , ti ) are considered equivalentin the evaluation.

• Recall (R) = the number of correctly predicted word pairs over all wordpairs

• Precision (P) = the number of correctly predicted word pairs over allpredictions for each relation category

University of Oslo

Evaluation

Example:

University of Oslo

Evaluation

Example:

University of Oslo

Evaluation

Example:

University of Oslo

Experiments and results

• CBOW based model shows better results than the Skip-gramin all semantic relation tasks.

• Generally, the embedding models have higher scores forantonymy prediction than synonymy.

• Explore the impact of each hyper-parameter on detection ofsemantic relations:

• The effects of different configurations are diverse andsometimes they are counter-intuitive.

• Inconsistent results among the relations: regarding to win, neg,min.count

• Final configuration (OilGas.d400): CBOW model + dim =400 + the other parameters = defaults i.e. win = 5,min.count = 5 and neg = 5

University of Oslo

Comparative evaluation

Model Coverage (tokens,vocab.)

dim Synonymy Antonymy Alt. form

R P R P R P

Google News 26% (100B, 3M) 300 7.0 1.8 37.0 8.1 1.6 0.4Wiki+Giga 23% (6B, 400K) 300 3.2 0.8 43.8 10.2 3.7 0.8

OilGas.d400 31% (108M, 330K) 400 13.1 3.5 50.8 11.4 11.0 2.7

• Compare the domain-specific embeddings with generaldomain embeddings

• Pre-trained embedding: Wiki+Giga 1 and GoogleNews 2

1https://nlp.stanford.edu/projects/glove/

2https://code.google.com/archive/p/word2vec/

University of Oslo

Comparative evaluation

• This comparison is somewhat unfair : Differences inpre-processing and hyperparameter tuning!

• enwiki : The same pre-processing steps andhyperparameters to train the CBOW model over theEnglish Wikipedia dump.

• enwiki+OilGas: Similar experiment with a data setconsisting of both the general and domain specific corpora

Model Coverage dim Synonymy Antonymy Alt. form

R P R P R P

Google News 26% (100B, 3M) 300 7.0 1.8 37.0 8.1 1.6 0.4Wiki+Giga 23% (6B, 400K) 300 3.2 0.8 43.8 10.2 3.7 0.8

OilGas.d400 31% (108M, 330K) 400 13.1 3.5 50.8 11.4 11.0 2.7

enwiki 29% (1.8B, 2M) 400 6.7 1.8 33.3 7.5 8.1 1.9enwiki+OilGas 31% (1.9B, 2.3M) 400 7.8 2.1 47.7 10.7 8.9 2.0

University of Oslo

Error Analysis

• The domain specific model provides better results than general domain modelsfor a domain-specific benchmark.

• The performance is low for all three tasks, in particular for the synonymydetection task.

• Explore the reasons behind these low scores through an in-depth error analysis

• The primary cause: Out of vocabulary (OOV) terms in the queryinventory.

• The model vocabulary contains only 31% of the evaluation dataset!• Domain-specific terms vs. Domain-specific corpus

(OOV rate in synonymy relation)

n-grams OOV frq = 0 0 < frq < min.count

uni-gram 18 (10%) 12 6bi-gram 343 (91%) 123 51tri-gram 72 (100%) 72 ->3 16 (100%) 16 -

• Excluding the OOV terms: R= 29% , P=6.5% (Synonymy detection).

• Still the scores are Low: Examine the model predictions closer!

University of Oslo

Error Analysis

University of Oslo

Error Analysis

University of Oslo

Error Analysis

University of Oslo

Error Analysis

University of Oslo

Error Analysis

University of Oslo

Error Analysis

University of Oslo

• Randomly 100 terms are selected from the reference inventory which are also inthe model vocabulary

• Manually categorise their 10-most-similar words provided in the wordembeddings. [Leeuwenberg et al. (2016)]

Category Description

1. Spelling Variant The prediction is an abbreviation or there between are differencesprediction and target word because of hyphenation.

2. Alternative or derived form Inflections or derivations. The prediction is alternative or derived formof the target word.

3. Reference-Synonyms The prediction is a synonym of the target word in the oilfield glossary.

4. Human-judged Synonyms The prediction is judged as true by the expert.

5. Antonyms The prediction is an antonym of a target term.

6. Hypernyms The prediction is a more general category of the target term.

7. Hyponyms The prediction is a more specific type of the target term.

8. Co-Hyponyms The prediction and target term share a common hypernym.

9. Holonyms The prediction denotes a whole whose part is denoted by the target term.

10. Meronyms The prediction is a part of the target term.

11. Related Prediction and target term are semantically related.

12. Unrelated The association between prediction and target term is unknown, they aresemantically unrelated.

University of Oslo

• Randomly 100 terms are selected from the reference inventory which are also inthe model vocabulary

• Manually categorise their 10-most-similar words provided in the wordembeddings. [Leeuwenberg et al. (2016)]

Category Description

1. Spelling Variant The prediction is an abbreviation or there between are differencesprediction and target word because of hyphenation.

2. Alternative or derived form Inflections or derivations. The prediction is alternative or derived formof the target word.

3. Reference-Synonyms The prediction is a synonym of the target word in the oilfield glossary.

4. Human-judged Synonyms The prediction is judged as true by the expert.

5. Antonyms The prediction is an antonym of a target term.

6. Hypernyms The prediction is a more general category of the target term.

7. Hyponyms The prediction is a more specific type of the target term.

8. Co-Hyponyms The prediction and target term share a common hypernym.

9. Holonyms The prediction denotes a whole whose part is denoted by the target term.

10. Meronyms The prediction is a part of the target term.

11. Related Prediction and target term are semantically related.

12. Unrelated The association between prediction and target term is unknown, they aresemantically unrelated.

University of Oslo

Category Example 1st:10th(%)

1. Spelling Variant borehole →bore-hole 2.402. Alternative or derived form acidizing→acidization 3.203. Reference-Synonyms filter cake →mud cake 2.84. Human-judged Synonyms seismometer →seismograph 8.45. Antonyms transgressive →regressive 0.906. Hypernyms acidizing →stimulation 1.307. Hyponyms EOR →In-situ combustion 9.308. Co-Hyponyms EOR →MEOR 13.109. Holonyms shoe→wellbore 1.1010. Meronyms rig →wellhead 2.8011. Related Kirchhoff migration →NMO correlation 35.2012. Unrelated backflow →sediment-laden 19.50

• In general, semantically meaningful in a majority of cases. There is one type ofmorphosyntactic or semantic relation! (except the Unrelated/Unknown)

• Less than 20% of errors are assigned to the Unrelated/Unknown category

University of Oslo

• It is consistent with the finding in Leeuwenberg et al. (2016) ( thegeneral domain and WordNet).

• If we consider the count of human-judged synonyms as true positives,the precision and recall will be considerable high!

• The embeddings model proposes more synonyms that are not in thereference (the reference is provided by manual procedure).

• The most frequent error type falls in the related category.

• The hyponym and co-hyponym relations are another frequent error type(were also reported in previous studies).

• The morphosyntactic type of relations such as alternative or derviedform and spelling variant cover another type of errors.

• There are several meaningful relation types such as Hypernyms,Meronyms and Holonyms ( useful in many downstream applications).

University of Oslo

Conclusion

• We observe that constructing domain-specific wordembeddings is worthwhile even with a considerably smallercorpus size.

• Although the evaluation shows low performance in synonymydetection, an in-depth error analysis reveals the ability ofthe model to discover semantic relations : hyponymy,co-hyponymy and relatedness

University of Oslo

Future work

• On-going expert annotation survey. (Geoscientists fromStatoil)https://goo.gl/forms/hMv1PO9iZHZ5w35m2

• Cope with OOV terms:• Phrase extraction (n-grams that are in the input data but have

not been detected by the model)• Unseen words (n-grams that are not in the input data)

• Employ embeddings in downstream tasks (e.g, CDAUnstructured Data challenge, Statoil sentence markup task)

University of Oslo

CDA Unstructured Data Challenge

• UK subsurface well data and seismic documents + metadata

• Approx. 500,000 data items, 3.5Tb in total

• Participants executed a proposal that• Makes use of the unstructured data (reports, analyses, etc.),

along with any other data of interest• Demonstrates the value that can be added through modern

data analytical techniques

University of Oslo

CS-8: Hardcopy Cataloguingstandard (38 report types):

PON-9: Basic Set Well ( 9types, 27 sub-types):

University of Oslo

Thank You!

University of Oslo

dim Synonymy(%)

Antonymy(%)

Alternativeform (%)

R P R P R P

50 10.2 2.7 42.9 9.6 9.8 2.3100 10.2 2.7 49.2 11.1 11.0 2.6200 12.4 3.3 49.2 11.1 12.3 2.9300 13.1 3.5 49.2 11.1 11.7 2.7400 13.1 3.5 50.8 11.4 11.7 2.7500 12.4 3.3 47.6 10.7 12.9 3.0600 12.4 3.3 46.0 10.4 11.0 2.6700 12.4 3.3 47.6 10.7 11.7 2.7

Table: Evaluation results for different vector size (default=100)

win Synonymy(%)

Antonymy(%)

Alternativeform (%)

R R R P R P

2 10.2 2.7 49.2 11.1 12.3 2.93 12.4 3.3 42.9 9.6 9.8 2.35 10.2 2.7 49.2 11.1 11.0 2.610 10.9 2.9 47.6 10.7 12.3 2.915 10.2 2.7 50.8 11.4 11.0 2.620 10.2 2.7 47.6 10.7 10.4 2.4

Table: Evaluation results for different context window size (default=5)

University of Oslo

neg Synonymy(%)

Antonymy(%)

Alternativeform (%)

R P R P R P

3 10.2 2.7 47.6 10.7 10.4 2.45 10.2 2.7 49.2 11.1 11.0 2.610 10.2 2.7 49.2 11.1 11.7 2.715 10.2 2.7 46.0 10.4 12.3 2.9

Table: Evaluation results for different numberof negative samples (default=5)

min.count Synonymy(%)

Antonymy(%)

Alternativeform (%)

R P R P R P

2 9.9 2.7 48.4 10.9 11.8 2.73 10.1 2.7 50.0 11.2 12.0 2.85 10.2 2.7 49.2 11.1 11.0 2.610 10.4 2.8 48.3 10.9 11.7 2.7

Table: Evaluation results for different valuefor frequency cut off (default=5)

University of Oslo

Category 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th Example

SLB 12 5 2 4 3 0 0 1 1 0 filter cake / mud cakeHUM 18 11 13 5 8 6 3 5 5 6 seismometer / seismographSPL 5 6 2 2 0 1 5 2 1 0 borehole / bore-holeREL 11 23 19 27 29 33 41 40 35 31 Kirchhoff migration / NMO correlationNAM 3 6 4 5 5 4 1 5 5 4 borehole / KSDB (Kola Superdeep Borehole)CHP 9 13 16 16 15 17 11 7 13 14 EOR / MEORINF/DER 8 6 4 3 3 3 2 3 0 1 acidizing/acidizationPLS 1 1 0 1 1 0 0 0 0 0 dyke / dykesFRC 3 2 7 5 4 7 7 4 12 11 shoe /casingHPO 11 8 7 12 9 10 9 12 7 8 EOR / In-situ combustionHPR 4 1 1 2 2 1 1 0 1 0 acidizing / stimulationUNK 13 12 16 13 14 14 16 17 18 20 backflow / sediment-ladenMRN 1 2 6 3 5 2 2 3 1 3 rig / wellheadHOL 0 1 2 2 0 2 2 0 0 2 shoe/wellboreANT 1 3 1 0 2 0 0 1 1 0 transgressive / regressive

Table: Manual error analysis results for the 10-most-similar words

cross-domain evaluation of word embeddings using an ... · (schlumberger, dnv-gl), it companies...

Documents