scool : a system for academic institution name normalization

23
sCooL: A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder 1

Upload: lena

Post on 24-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

sCooL : A System for Academic Institution Name Normalization. Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder. About sCooL What is entity normalization? Why is academic entity normalization important? - PowerPoint PPT Presentation

TRANSCRIPT

1

sCooL: A System for Academic Institution

Name NormalizationFerosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair

Classification R & DCareerBuilder

2

Presentation overview

About sCooL◦ What is entity normalization?◦ Why is academic entity normalization important?◦ What are the academic entity normalization challenges?

Inside sCooL◦ A high-level overview of the core components◦ Atlas- the mapping manager

Evaluating sCooL◦ Comparing sCooL with existing implementation◦ Independent evaluation of sCooL

Concluding remarks◦ Demo◦ Questions?

3

About sCooL:Academic entity normalization facts

Facts 7,021 post-secondary title IV institutions in 2010-111*

200 Million unique visitors @ CB U.S

12 Million unique academic institutions entries in CB resume database

*http://nces.ed.gov/fastfacts/display.asp?id=84

4

About sCooL:Academic entity normalization definition

No. Name (surface formss) Frequency

1 4102 1393 1314 65 16 17 18 19 110 1}Entity:

Surfa

ce fo

rms

5

About sCooL:Why academic entity normalizations

Improved Searching

Labor market dynamics insights

6

About sCooL:Academic entity normalization challenges

No. Name (surface formss) Frequency

1 Salford College 4102 Salford College of Technology 1393 Salford City College 1314 Salford Uni 65 Salford University - 16 The University of Salford. 17 Salford University **+ 18 University of Salford 1982 19 =- University OF SALFORD 110 University of Salford- 1}

Entity:Salford City CollegeMerchants Quay, Salford QuaysUnited Kingdom

Entity:University of SalfordSalford, LancashireUnited Kingdom

Entity:Salford College68 Grenfell Street, AdelaideAustralia

How will you identify the most accurate normalization from a given surface form?

7

About sCooL:Academic entity normalization challenges..

String similarity algorithms◦ Edit distance

Salford university -> Salford Unevarsity (Edit distance 2) (spelling error)

St. Loye’s College ->St. Luke’s College (Edit distance 2) (Two different academic institutions)

How will you distinguish spelling or typing errors from two different institution mapping scenario?

8

About sCooL:Academic entity normalization challenges

How will you create and maintain the surface form-entity mappings?

Legacy names (Mergers)

◦ University of Central England in Birmingham is an old name of Birmingham City University

◦ In January 2009, Salford College merged with Eccles College and Pendleton College to form Salford City College

◦ In October 2004, Victoria University of Manchester with the University of Manchester Institute of Science and Technology to form The University of Manchester

Popular names and Acronyms

◦ Ole Miss is a popular name for The University of Mississippi◦ MIT is an acronym for Massachusetts Institute of Technology. However, GIT is not an

acronym for Georgia Institute of Technology but Georgia Tech or Ga Tech are popular names for the institution.

9

No. Top 10 frequent universities in UK dataset

Frequency

1 N/A 128976

2 City & Guilds 23992

3 Not Specified 18598

4 City and Guilds 17441

5 Open University 6886

6 MIDDLESEX UNIVERSITY 5490

7 University of East London 5266

8 University of Greenwich 5108

9 CITY UNIVERSITY 4863

10 Kingston University 4856

About sCooL:Academic entity normalization challenges

How can we remove K-12 schools and noise?

Institution type Distribution

College 23.32%

University 16.57%

K-12 school 34.22%

Not sure 25.89%

10

About sCooL:Challenges summary

How will you identify the most accurate normalization from a given surface form?

How will you distinguish spelling or typing errors from two different institution mapping scenario?

How will you create and maintain the surface form-entity mappings?

How can we remove K-12 schools and noise?

11

Raw input query (surface form)

Remove K-12 schools• Weka classifier

Search institutions using mappings DB• Lucene index

Refine results• String comparison algorithm

Normalized entity• Update mappings DB

Inside sCooL:A high-level overview of the system

Inside sCooL:Atlas- sCooL’s mapping manager

12

CB mappi

ngs

Wikimappings

MongoDB

Lucene

Atlas

sCooL

13

Inside sCooL:Refining Lucene results

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝑇𝑟𝑢𝑒𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛

𝑁𝑜𝑛𝑁𝑢𝑙𝑙 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛𝑠 (𝑇𝑟𝑢𝑒+𝐹𝑎𝑙𝑠𝑒)

𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒=𝑇𝑟𝑢𝑒𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛

𝐴𝑙𝑙 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 (𝑇𝑟𝑢𝑒+𝐹𝑎𝑙𝑠𝑒+𝑁𝑢𝑙𝑙)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Coverage

Accuracy

Threshold similarity

14

Evaluation:Comparing sCooL with existing implementation

Targeted metrics: Accuracy & Coverage

Precision is more important than Recall

Stratified Sampling in estimate of ratios

Favor high-frequency queries in sampling

15

Evaluation:Comparing sCooLwith existing implementation

𝐏𝐫 (|�̂�𝑖−𝑃 𝑖

𝑃 𝑖 |<h 𝑖)=𝐶

{𝑛0= 𝑍 𝛼2 𝑃 𝑖(1− 𝑃𝑖)

h 𝑖2

𝑛𝑖=𝑛0

1+(𝑛0−1)/𝑁 𝑖

�̂�=∑𝑖=1

3 𝑁 𝑖

∑𝑖𝑁 𝑖

�̂�𝑖

91%

7%

2%Sampling design

[1, 6][7, 39][40, max]

16

Evaluation:Comparing sCooL with existing implementation

Groups Group Size Sample Size Sampling Rate

sCool Accuracy

Existing System Accuracy

[1, 6] 145,126 780 1% 92% 75% [7, 39] 11,938 736 6% 96% 79% [40, max] 3,896 653 17% 95% 85% Total 160,960 2,169 1% 95% 80%

Dataset Coverage Weighted Coverage

UK CareerBuilder data sCool Existing System sCool Existing

System 40% 1% 73% 46%

17

Evaluation:Independent evaluation of sCooL

Test1-4ICU university list

The 4ICU [22] website145 popular universities and colleges in U.K.

Test2-Guardian university list:

The Guardian [23]a list of 135 universities in U.K.

DatasetAccuracy Coverage

sCool Existing System sCool Existing

SystemTest 1 (145) 93% 91% 95% 79%Test 2 (135) 93% 90% 88% 72%

18

sCooL:Demo

Atlas http://ec2-54-193-1-73.us-west-1.compute.amazonaws.com/Atlas/

19

sCooL:Questions

20

sCooL: AppendixLucene search results for “University of Milan”

Rank Searchable field Display name1 polytechnic university of milan Polytechnic University of Milan

2 university of milan University of Milan3 catholic university of milan Universit`a Cattolica del Sacro Cuore

4 iulm university of milan IULM University of Milan

5 university of milan bicocca University of Milan Bicocca

6 milan university University of Milan7 politecnico of milan Polytechnic University of Milan

8 milan polytechnic Polytechnic University of Milan

21

sCooL: AppendixString similarity algorithms

Rank String similarity algorithms1 Levenshtein

2 Lucene Levenshtein3 N-gram

4 Jaccard Similarity

5 Jaro Winkler

6 Hamming7 Equals

8 Ignore case Equals

22

Evaluation:Comparing sCool with existing implementation

Balancing between Accuracy and Coverage

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

1000

2000

3000

4000

5000

6000

7000

CorrectWrongNull

Threshold similarity

Total input queries

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

CoverageAccuracy

Threshold similarity

23

About sCooL:Related work

Cucerzan, S from Microsoft Research did great work on large-scale disambiguation by Wikipedia data in 2007

Jijkoun, V et. al. from Univ. of Amsterdam proposed NEN in user generated content in 2008

Liu, X et. al. from Microsoft Research, China conducted a joint inference on NER and NEN for tweets in 2012

Magdy, W et. al. from IBM, Egypt invented NEN for Arabic names in 2007

Jonnalagadda, S et. al. from Lnx Research, CA developed NEMO, a NER and NEN system for PubMed author affiliations 2011

Cohen, A from OHSU studied gene/protein NEN by automatically generated libraries in 2005