application of regular expressions in the german business register

18
© Federal Statistical Office Germany, IV A2 Federal Statistical Office Germany Application of Regular Expressions in the German Business Register Session 5: Projects on Improvements for Business Registers Wiesbaden Group on Business Registers Paris, November 26 th 2007, Patrizia Moedinger

Upload: hernando-norales

Post on 02-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Application of Regular Expressions in the German Business Register. Session 5: Projects on Improvements for Business Registers Wiesbaden Group on Business Registers Paris, November 26 th 2007, Patrizia Moedinger. Example 1: Improving legal form coding by using regular expressions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2

Federal Statistical Office Germany

Application of Regular Expressions in the German Business Register

Session 5: Projects on Improvements for Business Registers

Wiesbaden Group on Business RegistersParis, November 26th 2007, Patrizia Moedinger

Page 2: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 2

Example 1: Improving legal form coding by using regular expressions

Page 3: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 3

Background

information on legal forms mainly from VAT records

not all administrative sources provide information on legal forms

use of different not compatible legal form coding or different aggregation levels

special requirements for other purposes like the coding of institutional sectors

Page 4: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 4

Background

enterprises (legal units) with certain legal forms are legally obliged to carry their legal form in the enterprise name: incorporated firms non-incorporated firms cooperatives merchants that are registered in the German

Commercial Register

enterprise names can be used for legal form coding

Page 5: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 5

Definition of search patterns

patterns from nomenclature, abbreviation and notations (tax authorities)GmbH, AG & Co.KG, Limited, Ltd.

patterns from BR real data mistakes in writing, missing blanks, ..

construction of regular expression

Page 6: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 6

Evaluation of search patterns

completeness of codinglegal obligation: high level of found legal forms in enterprise names

degree of reliance: evaluation of coding results drawing sample after legal form coding classification of the coding results calculation of sensitivity, specificity, positive

predictive value, negative predictive value

Page 7: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 7

Completeness of coding

93.7

9.9

3.2

89.7

6.3

90.1

96.8

10.3

0 50 100

1

2

3

4

%

no legal form could be detected from enterprise name

legal form could be detected from enterprise name

sole proprietors

non-incorporated firms

incorporated firms

miscellaneous legal forms (including cooperatives)

Page 8: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 8

Evaluation of Type I and II errors

Enterprise name contains

legal formno or wrong legal form

regularexpressiondetects

legal form 1,009 4

PPV (positive predictive value) = 1,009 / (1,009 + 4)= 99.6 %

no or wrong legal form

26 2,961

NPV (negative predictive value) = 2,961 / (2,961 + 24)= 99.1 %

Sensitivity = 1,009 / (1,009 + 26) = 97.5 %

Specificity = 2,961 / (4 + 2,961)= 99.8 %

N =4,000

Page 9: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 9

Example 2: Data pre-processing as a preliminary for record linkage

Page 10: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 10

Background no common unique identifiers available

data from different sources are initially linked by names and addresses

different or none address standards

different notations “BMW“ or “Bayerische Motorenwerke“ or “Bay. Motorenwerke“

German BR is technically limited in storing several addresses (only dispatch and domicile)

Page 11: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 11

Problem of non standardized notations matching by administrative identifiers

dependent variable =

match by administrative identifiers + no change in the postal code

independent variable =

differences between enterprise names, street names and town names (Levenshtein edit distance)

same (administrative) source

different sources (administrative source – BR)

Page 12: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 12

Matching probability against string similarity within an administrative source (Employment Agency) (Model: Logistic regression)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Levenshtein - Edit - Distance / Maximum String Length

pre

dic

ted

y

EnterpriseName

Street Name

Town Name

Match

No Match

Page 13: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 13

Matching probability against string similarity between an administrative source (Employment Agency) and BR (Model: Logistic regression)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

Levenshtein Edit Distance / Maximun String Length

pre

dic

ted

y

Match

No match

Street NameEnterprise Name

Town Name

Page 14: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 14

Pre-processing of administrative data for record linkagehigh level of similarity between two strings identical units

high level of disparity between two strings different units

differences in name or address

low high

identical unit

different unit

Page 15: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 15

Pre-processing of administrative data for record linkage conversion into specific variables for

string matching

BMW

AG

Branch Munich Mr Mueller

enterprise name:

legal form:

other elements:

BMW AG Branch MunichMr Mueller

enterprise address

simplify comparison strings

Page 16: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 16

Methods for evaluation

evaluate link between string similarity and match before and after pre-processing the data

evaluation of matching results

(drawing sample after matching process)

classification of the matching results calculation of sensitivity, specificity,

positive predictive value, negative predictive value

controlling for effects caused by the used matching program

Page 17: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 17

Synopsis

BR text data needs special treatment in data processing

applications for regular expressions simple application: legal form coding

(limited set of search pattern)more complex application: pre-

processing (set of pattern depends on data source and later use)

application of regular expressions should always be evaluated

Page 18: Application of Regular Expressions in the German Business Register

© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger

Federal Statistical Office Germany

20.04.23 Slide 18

Thank you for your attention.