advances in automatic chemical spelling correction · •adding advanced automatic chemical...

29
Advances in Automatic Chemical Spelling Correction ACS National Meeting, Philadelphia, USA 19 th August 2012 Roger Sayle and Daniel Lowe NextMove Software Cambridge, UK

Upload: others

Post on 31-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

Advances in Automatic Chemical Spelling Correction

ACS National Meeting, Philadelphia, USA 19th August 2012

Roger Sayle and Daniel Lowe

NextMove Software

Cambridge, UK

Page 2: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

example spelling errors

• Sample misspellings of pyrimidine in US patent grants since the beginning of this year.

ACS National Meeting, Philadelphia, USA 19th August 2012

Incorrect Name US Patent No. Issue Date

pryimidine 8093264 10th January 2012

pyrmidine 8097728 17th January 2012

pyrimdine 8114996 14th February 2012

pyrimidne 8129897 6th March 2012

pyrimidinc 8148339 3rd April 2012

pyridmidine 8158627 8th May 2012

Page 3: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

previous work

• Roger Sayle, Paul Hongxing Xie and Sorel Muresan, “Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction”, Journal of Chemical Information and Modeling, Vol. 52, No. 1, pp. 51-62, 2012.

• G.H. Kirby, M.R. Lord, J.D. Rayner, “Computer Translation of IUPAC Systematic Organic Chemical Nomenclature. 6. (Semi)automatic name correction, Journal of Chemical Information and Computer Science, Vol. 31, pp. 153-160, 1991.

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 4: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

STring EDIt Distance

• The Levenshtein Distance (Levenshtein 1965) is the minimum number of edits (insertions, deletions or substitutions) required to transform one string into another.

• The Damerau-Levenshtein Distance is an extension of Levenshtein Distance to include transposition of two adjacent characters.

• The distances can be efficiently computed with dynamic programming using the Needleman-Wunsch-Sellers alignment algorithm (bioinformatics).

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 5: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

needleman-wunsch-sellers

ACS National Meeting, Philadelphia, USA 19th August 2012

c l o r a m p h e n i c a l

c

h

l

o

r

a

m

p

h

e

n

i

c

o

l

Page 6: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

needleman-wunsch-sellers

ACS National Meeting, Philadelphia, USA 19th August 2012

c l o r a m p h e n i c a l

c

h

l

o

r

a

m

p

h

e

n

i

c

o

l

Page 7: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

further complications

• Edit operations can have their own specific penalties.

• The latest implementation supports transpositions, to catch spelling mistakes such as “chlorofrom”.

• Some dictionaries to match against have tens of millions of entries, others are infinite.

• The start and end of the input isn’t known in free-text and are assigned on the quality of the match.

• Correct nesting of parenthesis and brackets needs to be enforced as part of the matching process.

• In summary - It’s mind bogglingly complicated. ACS National Meeting, Philadelphia, USA 19th August 2012

Page 8: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

dictionaries as automata

• Nitrogen containing heterocycles as minimal DFA:

– Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine, Pyrimidine, Pyrazine

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 9: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

Example iupac-like grammar

• More generally, still CaffeineFix FSMs can represent formal grammars, i.e. infinite dictionaries.

alk := “meth” | “eth” | “prop” | “but”

parent := alk “ane”

subst := “bromo” | “chloro” | “fluoro”

locant := “1” | “2” | “3” | “4” /* any digit */

prefix := [ prefix “-” ] [ loc “-” ] subst

| [ prefix ] subst

name := [ prefix [ “-” ] ] parent

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 10: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

Iupac-like grammar examples

• methane

• chloroethane

• 2-bromo-propane

• chloro-bromo-methane

• 1-fluoro-2-chloro-ethane

• chlorofluoromethane

• 4-bromomethane

• 1-chloro-1-chloro-1-chloro-methane

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 11: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

Representing grammars as dFAs

Backward edges allow matching an infinite number of words.

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 12: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

current iupac grammar FSM

• As of July 2012, the current CaffeineFix grammar contains nearly 1.2 million edges.

• This grammar covers...

– 99.15% (232144/234142) names in the NCI00 database.

– 95.28% (67995/71367) names in the Maybridge catalogue.

– 95.27% (25890/48167) names in the Keyorganics catalogue.

• These figures are comparable to name-to-structure conversion rates on these names.

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 13: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

SPelling correction examples

• lH-ben zimidazole → 1H-benzimidazole

• triphenylposhine → triphenylphosphine

• 4- (2-ADAMANTYLCARBAM0YL) -5-TERT-BUTYL-PYRAZOL-1-YL] BENZOIC ACID →

4-(2-adamantylcarbamoyl)-5-tert-butyl-pyrazol-1-yl]benzoic acid

• didec-2-ene → dodec-2-ene

• spiro[2.2]hexane → spiro[2.3]hexane

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 14: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

low cost “frequent” edit ops

• A number of common corrections are so frequent as to be given a lower (free) cost.

1. Deletion of whitespace.

2. Deletion of a hyphen (where not anticipated)

3. Substitution of “l” (lower case el) for “1” (one).

4. Substitution of “I” (upper case ey) for “l” (el) or “1” (one).

5. Substitution of “rn” by “m”.

6. Substitution of “1” (one) by “l” (el).

7. Substitution of “φ” by “rp” [OCR artifact].

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 15: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

handle with care

• Alas, introducing automatic spelling correction (fuzzy matching) to entity recognition often requires the introduction of white word lists to avoid problems.

– herein → heroin

– aspiring → aspirin

– cranium → uranium

– ability → abilify

• More aggressive correction leads to more problems:

– “be that the line” → methantheline

– “park on a zone” → parconazole

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 16: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

benchmarking and analysis

• To quantify the benefits of automatic spelling correction to “real world” chemical text mining we analysed the first 28 weeks of US patent grants from 2012.

• This corresponds to 145,473 documents, issued between 3rd January 2012 and 10th July 2012.

• A total of 4,061,670 IUPAC-like systematic names were identified, with 1,816,317 unique patent/name pairs.

• OPSIN interprets 3,722,399 and 1,647,402.

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 17: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

total molecule entities

ACS National Meeting, Philadelphia, USA 19th August 2012

3411665, 84%

97472, 2% 552533,

14%

All Extracted Molecule Entities

Correct

Simple

D=1

3200193, 86%

87719, 2%

434487, 12%

OPSIN Interpeted Molecule Entities

Correct

Simple

D=1

Using correction retrieves ~19% more entities and ~16% more OPSIN recognizable names.

Page 18: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

unique molecule entries

ACS National Meeting, Philadelphia, USA 19th August 2012

1430931, 79%

75027, 4%

310359, 17%

All Extracted Molecule Entities

Correct

Simple

D=1

1343755, 82%

67317, 4% 236330,

14%

OPSIN Interpeted Molecule Entities

Correct

Simple

D=1

The effect is more pronounced with patent-cmpnd pairs with a +27% improvement over no correction.

Page 19: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

influence on n2s software (opsin)

ACS National Meeting, Philadelphia, USA 19th August 2012

Interpretation Count Fraction

Not Interpretable 116216 17.88%

Before not After 11779 1.81%

After not Before 270453 41.61%

Same Before/After 181693 27.95%

Different Before/After 69864 10.75%

Total 650005 100.00%

Although some valid names are lost by correction, overall the effect is overwhelmingly positive.

Page 20: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

break-down of edit operations

ACS National Meeting, Philadelphia, USA 19th August 2012

Edit Operation Count Fraction

Deletion 392324 47.50%

Insertion 232493 28.15%

Substitution 198438 24.03%

Transposition 2670 0.32%

Total 825,925 100.00%

Page 21: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

drug dictionary entities

ACS National Meeting, Philadelphia, USA 19th August 2012

1013711, 93%

6141, 0%

71684, 7%

All Drug Dictionary Entities

Correct

Simple

D=1

395482, 92%

4119, 1%

30687, 7%

Unique Drug Dictionary Entities

Correct

Simple

D=1

Some improvement is seen with drug dictionaries, but there’s little benefit in fixing simple OCR issues.

Page 22: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

target dictionary entries

• Simple correction of ChEMBL protein target names

• prostaglandin H- 2 synthase- 1 → prostaglandin H2 synthase 1

• Alanine amino-transferase → Alanine aminotransferase

• acetyl cholinesterase → acetylcholinesterase

• cyclooxy-genase-2 → cyclooxygenase-2

• MAP kinase ERK-2 → MAP kinase ERK2

• HEC-GLCNAC-6-ST → HEC-GLCNAC6ST

• Herne Oxygenase → Heme Oxygenase

• Prealburnin → Prealbumin

• p110- delta → p110delta

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 23: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

non-word spelling correction

• Automatic correction technology can also be applied to entities other than words or IUPAC-like chemical nomenclature.

• For example, Chemical Abstract Service’s registry numbers.

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 24: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

cas registry number grammar

• Two to seven digits, followed by a hyphen, two digits, a hyphen and a final check digit

– e.g. 7732-18-5

• Regular Expression: (([1-9]\d{2,5})|([5-9]\d))-\d\d-\d

ACS National Meeting, Philadelphia, USA 19th August 2012

0

11-4

2

5-9

30-9

40-9

5

0-9

0-9

10

-

60-9

-

70-9

-

80-9

-

90-9

-

-

110-9

120-9

13-

140-9

Page 25: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

Cas check digit calculation

• More generally CaffeineFix’s finite state machines can do limited processing...

• The final check digit of a CAS number is calculated by series term summation modulo 10.

• The last digit time 1, the previous digit times 2, the previous digit times 3, and computing the sum modulo 10.

• The CAS number for water is 7732-18-5.

• The checksum 5 is calculated as (1x8 + 2x1 + 3x2 + 4x3 + 5x7 + 6x7) mod 10 = 5.

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 26: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

Fsm for matching cas check digits

ACS National Meeting, Philadelphia, USA 19th August 2012

0

2

4

6

8

12

17

14

19

11

16

13

18

21

23

25

27

29

22

24

26

28

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

53

55

57

59

52

54

56

58

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

Page 27: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

cas number correction example

• 7732-18-8? Did you mean... – 7732-18-5

– 7732-11-8

– 77328-18-8

– 7733-18-8

– 77342-18-8

– 77392-18-8

– 71732-18-8

– 76732-18-8

– 97732-18-8

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 28: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

take home message

• Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%.

– Andrew Hinton, “Benchmarking ChemAxon’s Name-to-Structure batch tool on Patent Data”, 2011 ChemAxon EUGM, Budapest.

– Sorel Muresan, “Automated Spelling Correction to Improve Recall Rates of Name-to-Structure Tools for Chemical Text Mining”, 2011 ChemAxon EUGM.

ACS National Meeting, Philadelphia, USA 19th August 2012

Page 29: Advances in Automatic Chemical Spelling Correction · •Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%

acknowledgements

• Daniel Lowe, NextMove Software.

• Sorel Muresan and Paul Hongxing Xie, AstraZeneca.

• Thank you for your time.

• Any questions?

ACS National Meeting, Philadelphia, USA 19th August 2012