getting started with regular expressions in marcedit

26
Getting Started with Regular Expressions in MarcEdit TERRY REESE HEAD OF DIGITAL INITIATIVES, THE OHIO STATE UNIVERSITY

Upload: terry-reese

Post on 22-Feb-2017

454 views

Category:

Education


6 download

TRANSCRIPT

Page 1: Getting Started with Regular Expressions In MarcEdit

Getting Started with Regular Expressions in MarcEditTERRY REESEHEAD OF DIGITAL INITIATIVES, THE OHIO STATE UNIVERSITY

Page 2: Getting Started with Regular Expressions In MarcEdit

Topics MarcEdit Regular Expression Support Information

Understanding .NET Regular Expressions◦ Major components of the language◦ Understanding grouping mechanisms and references

How Does MarcEdit implement expressions

Getting Regular Expression Help

Page 3: Getting Started with Regular Expressions In MarcEdit

MarcEdit Regular Expression Support

Functions that presently support regular expressions◦ Delete Field◦ Edit Field◦ Copy Field◦ Swap Field◦ Build New Field◦ Extract/Delete Records◦ Validation Processing◦ Linked Data tooling◦ More…

Page 4: Getting Started with Regular Expressions In MarcEdit

MarcEdit Regular Expression Support

When processing regular expressions with MarcEdit, MarcEdit makes entire fields or subfields available for processing

◦ i.e., when processing a delete field function – all data from =[field number] are part of the field that can be queried.

MarcEdit’s regular expression by default deals with one field at a time (i.e., regular expressions do not allow you to find data across fields by default)

MarcEdit’s Regular Expression Support is defined by Microsoft .NET’s Regular Expression object

◦ This object uses a syntax that looks Perl-like, but has some differences.

Page 5: Getting Started with Regular Expressions In MarcEdit

Microsoft’s Regular Expression language

Concepts:◦ Character escapes◦ Anchors◦ Character classes◦ Grouping◦ Qualifiers◦ Substitutions

MSDN Documentation: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx

PDF Quick Reference: http://download.microsoft.com/download/D/2/4/D240EBF6-A9BA-4E4F-A63F-AEB6DA0B921C/Regular%20expressions%20quick%20reference.pdf

Page 6: Getting Started with Regular Expressions In MarcEdit

How we use Regular Expressions in MarcEdit

Your most important parts of the regular expression language are:1. Character escapes: \d\r\n\$\x##2. Character Classes [] & [^]3. Grouping Elements ()4. Anchors: ^$5. Quantifiers: *?+{#}6. Substitutions: $#

Page 7: Getting Started with Regular Expressions In MarcEdit

How Expressions Manifest in MarcEdit

Part of understanding regular expressions in MarcEdit, is understanding what data is exposed to the Regular expression engine. Each of MarcEdit’s global edit functions see different

levels of dataThis is important to understand when:

Creating processing strategies Knowing which global editing function to choose

Page 8: Getting Started with Regular Expressions In MarcEdit

Replace Function

Page 9: Getting Started with Regular Expressions In MarcEdit

Replace Function Provides:

Access to all field data Can be processed across fields

(lines) Can do preconditional

sorting/evaluation before evaluating for replacement (can search for data in one field, and then perform and action on another if true)

Provides most access to record data for evaluation

Page 10: Getting Started with Regular Expressions In MarcEdit

Add/Delete Function

Page 11: Getting Started with Regular Expressions In MarcEdit

Add/Delete Function Provides:

Access to all field data from the equal sign to end of line

No option to evaluate across fields Only available when deleting data

Page 12: Getting Started with Regular Expressions In MarcEdit

Edit Field Data

Page 13: Getting Started with Regular Expressions In MarcEdit

Edit Field Data Function Provides:

Access all data after the indicators (no indicator or field data access)

Can be used to break up fields into new fields and do recursive searching

Page 14: Getting Started with Regular Expressions In MarcEdit

Edit Subfield Data

Page 15: Getting Started with Regular Expressions In MarcEdit

Edit Subfield Data Provides:

Only provides access to the defined subfield or control data positions

Page 16: Getting Started with Regular Expressions In MarcEdit

Regular Expression Basics

I like to think of regular expressions the same way as I think of diagraming a sentence.

http://www.english-grammar-revolution.com/images/puzzler_words_october_2012.jpg

Page 17: Getting Started with Regular Expressions In MarcEdit

Regular Expression Basics

I am trying to look at the data I want to replace and break it into its component parts. For example if I wanted to add a period to the 500 if it is missing

Source Fields:

=500 \\$aPrime meridians: Greenwich and Washington

=500 \\$aPrime meridians: Greenwich and Washington?

Structure:

Expression: (=500.*[^\W])$

Page 18: Getting Started with Regular Expressions In MarcEdit

Examples Looking at example.txt using the replace function:

◦ Add a period to the 500 if it is missing

◦ Add a $h of cartographic resources between the $a and $c .

◦ Split the 856 into two fields, breaking on the $u.

Page 19: Getting Started with Regular Expressions In MarcEdit

Examples 1◦ Add a period to the 500 if it is missing◦ Find What: (=500.*[^\W])$◦ Replace With: $1.

Explanation:◦ (=500.*[^\W])$

◦ Searches for the 500, then matches all data in the line, until you get to the final character. It then evaluates the final character to see if it’s a not a word character

Page 20: Getting Started with Regular Expressions In MarcEdit

Example 2◦ Add a $h of cartographic resources between the $a and $c .

Find What: (=245.{4})(\$a.*)(/.*) ◦ (=245.{4})

◦ Match the 245 field with any value in the next 4 characters being valid.◦ (\$a.*)

◦ Select everything within the subfield a◦ (/\$c.*)

◦ Select the / value and the subfield c (and other data)

Replace With: $1$2$$h[cartographic resource] $3

Page 21: Getting Started with Regular Expressions In MarcEdit

Example 3 Split the 856 into two fields, breaking on the $u.

◦ Find What: (=856.{4})(\$u.*[^$])(\$u.*)◦ (=856.{4})

◦ Matches the 856 field◦ (\$u.*[^$])

◦ Match $u, but stop at the end of the subfield◦ (\$u.*)

◦ Match reminder of field◦ Replace With: $1$2\n=856 41$3

Page 22: Getting Started with Regular Expressions In MarcEdit

Lcase/ucase MarcEdit’s regular expression engine includes to extension functions for dealing with case switching of characters.

◦ lcase & ucase

◦ Usage: (=450.{4})(\$a.)(.*)◦ $1$2lcase($3)

◦ Example: Find the 500 with all upper case characters and convert the case of all values but the first letter in the sentence to lower case.

Page 23: Getting Started with Regular Expressions In MarcEdit

Multi-Field Replacements

By default, MarcEdit handles one field at a time when doing regular expressions.

◦ However, when you need to do evaluations against multiple fields, you can by adding /m to the end of your replacement in the Replace Function in the MarcEditor

◦ This is a special function added to the MarcEdit regular expression engine

Page 24: Getting Started with Regular Expressions In MarcEdit

Delete Field Function The delete field function exposes all the data in the field to be acted upon as a regular expression.

◦ i.e. =856 .*◦ So the first value in the Delete Field evaluation is an =, not the subfield data

◦ The reason to do this is to allow for explicit evaluations of indicators.

Page 25: Getting Started with Regular Expressions In MarcEdit

Getting Regular Expression Help

The MarcEdit Listserv has a number of regular expression experts that provide a lot of help to users looking for it

http://metis3.gmu.edu/cgi-bin/wa?A0=MARCEDIT-L

Page 26: Getting Started with Regular Expressions In MarcEdit

Questions