getting started with regular expressions in marcedit
TRANSCRIPT
Getting Started with Regular Expressions in MarcEditTERRY REESEHEAD OF DIGITAL INITIATIVES, THE OHIO STATE UNIVERSITY
Topics MarcEdit Regular Expression Support Information
Understanding .NET Regular Expressions◦ Major components of the language◦ Understanding grouping mechanisms and references
How Does MarcEdit implement expressions
Getting Regular Expression Help
MarcEdit Regular Expression Support
Functions that presently support regular expressions◦ Delete Field◦ Edit Field◦ Copy Field◦ Swap Field◦ Build New Field◦ Extract/Delete Records◦ Validation Processing◦ Linked Data tooling◦ More…
MarcEdit Regular Expression Support
When processing regular expressions with MarcEdit, MarcEdit makes entire fields or subfields available for processing
◦ i.e., when processing a delete field function – all data from =[field number] are part of the field that can be queried.
MarcEdit’s regular expression by default deals with one field at a time (i.e., regular expressions do not allow you to find data across fields by default)
MarcEdit’s Regular Expression Support is defined by Microsoft .NET’s Regular Expression object
◦ This object uses a syntax that looks Perl-like, but has some differences.
Microsoft’s Regular Expression language
Concepts:◦ Character escapes◦ Anchors◦ Character classes◦ Grouping◦ Qualifiers◦ Substitutions
MSDN Documentation: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
PDF Quick Reference: http://download.microsoft.com/download/D/2/4/D240EBF6-A9BA-4E4F-A63F-AEB6DA0B921C/Regular%20expressions%20quick%20reference.pdf
How we use Regular Expressions in MarcEdit
Your most important parts of the regular expression language are:1. Character escapes: \d\r\n\$\x##2. Character Classes [] & [^]3. Grouping Elements ()4. Anchors: ^$5. Quantifiers: *?+{#}6. Substitutions: $#
How Expressions Manifest in MarcEdit
Part of understanding regular expressions in MarcEdit, is understanding what data is exposed to the Regular expression engine. Each of MarcEdit’s global edit functions see different
levels of dataThis is important to understand when:
Creating processing strategies Knowing which global editing function to choose
Replace Function
Replace Function Provides:
Access to all field data Can be processed across fields
(lines) Can do preconditional
sorting/evaluation before evaluating for replacement (can search for data in one field, and then perform and action on another if true)
Provides most access to record data for evaluation
Add/Delete Function
Add/Delete Function Provides:
Access to all field data from the equal sign to end of line
No option to evaluate across fields Only available when deleting data
Edit Field Data
Edit Field Data Function Provides:
Access all data after the indicators (no indicator or field data access)
Can be used to break up fields into new fields and do recursive searching
Edit Subfield Data
Edit Subfield Data Provides:
Only provides access to the defined subfield or control data positions
Regular Expression Basics
I like to think of regular expressions the same way as I think of diagraming a sentence.
http://www.english-grammar-revolution.com/images/puzzler_words_october_2012.jpg
Regular Expression Basics
I am trying to look at the data I want to replace and break it into its component parts. For example if I wanted to add a period to the 500 if it is missing
Source Fields:
=500 \\$aPrime meridians: Greenwich and Washington
=500 \\$aPrime meridians: Greenwich and Washington?
Structure:
Expression: (=500.*[^\W])$
Examples Looking at example.txt using the replace function:
◦ Add a period to the 500 if it is missing
◦ Add a $h of cartographic resources between the $a and $c .
◦ Split the 856 into two fields, breaking on the $u.
Examples 1◦ Add a period to the 500 if it is missing◦ Find What: (=500.*[^\W])$◦ Replace With: $1.
Explanation:◦ (=500.*[^\W])$
◦ Searches for the 500, then matches all data in the line, until you get to the final character. It then evaluates the final character to see if it’s a not a word character
Example 2◦ Add a $h of cartographic resources between the $a and $c .
Find What: (=245.{4})(\$a.*)(/.*) ◦ (=245.{4})
◦ Match the 245 field with any value in the next 4 characters being valid.◦ (\$a.*)
◦ Select everything within the subfield a◦ (/\$c.*)
◦ Select the / value and the subfield c (and other data)
Replace With: $1$2$$h[cartographic resource] $3
Example 3 Split the 856 into two fields, breaking on the $u.
◦ Find What: (=856.{4})(\$u.*[^$])(\$u.*)◦ (=856.{4})
◦ Matches the 856 field◦ (\$u.*[^$])
◦ Match $u, but stop at the end of the subfield◦ (\$u.*)
◦ Match reminder of field◦ Replace With: $1$2\n=856 41$3
Lcase/ucase MarcEdit’s regular expression engine includes to extension functions for dealing with case switching of characters.
◦ lcase & ucase
◦ Usage: (=450.{4})(\$a.)(.*)◦ $1$2lcase($3)
◦ Example: Find the 500 with all upper case characters and convert the case of all values but the first letter in the sentence to lower case.
Multi-Field Replacements
By default, MarcEdit handles one field at a time when doing regular expressions.
◦ However, when you need to do evaluations against multiple fields, you can by adding /m to the end of your replacement in the Replace Function in the MarcEditor
◦ This is a special function added to the MarcEdit regular expression engine
Delete Field Function The delete field function exposes all the data in the field to be acted upon as a regular expression.
◦ i.e. =856 .*◦ So the first value in the Delete Field evaluation is an =, not the subfield data
◦ The reason to do this is to allow for explicit evaluations of indicators.
Getting Regular Expression Help
The MarcEdit Listserv has a number of regular expression experts that provide a lot of help to users looking for it
http://metis3.gmu.edu/cgi-bin/wa?A0=MARCEDIT-L
Questions