![Page 1: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/1.jpg)
Enhancing the Performance and Extensibility of the XC
MetadataServicesToolkitBen Anderson, Software Engineer, XCO
![Page 2: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/2.jpg)
Download this presentation:www.extensiblecatalog.org/learnmore
2
![Page 3: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/3.jpg)
Timeline
3
Jennifer Bowenpresented at
code4lib2/10
I began at XCO3/10
work beganon 0.34/10
0.3 released1/11
0.2 released
1.0 released
![Page 4: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/4.jpg)
MARCXML(6M records) DC-TERMS
(13k records)
XC Software ComponentsUser Interface for searching and browsing
Library Website (on Drupal)Library Website (on Drupal)
Integrated Library SystemIntegrated Library System RepositoryRepository
XC Drupal Toolkit
Tools for automated processing of large batches of metadata XC Metadata
Services ToolkitXC Metadata
Services Toolkit
Tools for connectivity between XC and an ILS
XC
Circ
. Sta
tus/
Req.
Auth
entic
ation
XC NCIP Toolkit
XC NCIP Toolkit
4
XC OAI ToolkitXC OAI Toolkit irplus
![Page 5: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/5.jpg)
5
![Page 6: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/6.jpg)
Learn More about XC atwww.extensiblecatalog.org
6
![Page 7: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/7.jpg)
One Example of Process Flow
7
MARC BIBrecord from externalrepository
NormalizedMARC BIBrecord from normalization service
FRBRized recordsfrom transformationservice
work
expression
manifestation
![Page 8: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/8.jpg)
M S T
Logical Process
8
OAI-PMH Harvest
MARCNormalization
Service
MARC-XCTransformation
Service
Pseudo OAI-PMH Harvests
OAI-PMH Harvestable
provider caches
repo repo
![Page 9: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/9.jpg)
Add an External Repository
9
![Page 10: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/10.jpg)
Schedule a Harvest
10
![Page 11: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/11.jpg)
Configure Processing Rules
11
![Page 12: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/12.jpg)
Browse Records
12
![Page 13: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/13.jpg)
Goals for 0.3
• Each service should process one million records per hour on an “average library server”– 1.5 GHz SPARC V9 – 8G RAM (3G for the JVM)– 10k RPM hard drive
• Services should have little to no degradation as the size of a repository grows– University of Rochester has 6M records
• Implementing a service should be easy– it should require no knowledge of MST internals– it should not be up to the service implementer to figure
out how to build and package their service
13
![Page 14: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/14.jpg)
Determine Throughput of 0.2
• Using the MARC Normalization service as our metric, the first million records processed at average at a speed of:– 29 ms/record = 120k/hr (goal is 3.6 ms/rec = 1M/hr)
• Before the service processed 2 million records, the process crawled to a halt (goal was little to no degradation of at least 6 million records).
14
![Page 15: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/15.jpg)
Determine Bottlenecks with TimingLogger
15
This codeproduces this output
![Page 16: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/16.jpg)
Bottleneck Breakdown
• 29 ms per record– 2.5 ms to create DOM– 5 ms for actual service processing (the innards of
the MARC-Normalization service)– 21 ms for querying solr and inserting
• This is the average - both querying and inserting are done in batch.
• I had a hard time separating the two
16
![Page 17: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/17.jpg)
0.2 Design
17
•All data needed for the UI•except for searching and browsing records
•All data needed for configuring harvests, services, processing rules, etc
•Text indexes necessary for searching and browsing records•All record/repository data
![Page 18: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/18.jpg)
0.3 Design Change to use MySQL
18
•All data needed for the UI•except for searching and browsing records
•All data needed for configuring harvests, services, processing rules, etc•All record/repository data
•Doesn’t store any data•Use only for indexing records to support searching in the UI
![Page 19: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/19.jpg)
0.3 Design – Keep the table sizes small
19
One index for allrepositories
Each external repository cache and each service gets its own set of database tables
externalprovider
reponormalization
repotransformation
repo
![Page 20: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/20.jpg)
one or moreper record
zero or moreper record
one per record
0.3 Design - Yes, a boring ERD
20
record_updates
record_id
update_date
records_xml
record_id
xml
record_sets
records_xml
record_id
xml
record_predecessors
record_id
pred_record_id
![Page 21: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/21.jpg)
Did that improve things?
21
• 11 ms per record (previously 29)– 2.5 ms to create DOM– 5 ms for actual service processing (the innards of the
MARC-Normalization service)– 3.5 ms (previously 21) for querying MySQL and
inserting into MySQL• again, both querying and inserting are done in batch• The query time is almost nill - it’s the inserting that takes
time.• It’s faster, but still nearly 3x slower than our goal• The performance showed little to no degradation
![Page 22: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/22.jpg)
Get rid of XPath
22
XPath isn’t a bad technology, but when you’re optimizing for performance, it can be beneficial to find other ways to accomplish the same task. So, I changed this code…
to this code…
![Page 23: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/23.jpg)
Did that improve things?
23
• 7 ms per record (previously 11)– 2.5 ms to create DOM– 1.0 ms (previously 5) for actual service processing
(the innards of the MARC-Normalization service)– 3.5 ms for MySQL inserts
• It’s faster, but still nearly 2x slower than our goal
![Page 24: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/24.jpg)
Delayed Indexing in MySQL
• MySQL modifies table indexes with each insert.
• It is faster to the drop indexes, insert lots of rows into the tables, and then add the indexes back.– This is the way mysqldump works– This means you can’t read the data while doing an
insert. No big deal – we’ll just do it during large loads.
24
![Page 25: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/25.jpg)
Did that improve things?
25
• 6 ms per record (previously 11)– 2.5 ms to create DOM– 1.0 ms for actual service processing (the innards of
the MARC-Normalization service)– 2.2 ms (previously 3.5) for MySQL inserts
• It’s faster, but still nearly 2x slower than our goal
![Page 26: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/26.jpg)
Batch Prepared Statements
26
Java/JDBC provides an extremely highly performant method for sending large chunks of data to the db at once using batch prepared statements.
There’s no way to speed this part up… or so I thought…
![Page 27: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/27.jpg)
LOAD DATA INFILE
27
When discussing db optimizations with XC’s Drupal Toolkit developer, Peter Kiraly, he said PHP didn’t have the same ability. Instead he’d have to write out a csv file and load that in. I figured I might as well try it.
![Page 28: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/28.jpg)
Did that improve things?
28
• 4 ms per record (previously 6)– 2.5 ms to create DOM– 1.0 ms for actual service processing (the innards of
the MARC-Normalization service)– 0.6 ms (previously 2.2) for MySQL inserts
• Pretty close, but still not there
![Page 29: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/29.jpg)
Sometimes it’s the little things
29
DomFactoryBuilderDOAServiceFactoryFactoryImplI knew enough not to create the DocumentBuilderFactory each time, but didn’t realize creating the DocumentBuilder each time would have that much of an effect.
Code was
Code is now
![Page 30: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/30.jpg)
Did that improve things?
30
• 3 ms per record (previously 4)– 0.9 ms (previously 2.5) to create DOM– 1.0 ms for actual service processing (the innards of
the MARC-Normalization service)– 0.6 ms for MySQL inserts
• WE DID IT! We have exceeded our goal!
![Page 31: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/31.jpg)
0.2 Service Development
31
Internals of the MST were exposed to the service developer and the developer was expected to re-implement much of this internal code.
![Page 32: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/32.jpg)
code.google.com/p/xcmetadataservicestoolkit/
32
![Page 33: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/33.jpg)
0.3.x Service Development
• Install Java, Ant, MySQL
33
$ wget 'http://xcmetadataservicestoolkit.googlecode.com/files/example-0.3.0-dev-env.zip’
$ unzip example-0.3.0-dev-env.zip$ cd example$ ant retrieve$ ant -Dtest=ProcessFiles test$ ls -ladh ./build/test/actual_output_records/1/*$ ant zip
![Page 34: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/34.jpg)
Input Files for Testing
$ ls -1 ./test/input_records/1/* | xargs cat<records xmlns="http://www.openarchives.org/OAI/2.0/"> <record> <header> <identifier>oai:mst.rochester.edu:bib:1</identifier> </header> <metadata> <foo xmlns="foo:bar">pb&j</foo> </metadata> </record> <record> <header> <identifier>oai:mst.rochester.edu:bib:1</identifier> </header> <metadata> <foo xmlns="foo:bar">pb&j 2</foo> </metadata> </record></records>...
34
![Page 35: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/35.jpg)
Output Files from Testing
$ ls -1 ./build/test/actual_output_records/1/* | xargs cat<records xmlns="http://www.openarchives.org/OAI/2.0/"><record> <header status="replaced"> <identifier>oai:mst.rochester.edu:example/1</identifier> <datestamp /> <predecessors> <predecessor>oai:mst.rochester.edu:bib:1</predecessor> </predecessors> </header> <metadata> <foo xmlns="foo:bar"> pb&j <bar>you've been foobarred!</bar> </foo> </metadata></record>
35
![Page 36: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/36.jpg)
Implementing in Code
36
![Page 37: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/37.jpg)
More tidbits for interested implementers
• The MST now is configured via spring– each service is given it’s own application context
as well as it’s own classloader• This means it can use all the objects and services from
the MST while not worrying about name collisions (naming or dependencies) w/ other services
• Each service is given it’s own db schema (again, so you don’t have to worry about name collisions). The db schema is prefixed w/ “xc_”
37
![Page 38: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/38.jpg)
Other Services
• MARC-XC-Transformation Just as fast as the marcnormalization service
• DC-XC-Transformation Initially contributed by Kyushu University (in Japan) – now one of our core services.
38
![Page 39: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/39.jpg)
Photo Credits
• All photos taken from flickr.com– “Brick Wall” by somenametoforget– “Snail” by DRB62– “Paris Train” by Pictr 30D– “Spaghetti with tomato sauce” by HatM– “Hawk in Flight” by Nick Chill– “Tortoise” by GraphicReality
39
![Page 40: Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b220da/html5/thumbnails/40.jpg)
Final Numbers
0.2• 125k records / hr
29 ms / record
• fell down before 2M records processed
• not easily extensible40
0.3• 1.2M records / hr
3.0 ms / record
• processed 16M records with no degradation
• easily extensible
1.5 GHz CPU