oclc online computer library center parallel text searching on a beowulf cluster using srw ralph...
TRANSCRIPT
OCLC Online Computer Library Center
Parallel Text Searching on a
Beowulf Cluster using SRWRalph LeVan
OCLC Research
GoalGoalDemonstrate 100 searches/second on our
50 million record WorldCat database residing on a small Beowulf Cluster
Beowulf ClusterBeowulf Cluster24 nodes– 2 2.8GHtz Xeon CPUs– 4 GB of memory
80 GB of disk on 23 application nodes
130 GB of disk on root node
DatabaseDatabase50 million records
69 partitions (~700,000 records)– 3 partitions per application node
Partitioned by popularity
Searched using OCLC Research’s Open Source Gwen and Pears toolkits
ArchitectureArchitecture1 Tomcat on each application node
3 SRW/U databases configured for each Tomcat
1 client application on the root node
Trial #1Trial #1SRW client searching 69 databases
Result:
2 searches/second (437ms/search)
Ganglia Cluster Report shows the root node glowing red and the application nodes a peaceful blue
Trial #2Trial #2SRU client with scanned response searching 69 databases
Result:
25 searches/second (40ms/search)
Ganglia Cluster Report still shows the root node glowing red and the application nodes a peaceful blue
Trial #3Trial #3SRW client with hand built XML and scanned response searching 69 databases
Result:
21 searches/second (46ms/search)
Ganglia Cluster Report still shows the root node glowing red and the application nodes a peaceful blue
SRW dropped
RearchitectureRearchitectureProblem: Ganglia Reports indicate that
the client is the bottleneck
Solution: Put a 3-way federator on each Tomcat (a virtual database for the client) and have the client search 23 databases instead of 69
ResultResultSRU client: 71 searches/second (14 ms)
Hand-built SRW client: 33 searches/second (30ms)
Original SRW client: 6 searches/second(164)
Ganglia cluster report still shows root node red, but application nodes are now green and yellow
RearchitectureRearchitectureCreate a virtual 23-way database on each Tomcat that will federate searches from the 23 virtual 3-way databases
Put one of these on each Tomcat
Create a new client that sends searches on threads to each available 23-way database
ResultResult
With 23 threads, 172 searches/second– Average response time of 170ms
The Ganglia report showed all nodes running red