bhl hardware architecture - storage and clusters

24
Phil Cryer open source systems architect and accidental tourist January 2010 Beijing, China Tuesday, January 12, 2010

Upload: phil-cryer

Post on 04-Dec-2014

2.770 views

Category:

Technology


1 download

DESCRIPTION

The Biodiversity Heritage Library (BHL), like many other projects within biodiversity informatics, maintains terabytes of data that must be safeguarded against loss. Further, a scalable and resilient infrastructure is required to enable continuous data interoperability, as BHL provides unique services to its community of users. This volume of data and associated availability requirements present significant challenges to a distributed organization like BHL, not only in funding capital equipment purchases, but also in ongoing system administration and maintenance. A new standardized system is required to bring new opportunities to collaborate on distributed services and processing across what will be geographically dispersed nodes. Such services and processing include taxon name finding, indexes or GUID/LSID services, distributed text mining, names reconciliation and other computationally intensive tasks, or tasks with high availability requirements.

TRANSCRIPT

Page 1: BHL hardware architecture - storage and clusters

Phil Cryeropen source systems architect

and accidental touristJanuary 2010 Beijing, China

Tuesday, January 12, 2010

Page 2: BHL hardware architecture - storage and clusters

> BHL hardware architecture

• storage issues and solutions

• hardware risks and configuration

• cloud computing possibilities

• code and communication

Tuesday, January 12, 2010

Page 3: BHL hardware architecture - storage and clusters

> Storage issues

• more data being created and saved

• expanding SANs can be expensive

• storage has not kept up with Moore’s Law

• backups are usually for disaster recovery, not redundancy or failover

Tuesday, January 12, 2010

Page 4: BHL hardware architecture - storage and clusters

> Storage solutions: distributed storage

• write once, read anywhere

• replication and fault tolerance

• error correction

• automatic redundancy

• scalable horizontally

Tuesday, January 12, 2010

Page 5: BHL hardware architecture - storage and clusters

> Distributed storage - Options

• hosted storage (cloud, Amazon S3)

• self hosted with proprietary hardware and software (Sun Thumper)

• self hosted with commodity hardware and open source distributed filesystem (GlusterFS, Lustre)

Tuesday, January 12, 2010

Page 6: BHL hardware architecture - storage and clusters

> Distributed storage - GlusterFS

• GlusterFS: a cluster file-system capable of scaling to several petabytes*

• open source software on commodity hardware

• simple to install and manage

• very customizable

• offers seamless expansion

*(1,024 terabytes = 1 petabyte)

Tuesday, January 12, 2010

Page 7: BHL hardware architecture - storage and clusters

> Distributed storage - Archival 

• Fedora-commons is an open source repository (not a Linux distribution)

• accounts for all changes, so it provides built-in version control

• provides disaster recover

• open standards to mesh with future file formats

• provides open sharing services such as OAI-PMH

Tuesday, January 12, 2010

Page 8: BHL hardware architecture - storage and clusters

Tuesday, January 12, 2010

Page 9: BHL hardware architecture - storage and clusters

> Distributed storage - Mirrored data

• now we have redundancy

• in fact, multiple redundant copies

• provides fault tolerance

• offers load balancing

• gives us future geographical distribution

Tuesday, January 12, 2010

Page 10: BHL hardware architecture - storage and clusters

> Distributed processing

• more abilities available than just storing data

• with distributed storage comes distributed processing

• distributed processing means faster answers

• faster answers mean new questions • lather, rinse, repeat

Tuesday, January 12, 2010

Page 11: BHL hardware architecture - storage and clusters

> Distributed processing

• make existing data more useful

• image and OCR re-processing

• distributed web services

• identifier resolution pools

• map/reduce frameworks

• generate new visualizations, text mining, NLP and ???

Tuesday, January 12, 2010

Page 12: BHL hardware architecture - storage and clusters

TaxonFinder TaxonFinder TaxonFinder TaxonFinder

WebService WebService

Load Balancer

Cluster Node

Cluster

Site

Request

TaxonFinder TaxonFinder TaxonFinder TaxonFinder

WebService WebService

Load Balancer

Cluster Node

Cluster

Site

TaxonFinder TaxonFinder TaxonFinder TaxonFinder

WebService WebService

Load Balancer

Cluster Node

Cluster

Site

Tuesday, January 12, 2010

Page 13: BHL hardware architecture - storage and clusters

> Hardware configuration: risks

• computers crash

• hard drives die

• networks fail

• natural disasters occur

but...

Tuesday, January 12, 2010

Page 14: BHL hardware architecture - storage and clusters

...so plan for it.

Tuesday, January 12, 2010

Page 15: BHL hardware architecture - storage and clusters

> Hardware configurations

• six 5U sized servers hosted at MBL in Woods Hole

• 8 Gigs RAM in each server

• 24 / 1.5 TB drives in each server

• 100 TB storage overall

• fast connection, far greater bandwidth than ever before

• mirror sets (3 and 3) will be stored in different locations

but...

Tuesday, January 12, 2010

Page 16: BHL hardware architecture - storage and clusters

> Some assembly required (optional)

• while our example uses new, faster commodity hardware...

• it could run on any hardware that can run Linux

• you could chain together old "out dated" computers

• build your own cluster for next to nothing (host it in your basement)

• solves some infrastructure funding issues and provides a working proof of concept

• hardware vendor neutrality

Tuesday, January 12, 2010

Page 17: BHL hardware architecture - storage and clusters

> Our proof of concept

• we ran a six box cluster to demonstrate GlusterFS clustered filesystem

• ran Debian/GNU Linux

• simulated hardware failures

• synced data with a remote cluster (STL to MBL/WH)

• ran map/reduce jobs (distributed computing)

• defined procedures, configurations and build scripts

Tuesday, January 12, 2010

Page 18: BHL hardware architecture - storage and clusters

> Distributed storage - Projected costs

Graph from Backblaze (http://www.backblaze.com)

$246,000

Tuesday, January 12, 2010

Page 19: BHL hardware architecture - storage and clusters

> Cloud computing

• BHL is participating in a pilot for Duraspace with The New York Public Library

• Duraspace would provide a link to cloud providers

• pilot to show feasibility of hosting

• testing use of image server, other services in the cloud

• cloud could seed new clusters

Tuesday, January 12, 2010

Page 20: BHL hardware architecture - storage and clusters

> Code (63 6f 64 65)

• our code and configurations are open source, hosted on Google Code

People details

Starred (view starred projects)

Code license: New BSD License

Labels: bhl, linux, djatoka, imageviewer,java, asp.net, c-sharp

Links: Biodiversity Heritage Library

Blogs: BHL BlogBHL updates on Twitter

Feeds: Project feeds

Project owners:

phil.cryer, [email protected],[email protected],cfreeland27

[email protected] | My favorites | Profile | Sign out

bhl-bitsOpen Source code from the various BiodiversityHeritage Library (BHL) projects

Search projects

Project Home Downloads Wiki Issues Source Administer Summary | Updates | People

This is the repository for code from variousBiodiversity Heritage Library (BHL) projects, all ofwhich are released under various Open Sourcelicenses. Our current portal website runs on .NET,connects to a MSSQL backend database, andutilizes an Apache Tomcat webapp fronted imageviewer. Our new citation web site runs under Linux(Debian), Drupal, Apache, PHP and MySQL with helpfrom HAProxy, Solr/Lucene, OpenVZ, KVM and othersoftware. We'll be releasing code, as well as the allimportant configuration files for applications we useand support on our servers. We're open to any andall suggestions from the community, and of courseaccept any patches that improve our code. Weappreciate your interest in our projects.

Latest updates

Portal - portal is the .NET based code thatpowers our current portal site, BiodiversityHeritage Library (BHL)Citebank - our next generation, citation website build with Drupal, MySQL, PHP, etc,Citebank. Includes our all the Drupal files, withall currently configured themes and modules you can see on the production site/citej2k viewer - our customized jpeg2000 viewer used on BHL, Botanicus and Tropicos. Uses the adore-djatoka project with our own image viewer based on Mootools, all running under TomcatUtilities - scripts that help us gather and use data from other providers

You can browse all of our code here: http://code.google.com/p/bhl-bits/source/browse/#svn/trunk

For more information about the Biodiversity Heritage Library, visit: http://biodiversitylibrary.org/About.aspx orfor other contacts or requests for specific code/configurations not present, drop me us an email (phil dotcryer at mobot dot org)

Thanks

http://code.google.com/p/bhl-bits/

Tuesday, January 12, 2010

Page 21: BHL hardware architecture - storage and clusters

> Communication

• our Projects server is a portal for various projects and ideas

http://projects.biodiversitylibrary.org/

Latest projects

Data Hosting Centre Task Group (12/02/2009 02:24 PM)Whilst an unprecedented volume of biodiversity data iscurrently being generated worldwide, it is perceived thatsignificant amounts of data get lost or will be lost afterproject closure. To investigate the causes of such loss, andrecommend strategies as to how biodiversity data can berescued and archived, through ‘data hosting centres’ a TaskGroup on ‘Data Hosting Centres’ will be commissioned.

Consistant, updated Identity (09/21/2009 03:50 AM)An ongoing project to update BHL's identity, theme and logoacross BHL Portal and Citebank

Clustered storage (08/25/2009 05:28 AM)The project to build a reproducible clustered file-systemdistributed across multiple servers, utilizing GlusterFS, tohouse all of BHL Global content.

Website feedback (06/15/2009 11:11 AM)A project to accept feedback from users to cite.bhl.org

MOBOT djatoka implementation (04/13/2009 04:30 PM)

Ten major natural history museum libraries, botanical libraries, andresearch institutions have joined to form the Biodiversity HeritageLibrary Project. The group is developing a strategy and operationalplan to digitize the published literature of biodiversity held in theirrespective collections. This literature will be available through a global“biodiversity commons.”

The participating libraries have over two million volumes ofbiodiversity literature collected over 200 years to support the work ofscientists, researchers, and students in their home institutions andthroughout the world.

The BHL will provide basic,important content forimmediate research and formultiple bioinformaticsinitiatives. For the first timein history, the core of ournatural history and herbarialibrary collections will beavailable to a truly globalaudience. Web-basedaccess to these collections will provide a substantial benefit to peopleliving and working in the developing world -- whether scientists orpolicymakers. The Biodiversity Heritage Library Project strives toestablish a major corpus of digitized publications on the Web drawnfrom the historical biodiversity literature. This material will beavailable for open access and responsible use as a part of a globalBiodiversity Commons.

We will work with the global taxonomic community, rights holders,and other interested parties to ensure that this legacy literature isavailable to all. The Biodiversity Heritage Library Project must be amulti-institutional project because no single natural history museumor botanical garden library holds the complete corpus of legacyliterature, even within the individual sub-domains of taxonomy.However, taken together, the proposed consortium of collectionsrepresents a uniquely comprehensive assemblage of literature.

Latest newsCitebank: New url live The site's URL is now: http://cite.biodiversitylibrary.orgAdded by Phil Cryer 278 days ago

Backup Linux systems: Backup software on DEV (2 comments) Backup software installed successfully on DEVAdded by Phil Cryer 292 days ago

View all news

Home

Tuesday, January 12, 2010

Page 22: BHL hardware architecture - storage and clusters

> Communication

• each project has its own wiki, allowing for detailed, how-to instructions

http://projects.biodiversitylibrary.org/wiki/clusteredfilesystem/

INSTALL

the following is a continuously updated doc and howto for configuring a BHL cluster node on acurrent Debian GNU/Linux installation. A breakdown of times involved in the physical building ofnodes is available at the build times page

System configuration

1) upgrade system to Squeeze (Lenny uses kernel 2.26, which doesn't fully support ext4)and enable contrib and non-free repos

Copy the next 4 lines, and paste them in as 1:

echo "deb http://ftp.us.debian.org/debian/ squeeze main contrib non-freedeb-src http://ftp.us.debian.org/debian/ squeeze maindeb http://security.debian.org/ squeeze/updates maindeb-src http://security.debian.org/ squeeze/updates main" > /etc/apt/sources.list

and update the system

apt-get updateaptitude safe-upgrade

and now is a good a time as any to make sure the timezone data is all setup correctly (mine wasn't)

dpkg-reconfigure tzdata

2) reboot to use latest kernel (if needed)

/sbin/shutdown -r now

3) install needed utiltities and applications

apt-get install build-essential flex bison screen nginx vim-nox git-core subversion saidar

Install and configure GlusterFS

4) download latest GlusterFS, and associated MD5SUM file

wget http://ftp.gluster.com/pub/gluster/glusterfs/3.0/LATEST/MD5SUMwget http://ftp.gluster.com/pub/gluster/glusterfs/3.0/LATEST/glusterfs-3.0.0.tar.gz

Verify the GlusterFS tarfile

md5sum -c MD5SUM

you should see

glusterfs-3.0.0.tar.gz: OK

5) build GlusterFS

tar -zxf glusterfs-3.0.0.tar.gzcd glusterfs-3.0.0./configuremake && make installldconfigglusterfs --version | head -n1

you should get something like

glusterfs 3.0.0 built on Dec 29 2009 22:15:12

6) configure GlusterFS

For our example we'll be configuring GlusterFS to do a RAID1 (mirror) between two hosts, with IPsand 128.128.164.74 (local), 128.128.164.174 (remote) and both with data directories mounted to/mnt/data

glusterfs-volgen —name store1 —raid 1 128.128.164.74:/mnt/data 128.128.164.174:/mnt/data

Tuesday, January 12, 2010

Page 23: BHL hardware architecture - storage and clusters

> Code (63 6f 64 65)

• our code and configurations are open source and hosted on Google Code

• our projects server shares detailed instructions to get you get started

• get involved

• join our mailing-lists (bhl-tech, bhl-bits)

• ask us questions

• you can do it, and we will help

Tuesday, January 12, 2010