arctos at the university of alaska museum insect collection derek sikes 1 gordon jarrell 2 dusty...

37
Arctos at the University of Alaska Museum Insect Collection Derek Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University of Alaska Museum Fairbanks, AK 2 Museum of Southwestern Biology, NM Alaska Entomological Society 5 th Annual Meeting, Anchorage, AK

Upload: roddy

Post on 25-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Arctos at the University of Alaska Museum Insect Collection Derek Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University of Alaska Museum Fairbanks , AK 2 Museum of Southwestern Biology, NM Alaska Entomological Society 5 th Annual Meeting, Anchorage, AK 27-28 Jan 2012. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

Arctos at the University of Alaska Museum Insect Collection

Derek Sikes1

Gordon Jarrell2Dusty McDonald1

1 University of Alaska Museum Fairbanks, AK

2 Museum of Southwestern Biology, NM

Alaska Entomological Society5th Annual Meeting, Anchorage, AK27-28 Jan 2012

Page 2: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

Major repositories using the Arctos database:(43 collections of specimens or observations, 1.4M records)

Page 3: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

in partnership with

which is a member of

TeraGrid – A nationwide network of 11 supercomputing facilities

U. S. National Science Foundation’s Office of Cyberinfrastructure

which is sponsored by

Page 4: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

Arctos: A 15 year history MVZ: 1995 - Hired Stan Blum to develop relational data model (following modeling

by Assoc. Systematic Collections).

MVZ: 1997 - Hired John Wieczorek to implement model (desktop application) using Sybase and Versata. Partial implementation (e.g., no loans).

UAM: 1998-2000 - John W. migrated mammal data to Oracle, set up Versata.

UAM: 2002 - Dusty McDonald replaced Versata with ColdFusion, implemented full model (first web-based instance, aka Arctos).

MSB: 2003 – Joined Arctos at UAM (first multi-hosting instance).

MVZ and MCZ: 2005-2007 - Implemented separate instances of Arctos at Berkeley and Harvard (MVZ: first Postgres, then Oracle).

MVZ: 2009 - Moved hosting of data to Alaska (Virtual Private Database version).

Page 5: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

ARCTOS• Specimens (objects) - body

parts, tissues, containers, etc.

• Images, media (stored at TACC)

• Projects, permits, publications • Accessions, loans, usage

• Labels, as PDF files

• Agents, agent activity

Arctos

Specimen Cataloglabel data (and more)

Projectscontribute and/or

use specimens

Accessions Loans,usage

Publicationscite specimens

GenBank

Federated portals

BerkeleyMapper

“Media” in TeraGrid

The rest ofCyberspace

Citations

Page 6: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum
Page 7: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum
Page 8: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum
Page 9: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum
Page 10: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum
Page 11: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum
Page 12: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

BerkeleyMapper & Google Maps, with error circles

Page 13: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

Breadth of Data in Arctos Fish, amphibians, reptiles, mammals, birds and bird eggs/nests, plants, arthropods, fossils, molluscs AND their parasites Specimens and observations Media (images, audio, video) Publications, fieldnotes

Arctos constantly evolving to incorporate new kinds of data, e.g.,: Better representation of non-publication documents (fieldnotes, correspondence) Cultural collections (art, anthropology...)Nearly all that is known about an object (or observation) can be included in Arctos.

Page 14: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

Linking specimen records to archival documentation…

Page 15: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

1) What is the primary user audience? - large/ small museum management? taxonomic research? is a dedicated IT / programmer required? Single vs multi-user? (annual cost?)

2) GBIF - does the database provide data to GBIF?

3) Barcoding - does the database handle batch processing of specimens using barcodes? ( 'speed / ease of use')

4) Georeferencing - does it conform to the recommended 'best practices' guide published by GBIF?

5) What is the ease / difficulty of websetup?

6) Security - can a data entry technician accidentally delete or change (corrupt) large amounts of data? Is/are the database server(s) protected from disaster (eg floods, fires)?

7) Likes / dislikes & pros/cons

ECN Session – Arthropod Collections Databases

Page 16: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

1a) What is the primary user audience?

Museums / collections data management (also: observations, Federal collections [USFWS], large private collections associated with public institution]

1b) is a dedicated IT / programmer required?

Yes, but the IT staff are shared among all participants.

1c) Single vs multi-user?

Multi-user without practical limits.

1d) Annual cost?

Negotiated per institution based on size and maintenance needscurrently ranging $1,300 - $27,000

ECN Session – Arthropod Collections Databases

Page 17: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

2) GBIF - does the database provide data to GBIF?

Arctos does this automagically every minute.

3) Barcoding - does the database handle batch processing of specimens using barcodes? ( 'speed / ease of use')

Arctos attaches barcodes to “parts.” This lets you track things like tissues, extractions, slides and pinned bodies of each cataloged specimen separately.

ECN Session – Arthropod Collections Databases

Page 18: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

4) Georeferencing - does it conform to the recommended 'best practices' guide published by GBIF?

Arctos fully supports georeferencing "best practices," in part because the authors of that document and of Arctos' spatial data structure are one and the same. (John Wieczorek)

5) What is the ease / difficulty of websetup?

Acquire password. Enter data. (Arctos is only available via the web).

ECN Session – Arthropod Collections Databases

Page 19: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

Preservation of specimens

and their associated data

for perpetuity

NSF will help us get our data online but ensuring they stay online forever is a problem that hasn’t been solved

ECN Session – Arthropod Collections Databases

Page 20: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

33,090 specimens28 institutions / private collections736 images4,516 bibliographic images428 users

Page 21: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

DMNSArachnologyData

In-house ->

NSD ->

Crash

-> K EMu

Page 22: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum
Page 23: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

Database errors...

Page 24: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

Cabinetsantiquatedwoodendamaged

= unsafe

Page 25: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

Databasehome-madeweak securitymine alonenot online

= unsafe

Arctos

Specimen Cataloglabel data (and more)

Projectscontribute and/or

use specimens

Accessions Loans,usage

Publicationscite specimens

GenBank

Federated portals

BerkeleyMapper

“Media” in TeraGrid

The rest ofCyberspace

Citations

Page 26: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

6) Security - can a data entry technician accidentally delete or change (corrupt) large amounts of data?

No – Data entry technicians enter data into a staging area

Data must be vetted before being loaded by someone with more access privileges

All non-select transactions are audited. We can (theoretically) roll back to any point in history, or roll any user's updates back to any point in history. We can re-create all actions by all users.

ECN Session – Arthropod Collections Databases

Page 27: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

6) Security - Is/are the database server(s) protected from disaster (eg floods, fires)?

Yes – running a RAID array

Backups – continuous logs to a remote NAS– local drives– Texas Advanced Computing Center– San Diego Supercomputing Center

“If we lose all the nightly backups (3 tectonic plates), I'm betting nobody will be overly worried about Arctos data.

Or breathing.” – D. McDonald

ECN Session – Arthropod Collections Databases

Page 28: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

7) Likes / dislikes & pros/cons

DISLIKES:

- Learning curve fairly steep -> back to kindergarten

- Can’t customize to my heart’s content, each change must be voted on & prioritized by other users

- Web access generally slower than I like ( we are all more critical of others than ourselves)

- Only available when networked. Field work in remote areas requires special solutions if data are to be accessed.

- User interface is ~ garish, clunky, industrial (but works)

ECN Session – Arthropod Collections Databases

Page 29: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

7) Likes / dislikes & pros/cons

LIKES:

- Rock – solid security, the data will outlive me

- Web-published

- Cutting-edge web integration (mapping, GenBank, etc)

- No responsibility on my part to maintain backups, software updates, etc. Need only a networked computer

- Arctos programmers & designers are biologists / users who really care about “doing it right”

ECN Session – Arthropod Collections Databases

Page 30: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum
Page 31: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

6) Security - can a data entry technician accidentally delete or change (corrupt) large amounts of data?

ECN Session – Arthropod Collections Databases

There are multiple roles and partitions at various levels. A data entry technician has write access to exactly one table, the bulkloader. Additionally, one VPD limits his access to his own collection, another limits access to his own rows, and yet another prevents him from marking records to load. In short, he can only un-do anything he's done, and then only in a "staging area" separate from "real" data.

A similar model is used throughout Arctos. We control access at the table and row level, and can easily implement finer-grained control if such becomes necessary. Users (theoretically) get only the rights that they need and have demonstrated an understanding of to the data they need, all the while having full access to shared data (like agents).

Data like agents and taxonomy - things where character strings rather than data concepts matter to collections - are trigger-protected based on usage. You can't update an agent name after it's been used as an author, for example. This is pretty basic referential integrity, and Arctos is the only thing that has it.

Data and user rules are all handled by the RDBMS, so we can plug in forms written by other people/projects, offer SQL command-line access, webservices, etc., without worrying too much about security or referential integrity. (Specify, for example, cannot safely support such access as all data and access rules live in the application layer.)

All non-select transactions are audited. We can (theoretically) roll back to any point in history, or roll any user's updates back to any point in history. We can re-create all actions by all users.

In addition to ColdFusion's Application Security, we take full advantage of Oracle security - a breach of one just leads to another layer. Oracle handles things like secondary user access and brute-force password crack attempts. An independent semi-intelligent (and slightly paranoid) security wrapper watches for malicious behavior and blocks IP access if it detects anything anomalous.

Page 32: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

6) Security - Is/are the database server(s) protected from disaster (eg floods, fires)?

ECN Session – Arthropod Collections Databases

The server is running a RAID array - we can lose a disk or two and not lose any data (or stop working). Rollback logs are continuously written to a remote NAS (Networked Attached Storage) system. Daily backups are stored on the local drives, on the NAS, and on tape in GVEA's "bunker." (They won't tell us what or where that is, but your electric bill and medical records are in there and it makes the Department of Homeland Security happy.) Daily backups are also copied to the Texas Advanced Computing Center at Austin (one copy on disk and another on tape) and to tape at the San Diego Supercomputing Center. We may have another copy going to massively redundant disk at the National Center for Supercomputing Applications (University of Illinois at Urbana- Champaign) by the time you get to Reno.

We can recover to the point of failure, or at least to within a couple minutes of it, with one copy of the most recent daily backup and one copy of the rollback logs. (Depending on recent activity, we can usually actually recover from a week-or-so old daily + the rollbacks.) We'll lose <24H of data if if we lose all the rollbacks - the sever and the NAS. Those are in two buildings, both with serious security, separated by about a hundred yards of gravel parking lot. If we lose all the nightly backups (3 tectonic plates), I'm betting nobody will be overly worried about Arctos data. Or breathing.

There are a couple dozen probes per day - I think it's fairly safe to say that Arctos security has been tested. (Actual attacks are now kind of hard to detect due to the aforementioned paranoid IP killer, which generally shuts them off at the first probe, but we used to get one per week or so.) A big DDoS attack would easily take us down, but (1) we're too boring to attract such a thing, and (2) so what? - those things just eat servers, not data.

Page 33: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

6) Security - Is/are the database server(s) protected from disaster (eg floods, fires)?

ECN Session – Arthropod Collections Databases

We've lost a few disks over the years, but never lost data or had a server go down due to it. (We've had lots of downtime, just not equipment-related.) Our biggest threat is probably a disgruntled employee with too much access and a long-term plan, but we could probably (with expensive consultant help) even recover from that, and there's no lack of tools to detect such behavior.

That might all be a little overkill - I'd settle for daily backups on 2 major tectonic plates if absolutely necessary –

but I certainly think that you have an obligation to do more than install [database X] on some junker computer and maybe buy a tape drive when you take public money to create or curate digital data.

[database X] may be free, but supporting it takes a real commitment in hardware, infrastructure, and expertise that most Universities are poorly equipped to make.I don't know of a single large project that hasn't at some point lost digital data.

- Dusty McDonald, Arctos programmer

Page 34: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

Lessons Learned

1) Proprietary software is generally a bad idea unless you haveguaranteed, sustained budget for staff and upgrades.

2) Back-ups cannot merely be performed/scripted with the assumption that the job is done.3) Back-ups should NOT be incremental, MUST be stored offsite, and MUST include separate images of operating system and databases4) Restoration from bare metal must be fully documented and

periodically performed to verify that the process DOES work.

5) Source code must be in a distributed public repository like Github.

- D. Shorthouse

Page 35: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

University of Connecticut Bird Collection data were found... on a single floppy

2031 records in a flat file

Page 36: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

University of Connecticut Bird Collection data were found... and made available on-line

Page 37: Arctos  at the University of Alaska Museum Insect Collection Derek  Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University  of Alaska Museum

But... Something with the server setup is not stable.