edanz journal selector, a prototype based on solr/nutch/hadoop: presented by liang shen, european...
TRANSCRIPT
Edanz Journal Selector a Prototype based on Solr/Nutch/Hadoop
Liang SHEN
@shenzhuxi Web Developer European Bioinformatics Institute Drupal/Solr
Edanz Journal Selector (2011)
So many journals!
DEMO
Open Access
• By National Center for Biotechnology Information, U.S. National Library of Medicine • Approximately 26,000 records are included in the PubMed journal lists
Feeds Journal TOCs • 21,498 journals from 1,677 publishers • Institute for Computer Based Learning • Heriot-Watt University
Springer • Springer Metadata API
• Provides metadata for over 5 million online documents • Springer Open Access API
• Provides metadata, full-‐text content, and images for over 80,000 open access ar:cles
Open Source Stack
• Infrastructure: Amazon Web Service • Data processing: Hadoop/Hive • Index: Solr/Lucene • Web service: Drupal • Piwik
HDFS
Index
Feeds API Web
Springer Journal Selector
Chinese
Japanese
Scalability • Shards
Internet vs. Intranet
Re-think after 3 years
Don't use Hadoop (<5TB)
Thanks! Liang Shen