odsc and irods

24
1 Open Data Science Conference and iRODS User Group meeting Raminder Singh Research Data Services Research Technologies, Indiana University July 7 th , 2016

Upload: raminder-singh

Post on 09-Feb-2017

116 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: ODSC and iRODS

1

Open Data Science Conference and iRODS User Group meeting

Raminder SinghResearch Data Services

Research Technologies, Indiana UniversityJuly 7th, 2016

Page 2: ODSC and iRODS

2

ODSC East 2016https://www.odsc.com/boston

Page 3: ODSC and iRODS

3

Technologies Discussed• Julia is a high-level, high-performance dynamic programming language for technical computing with

familiar syntax. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library.

• Stan is for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences, engineering, and business

• Scikit-learn is a python library with classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with other libraries like NumPy and SciPy.

• Apache Spark is an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.

• Apache Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.

• Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

• Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Page 4: ODSC and iRODS

4

Keynote Speakers

Page 5: ODSC and iRODS

5

About Companies of Keynote Speakers

• Booz Allen Hamilton: Core business is the provision of management, technology and security services, to civilian government agencies. http://www.boozallen.com/datascience

• Rapid Miner: Integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. https://rapidminer.com/

• CrowdFlower: Data enrichment, data mining as a Software as a Service. https://www.crowdflower.com/

Page 6: ODSC and iRODS

6

Other Interesting Speakers

Page 7: ODSC and iRODS

7

Topics for Training Workshops

• Using R for Data Analytics– https://github.com/zachmayer/forecast

• Building a Real-time Recommender Systems with Spark ML, Kafka, and the PANCAKE STACK– http://advancedspark.com/

• Analyzing Open Data in Healthcare using Public APIs and Reproducible Workflows

– https://github.com/jhajagos/health-open-data-workshop

Page 8: ODSC and iRODS

8

List of Good Talks Available Online• Kirk Borne – “2 Most Important Things in Data Science”

– https://www.opendatascience.com/conferences/odsc-east-2016-kirk-borne-the-2-most-important-things-in-data-science/• Experiment • Data collection

• Tomorrow’s Map Room: Data Portals– https://www.opendatascience.com/blog/tomorrows-map-room-data-portals/

• Interactive Data Visualizations in R with Shiny and ggplot2– https://www.opendatascience.com/conferences/odsc-east-2016-joe-cheng-zev-ross-interactive-data-vi

sualizations-in-r-with-shiny-and-ggplot2/

• Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Shiny in R or D3 in Java script. http://bokeh.pydata.org– https

://www.opendatascience.com/conferences/odsc-east-2016-peter-wang-interactive-viz-of-a-billion-points-with-bokeh-datashader/

• Exaptive Xap Store is an 'app store' for data applications. They are standardizing set of libraries to be used to create Networks. http://www.exaptive.com/data-application-gallery

Page 9: ODSC and iRODS

9

Page 10: ODSC and iRODS

10

Objective to Attend

• iRODS features and architecture• User Community• Use Cases and Solutions built over iRODS• Future development and directions

Questions• Can I write rules in other languages? • Is it possible to attach it to existing storage?• What does it take to implement data policy rules for Research Data Alliance

(RDA) practical policy recommendations?

Page 11: ODSC and iRODS

11

Page 12: ODSC and iRODS

12

Page 13: ODSC and iRODS

13

iRODS Implements Four Main Functions

Data Virtualization: iRODS provides a logical representation of files stored in physical storage locations. We call this logical view a virtual file system and the capabilities it provides.

Data Discovery: This information about data, called metadata, is extremely useful for Data Discovery, locating relevant data within large data sets.

Workflow Automation: Once data is stored and available in the catalog, it often needs to be migrated, secured, or otherwise processed.

Secure Collaboration: Data is most useful when it’s in the hands of the right people. There is a recognized need in the public research community to publish data sets that accompany written articles.

Page 14: ODSC and iRODS

14

Page 15: ODSC and iRODS

15

Page 16: ODSC and iRODS

16

Page 17: ODSC and iRODS

18

EMC2 Case of Adaptive Hierarchical Metadata Using MetaLnx

Page 18: ODSC and iRODS

19

Page 19: ODSC and iRODS

20

Getting R to talk to iRODSBernhard Sonderegger, Nestlé Institute of Health Sciences

• The R language is an environment with a large and highly active user community in the field of data science. At NIHS we have developed the R-irods package which allows user-friendly access to irods data objects and metadata from the R language. Information is passed to the R functions as native R objects (e.g. data-frames) to facilitate integration with existing R code and to allow data access using standard R constructs.

• To maximize performance and maintain a simple architecture, the implementation heavily relies on the icommands C++ code wrapped using Rcpp bindings.

• The R-irods package has been engineered to have semantics equivalent to the icommands and can easily be used as a basis for further customization. At the NIHS we have created an ontology aware package on top of R-irods to ensure consistent metadata annotations and to facilitate query construction.

Page 20: ODSC and iRODS

21

Page 21: ODSC and iRODS

22

Page 22: ODSC and iRODS

23

Page 23: ODSC and iRODS

24

Review

Questions• Can I write rules in other languages?

– YES• Is it possible to attach it to existing storage?

– YES. There are tools to load the data• What does it take to implement data policy rules for Research Data Alliance

(RDA) practical policy recommendations?– Here https://github.com/DICE-UNC/policy-workbook is a reference

implementation for RDA recommendations. It needs some work to update and test these with the latest version of iRODS.

Page 24: ODSC and iRODS

25

iRODS User Group Meeting notes and slides

• http://irods.org/documentation/articles/irods-user-group-meeting-2016/ - Use Case slides• http://irods.org/wp-content/uploads/2016/06/technical-overview-2016-web.pdf - Tech

report• http://slides.com/irods/ : Workshop Slides• https://github.com/DICE-UNC/policy-workbook: RDS Policies implementation• http://www.cyverse.org/ : iRODS as a service• http://irods.org/documentation/articles/ : Other Articles• http://www.odum.unc.edu/ • http://datafed.org/about/use-cases/• http://renci.org/news/virtual-institute-for-social-research/