developing big data curriculum with open source infrastructure · 2019-10-09 · developing big...

8
DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE - ANURAG NAGAR UNIVERSITY OF TEXAS AT DALLAS

Upload: others

Post on 24-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE · 2019-10-09 · DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE-ANURAG NAGAR UNIVERSITY OF TEXAS AT

DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE

- ANURAG NAGARUNIVERSITY OF TEXAS AT DALLAS

Page 2: DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE · 2019-10-09 · DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE-ANURAG NAGAR UNIVERSITY OF TEXAS AT

Big Data Education

• One of the most popular courses at the graduate & undergraduate level.

• Huge demand from industry.

• Curriculum involves study of distributed and parallel computing systems.

• Cutting edge infrastructure is essential to give students rich experience.

Page 3: DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE · 2019-10-09 · DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE-ANURAG NAGAR UNIVERSITY OF TEXAS AT

InfrastructureInfrastructure needed:

• Hadoop Distributed File System- Ability to run large MapReduce jobs

• Apache Spark for analytics and machine learning

• Apache Pig for workflow modeling

• Apache Hive for data warehousing

• High availability Cassandra cluster

Page 4: DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE · 2019-10-09 · DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE-ANURAG NAGAR UNIVERSITY OF TEXAS AT

Open Source Solutions

Hadoop:

• Apache Spark install on personal machine

• Cloudera Virtual Machine

• Cloudera Docker

• Amazon EC2 / EMR - free credits for students

• Microsoft HDInsight- free credits

Page 5: DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE · 2019-10-09 · DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE-ANURAG NAGAR UNIVERSITY OF TEXAS AT

Open Source Solutions

Spark:

• Databricks community edition- free 6 GB cluster

• Install Spark on personal machine

• Amazon EC2 - free credits for students

Page 6: DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE · 2019-10-09 · DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE-ANURAG NAGAR UNIVERSITY OF TEXAS AT

Open Source Solutions

Pig / Hive:

• Cloudera VM

• Cloudera Docker

Page 7: DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE · 2019-10-09 · DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE-ANURAG NAGAR UNIVERSITY OF TEXAS AT

Open Source Solutions

Cassandra Cluster:

• Datastax distribution- free and easy to install

• Datastax community edition

Page 8: DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE · 2019-10-09 · DEVELOPING BIG DATA CURRICULUM WITH OPEN SOURCE INFRASTRUCTURE-ANURAG NAGAR UNIVERSITY OF TEXAS AT

Conclusion • Big Data courses are very popular.

• Teaching still evolving.

• Infrastructure critical, but expensive and beyond the reach of many schools.

• Open source infrastructure can help get you easily started.

• Students get more experience with installation and administration.