netbeans for big data

16
| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend 1

Upload: qaware-gmbh

Post on 08-Jan-2017

96 views

Category:

Data & Analytics


1 download

TRANSCRIPT

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend1

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

BigData with Free and Open Source Tools

2

NetBeans IDE for BigData Development with Apache SparkJohannes Weigend - QAware GmbH

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

About this TalkA brief overview about BigData Processing (10 Minutes) Live Demo: Apache Zeppelin and Spark (5 Minutes) Spark Programming with NetBeans (10 Minutes)

3

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

Horizontal Scalability is Difficult!■ Horizontal scalability of functions■ Trivial ■ Loadbalancing of (stateless) services (makro- / microservices) ■ More users ! more machines

■Non trivial ■ More machines ! faster response times

■ Horizontal scalability of data■ Trivial■ Linear distribution of data on multiple machines ■ More machines ! more data

■Non trivial ■ Constant response times with growing datasets

4

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

Hadoop Gives Answers to Horizontal Scalability of Data and Functions

5

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

■ Distributed computing (100x faster than Hadoop (M/R)■ Distributed Map/Reduce on distributed data can be done in-memory ■Written in Scala (JVM)■ Java/Scala/Python APIs■ Processes data from distributed and non-distributed sources■Textfiles (accessible from all nodes)■Hadoop File System (HDFS)■Databases (JDBC)■Solr per Lucidworks API■ ...

7

READ THIS: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

Cluster

JVM

Worker

Worker

JVM

JVM

JVM

Worker

Master / Yarn / MesosJVM

Executor

Executor

JVM

JVM

JVM

Executor

start

start

start

TaskTask(s)

Slave

Slave

Slave

Master Host

Spark Context

MasterURL

Resilient Distributed

Dataset RDD

Driver Node

creates

Driver Application

Application

uses

Partition

Task(s)

Partition

Task(s)

Partition

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

Apache Spark - Lambda on Steroids

9

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend10

„Put the Cloud in a Box“

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

Cloud Case – 5x Intel NUC6i5SYK

11

6th generation Intel® Core™ i5-6260U processor with Intel® Iris™ graphics (1.9 GHz up to 2.8 GHz Turbo, Dual Core, 4 MB Cache, 15W TDP)

CPU

32 GB Dual-channel DDR4 SODIMMs 1.2V, 2133 MHz

RAM

256 GB Samsung M.2 internal SSDDISK

! This case is as powerful as five notebooks

10 Cores, 20 HT Units, 160 GB RAM, 1,25 TB DiskTotal

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

LogFile Analysis with Apache Spark and NetBeans

■DEMO- Getting Started with Spark Programming in NetBeans

- Working with Gradle projects and code completion- Using a real cluster (The cloud case)

- Working with the remote terminal- Using the embedded browser

- Using Docker- Connect to a remote Docker Engine- Using container logs

12

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

Spark Pattern 1: Distributed Task with Params

14

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

Spark Pattern 2: Distributed Read from External Sources

15

| Java One 2016 | UGF6436 | BigData with Free and Open Source Tools | Johannes Weigend

Spark Pattern 3: Caching and Further Processing with RDDs

16