tajo seoul meetup-201501

Apache Tajo Quick Start

Jinho Kim (jinossy@gmail.com)

Agenda

•  Introduc6on to Tajo

•  Tajo Quick Start

•  Introduc6on to Text files

About me •  Jinho Kim

–  Senior Research Engineer, Gruter Corp (2011. 5 ~) –  Full-‐Hme contributor to Apache Tajo (2013.6 ~ ) –  Apache Tajo PMC member and commiOer (2013.3 ~ )

•  Contacts

–  Email: jhkim AT apache.org –  Linkedin: hOp://linkedin.com/in/jinossy/ –  TwiOer: @jinossy

INTRODUCTION TO TAJO

Apache Tajo

•  Open-‐source big data warehouse (also called SQL-‐on-‐hadoop) system

•  Apache Top-‐level project since March 2014

•  Supports SQL standards

•  Low latency, and long running batch queries

•  0.9.0 released in Oct 2014.

Hadoop eco-‐system Integra6on

•  De-‐factor standard file format support –  Parquet, RCFile, SequenceFile, and Text files

•  Hcatalog support –  Enable Tajo to access exisHng tables used in Hive and others

•  Yarn support –  Tajo can be run on Yarn cluster by using Apache Slider.

Overall Architecture

Master Server (HA)

Client

JDBC TSql Web UI

CatalogStore

HCatalog

Submit a Query

Manage metadata

Allocate a query

Send task & monitor

Slave Server

TajoWorker

QueryMaster

Local FileSystem HDFS

Local Query Engine

StorageManager

Slave Server

TajoWorker

QueryMaster

Local Query Engine

StorageManager

Slave Server

TajoWorker

QueryMaster

Local Query Engine

StorageManager

타조마스터 TajoMaster

TAJO QUICK START

본 문서는 개발 브랜치 기준으로 작성 되었으며 문서 작성일 기준으로 공식 릴리즈 되지 않았습니다. 아래 링크는 미리 준비한 Tarball 입니다. hOp://people.apache.org/~jhkim/tajo-‐0.10.0-‐SNAPSHOT.tar.gz

Installing Tajo

•  Requirement –  Linux –  JDK 1.6 or 1.7 – Hadoop 2.3.0 or higher

•  $ wget hOp://archive.apache.org/dist/hadoop/common/hadoop-‐2.6.0/hadoop-‐2.6.0.tar.gz

•  $ tar xvzf hadoop-‐2.6.0.tar.gz

•  From a Release Tarball –  $ wget hOp://archive.apache.org/dist/tajo/tajo-‐X.X.X/tajo-‐X.X.X.tar.gz

–  $ tar xvzf tajo-‐X.X.X-‐SNAPSHOT.tar.gz

Installing Tajo

•  Requirement – Maven 3 –  Protocol buffer 2.5.0

•  Building from source code –  $ git clone hOps://github.com/apache/tajo.git tajo –  $ cd tajo –  $ mvn clean install -‐Pdist -‐DskipTests –Dtar – $ cp tajo-‐dist/target/tajo-‐X.X.X-‐SNAPSHOT.tar.gz {TAJO_HOME}

Tajo Cluster Mode

•  Local mode – A local mode Tajo instance can start up with very simple configuraHons.

•  Fully distributed mode – A fully distributed mode enables a Tajo instance to run on (HDFS). In this mode, a number of Tajo workers run across a number of the physical nodes where HDFS data nodes run.

Se\ng up a Local mode

•  Local mode Hadoop Cluster 없이�� 1대�� 장비로�� 구성�� 가능하며�� Local file 을�� 주로�� 사용할경우�� 추천

•  SSH

•  conf/tajo-‐env.sh

•  Launch a Tajo cluster

export HADOOP_HOME={HADOOP_HOME} export JAVA_HOME={JAVA_HOME} export TAJO_WORKER_HEAPSIZE=1000 # export TAJO_LOG_DIR=${TAJO_HOME}/logs

$ ssh-‐keygen -‐t rsa $ ssh-‐copy-‐id ~/.ssh/id_rsa.pub {hostname}

$ $TAJO_HOME/bin/start-‐tajo.sh

Se\ng up a Local mode -‐ Op6onal

•  tajo.rootdir (tajo-‐site.xml) – warehouse, system 등 데이터�� 저장�� 디렉토리

•  tajo.worker.tmpdir.loca6ons (tajo-‐site.xml) – Query 실행에�� 필요한�� 중간�� 데이터�� 저장�� 디렉토리

<property> <name>tajo.rootdir</name> <value>file:///tajo/meetup/warehouse</value> <descripHon>Base directory including system directories.</descripHon> </property>

<property> <name>tajo.worker.tmpdir.locaHons</name> <value>/tmp/tajo-‐${user.name}/tmpdir</value> <descripHon>A base for other temporary directories.</descripHon> </property>

Se\ng up a Fully distributed mode

•  conf/tajo-‐site.xml – Master 와�� Worker 들의�� 연결에�� 필요한�� RPC 설정

<property> <name>tajo.rootdir</name> <value>hdfs://hostname:port/tajo</value> </property> <property> <name>tajo.master.umbilical-‐rpc.address</name> <value>hostname:26001</value> </property> <property> <name>tajo.master.client-‐rpc.address</name> <value>hostname:26002</value> </property> <property> <name>tajo.resource-‐tracker.rpc.address</name> <value>hostname:26003</value> </property> <property> <name>tajo.catalog.client-‐rpc.address</name> <value>hostname:26005</value> </property>

•  SSH – 모든�� Tajo Worker�� 에�� 키�� 등록

•  Hadoop Home

•  conf/workers – Worker 로�� 사용될�� 모든�� 호스트�� 등록

Hostname1 Hostname2 Hostname3 …

export HADOOP_HOME={HADOOP_HOME}

•  Make base directories and set permissions •  Distribute a Tajo home to workers

•  Launch a Tajo cluster $ $TAJO_HOME/bin/start-‐tajo.sh

$ $HADOOP_HOME/bin/hadoop fs -‐mkdir /tajo $ $HADOOP_HOME/bin/hadoop fs -‐chmod g+w /tajo

First query execu6on

•  Sample data

•  HDFS (Op6onal)

mkdir table1; cat >> table1/data.csv << EOF 1|abc|1.1|a 2|def|2.3|b 3|ghi|3.4|c 4|jkl|4.5|d 5|mno|5.6|e EOF

$ $HADOOP_HOME/bin/hadoop fs -‐mkdir /tajo/warehouse/table1 $ $HADOOP_HOME/bin/hadoop fs -‐put data.csv /tajo/warehouse/table1

•  Star6ng the Tajo Shell (tsql)

•  Tsql usage

$ ${TAJO_HOME}/bin/tsql

usage: tsql [opHons] [database] -‐B,-‐-‐background execute as background process -‐c,-‐-‐command <arg> execute only single command, then exit -‐conf,-‐-‐conf <arg> configuraHon value -‐f,-‐-‐file <arg> execute commands from file, then exit -‐h,-‐-‐host <arg> Tajo server host -‐help,-‐-‐help help -‐p,-‐-‐port <arg> Tajo server port -‐param,-‐-‐param <arg> parameter value in SQL file

•  Create tables in Tajo – Managed Table

–  External Table

•  HDFS 일경우�� locaHon 을�� 변경한다.

$ default> create table table1 ( id int, name text, score float, type text) using text;

$ default> create external table table1 ( id int, name text, score float, type text) using text with ('text.delimiter'='|') locaHon 'file:/tajo/meetup/table1';

•  Selec6ng data

default> select * from table1 where id > 2; Progress: 100%, response Hme: 0.492 sec id, name, score, type -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 3, ghi, 3.4, c 4, jkl, 4.5, d 5, mno, 5.6, e (3 rows, 0.492 sec, 36 B selected) default> \q

Maximum number of parallel running tasks

•  Worker Heap Memory Size –  Tajo Worker 는�� Single JVM 내에서�� 메모리를�� 공유하기�� 때문에�� 적당한�� Size 의�� Heap 메모리가�� 필요함 •  TAJO_WORKER_HEAPSIZE=8000 (8GB)

•  Worker Resources –  CPU Core, Memory Size, Disk 수를�� 고려하여�� 동시에�� 실행할�� Task 수를�� 조절할수�� 있다

•  Worker Memory Resource –  tajo.worker.resource.memory-‐mb

•  1개의�� Worker 가�� 총�� 사용할수�� 있는�� 메모리를�� 정의한다

–  tajo.task.memory-‐slot-‐mb.default •  1개의�� Task 를�� 처리할때�� 사용할�� 메모리를�� 정의한다

–  tajo.qm.resource.memory-‐mb •  1개의�� Query 를�� 처리는�� QueryMaster 가�� 사용할�� 메모리를�� 정의한다

•  Worker Disk Resource – Disk 설정은�� 과도한 Access 를�� 줄이거나�� 빠른�� Storage 를�� 위해�� 늘릴수�� 있고�� Table Scan 단계에서�� 사용된다.

–  tajo.worker.resource.disks •  1개의�� Worker 가�� 총�� 사용할수�� 있는�� 디스크�� 수를�� 정의한다

–  tajo.task.disk-‐slot.default •  1개의�� Task 를�� 처리할때�� 사용할�� 디스크�� 수를�� 정의한다

–  tajo.worker.resource.dfs-‐dir-‐aware •  Hadoop DataNode 에�� 정의된�� 정보로�� 총�� 사용�� 디스크수를�� 결정하며�� tajo.worker.resource.disks�� 설정은�� 무시된다.

•  아래의�� 예제는�� 동시에�� 최대�� 4개의�� Task 와�� 1개의�� QueryMaster 를�� 실행할수�� 있다

TAJO_WORKER_HEAPSIZE=8512 <property> <name>tajo.worker.resource.memory-‐mb</name> <value>8512</value> </property> <property> <name>tajo.task.memory-‐slot-‐mb.default</name> <value>2000</value> </property> <property> <name>tajo.worker.resource.disks</name> <value>3.0</value> </property> <property> <name>tajo.task.disk-‐slot.default</name> <value>1.0</value> </property>

INTRODUCTION TO TEXT FILES

Text Files

•  TEXT File Format – 열은�� ASCII 개행문자로�� 끝나는�� Plain-‐Text 로�� 구성되고�� 행은�� 구분�� 문자로�� 분리되는�� 파일

•  JSON File Format –  (JavaScript Object NotaHon) 열은�� ASCII 개행문자로�� 끝나는�� JSON Document 로�� 구성된�� 파일

Text Files

•  Performance Improvement –  Byte Array 대신�� Off-‐heap Buffer 를�� 사용하여�� Memory-‐Copy 개선�� 및�� 효율적인�� 메모리�� 관리

–  Low-‐Level Access (sun.misc.Unsafe 활용) •  Off-‐heap Memory 에�� 직접�� 액세스�� 하여�� Line split 및�� Number DeserializaHon 처리

Text Files

•  Table File format –  TEXT 또는 JSON�� 파일�� 형식을�� 지정하기�� 위해�� USING 절을�� 사용한다. •  create table text_table (id int, name text) using text •  create table json_table (id int, name text) using json

•  Physical Proper6es –  File format 별로�� 제공되는�� physical parameters�� 들은�� WITH 절로�� 지정할수�� 있다. •  create table text_table (id int, name text) using text with (’text.delimiter'=’|’)

Text Files

•  Delimiter –  1개의�� Line 은�� Row 로�� 인식된다

•  Line delimiter: CR, LF, CR+LF

–  Field 는�� 1byte character 로 구분된다. Default: ‘|’ •  CSV: ‘text.delimiter’=‘,’ •  TSV: ‘text.delimiter’=‘\t’ •  Hive default : ’text.delimiter'='\u0001’

default> create external table table1 ( id int, name text, score float, type text) using text with ('text.delimiter'='|') locaHon '/tajo/meetup/table1';

Text Files

•  NULL Value Handling –  Text file 의�� null field value 를�� 인식�� 하기위한�� 옵션

•  Default value 는�� empty string 을 사용

•  Hive default: 'text.null'='\\N’

default> create external table table1 ( id int, name text, score float, type text) using text with ('text.delimiter'='|', 'text.null'='\\N') locaHon 'file:/tajo/meetup/table1';

Text Files

•  Supported Codecs •  Hadoop 이�� 지원하는�� Codec 을�� 사용 (Bzip2 미지원) •  일부�� Codec 은�� Hadoop NaHve Module 이�� 필요함�� (snappy)

•  Table Decompression – 파일의�� 확장자와�� 같은�� Codec 이�� 사용된다.

•  Table Compression –  Table data 생성과�� 함께�� 압축한다.

•  Create table as, insert into

org.apache.hadoop.io.compress.DeflateCodec org.apache.hadoop.io.compress.GzipCodec org.apache.hadoop.io.compress.SnappyCodec ..

Text Files

•  Table Compression –  Managed Table

–  External Table

default> create table table2 (id int, name text, score float, type text) using text with ('text.delimiter'='|', 'compression.codec'='org.apache.hadoop.io.compress.DeflateCodec') as select * from table1;

$ gzip /tajo/meetup/table3/data.csv $bin/tsql default> create external table table3 (id int, name text, score float, type text) using text

with ('text.delimiter'='|', 'compression.codec'='org.apache.hadoop.io.compress.GzipCodec') locaHon 'file:/tajo/meetup/table3';

Text Files

•  JSON –  1개의�� 라인은�� 1개의�� JSON Document 로�� 인식된다

•  Line delimiter: CR, LF, CR+LF

–  Field 는�� json 파일의�� key 로�� 구분된다 •  Nested Data Structures (지원�� 예정)

–  Parsing error tolerance •  문법�� 오류로�� 인한�� row 를�� 처리할수�� 있다 •  text.error-‐tolerance.max-‐num 이�� -‐1�� 일경우�� 모든�� 오류를�� 무시된다.

Text Files

•  JSON

/tajo/meetup/json_table1 default> create external table json_table1 (

id int, name text, score float, type text) using json with ('text.error-‐tolerance.max-‐num'='5') locaHon '/tajo/meetup/json_table1';

$ mkdir /tajo/meetup/json_table1 $ cat >> /tajo/meetup/json_table1/data.json << EOF {"id":1,"name":"abc", "score":1.1, "type":"a"} {"id":2,"name":"def", "score":2.3, "type":"b"} {"id":3,"name":"ghi", "score":3.4, "type":"c"} {"id":4,"name":"jkl", "score":4.5, "type":"d"} {"id":5,"name":"emiya muljomdao","score":5.6, "type":"e"} EOF

Get Involved!

•  We are recruiHng contributors!

•  General –  hOp://tajo.apache.org

•  Ge�ng Started –  hOp://tajo.apache.org/docs/0.9.0/ge�ng_started.html

•  Downloads –  hOp://tajo.apache.org/docs/0.9.0/ge�ng_started/downloading_source.html

•  Jira – Issue Tracker –  hOps://issues.apache.org/jira/browse/TAJO

•  Join the mailing list –  dev-‐subscribe@tajo.apache.org –  issues-‐subscribe@tajo.apache.org

tajo seoul meetup-201501

Software

201501 gests423 s2_part_i

hix news 201501

201501 3dlive

aeroplane 201501

tajo solar - arco · title: tajo solar author: powermac g4...

201501 jan groups newsletter

dwell 201501

201501 uemx 3613 topic3 population

201501 voice web

grammar book tajo h

201501 uoit exam schedule by date

vopex technology 201501

201501 the emerging equilibrium in banking

201501 technology cio survey 2014 - deloitte

dear colleague letter ell 201501

exposición: "tajo. in memoriam"

query optimization in apache tajo

elements 201501

201501 gests423 s3

jaycee herald 201501