pycon 2016-open-space

Big Data Architecture and Cluster Optimization with

Python

By: Chetan Khatri

Principal Big Data Engineer, Nazara Technologies.Data Science & Machine Learning Curricula Advisor,

University of Kachchh, Gujarat.

Pycon India 2016

Data Analytics Cyclel Understand the Businessl Understand the Datal Cleanse the Datal Do Analytics the Datal Predict the Datal Visualize the datal Build Insight that helps to grow Business Revenuel Explain to Executive (CxO)l Take Decisionl Increase Revenue

Capacity Planning (Cluster Sizing)l Telecom Business:

l 122 Operators , 4 Region(INDIA, Africa, ME, Latin America.

l 12 TB of Data per Yearl 11,00,000 Transactions per day.

l Gaming Business:l 6 Billion events per month = (near by) 15 TB of Data

per year.l Total: 27 TB of Data per year

Predictive Modeling Cycle1. Data Quality (Removing Noisy, Missing Data)2. Feature Engineering3. Choosing Best Model: " based on culture of Data, For ex. If continues data-points go with Linear Regression , If categorical binomial prediction requires then go with Logistic Regression, For Random sample of data(Feature randomization) and have better generalization performance. other like Gradient Boosting Trees for optimal linear combination of trees and weighted sum of predictions of individual trees."Try from Linear Regression to Deep Learning (RNN, CNN)4. Ensemble Model (Regression + Random Forest + XGBoost)5. Tune Hyper-parameters(For ex in Deep Neural Network, Needs to tune mini-batch size, learning rate, epoch, hidden layers)6. Model Compression - Port model to embedded / mobile devices using Compress matrices(Sparsify, Shrink, Break, Quantize)7. Run on smart-phone

Big Data Cluster Tuning – OS Parameters

TPS (Transaction Per Second) - throughput for every Jobs.

Time Wait Interval - TCP - For ex. 4 min

Max.portmax.connection

sysctl net.ipv4.ip_local_port_rangesysctl net.ipv4.tcp_fin_timeout

Max Thread - sysctl -a | grep threads_maxecho 120000 > /proc/sys/kernal/threads_maxecho 600000 > /proc/sys

cat /proc/sys/kernal/threads_max

Number of Thread = Total Virtual Memory / (Stacksize * 1024 * 2024)

java.lang.OutOfMemoryError: Java heap space !

l List Ram: free -ml Storage: df -hl ulimit -s // Stack memoryl ulimit -v // Virtual Memoryl echo 120000 > /proc/sys/kernal/threads_maxl echo 600000 > /proc/sys/kernal/max_map_countl echo 200000 > /proc/sys/kernal/pid_max

Virtual Memory Configuration – swap configuration

l sudo fallocate -l 20G /swapfilel sudo chmod 600 /swapfilel sudo mkswap /swapfilel sudo swapon /swapfilel sudo swapon -sl sudo nano /etc/fstabl /swapfile none swap sw 0 0

Maximum number of open filesl ulimit -nl sudo nano /etc/security/limits.confl * soft nofile 64000l * hard nofile 64000l root soft nofile 64000l root hard nofile 64000l sudo nano /etc/pam.d/common-sessionl session required pam_limits.sol sudo nano /etc/pam.d/common-session-noninteractivel session required pam_limits.so

Big Data Optimization: Tune kafka Cluster

l buffer.memory: defaultl batch.size: "655357"l linger.ms: "5"l compression.type: lz4l retries: defaultl send.buffer.bytes: defaultl connections.max.idle.ms: defaultl bootstrap.serversl batch.sizel linger.msl connections.max.idle.ms = 10000l compression.typel retries

Spark Cluster Hyper parameter Tuning

l 1) ./spark-shell --confl --conf spark.executor.memory=50gl --conf spark.driver.memory=150gl --conf spark.kryoserializer.buffer.max=256 l --conf spark.driver.maxResultSize=1g l --conf spark.dynamicAllocation.enabled=true l --conf spark.shuffle.service.enabled=true l --conf spark.rpc.askTimeout=300s l --conf spark.dynamicAllocation.minExecutors=5 l --conf spark.sql.shuffle.partitions=1024

l 2) Configuration in spark-defaults.conf at /usr/local/spark-1.6.1/conf

l spark.master spark://master.prod.chetan.com:7077l spark.serializer org.apache.spark.serializer.KryoSerializerl spark.eventLog.enabled truel spark.history.fs.logDirectory file:/data/tmp/spark-eventsl #spark.eventLog.dir=hdfs://namenode_host:namenode_port/user/spark/

applicationHistory4l spark.eventLog.dir file:/data/tmp/spark-events

l PySpark with Hadoop Demo

PySpark with Hadoop Demo- MapReduce with wordcount

l >>> textFile = sc.textFile("file:///home/chetan306/inputfile.txt")

l >>> textFile.count()l >>> textFile.first()l >>> wordCounts = textFile.flatMap(lambda line:

line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

l >>> wordCounts.collect()

Data Science in University Education Initiativel Data Science Lab, Computer Science Department – University

of Kachchh.

Data Science in University Education Initiativel - Machine learning / Data Science with Python

Questions ?

Resourceshttps://github.com/dskskv/pycon-india-2016

chetan@kutchuni.edu.inTwitter: @khatri_chetan

pycon 2016-open-space

Data & Analytics

solvcon pycon apac 2014

pycon talk final

pycon india 12

pycon 2016

pycon data design_meaning

adaptation in open source software, pycon 2016 keynote

odoo connector pycon 2015

pycon jp 2014 closing

pycon uk 2014 keynote

an ethnography of women in free/open source...

pycon ph 2014 - geocomputation

pycon 2012, santa...

laszlo pycon 2005

pycon 2010 tutorial

pycon singapore 2013 keynote

dojo intro pycon

city of eloy parks and open space element open space element...

pycon quickly

pycon ua 2016

pycon 2012 - practical devops