on job training right place to develop your career · section 2. demystifying hadoop . o why hadoop...

Baroda Institute of Technology

On Job Training

[email protected]

Typewritten text

Right Place To Develop Your Career

Company Profile:

Company Name: Baroda Institute Of Technology (BIT)

Registration No. B-26/23349

GST No. 24AIVPP3462A1Z8

Pan No. AIVPP3462A

Contact Details:

BIT

BIT a unique Centre for Professional Development, is made up of a network of passionate,

supportive, collaborative, diverse and professional trainer & Faculties who focus on

developing the skills of students to improve their performance.

Our Strength:

• Pioneers in Skill based Training • More than 15 years of Training Excellence • Wide range of Training Methodologies

BIT (Baroda Institute of Technology) incorporated in the year 2002 with an aim to endow

Job aspirants and Professionals with necessary skills to excel in their field by giving them

quality training at par with the industry standards. The promoters of the company had

been into training services for more than decade , the experience and expertise of which

have paved the way for the company to develop and grow into further branches. BIT has its

corporate office in Vadodara – Sayajigunj and its branches in various other parts of Gujarat.

BIT offers career & professional training through BIT Computer Education (www.bitbaroda.com), BIT INFOTECH offers Website and Software Development (www.bitinfotech.in) & JOBBIT offers

Placement Services (www.jobbit.in)

Baroda Institute Of Technology

B-208, Manubhai Tower, Opp. M.S.

University, Sayajigunj,

Vadodara, Gujarat, India

PH.0265-2225711

M. 9327219987

www.bitbaroda.com

Introduction:


Internship in Big Data Hadoop

This training offers an immersive training program for the engineering and postgraduate students

willing to build career in the fields of business analytics. The extensive course curriculum designed by

the industry experts offers scenario-based learning offering unmatched details and expertise.

Project-based training methodology builds the proficiency in the students to successfully contribute

and complete the project.

With this comprehensive training, you will learn the fundamentals of relational database to manage

the structured and unstructured data; grid computing, Hadoop Ecosystem, and Hadoop distributed

file system (HDFS). The need, functionality, and how to develop applications in Map Reduce is

covered. How to set up Hadoop cluster and administer it is illustrated by the expert trainer.

Big data and its technologies offer progressive career path with accelerated growth. Fill the

significant gap between the academics learning and skills employers are looking for. Gear up

yourself with the industry demanded skills and get ready to be part of this trendy job market.


Course Content • Section 1. The big picture of Big Data o What is Big Data o Necessity of Big Data and Hadoop in the industry o Paradigm shift - why the industry is shifting to Big Data tools o Different dimensions of Big Data o Data explosion in the Big Data industry o Various implementations of Big Data o Different technologies to handle Big Data o Traditional systems and associated problems o Future of Big Data in the IT industry

• Section 2. Demystifying Hadoop o Why Hadoop is at the heart of every Big Data solution o Introduction to the Big Data Hadoop framework o Hadoop architecture and design principles o Ingredients of Hadoop o Hadoop characteristics and data-flow o Components of the Hadoop ecosystem o Hadoop Flavors – Apache, Cloudera, Hortonworks, and more

• Section 3. Setup and Installation of Hadoop o A) SETUP AND INSTALLATION OF SINGLE-NODE HADOOP CLUSTER

o Hadoop environment setup and pre-requisites o Hadoop Installation and configuration

o Working with Hadoop in pseudo-distributed mode

o Troubleshooting encountered problems

o B) SETUP AND INSTALLATION OF HADOOP MULTI-NODE CLUSTER

o Hadoop environment setup on the cloud (Amazon cloud) o Installation of Hadoop pre-requisites on all nodes

o Configuration of masters and slaves on the cluster

o Playing with Hadoop in distributed mode

• Section 4. HDFS – The Storage Layer o What is HDFS (Hadoop Distributed File System) o HDFS daemons and architecture o HDFS data flow and storage mechanism o Hadoop HDFS characteristics and design principles o Responsibility of HDFS Master – NameNode o Storage mechanism of Hadoop meta-data o Work of HDFS Slaves – DataNodes o Data Blocks and distributed storage o Replication of blocks, reliability, and high availability o Rack-awareness, scalability, and other features o Different HDFS APIs and terminologies o Commissioning of nodes and addition of more nodes


o Expanding clusters in real-time o Hadoop HDFS Web UI and HDFS explorer o HDFS best practices and hardware discussion

• Section 5. A Deep Dive into MapReduce o What is MapReduce, the processing layer of Hadoop o The need for a distributed processing framework o Issues before MapReduce and its evolution o List processing concepts o Components of MapReduce – Mapper and Reducer o MapReduce terminologies- keys, values, lists, and more o Hadoop MapReduce execution flow o Mapping and reducing data based on keys o MapReduce word-count example to understand the flow o Execution of Map and Reduce together o Controlling the flow of mappers and reducers o Optimization of MapReduce Jobs o Fault-tolerance and data locality o Working with map-only jobs o Introduction to Combiners in MapReduce o How MR jobs can be optimized using combiners

• Section 6. MapReduce - Advanced Concepts o Anatomy of MapReduce o Hadoop MapReduce data types o Developing custom data types using Writable & WritableComparable o InputFormats in MapReduce o InputSplit as a unit of work o How Partitioners partition data o Customization of RecordReader o Moving data from mapper to reducer – shuffling & sorting o Distributed cache and job chaining o Different Hadoop case-studies to customize each component o Job scheduling in MapReduce

• Section 7. Hive – Data Analysis Tool o The need for an adhoc SQL based solution – Apache Hive o Introduction to and architecture of Hadoop Hive o Playing with the Hive shell and running HQL queries o Hive DDL and DML operations o Hive execution flow o Schema design and other Hive operations o Schema-on-Read vs Schema-on-Write in Hive o Meta-store management and the need for RDBMS o Limitations of the default meta-store o Using SerDe to handle different types of data o Optimization of performance using partitioning o Different Hive applications and use cases


• Section 8. Pig – Data Analysis Tool o The need for a high level query language - Apache Pig o How Pig complements Hadoop with a scripting language o What is Pig o Pig execution flow o Different Pig operations like filter and join o Compilation of Pig code into MapReduce o Comparison - Pig vs MapReduce

• Section 9. NoSQL Database – Hbase o NoSQL databases and their need in the industry o Introduction to Apache HBase o Internals of the HBase architecture o The HBase Master and Slave Model o Column-oriented, 3-dimensional, schema-less datastores o Data modeling in Hadoop HBase o Storing multiple versions of data o Data high-availability and reliability o Comparison - HBase vs HDFS o Comparison - HBase vs RDBMS o Data access mechanisms o Work with HBase using the shell

• Section 10. Data Collection using Sqoop o The need for Apache Sqoop o Introduction and working of Sqoop o Importing data from RDBMS to HDFS o Exporting data to RDBMS from HDFS o Conversion of data import/export queries into MapReduce jobs

• Section 11. Data Collection using Flume o What is Apache Flume o Flume architecture and aggregation flow o Understanding Flume components like data Sources and Sinks o Flume channels to buffer events o Reliable & scalable data collection tools o Aggregating streams using Fan-in o Separating streams using Fan-out o Internals of the agent architecture o Production architecture of Flume o Collecting data from different sources to Hadoop HDFS o Multi-tier Flume flow for collection of volumes of data using AVRO

• Section 12. Apache YARN & advanced concepts in the latest version o The need for and the evolution of YARN o YARN and its eco-system o YARN daemon architecture o Master of YARN – Resource Manager


o Slave of YARN – Node Manager o Requesting resources from the application master o Dynamic slots (containers) o Application execution flow o MapReduce version 2 application over Yarn o Hadoop Federation and Namenode HA

• Section 13. Processing data with Apache Spark o Introduction to Apache Spark o Comparison - Hadoop MapReduce vs Apache Spark o Spark key features o RDD and various RDD operations o RDD abstraction, interfacing, and creation of RDDs o Fault Tolerance in Spark o The Spark Programming Model o Data flow in Spark o The Spark Ecosystem, Hadoop compatibility, & integration o Installation & configuration of Spark o Processing Big Data using Spark

• Section 14 Live Project o Getting customer’s requirements

o Preparing database and business logics

o Developing application

o Testing and implementing the project

o Troubleshooting the project application after implementation

o Summary

on job training right place to develop your career · section 2. demystifying hadoop . o why hadoop...

Documents