apache sqoop with an use case

32
Apache Sqoop BY DAVIN.J.ABRAHAM

Upload: davin-abraham

Post on 25-May-2015

1.417 views

Category:

Technology


0 download

DESCRIPTION

Big data, Apache, Apache sqoop, Sqoop, use case

TRANSCRIPT

Page 1: Apache sqoop with an use case

Apache SqoopBY DAVIN.J.ABRAHAM

Page 2: Apache sqoop with an use case
Page 3: Apache sqoop with an use case

What is Sqoop

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Sqoop imports data from external structured datastores into HDFS or related systems like Hive and HBase.

Sqoop can also be used to export data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.

Sqoop works with relational databases such as: Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

Page 4: Apache sqoop with an use case

Why Sqoop?

As more organizations deploy Hadoop to analyse vast streams of information, they may find they need to transfer large amount of data between Hadoop and their existing databases, data warehouses and other data sources

Loading bulk data into Hadoop from production systems or accessing it from map-reduce applications running on a large cluster is a challenging task since transferring data using scripts is a inefficient and time-consuming task

Page 5: Apache sqoop with an use case

Hadoop-Sqoop?

Hadoop is great for storing massive data in terms of volume using HDFS

It Provides a scalable processing environment for structured and unstructured data

But it’s Batch-Oriented and thus not suitable for low latency interactive query operations

Sqoop is basically an ETL Tool used to copy data between HDFS and SQL databases Import SQL data to HDFS for archival or analysis

Export HDFS to SQL ( e.g : summarized data used in a DW fact table )

Page 6: Apache sqoop with an use case

What Sqoop Does

Designed to efficiently transfer bulk data between Apache Hadoop and structured datastores such as relational databases, Apache Sqoop:

Allows data imports from external datastores and enterprise data warehouses into Hadoop

Parallelizes data transfer for fast performance and optimal system utilization

Copies data quickly from external systems to Hadoop

Makes data analysis more efficient 

Mitigates excessive loads to external systems.

Page 7: Apache sqoop with an use case

How Sqoop Works

Sqoop provides a pluggable connector mechanism for optimal connectivity to external systems.

The Sqoop extension API provides a convenient framework for building new connectors which can be dropped into Sqoop installations to provide connectivity to various systems.

Sqoop itself comes bundled with various connectors that can be used for popular database and data warehousing systems.

Page 8: Apache sqoop with an use case

Who Uses Sqoop?

Online Marketer Coupons.com uses sqoop to exchange data between Hadoop and the IBM Netezza data warehouse appliance, The organization can query its structres databases and pipe the results into Hadoop using sqoop.

Education company The Apollo group also uses the software not only to extract data from databases but to inject the results from Hadoop jobs back into relational databases

And countless other hadoop users use sqoop to efficiently move their data

Page 9: Apache sqoop with an use case

Importing Data - Lists databases in your mysql database.

$ sqoop list-databases --connect jdbc:mysql://<<mysql-server>>/employees --username airawat --password myPassword . . . 13/05/31 16:45:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. information_schema employees test

Page 10: Apache sqoop with an use case

Lists tables in your mysql database.$ sqoop list-tables --connect jdbc:mysql://<<mysql-server>>/employees --username airawat --password myPassword . . . 13/05/31 16:45:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. departments dept_emp dept_manager employees employees_exp_stg employees_export salaries titles

Page 11: Apache sqoop with an use case

Importing data in MySql into HDFS

Replace "airawat-mySqlServer-node" with the host name of the node running mySQL server, replace login credentials and target directory.Importing a table into HDFS - basic import $ sqoop import \ --connect jdbc:mysql://airawat-mySqlServer-node/employees \ --username myUID \ --password myPWD \ --table employees \ -m 1 \ --target-dir /user/airawat/sqoop-mysql/employees

.

.

.

.9139 KB/sec) 13/05/31 22:32:25 INFO mapreduce.ImportJobBase: Retrieved 300024

records

Page 12: Apache sqoop with an use case

Executing imports with an options file for static information

Rather than repeat the import command along with connection related input required, each time, you can pass an options file as an argument to sqoop.  

Create a text file, as follows, and save it someplace, locally on the node you are running the sqoop client on.  

. Sample Options file: ___________________________________________________________________________ $ vi SqoopImportOptions.txt # #Options file for sqoop import # import --connect jdbc:mysql://airawat-mySqlServer-node/employees --username myUID --password myPwd # #All other commands should be specified in the command line

Page 13: Apache sqoop with an use case

Options File - Command

The command $ sqoop --options-file SqoopImportOptions.txt \ --table departments \ -m 1 \ --target-dir /user/airawat/sqoop-mysql/departments

.

.

. 13/05/31 22:48:55 INFO mapreduce.ImportJobBase: Transferred 153 bytes

in 26.2453 seconds (5.8296 bytes/sec) 13/05/31 22:48:55 INFO mapreduce.ImportJobBase: Retrieved 9 records.

-m argument is to specify number of mappers. The department table has a handful of

records, so I am setting it to 1.

Page 14: Apache sqoop with an use case

The files Created In hdfsFiles created in HDFS: $ hadoop fs -ls -R sqoop-mysql/

drwxr-xr-x - airawat airawat 0 2013-05-31 22:48 sqoop-mysql/departments

-rw-r--r-- 3 airawat airawat 0 2013-05-31 22:48 sqoop-mysql/departments/_SUCCESS

drwxr-xr-x - airawat airawat 0 2013-05-31 22:48 sqoop-mysql/departments/_logs

drwxr-xr-x - airawat airawat 0 2013-05-31 22:48 sqoop-mysql/departments/_logs/history

-rw-r--r-- 3 airawat airawat 79467 2013-05-31 22:48 sqoop-mysql/departments/_logs/history/cdh-jt01_1369839495962_job_201305290958_0062_conf.xml

-rw-r--r-- 3 airawat airawat 12441 2013-05-31 22:48 sqoop-mysql/departments/_logs/history/job_201305290958_0062_1370058514473_ airawat_departments.jar

-rw-r--r-- 3 airawat airawat 153 2013-05-31 22:48 sqoop-mysql/departments/part-m-00000

Page 15: Apache sqoop with an use case

To View the contents of a table

. Data file contents: $ hadoop fs -cat sqoop-mysql/departments/part-m-00000 | more d009,Customer Service d005,Development d002,Finance d003,Human Resources d001,Marketing d004,Production d006,Quality Management d008,Research d007,Sales

Page 16: Apache sqoop with an use case

Import all Rows But Column Specific

$ sqoop --options-file SqoopImportOptions.txt \ --table dept_emp \ --columns “EMP_NO,DEPT_NO,FROM_DATE,TO_DATE” \ --as-textfile \ -m 1 \ --target-dir /user/airawat/sqoop-mysql/DeptEmp

Page 17: Apache sqoop with an use case

Import all Columns, But row Specific using Where Clause

Import all columns, filter rows using where clause $ sqoop --options-file SqoopImportOptions.txt \ --table employees \ --where "emp_no > 499948" \

--as-textfile \ -m 1 \ --target-dir /user/airawat/sqoop-mysql/employeeGtTest

Page 18: Apache sqoop with an use case

Import - Free Form Query

. Import with a free form query with where clause $ sqoop --options-file SqoopImportOptions.txt \ --query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where EMP_NO < 20000 AND $CONDITIONS' \ -m 1 \ --target-dir /user/airawat/sqoop-mysql/employeeFrfrmQry1

Page 19: Apache sqoop with an use case

Import without Where clause

Import with a free form query without where clause $ sqoop --options-file SqoopImportOptions.txt \ --query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where $CONDITIONS' \ -m 1 \ --target-dir /user/airawat/sqoop-mysql/employeeFrfrmQrySmpl2

Page 20: Apache sqoop with an use case

Export: Create sample Table Employees

Create a table in mysql: mysql> CREATE TABLE employees_export ( emp_no int(11) NOT NULL, birth_date date NOT NULL, first_name varchar(14) NOT NULL, last_name varchar(16) NOT NULL, gender enum('M','F') NOT NULL, hire_date date NOT NULL, PRIMARY KEY (emp_no)

Page 21: Apache sqoop with an use case

Import Employees to hdfs to demonstrate export

Import some data into HDFS: sqoop --options-file SqoopImportOptions.txt \ --query 'select EMP_NO,birth_date,first_name,last_name,gender,hire_date from employees where $CONDITIONS' \ --split-by EMP_NO \ --direct \ --target-dir /user/airawat/sqoop-mysql/Employees

Page 22: Apache sqoop with an use case

EXPORT – Create a stage table

Create a stage table in mysql:

mysql > CREATE TABLE employees_exp_stg ( emp_no int(11) NOT NULL, birth_date date NOT NULL, first_name varchar(14) NOT NULL, last_name varchar(16) NOT NULL, gender enum('M','F') NOT NULL, hire_date date NOT NULL, PRIMARY KEY (emp_no) );

Page 23: Apache sqoop with an use case

The Export Command

$ sqoop export \ --connect jdbc:mysql://airawat-mysqlserver-node/employees \

--username MyUID \ --password myPWD \ --table employees_export \ --staging-table employees_exp_stg \ --clear-staging-table \ -m 4 \ --export-dir /user/airawat/sqoop-mysql/Employees . . . 13/06/04 09:54:18 INFO manager.SqlManager: Migrated 300024 records from `employees_exp_stg` to `employees_export`

Page 24: Apache sqoop with an use case

Results of Export

Results mysql> select * from employees_export limit 1; +--------+------------+------------+-----------+--------+------------+ | emp_no | birth_date | first_name | last_name | gender | hire_date | +--------+------------+------------+-----------+--------+------------+ | 200000 | 1960-01-11 | Selwyn | Koshiba | M | 1987-06-05 | +--------+------------+------------+-----------+--------+------------+ mysql> select count(*) from employees_export; +----------+ | count(*) | +----------+ | 300024 | +----------+ mysql> select * from employees_exp_stg; Empty set (0.00 sec)

Page 25: Apache sqoop with an use case

Export – Update Mode

. Export in update mode A2.2.1. Prep: I am going to set hire date to null for some records, for trying this functionality out.

mysql> update employees_export set hire_date = null where emp_no >400000; Query OK, 99999 rows affected, 65535 warnings (1.26 sec) Rows matched: 99999 Changed: 99999 Warnings: 99999

Page 26: Apache sqoop with an use case

Now to see if the update worked

Sqoop command: Next, we will export the same data to the same table, and see if the hire date is updated.

$ sqoop export \ --connect jdbc:mysql://airawat-mysqlserver-node/employees \ --username myUID \ --password myPWD \ --table employees_export \ --direct \ --update-key emp_no \ --update-mode updateonly \ --export-dir /user/airawat/sqoop-mysql/Employees

Page 27: Apache sqoop with an use case

It Worked!

. Results: mysql> select count(*) from employees_export where hire_date is null; +----------+ | count(*) | +----------+ | 0 | +----------+ 1 row in set (0.22 sec)

Page 28: Apache sqoop with an use case

Export in upsert (Update+Insert) mode

Upsert = insert if does not exist, update if exists.

Page 29: Apache sqoop with an use case

Upsert Command

sqoop export \ --connect jdbc:mysql://airawat-mysqlserver-node/employees \ --username myUID \ --password myPWD \ --table employees_export \ --update-key emp_no \

--update-mode allowinsert \

--export-dir /user/airawat/sqoop-mysql/Employees

Page 30: Apache sqoop with an use case

Exports may Fail due to

Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes)

Attempting to INSERT a row which violates a consistency constraint (for example, inserting a duplicate primary key value)

Attempting to parse an incomplete or malformed record from the HDFS source data

Attempting to parse records using incorrect delimiters

Capacity issues (such as insufficient RAM or disk space)

Page 31: Apache sqoop with an use case

Sqoop up Healthcare?

Most hospitals today store patient information in relational databases

In order to analyse this data and gain some insight from it, we need to get it into Hadoop.

Sqoop will make that process very efficient.

Page 32: Apache sqoop with an use case

Thank You For Your Time