sqoop2 refactoring for generic data transfer - nyc sqoop meetup
TRANSCRIPT
![Page 1: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/1.jpg)
Sqoop 2Refactoring for generic data transfer
Abraham Elmahrek
![Page 2: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/2.jpg)
Cloudera Ingest!
![Page 3: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/3.jpg)
Introduction to Sqoop 2
Provide a rest API and Java API for easy integration. Existing clients include a Hue UI and a command line client.
Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector.
Emphasize separation of responsibilities. Eventually have ACLs or RBAC.
Ease of use Extensible Security
![Page 4: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/4.jpg)
Life of a Request
• Client– Talks to server over REST + JSON– Does nothing but sends requests
• Server– Extracts metadata from data source– Delegates to execution engine– Does all the heavy lifting really
• MapReduce– Parallelizes execution of the job
![Page 5: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/5.jpg)
Workflow
![Page 6: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/6.jpg)
Job Types
IMPORT into Hadoop and EXPORT out of Hadoop
![Page 7: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/7.jpg)
Responsibilities
Connector responsibilities Sqoop framework responsibilities
Transfer data from Connector A to Hadoop
![Page 8: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/8.jpg)
Connector Definitions
• Connectors define:– How to connect to a data source– How to extract data from a data source– How to load data to a data source
public Importer getImporter(); // Supply extract method
public Importer getExporter(); // Supply load method
public class getConnectionConfigurationClass();
public class getJobConfigurationClass(MJob.Type type); // MJob.Type is IMPORT or EXPORT
![Page 9: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/9.jpg)
Intermediate Data Format
• Describe a single record as it moves through Sqoop• currently available
– CSV
col1,col2,col3,...col1,col2,col3,......
![Page 10: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/10.jpg)
• Hadoop as a first class citizen disables transfers between the components in the Hadoop ecosystem– HBase to HDFS not supported– HDFS to Accumulo not supported
• Hadoop ecosystem not well defined– Accumulo was not considered part of Hadoop ecosystem– What’s next? Kafka?
What’s Wrong w/ Current Implementation?
![Page 11: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/11.jpg)
Refactoring
• Connectors already defined extractors and loaders– Refactor the connector SDK
• Pull out HDFS integration to a connector• Improve Schema integration
Transfer data from Connector A to Connector B
![Page 12: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/12.jpg)
Connector SDK
• Connectors assume all roles• Add Direction for FROM and TO• Initializers and destroyers for both directions
Connector responsibilities
![Page 13: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/13.jpg)
HDFS Connector
• Move Hadoop role to connector• Schemaless• Data formats
– Text (CSV)– Sequence– etc.
![Page 14: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/14.jpg)
Schema Improvements
• Schema per connector• Intermediate data format (IDF) has a Schema• Introduce matcher• Schema represents data as it moves through the system
![Page 15: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/15.jpg)
Matcher
• Matcher ensures data goes to right place• Combinations
– FROM and TO schema– FROM schema– TO schema– No schema = Error
![Page 16: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/16.jpg)
Matcher
Ensure that FROM schema matches TO schema by index location of Schema
Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector.
Emphasize separation of responsibilities. Eventually have ACLs or RBAC.
Location Name User defined
![Page 18: Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup](https://reader030.vdocument.in/reader030/viewer/2022032420/55a4905b1a28aba03e8b4709/html5/thumbnails/18.jpg)
Thank you