Download - Big Data Testing
![Page 1: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/1.jpg)
Emerging trends in DW Testing
Big dataMobile BI
![Page 2: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/2.jpg)
Big Data
3 VsVolume , Velocity and Variety
Silent 4th V“Value”
![Page 3: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/3.jpg)
Hadoop – Big data framework• Hadoop is an opensource framework for handling big data processing.
(inspired by a paper published by Google on GFS and mapreduce)• 2 main components are-• HDFS- Hadoop distributed file system• Mapreduce – Programming framework for aggregation, counts written in
Java.
• Other tools supported by Hadoop-• Apache Pig – Higher level abstraction language for processing data.• Apache Hive – SQL like language to process data stored in Hadoop. Not fully SQL compilant.
![Page 4: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/4.jpg)
HDFS key points• HDFS was designed to be a scalable, fault-tolerant, distributed storage
system that works closely with MapReduce. • By distributing storage and computation across many servers, the
combined storage resource can grow with demand while remaining economical at every size.
• Files are split into blocks (single unit of • storage) – Managed by Namenode, stored by Datanode – Transparent to user• Replicated across machines at load time – Same block is stored on multiple machines – Good for fault-tolerance and access – Default replication is 3
![Page 5: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/5.jpg)
HDFS CLI• Mkdir— It will take path uri’s as argument and creates directory or directories.• hadoop fs -mkdir <paths> • Eg- hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop fs -mkdir
hdfs://nn1.example.com/user/hadoop/dir
• Ls - Lists the contents of a directory• hadoop fs -ls <args>• Eg- hadoop fs -ls /user/hadoop/dir1
• Put -Copy single src file, or multiple src files from local file system to the Hadoop data file system
• hadoop fs -put <localsrc> ... <HDFS_dest_Path>
• Get - Copies/Downloads files to the local file system• hadoop fs -get <hdfs_src> <localdst>
![Page 6: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/6.jpg)
Sample Map reduce
![Page 7: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/7.jpg)
Sample map reduce
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
![Page 8: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/8.jpg)
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
![Page 9: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/9.jpg)
Sample Apache Pig query
• inpt = load '/path/to/corpus‘ using TextLoader as (line: chararray); words = foreach inpt generate flatten(TOKENIZE(line)) as word; grpd = group words by word; cntd = foreach grpd generate group, COUNT(words); dump cntd;
• The above pig script, first splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple. In the third statement, the words are grouped together so that the count can be computed
![Page 10: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/10.jpg)
Sample Hive query
• SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING
having_condition] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number]
![Page 11: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/11.jpg)
Other Big data technologies• No SQL – Characteristics• 1. No particular schema• 2. Highly scalable• 3. Availability• 4. Speed of access
• Monogo DB(Document oriented), Cassandra, Hbase (Hadoop based)
• All supports Java Programming API. 2 of the above built in Java.
• All have its own query language – CQL, Hbase query language
• This technology deserves a entire presentation itself
![Page 12: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/12.jpg)
Reference architecture
![Page 13: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/13.jpg)
![Page 14: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/14.jpg)
Different layers
• 1. Source to Hadoop data ingestion (EL)• 2. Hadoop processing using mapreduce, Pig
hive (T)• 3. Loading of processed data from hadoop to
EDW.• 4. BI reports and analytics using the processed
data stored in hadoop via Hive
![Page 15: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/15.jpg)
![Page 16: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/16.jpg)
Validation of pre Hadoop processing-
• Data from various sources like weblog, social media, transactional data etc are loaded into hadoop.
• Methodology-• Compare the input data file against source system data to ensure
data is extracted correctly.• Validating the files are loaded into HDFS correctly• Comparison using script to read data from HDFS(Java API) and check
against the respective source data.• Sample testing. But identifying the sample can be challenge since
volume is huge. Consider critical business scenarios.
![Page 17: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/17.jpg)
Code for accessing HDFS file system public static void main(String[] args) throws IOException { FileSystem hdfs =FileSystem.get(new Configuration()); Path homeDir=hdfs.getHomeDirectory(); //Print the home directory System.out.println("Home folder -" +homeDir); }Code for copying File from local file system to HDFSPath localFilePath = new Path("c://localdata/datafile1.txt");Path hdfsFilePath=new (“hdfs://localhost:9000/test/file1.txt");hdfs.copyFromLocalFile(localFilePath, hdfsFilePath);Copying File from HDFS to local file system localFilePath=new Path("c://hdfsdata/datafile1.txt");hdfs.copyToLocalFile(hdfsFilePath, localFilePath);
Reading data From HDFS FileBufferedReader bfr=new BufferedReader(new InputStreamReader(hdfs.open(newFilePath))); String str = null;while ((str = bfr.readLine())!= null) { System.out.println(str); }
![Page 18: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/18.jpg)
Validation of Hadoop processing• Validating the business logic on standalone mode.• Validating the mapreduce process to verify the key value pairs is
generated correctly.• Validating the output data against source files to ensure data processing is
done correctly.• Creating MRUnit ( testing framework) for mapreduce. Low level – unit
testing.
• Alternate approach - Write Pig scripts to implement the business logic and verify the results generated from mapreduce.
![Page 19: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/19.jpg)
Validating of data extract and load into EDW (Sqoop)
• Validating that the transformation rules are applied correctly
• Validating that there is no data corruption by comparing target table data against HDFS file data. (Hive can be used )
• Validating the aggregation of data and data integrity.
![Page 20: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/20.jpg)
Validation of reports (Hive/EDW)
• Validating by firing queries against HDFS using HIVE
• Validating by running SQL against the EDW
• Normal report testing approach.
![Page 21: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/21.jpg)
Challanges• Volume- Prepare compare script to validate the data.
• To reduce the time we can run all the comparison script in parallel just like data is processed in mapreduce. https://community.informatica.com/solutions/file_table_compare_utility_hdfs_hive
• Sample data ensuring maximum scenarios is covered.
• Variety- Unstructured data can be converted to structured form and then compared.
![Page 22: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/22.jpg)
Non Functional testing
• Performance testing• Failover testing• Realtime support
![Page 23: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/23.jpg)
Summary• 1. Data ingestion into hadoop via Sqoop,Flume,Kafka
• 2.Data processing within Hadoop using mapreduce,Pig , Hive. (or ETL tools like Informatica Big data edition,Talend)
• 3.Reporting using reporting tools like Tableau, Microstrategy.(via HIVE)
• 4.Loading of data from Hadoop to EDW (Teradata/Oracle) or analytical database (GreenPlum/Netezzea)
![Page 24: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/24.jpg)
Why Hadoop is gaining popularity
• 1. Open source• 2. Can run on commodity servers• 3. Can support any type of unstructured data.
(which makes it win over parallel database)• 4.Fault tolerant
![Page 25: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/25.jpg)
Use cases• 1. ETL processing moved to Hadoop to take advantage of
processing of structured/unstructured data
• 2. Machine learning over Hadoop. Eg Recommendation engine (Amazon,Flipkart)
• 3.Fraud detection in credit card or insurance industry
• 4. Retail – Understanding customers buying pattern, Market basket analysis etc.
![Page 26: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/26.jpg)
Emerging trends in Big data• 1. Yarn – A generic framework over Hadoop which will allow
other applications(other than mapreduce) to be built over Hadoop and also allow integration of other applications with Hadoop
• 2. Stream Processing using Strom.(again opensource)
• 3. In memory cluster computing using Apache Spark over Hadoop clusters.
![Page 27: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/27.jpg)
Data Mining in Hadoop – Market Basket Analysis
• Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items.
• Collect list of pair of transactions items most frequently occurring together at a store
![Page 28: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/28.jpg)
Initial data
• Transaction 1: cracker, icecream, soda• Transaction 2: chicken, pizza, coke, bread• Transaction 3: baguette, soda, hering, cracker, soda• Transaction 4: bourbon, coke, turkey,bread• Transaction 5: sardines, soda, chicken, coke• Transaction 6: apples, peppers, avocado, steak• Transaction 7: sardines, apples, peppers, avocado,• steak• …
![Page 29: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/29.jpg)
Data setup
• < (cracker, icecream), (cracker, soda),(icecream and soda) >
• < (chicken, pizza), (chicken, coke), (chicken, bread),(pizza,coke) ….. >
• < (baguette, soda), (baguette, hering), (baguette,• cracker), (baguette, soda) >• < (bourbon, coke), (bourbon, turkey) >• < (sardines, soda), (sardines, chicken), (sardines, coke)• >
![Page 30: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/30.jpg)
Map phase
• < ((cracker, icecream),1), ((cracker, soda),1),((icecream and soda),1)>
• <((chicken, pizza),1), ((chicken, coke),1), ((chicken, bread),1),…((cracker,soda),1)>
Output –• ((cracker,icecream),1)• ((cracker,soda),1) ((cracker,soda),1) ((key) , value)
![Page 31: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/31.jpg)
Reduce phase• Input -• ((cracker,icecream),<1,1,1….>)• ((cracker,soda) ,<1,1>)
• Output -• ((cracker,icecream),540)• ((cracker,soda), 240)
Result-1.Icecream should be placed nearby to cracker.2.Keep some combo offers for the combination to increase sale.
![Page 32: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/32.jpg)
How to test the above code
• Create a sample dataset covering different types of items combination and take count. (using excel )
• Run the mapreduce jobflow for MBA
• Verify the results from step1 and step2
![Page 33: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/33.jpg)
What next
• 1. Learn Java and mapreduce (not mandatory but will definitely help)
• 2. Learn Hadoop .(Hadoop – Definitive Guide by Tom white, Hadoop in Action – Chuck lam)
• 4.Learn some stuff related to Pig and Hive.• 5. Plenty of tutorials in net• 3. Install a free VM (Cloudera/Hortonworks)
![Page 34: Big Data Testing](https://reader034.vdocument.in/reader034/viewer/2022051519/577cc0691a28aba71190028a/html5/thumbnails/34.jpg)
Thank you
• Questions