random query generator for hive november 2015 hive contributor meetup szehon ho

13
Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

Upload: deirdre-mcdowell

Post on 18-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

3 © 2014 Cloudera, Inc. All rights reserved. Data Generator Table-count (max, min) Column-count (max, min) Row-count (max, min) Column Data Types BooleanFloat TinyIntDecimal(r_precision, r_scale) SmallIntChar(r_length) BigIntVarchar(r_length) DoubleTimestamp

TRANSCRIPT

Page 1: Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

Random Query Generator for HiveNovember 2015 Hive Contributor Meetup

Szehon Ho

Page 2: Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

2© 2014 Cloudera, Inc. All rights reserved.

Overview• Collaboration with Impala team, work to run against Hive• Automates generation of test cases, solves:• Humans can only generate so many test queries• Humans focus on positive queries (what about machine-generated

queries)• Idea is to have two databases: test (Hive, Impala) and

reference database (Postgres, Mysql, Oracle)• Generate random data, issue random queries against both

Page 3: Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

3© 2014 Cloudera, Inc. All rights reserved.

Data Generator• Table-count (max, min)• Column-count (max, min)• Row-count (max, min)

Column Data TypesBoolean FloatTinyInt Decimal(r_precision, r_scale)SmallInt Char(r_length)BigInt Varchar(r_length)Double Timestamp

Page 4: Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

4© 2014 Cloudera, Inc. All rights reserved.

Query Generator1. Generate QueryModel based on QueryProfile2. ModelTranslator to translate from Model to database’s SQL dialect3. Execute the SQL on via DbConnectors4. Result comparison (sort if unsorted)

QueryModel

HiveProfile

ImpalaProfile

HiveTranslator

PostgresTranslator“Test databases”

MysqlTranslator

HiveQL

SQL (Postgres dialect)

SQL (Mysql dialect)

“Reference databases”

Page 5: Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

5© 2014 Cloudera, Inc. All rights reserved.

Query Model, High Level

Query

Clause

Constant/Col Funcs TableExpr

• Represent valid SQL query• Query consist of one or more

clause (from, select, group-by, union)

• Clause has one or more expressions (constants, columns, functions of columns, tables), different for different clause types

• Model is Recursive in nature:• Funcs can be run on output of

other funcs• Union clause can contain

another query• Some boolean funcs can contain

subquery

Page 6: Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

6© 2014 Cloudera, Inc. All rights reserved.

Query Model, Funcs• Func types:• Boolean funcs (isnull, and, or, in, =, !=, >, <)• Subquery funcs (exists, not exists, in, not in): May contain

another Query• Val funcs (Trim, Length, Concat, Add, Abs, Floor, Ceil, Greatest,

Least, etc)• Agg funcs (Eg, Max, Min, Sum, Avg, Count)• Analytic Funcs (Rank, DenseRank, RowNumber, Lead, Lag,

FirstValue, LastValue, Max, Min, etc..)• Window specification (“Rows between x and y”, “rows

unbounded preceding”, etc)• PartitionByClause (“over (partition by x)”)• OrderByClause

• Rules to determine where to use a func, based on func type and return type

Page 7: Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

7© 2014 Cloudera, Inc. All rights reserved.

QueryModel: Clauses• QueryModel• WithClause• SelectClause• FromClause: Table Expression• WhereClause:

• Predicate (Boolean expr)• GroupByClause: if Select (Basic or

AggFunc)• HavingClause: if Select (AggFunc)

• Predicate (Boolean expr)• UnionClause (Query)• OrderByClause• LimitClause

• SelectClause, List of Expr’s:• Constant• Col• Val Funcs• AggFunc• AnalyticFunc

• Window• PartitionByClause• OrderByClause

WithClause: Adds a table expression:

“With bar as (select * from foo) select * from bar;

GroupByClause, List of:• Constant• Col

OrderByClause, List of:• Constant• Col• Func

Page 8: Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

8© 2014 Cloudera, Inc. All rights reserved.

QueryModel: Joins• QueryModel• WithClause• SelectClause• FromClause:

• Multiple table expressions• JoinClause (define table

relationship)• WhereClause:

• Predicate (Boolean function, using expr from tables in JoinClause)

• GroupByClause• HavingClause

• JoinClause Types:• Inner• Left• Right• Left semi• Right semi• Right anti• Full outer• Cross

Page 9: Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

9© 2014 Cloudera, Inc. All rights reserved.

Demo

Page 10: Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

10© 2014 Cloudera, Inc. All rights reserved.

Results 1: HiveQL Discrepancies• Language Deficiences (as of Hive 1.1)• Support “Interval” for date arithemetic operations: date + INTERVAL

expr unit• With {…} cannot be used in subquery• Having must have a group by• Cannot sort by two expressions in window function, unless window

specified• Negative lag or lead amount not allowed• Only “Union all” and not “Union” (since fixed)

• Null Ordering• Hive lacks specifying null order (opposite of Postgres)

Page 11: Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

11© 2014 Cloudera, Inc. All rights reserved.

Results 2: JIRA’s so far• Many valid issues found, fixed since 1.1• HIVE-12082 : Null comparison for greatest and least operator• HIVE-12070 : Relax type restrictions on ‘Greatest’ and ‘Least’• HIVE-11737: IndexOutOfBounds compiling query with

duplicated groupby keys• HIVE-11712: Duplicate groupby keys cause ClassCastException• HIVE-11835: Type decimal(1,1) reads 0.0, 0.00, etc from text

file as NULL• HIVE-12296 : ClassCastException when selecting constant in

inner select (pending)

Page 12: Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

12© 2014 Cloudera, Inc. All rights reserved.

Going Forward

• Tackle non-SQL-92 query-support• Nested Types• Partitioned tables• Multi-insert

Page 13: Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho

Thank you.