delmolino - understanding data skew bias

Understanding Data Skew Biasin Application Code

Dominic Delmolino ([email protected])Director of Database Development, Network Solutions

Hotsos Symposium 2005 / Dallas, Texas USAMarch 2005

Agenda

• Network Solutions business profile• COTS billing system evaluation• Interlude (commentary on evaluation)• Implementation harbingers• The problem – bill runs to produce wholesale invoices• Vendor dialogs – process examination vs. tuning• Analysis challenges• The aha! moment• Solutions?• Epilogue

Network Solutions Business Profile

• Domain name rental agent (registrar)– Rent (register) your domain name with us on an annual or multi-

annual basis– Have the opportunity to buy additional services (email, web site) with

us for your registered domain• Most customers interact with web site (retail) via credit cards (7M

domains)• Resellers (wholesalers) also exist and are sent invoices (1.5M domains)• No seasonality to business• Major transactions include

– Establishment of new customer– Registration or purchase– Renewal– Service modification

• “System” processes 500 transactions per second

Cast

• Network Solutions– Finance department– Software Engineering group– System Operations group– VP level stakeholders

• Software Vendor• Implementation Consultants

The Project

• Replace in-house billing system with COTS product– Engineering wants to get out of the business of writing a

billing package– Finance wants more functionality– Operations wants professional support

• Requirements driven by Finance needs / wants– Perform refunds– Accept and process payments– Accept paper checks– Apply adjustments to account balances– Manage invoices

Interlude (commentary on evaluation)

What we did What we learned later

Checklist based approach.• Can your package do this?• Invoked vendor “yes” reflex

Use techniques from HR interviews1. Tell me how your package

performs the following function

No review of requirements prior to evaluation process

Need to quantity benefits of each requirement to balance against costs. E.g., processing paper checks

No questions about typical reference customer usage profiles – compared only on business size

Should have compared business model to that of reference customer to determine match / fit.

Product Choice

A billing system package commonly used by telecommunication companies.

Implementation Harbingers

Perspectives on Vendor and Implementation Team

Implementation Decisions: Business Focus vs. Application Focus

Business Focus Application Focus

Majority of transactions are on-line via credit cards

Application is batch invoice generation oriented, so translate each credit card purchase into a mini batch, even if invoice is never sent.

Products and services use a simple price schedule

Application supports complex charge categories and requires them at the detailed product level

Vendor database recommendations

• Normal = settings we use in our largest online DBs• Implementation version is 8.1.7

– 1GB block buffer cache (3x normal)– 20MB sort area size, with retained also 20MB (10x normal)– 300MB shared pool size (2x normal)– Hash_join_enabled to FALSE (normal is TRUE)– Optimizer_index_caching = 0 (normal is 90)– Optimizer_index_cost_adj = 100 (normal is 40)

• Regular, periodic table analyze using provided SQL script– Script analyzes every table using sample of 10,000 rows

We begin to wonder about competency and data skew bias, but of course at this point it is too late to re-open product evaluation

Monthly Wholesale Bill Run is Finally Tested

• But only after all configuration is done• And only after data importation / migration is done

Waterfall approach meant that one of our most important business process was not tested until the very end of

implementation

• Initial runs are poor, but after changes to application configuration, final test runs take about 12 hours to process 3,000 wholesalers representing roughly 2M end customers with 2.5M products

My bad – I didn’t question why it needed to take this long

Wholesale Bill Runs become a problem

Operations group has set an upper limit of 24 hours to complete, but bill runs start to routinely exceed that.Even though we are reducing the number of wholesalers we bill

Wholesale Bill Runs processing history

Wholesalers vs Time

0

500

1000

1500

2000

2500

3000

3500

1-Sep-0

3

1-Oct-

03

1-Nov

-03

1-Dec

-03

1-Ja

n-04

1-Feb

-04

1-Mar-0

4

1-Apr-0

4

1-May-0

4

1-Ju

n-04

1-Ju

l-04

1-Aug-0

4

1-Sep-0

4

1-Oct-

04

1-Nov

-04

1-Dec

-04

1-Ja

n-05

1-Feb

-05

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

WholesalersHours

Poly. (Hours)

Poly. (Wholesalers)


Services vs Time

0

500000

1000000

1500000

2000000

2500000

1-Sep-0

3

1-Oct-

03

1-Nov

-03

1-Dec

-03

1-Ja

n-04

1-Feb

-04

1-Mar-0

4

1-Apr-0

4

1-May-0

4

1-Jun

-04

1-Jul-

04

1-Aug-0

4

1-Sep-0

4

1-Oct-

04

1-Nov

-04

1-Dec

-04

1-Jan

-05

1-Feb

-05

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

Active ServicesHours

Poly. (Hours)

Poly. (Active Services)


Charges vs Time

0

100000

200000

300000

400000

500000

600000

1-Sep-0

3

1-Oct-

03

1-Nov

-03

1-Dec

-03

1-Ja

n-04

1-Feb

-04

1-Mar-0

4

1-Apr-0

4

1-May-0

4

1-Ju

n-04

1-Ju

l-04

1-Aug-0

4

1-Sep-0

4

1-Oct-

04

1-Nov

-04

1-Dec

-04

1-Ja

n-05

1-Feb

-05

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

Charges

Hours

Poly. (Hours)

Poly. (Charges)


Customers vs Time

1300000

1350000

1400000

1450000

1500000

1550000

1600000

1650000

1700000

1750000

1-Sep-0

3

1-Oct-

03

1-Nov

-03

1-Dec

-03

1-Ja

n-04

1-Feb

-04

1-Mar-0

4

1-Apr-0

4

1-May-0

4

1-Ju

n-04

1-Jul-

04

1-Aug-0

4

1-Sep-0

4

1-Oct-

04

1-Nov

-04

1-Dec

-04

1-Ja

n-05

1-Feb

-05

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

Active CustomersHours

Poly. (Hours)

Poly. (Active Customers)

Initial Vendor Response Approach

• Investigative approach– Slowness is due to a h/w or db tuning problem– Send us your statspack report– Send us sar reports– Send us tuxedo statistics

• Initial response– Low (92%) hit ratio means you need to 4x your SGA (from

512M to 2G)– You need to change hash_join_enabled to false– You need to set optimizer_index_caching to 0– You need to set optimizer_index_cost_adj to 0

Oh, by the way…

• Are your statistics up to date?– We stopped running analyze in April 2004 due to complex

vendor views suddenly changing execution plans (for the worse).

– We have begun to implement plan stability and are only running targeted analyzes.

• You need to run our dbanalyze script

• Not a single question about our business model

What we learn on our own

• 2-3 resource intensive queries– One to get charge categories or rate plans on a per product basis

• Common in the telco industry• But we use the same charge category for every product

– One to query every child customer of a wholesaler to see if their services have any charges in the month

• Assumption appears to be that every customer has active services• Every active service has MANY charges each month, again a telco-ism

(lots of calls per phone line or service)• Analyze has no effect, queries are written with hints and direct

assumptions (i.e., filter conditions applied at far ends of join chains).

Our investigation reveals data skew bias

Wholesalers / Customers

Services

Charges

Application (Telco) Assumption

NSI Business Data Model

Any service could be on any rate plan

All services are on the same rate plan

Customers are deleted when they terminate their services

Customers are always active – can buy new services later

Every active service has charges every bill run

Each service basically has a 1 in 30 chance of having a charge each bill run

Application SQL reflects data skew bias

Application (Telco) Assumption

What the code does

Any service could be on any rate plan

Query all services to determine the different rate plans

Customers are deleted when they terminate their services

Assume all active customers will have charges

Every active service has charges every bill run

Get all services and determine their charges

NSI Business Data Model

What the code should do

All services are on the same rate plan

If we use a global rate plan, don’t query the services

Customers are always active – can buy new services later

Don’t assume a customer will have a charge (generate $0 at the end)

Each service basically has a 1 in 30 chance of having a charge

Get all charges and

Sample queries from application

select /*+ ordered use_nl(a,c,sh) */ distinct c.service_idfrom account a, charge c, service swhere a.customer_node_id = :b1and a.account_id = c.account_idand c.charge_date between :b2 and :b3and c.service_id = s.service_idorder by 1

Commentary– Looping – run once for every customer (:b1)– Forces driving off of every customer (ordered and use_nl hints)– Analyze won’t have any effect on this query

Sample queries from application

select distinct charge_category_idfrom customer c, service s, service_charge_category sccwhere (c.root_customer_node_id = :b1

or c.customer_node_id = :b1)and c.customer_node_id = s.customer_node_idand s.service_id = scc.service_id

Commentary– Looping – run once for every customer (:b1)

We use LIOs/join/rows processed to measure statement efficiency

• Statement #1 (getting all services associated with charges for each customer)

call count cpu elapsed disk query rows------- ------ -------- ---------- ---------- ---------- ----------Parse 25058 2.98 3.15 0 0 0Execute 25058 1.36 1.11 0 0 0Fetch 38710 12.28 74.72 12519 1636259 14282------- ------ -------- ---------- ---------- ---------- ----------total 88826 16.62 78.98 12519 1636259 14282

• Also look at rows vs. executions – don’t want to execute if you’re not going to get any rows• LIOs/join/row processed = roughly 40 (we target <= 10 in our own code)• Processing customer record even if no charges exist results in low row per execution count

We use LIOs/join/rows processed to measure statement efficiency

• Statement #2 (getting all charge categories)

call count cpu elapsed disk query current rows------- ------ -------- ---------- ---------- ---------- ---------- ----------Parse 1 0.02 0.09 14 265 0 0Execute 1 0.00 0.00 0 0 0 0Fetch 2 7.75 143.37 29909 427780 0 2------- ------ -------- ---------- ---------- ---------- ---------- ----------total 4 7.77 143.46 29923 428045 0 2

• LIOs/join/row processed = over 210,000• Aggregation is no excuse• A lot of work to determine that we only use one rate plan

The disconnected conversations with vendor

• We send them statspack reports and execution plans for the problematic statements

• The conversation:– Run dbanalyze– “Which plans are not as you expect? Which plans should

change?”– We don’t know, please run dbanalyze– Set hash_join_enabled to false and set optimizer parameters

back to defaults– “But your statements don’t use hash joins, and can you tell us

how the plans will change when we reset the optimizer parameter?”

– These parameters affect more than execution plans

What every H/W vendor wants to hear…

• More conversation with s/w vendor– Disks look slow– Network may be slow– You should consider a larger server with faster CPU and

memory• The response from our Operations group was priceless:

– “Are you serious?”

The aha! moment

After 6-8 weeks of back and forth, where we provide more and more evidence that our database and h/w are fine, the vendor account manager finally brings in a functional consultant.

The functional consultant promptly validates our conclusions about application data skew bias

Solutions?

• Our “hack” – we adapt our data to their process– Prior to a bill run, temporarily delete end customers (those without

any charges) by moving them to a “holding” wholesaler– Vendor claims this will take too long and not give any benefit– Our results:

• 1 hour to move• Processing largest wholesaler drops from 5 hours to 5 minutes• 1 hour to move back• Total time should drop from 30 hours to 3 hours

– We have not implemented this yet due to concerns about directly modifying data

• Vendor analysis turns up configuration items– Option to not maintain running subtotal (we don’t use it any)– Option to not calculate special tariffs (we don’t use any)– These changes result in 40% savings in test environment

Other avoidance strategies?

Email me at [email protected]

Epilogue

• Homegrown applications are not immune to this problem• Consider web page screen design

Epilogue

• During our requirement construction process, our UI team prepares screen specifications

• These screen specs have samples of how to display data• The problem?

– All screen specs are created and distributed on 8 ½” x 11” paper

– Sample data chosen to fit paper size– Design assumptions perpetuate bias toward result set sizes

which match sample data sizes – no concept of pagination until it’s too late.

Questions & Answers

Thank you

delmolino - understanding data skew bias

Documents