hash objects – why use them?

26
Copyright © 2008, SAS Institute Inc. All rights reserved. Hash Objects – Why Use Them? Carolyn Cunnison SAS Technical Training Specialist

Upload: duncan

Post on 19-Jan-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Hash Objects – Why Use Them?. Carolyn Cunnison SAS Technical Training Specialist. Agenda. What are a Hash objects? When should I use them? Some sample code. What are HASH objects?. Hash object can be thought of as rows of keys and data loaded into memory. Keys. Data. Data. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hash Objects – Why Use Them?

Copyright © 2008, SAS Institute Inc. All rights reserved.

Hash Objects – Why Use Them?Carolyn CunnisonSAS Technical Training Specialist

Page 2: Hash Objects – Why Use Them?

Copyright © 2008, SAS Institute Inc. All rights reserved.

Agenda

What are a Hash objects?

When should I use them?

Some sample code.

Page 3: Hash Objects – Why Use Them?

Copyright © 2008, SAS Institute Inc. All rights reserved.3

What are HASH objects?

Keys Data Data

...

• Hash object can be thought of as rows of keys and data loaded into memory.

Page 4: Hash Objects – Why Use Them?

Copyright © 2008, SAS Institute Inc. All rights reserved.

Advantages of Hash Objects

Values can be hard-coded or loaded from a SAS data set.

Keys and data can be a mixture of character and numeric.

Provides in-memory data storage and retrieval.

Does not require that data be sorted.

Is sized dynamically.

Page 5: Hash Objects – Why Use Them?

Copyright © 2008, SAS Institute Inc. All rights reserved.

When to Use Hash Objects (1) Joining tables

I cut my processing time by 90% using hash tables - You can do it too!; Jennifer K. Warner-Freeman• http://www.nesug.info/Proceedings/nesug07/bb/bb16.p

df

Jennifer took an existing Proc SQL join which took between 2 and 4 hours to run. When she rewrote the program to use Hash tables, the program ran in 11 minutes.

Page 6: Hash Objects – Why Use Them?

Copyright © 2008, SAS Institute Inc. All rights reserved.

When to Use Hash Objects (2) Summary-less summarization

Hash-Crash and Beyond; Paul Dortman et al• http://www2.sas.com/proceedings/forum2008/037-2008

.pdf

Compared PROC SUMMARY with NWAY option to Hash Object

proc summary data = input nway ;

class k1 k2 ; var num ;

output out = summ_sum (drop = _:) sum = sum ;

The Hash Object did “the job more than twice as fast at the same time utilizing ⅓ the memory”

Page 7: Hash Objects – Why Use Them?

Copyright © 2008, SAS Institute Inc. All rights reserved.

When to Use Hash Objects (3) Dynamically output to multiple files

Paul Dortman paper (continued)

Use a Hash table instead of the following: Data out1 out2;

Set tablein;…

If id = 1 then output out1;

Else if id = 2 then output out2;

Page 8: Hash Objects – Why Use Them?

Copyright © 2008, SAS Institute Inc. All rights reserved.

When to Use Hash Objects (4) Removing data extremes

Knowledge Base Sample 25990• http://support.sas.com/kb/25/990.html

Removes top and bottom 10% of data values.

Page 9: Hash Objects – Why Use Them?

Copyright © 2008, SAS Institute Inc. All rights reserved.

When to Use Hash Objects (5) Perform data sampling without Proc Surveyselect

Better Hashing in SAS9.2; Robert Ray and Jason Secosky• http://support.sas.com/rnd/base/datastep/dot/better-has

hing-sas92.pdf

Select observations from a table without replacement.

Perform sampling and data manipulation in one step.

Page 10: Hash Objects – Why Use Them?

Copyright © 2008, SAS Institute Inc. All rights reserved.10

Terminology

Partial list of methods:

Objects Methods

HASH Definedata Add - a row

Definekey Remove - a row

Definedone Replace - data for key

Find Delete - hash table

HITER First

Last

Next

Prev

Page 11: Hash Objects – Why Use Them?

1111

Business ScenarioYou need to read orion.product_list and then look up information in the orion.supplier table.

Structure of orion.product_list

Structure of orion.supplier

Page 12: Hash Objects – Why Use Them?

1212

Loading Data from a SAS Data Setdata supplier_info; drop rc; length Supplier_Name $40 Supplier_Address $ 45 Country $ 2; if _N_=1 then do; declare hash S(dataset:'orion.supplier'); S.definekey('Supplier_ID');

S.definedata('Supplier_Name', 'Supplier_Address','Country'); S.definedone();

call missing(Supplier_Name, Supplier_Address,Country); end; set orion.product_list; rc=S.find(); if rc=0;run;

p305d02

Page 13: Hash Objects – Why Use Them?

1313 ...

Partial PDVSupplier_

Name

Supplier_

AddressCountry

Product

_ID

Product_

Name

Supplier

_ID

. .

rc _N_

. 1. . . D

Partial HASH Object S

KEY:Supplier

_ID

DATA:Supplier_

Name

DATA:Supplier_Address

DATA:Country

50Scandinavian Clothing A/S

Kr. Augusts Gate 13

NO

109 Petterson ABBlasieh-olmstorg 1

SE

316Prime Sports Ltd

9 Carlisle Place

GB

.

.

.

.

.

.

.

.

.

.

.

.

3298 A Team Sports2687 Julie Ann Ct

US

data supplier_info; drop rc; length Supplier_Name $40 Supplier_Address $ 45 Country $ 2; if _N_=1 then do; declare hash S(dataset:'orion.supplier'); S.definekey('Supplier_ID'); S.definedata('Supplier_Name', 'Supplier_Address', 'Country'); S.definedone(); call missing(Supplier_Name, Supplier_Address, Country); end; set orion.product_list; rc=S.find(); if rc=0;run;

Execution

Page 14: Hash Objects – Why Use Them?

1414 ...

Partial PDVSupplier_

Name

Supplier_

AddressCountry

Product

_ID

Product_

Name

Supplier

_ID

210200100009Kids Sweat Round Neck,Large Logo 3298

rc _N_

0 6. . . D

Partial HASH Object S

KEY:Supplier

_ID

DATA:Supplier_

Name

DATA:Supplier_Address

DATA:Country

50Scandinavian Clothing A/S

Kr. Augusts Gate 13

NO

109 Petterson ABBlasieh-olmstorg 1

SE

316Prime Sports Ltd

9 Carlisle Place

GB

.

.

.

.

.

.

.

.

.

.

.

.

3298 A Team Sports2687 Julie Ann Ct

US

data supplier_info; drop rc; length Supplier_Name $40 Supplier_Address $ 45 Country $ 2; if _N_=1 then do; declare hash S(dataset:'orion.supplier'); S.definekey('Supplier_ID'); S.definedata('Supplier_Name', 'Supplier_Address', 'Country'); S.definedone(); call missing(Supplier_Name, Supplier_Address, Country); end; set orion.product_list; rc=S.find(); if rc=0;run;

Execution

Page 15: Hash Objects – Why Use Them?

1515 ...

Partial PDVSupplier_

Name

Supplier_

AddressCountry

Product

_ID

Product_

Name

Supplier

_ID

A Team Sports 2687 Julie Ann Ct US 210200100009Kids Sweat Round Neck,Large Logo 3298

rc _N_

0 6. . . D

Partial HASH Object S

KEY:Supplier

_ID

DATA:Supplier_

Name

DATA:Supplier_Address

DATA:Country

50Scandinavian Clothing A/S

Kr. Augusts Gate 13

NO

109 Petterson ABBlasieh-olmstorg 1

SE

316Prime Sports Ltd

9 Carlisle Place

GB

.

.

.

.

.

.

.

.

.

.

.

.

3298 A Team Sports2687 Julie Ann Ct

US

data supplier_info; drop rc; length Supplier_Name $40 Supplier_Address $ 45 Country $ 2; if _N_=1 then do; declare hash S(dataset:'orion.supplier'); S.definekey('Supplier_ID'); S.definedata('Supplier_Name', 'Supplier_Address', 'Country'); S.definedone(); call missing(Supplier_Name, Supplier_Address, Country); end; set orion.product_list; rc=S.find(); if rc=0;run;

Execution

Page 16: Hash Objects – Why Use Them?

1616 ...

Partial PDVSupplier_

Name

Supplier_

AddressCountry

Product

_ID

Product_

Name

Supplier

_ID

A Team Sports 2687 Julie Ann Ct US 210200100009Kids Sweat Round Neck,Large Logo 3298

rc _N_

0 6. . . D

Partial HASH Object S

KEY:Supplier

_ID

DATA:Supplier_

Name

DATA:Supplier_Address

DATA:Country

50Scandinavian Clothing A/S

Kr. Augusts Gate 13

NO

109 Petterson ABBlasieh-olmstorg 1

SE

316Prime Sports Ltd

9 Carlisle Place

GB

.

.

.

.

.

.

.

.

.

.

.

.

3298 A Team Sports2687 Julie Ann Ct

US

data supplier_info; drop rc; length Supplier_Name $40 Supplier_Address $ 45 Country $ 2; if _N_=1 then do; declare hash S(dataset:'orion.supplier'); S.definekey('Supplier_ID'); S.definedata('Supplier_Name', 'Supplier_Address', 'Country'); S.definedone(); call missing(Supplier_Name, Supplier_Address, Country); end; set orion.product_list; rc=S.find(); if rc=0;run;

Execution

True

Page 17: Hash Objects – Why Use Them?

1717 ...

Partial PDVSupplier_

Name

Supplier_

AddressCountry

Product

_ID

Product_

Name

Supplier

_ID

A Team Sports 2687 Julie Ann Ct US 210200100009Kids Sweat Round Neck,Large Logo 3298

rc _N_

0 6. . . D

Partial HASH Object S

KEY:Supplier

_ID

DATA:Supplier_

Name

DATA:Supplier_Address

DATA:Country

50Scandinavian Clothing A/S

Kr. Augusts Gate 13

NO

109 Petterson ABBlasieh-olmstorg 1

SE

316Prime Sports Ltd

9 Carlisle Place

GB

.

.

.

.

.

.

.

.

.

.

.

.

3298 A Team Sports2687 Julie Ann Ct

US

data supplier_info; drop rc; length Supplier_Name $40 Supplier_Address $ 45 Country $ 2; if _N_=1 then do; declare hash S(dataset:'orion.supplier'); S.definekey('Supplier_ID'); S.definedata('Supplier_Name', 'Supplier_Address', 'Country'); S.definedone(); call missing(Supplier_Name, Supplier_Address, Country); end; set orion.product_list; rc=S.find(); if rc=0;run;

Execution

Implicit OUTPUT;Implicit RETURN;

Page 18: Hash Objects – Why Use Them?

1818 ...

Partial PDVSupplier_

Name

Supplier_

AddressCountry

Product

_ID

Product_

Name

Supplier

_ID

A Team Sports 2687 Julie Ann Ct US 210200100009Kids Sweat Round Neck,Large Logo 3298

rc _N_

0 6. . . D

Partial HASH Object S

KEY:Supplier

_ID

DATA:Supplier_

Name

DATA:Supplier_Address

DATA:Country

50Scandinavian Clothing A/S

Kr. Augusts Gate 13

NO

109 Petterson ABBlasieh-olmstorg 1

SE

316Prime Sports Ltd

9 Carlisle Place

GB

.

.

.

.

.

.

.

.

.

.

.

.

3298 A Team Sports2687 Julie Ann Ct

US

data supplier_info; drop rc; length Supplier_Name $40 Supplier_Address $ 45 Country $ 2; if _N_=1 then do; declare hash S(dataset:'orion.supplier'); S.definekey('Supplier_ID'); S.definedata('Supplier_Name', 'Supplier_Address', 'Country'); S.definedone(); call missing(Supplier_Name, Supplier_Address, Country); end; set orion.product_list; rc=S.find(); if rc=0;run;

Execution

Continue until EOF

Page 19: Hash Objects – Why Use Them?

1919

Resultsproc print data=supplier_info(obs=10); var Product_ID Supplier_ID Supplier_Name Supplier_Address Country; title "Product Information";run;

Product Information

Obs Product_ID Supplier_ID Supplier_Name Supplier_Address Country

1 210200100009 3298 A Team Sports 2687 Julie Ann Ct US 2 210200100017 3298 A Team Sports 2687 Julie Ann Ct US 3 210200200022 6153 Nautlius SportsWear Inc 56 Bagwell Ave US 4 210200200023 6153 Nautlius SportsWear Inc 56 Bagwell Ave US 5 210200300006 1303 Eclipse Inc 1218 Carriole Ct US 6 210200300007 1303 Eclipse Inc 1218 Carriole Ct US 7 210200300052 1303 Eclipse Inc 1218 Carriole Ct US 8 210200400020 1303 Eclipse Inc 1218 Carriole Ct US 9 210200400070 1303 Eclipse Inc 1218 Carriole Ct US 10 210200500002 772 AllSeasons Outdoor Clothing 553 Cliffview Dr US

Partial PROC PRINT Output

Page 20: Hash Objects – Why Use Them?

20

Could I do the same thing with a MERGE ?

Yes. But ……•Would have to sort both tables.•Reading from disk is slower than reading from memory.

Page 21: Hash Objects – Why Use Them?

21

What about data size ?

Scalability of Table Lookup Techniques, Rick Langston http://support.sas.com/resources/papers/proceedi

ngs09/037-2009.pdf Compared Hash table, Sort/Merge, Indexing,

Proc SQL and Proc Format as table lookup techniques.

Hash object processing was successful up to around 1,900,000 rows and then ran out of memory.

Page 22: Hash Objects – Why Use Them?

22

Did you know that….

PROC SQL sometimes uses hashing to join tables.• Possible processing methods are:

sqxjsl - Step Loop Join (Cartesian product)sqxjm - Merge Joinsqxjndx- Index Joinsqxjhsh- Hash Join

• To view the method used:

Proc sql _method;

Page 23: Hash Objects – Why Use Them?

23

The HITER object• The HITER object must point to a HASH object.

• Read the HITER using the following methods.

Page 24: Hash Objects – Why Use Them?

Copyright © 2008, SAS Institute Inc. All rights reserved.

Conclusion Hash and Hiter objects are very flexible.

Data has to fit into memory.

Results will depend on your data, your environment, and what you are trying to do.

You have to benchmark.

Page 25: Hash Objects – Why Use Them?

Copyright © 2008, SAS Institute Inc. All rights reserved.25

Want to know more?

• SAS Programming III: Advanced Techniques and Efficiencies

https://support.sas.com/edu/schedules.html?ctry=ca&id=279

• Also available as Live Web course.

Page 26: Hash Objects – Why Use Them?

Copyright © 2008, SAS Institute Inc. All rights reserved.

Questions?