building things :disks databases

Building Things : DisksDatabases

Introduction

• This is a mix and match section

• It contains all the information that does not have a home elsewhere in this presentation

• It covers– Disk layouts– Disk Farms– Striping on Striping– Database build

Disk Layouts

• Basically in an IQ-M system you can have a number of different disk configurations

• We consider the following– Simple SCSI (or UDMA-[1]66) Disks– Basic Disk “Farms” (Collections of raw disks)– RAID Systems (NT RAID, Clariion etc.)– Disk Subsystems (EMC, MTI Guardian etc.)

Basic Disk Farms

• Here we are considering a number of “dumb” disks

• There is no intelligence in the disks – or the controllers

• Here is where we use IQ-M striping– Here we have a number of disk drives (on various

controllers)– We stripe across a series of disks

Disk Striping

Basic Disks

1-dimensionalIQ Striping

In the 1-dimensional striping we are spreading the data writes (and reads) across a series of disk drives – limiting the overall disk head movement.

Problems with Disk Striping

• IQ-M has to make location decisions as to where to write and read the data

• It is not as simple as writing to “bit bucket” or a serial list of disk blocks

• It is faster than simple 1 disk = 1 device• The performance hit is measured in extra micro

seconds per read/write – but with a 10 TBytes database this could be important

A DCB

RAID – 0,1Redundant Array ofInexpensive Disks

Raid 1 - MirrorIn RAID 1 – Mirror, we can consider that Disks A and B are mirrored and Disks C and D

This improves read and write performance, as the controller will pick which disk heads are closer to the data required. Also there is an improvement in data security – 2 copies

A DCB

Raid 0 - StripeIn RAID 0 – Stripe, we can spread the reads and writes as per the last slide

This dramatically improves the read and write performance, but does nothing for the data security

A DCB

RAID – 0/1 and 5Redundant Array ofInexpensive Disks

Raid 0/1 – Stripe and Mirror In RAID 0/1 – Stripe and Mirror, we can see that the stripe sets on A and B mirror to C and D

This gives security – in the mirroring (copies) of the data, and improved performance due both to the two copies and the stripe sets

A DCB

Raid 5 – Checksum Multi-readIn RAID 5 – Csum/MR, The data is in “stripes” across the disks then a checksum is written to the final disk

This gives a small performance improvement for reads, no improvement for writes (in fact it will slow writes down!). The only bonus is security. The system will allow a disk failure in the pack without requiring the 2 to 1 overhead of mirroring systems

Other RAID

• There is RAID 2, 3 and 4• These are variations on the mirroring/striping and

checksum theme• There are also proprietary RAID schemes

– RAID S The is EMC’s RAID system– RAID MThis is the MTI RAID Scheme

• We will come on to talk about proprietary schemes in a few slides time

Well is RAID useful ?

• Yes it is – mainly !

• If the RAID system handles the data distribution at the hardware level (with RAID controllers) then RAID is both fast and safe

• Windows NT (and 2000) can drive RAID through the operating system

• This can be slower than IQ-M performing the striping– Watch out for large numbers of reads – and individual read

performance slowing down

– Also Avoid RAID 5. It is very slow for writes

Disk Subsystems (emc, HDS)

• These are disk arrays where we have no idea what the internal organization of the disks is (and nor do we care)

• Generally we can consider that we are writing to memory (and usually we are!)

• The tuning of the disks is usually best left up to the Hardware Support guys

Striping on Striping

• There is one very contentious area in disk organisation• This is 2 dimensional disk striping or striping on stripes• This is driven by using an operating system or RAID unit

to provide 1 dimensional striping on a series of disks• Then we apply IQ striping onto the existing striping• 2 Dimensional stripes• Do it right – it’s very fast, do it wrong and it isn’t!

A diagram

Operating SystemStriping

AS IQ-M Striping

The reason this works is now the individual disks only have a very small amount of data to read and write so we rely on very fast disk -> processor communicationsIf this is done properly it is the fastest disk access for ASIQ-M

The secret, if there is one, is to make the block size the same as the “micro” stripe size… But this needs further experimentation

Disk Striping

• By default Disk_Striping is set ON for RAW devices and OFF for file system devices

• We should be using RAW devices – they are (generally) a lot faster and potentially safer than file system devices

• You may want to play with this parameter if you are running devices on disk farms or RAID array systems

File system Storage

• If you must have the IQ Store on Operating System File System• OS_File_Cache_Buffering can be set off for Solaris and

Windows NT (and 2000)

• However the system will slow down in the following areas– IQ Page Size < file system block size

– During Loads

– Solaris > 4Gb memory

• The description in the manual is awesome!

Create Database

• OK so we have built the device or devices – about time we considered some of the options to create the database

• Of all of the options when issuing a CREATE DATABASE command the following slide details the most important

CASE - 1

• CASE– CASE RESPECT is the fastest– There is a 10-20% hit going to CASE IGNORE

• Implications to FP Indexes– Regardless of RESPECT vs. IGNORE all 1-byte and 2-byte

FP indexes store all the binary values for the data– So ABC, abc, Abc, Abc are all stored even for CASE

IGNORE

CASE - 2

• Remember because we store all bitmaps we can go from 1-byte to 2-byte FP, or 2-byte to flat FP where we might not want to

• A solution to this is to set the server in CASE RESPECT (because it is faster)

• Then use an ETL tool to rtrim() and ucase() or lcase() all of the incoming character data

CASE - 2

• The HG index stores data in what is called “conditioned” mode.– For a CASE IGNORE database there is only one entry per

logical value ABC = abc = Abc etc.

– For a CASE RESPECT database there has to be one entry per value ABC != abc != Abc etc.

CASE – 3

• For an LF index we hold partially conditioned values– For CASE RESPECT and CASE IGNORE all values have a bit-

map– This can be wasteful on space

• The reason for having this is two fold– To allow for the recreation of the FP index from the LF– To allow for some rare cases (some group by’s) where we still

project values from an LF index

COLLATION

• COLLATION– Set to ISO_BINENG, this is the fastest– If you must have a collation sequence this will slow the

system down by around 10% for 8 bit character sets and 50% for multi-byte character sets

– There are substantial slow downs for all multi-byte character sets

IQ PAGE SIZE - 1

• This is an area of extreme contention

• The IQ PAGE size (effectively) determines the the size of the smallest addressable area in memory

• NOT ON DISK !

• The disk parameter is the BLOCK SIZE

IQ PAGE SIZE – 2

• The rules for IQ PAGE SIZE (or memory buffer size) are simple

• Set to 64K unless….– Set to 128K when the memory model exceeds around 1

Gbytes per cache, and the number of rows in the FACT table exceeds 100m

– Set to 256K when the memory model exceeds 4 Gbytes per cache (this may be a rare case!)

– Do not set to 512K – there is a little bug…

BLOCK SIZE

• The block size is set automatically when the IQ PAGE SIZE is set

• (In 11.x IQ you could set the Max Compression parameter that would vary the number of blocks per page)

• In IQ 12, you can vary the BLOCK SIZE but it does not do much except in some extreme cases

So what is BLOCK SIZE?

Disk DeviceMemory Cache

A Page

A Chunk(A variable length object)

When a “page” is written out of memory it is compressed. Only the resulting used “blocks” within the page are written to disk, this set of blocks is called a “chunk”

Writing to Disk

Default Page in memory64Kb – 16 Blocks

Compress beforeDisk Write Operation

10 Blocks used before compression

Compressor

4 blocks used after compression

Disk Write

1 Chunk on Disk the size of 4 blocks - 16kbytes

Disk_Striping_Packed

• The problem with large systems is that we only have 1 freelist

• This means that if we want 8 blocks of space we will grab the first > 8 space we can find, which tends to fragment the devices

• If Disk_Striping_Packed is ON then we have one freelist for each “number of blocks” available– 1 for 1 block free– 1 for 2 blocks free– Etc.

• The trade-off is better space usage against slightly worse localisation.

BLOCKS

• So as can be seen the IO operation is performed in blocks

• but how many are written at one time depends upon– The usage (how much of the page is used)– The Compressibility (how well the used blocks compress)

• We have found the default works the best– That is for a 64K page 16 blocks of 4k

Block Size

• The compressor is designed to run in 2Kbyte chunks so – Never make the block size less than 2Kbyte– Always make is a multiple of 2Kbyte

• The compressor does not have a huge amount of memory to run in so – If the Page size is big (>128Kbyte) do not set the block size

much bigger than 32-64Kbyte

Scoping - 1

• ASIQ-M 12.4.2 used the ASA scoping rules, not the ANSI rules

• In ANSI there are strict rules as to where aliases can be used, and what they are

• ASA has rules that an extension to the ANSI rules and allow much more freedom

• In ASA you can re-use and rename anything, anywhere

Scoping - 2

• An example:Select alpha = 27, beta, gamma From tablenameWhere alpha between 27 and 29

• In ASE rows are returned that match the where clause i.e. alpha between 27 and 29

• In ASA/ASIQ all table rows are returned• Why ?

Scoping – 3

Select alpha = 27, beta, gamma

From tablename

Where alpha between 27 and 29

• In ASIQ the following happens– In the select we have defined an alias alpha and assigned a value

to it (27)

– Then in the where clause we check the value of alpha against the values 27 thru 29 and the condition is true because alpha is 27

• Note: Alpha is defined as an alias here not a column identifier

A Picture

Select alpha = 27, beta, gammaFrom tablenameWhere alpha between 27 and 29

ASE

ASIQ

In ASE this alpha is a aliasIn ASE this alpha is a column name

In ASIQ this alpha is a aliasIn ASIQ this alpha is also a

alias, because it has been definedTo make the code “correct” for ASIQ you

need to prepend the second alpha withThe table name. E.g. tablename.alpha

Building Things - End

building things :disks databases

Documents