chapter 1: introduction to data miningyee/784/handouts/dmb.pdfchapter 1: introduction to data mining...

Chapter 1: Introduction to Data Mining

© Thomas Yee

Statistics Department, Auckland University

February 2012

[email protected]://www.stat.auckland.ac.nz/~yee

© Thomas Yee (Stat Dept, UoA) Chapter 1: Introduction to Data Mining 1/118February 2012 1 / 118

http://www.stat.auckland.ac.nz/~yee

Chapter Outline

Chapter Outline

1 1.1 Introduction

2 1.2 Generic topics

3 1.3 Statistical methods used in DM

4 1.4 Debunking DM myths

5 1.5 Fraud detection

6 1.6 References†

© Thomas Yee (Stat Dept, UoA) Chapter 1: Introduction to Data Mining 2/1182012/2 2 / 118

1.1 Introduction

1.1 Introduction I

This chapter gives a brief introduction to Data Mining (DM) and issuesassociated with handling large datasets. We shall focus on the statisticalaspects of DM. There are many other important topics which we onlybriefly mention or ignore.

For students with impairments, I am willing to make special arrangementsre. these course notes. . . please let me know.

Sections daggered (†) are non-examinable, as well as quotes at the bottomof each page.


1.1 Introduction

Some Definitions I

Data mining (DM) is at best a vaguely defined field; its definition dependslargely on the background and view of the definer. Here are somedefinitions.

Data mining is the process of selecting, exploring, and modeling largeamounts of data to uncover previously unknown patterns for businessadvantage (SAS 1998 Annual Report, p.51)

Data mining is advanced methods for exploring and modelingrelationships in large amounts of data (SAS Institute Inc.)

Data mining is the nontrivial process of identifying valid, novel,potentially useful, and ultimately understandable patterns in the data(Fayyad)

Data mining is the process of discovering advantageous patterns indata (John)


1.1 Introduction

Some Definitions IIData mining is the process of extracting previously unknown, valid,and actionable information from large databases and using it to makecrucial business decisions (Zekulin)

Data mining is the computer automated exploratory data analysis of(usually) large complex data sets. (Friedman, 1998)

Data mining is the efficient discovery of valuable, nonobviousinformation from a large collection of data. (Bigus, 1996)

Data mining is the search for valuable information in large volumes ofdata. (Weiss and Indurkhya, 1998)

In contrast, Statistics is typically a more formal process of designing,collecting, organizing, presenting and analyzing data and obtainingconclusions.

Statistics is a science which ought to be honourable . . . (Thomas Carlyle)


1.1 Introduction

Data miner—a hot job!!!

The June 26, 2000, issue of TIME, predicted that one of the 10 hottestjobs of the 21st century will be Data Miners:

“..., research gurus will be on hand to extract useful tidbits frommountains of data, pinpointing behavior patterns for marketersand epidemiologists alike.”

An excerpt from the NY Times (2009/08/06): “The rising stature ofstatisticians, who can earn $125,000 at top companies in their first yearafter getting a doctorate, is a byproduct of the recent explosion of digitaldata.”

. . . the movement of the last hundred years is all in favour of the statistican.

(H. G. Wells)


1.1 Introduction

Why is it called “Data Mining”? I

Data are a plentiful resource. As such, perhaps one can mine it for itsnuggets of gold (gems of truth/insight/knowledge), i.e., sifting throughvast quantities of raw data looking for valuable nuggets of businessinformation. The analogy is that it entails sifting through piles and piles ofdirt and rubbish to get at the treasure. That is, new technologies helpcompanies find riches buried in dormant data sets.

In part, DM has been driven by individuals and organisations who findthemselves with large data holdings which they feel ought to be sources ofvaluable information. The interest has been fanned by hardware andsoftware computer vendors who are looking for new market niches. Seelater.

Some statisticians have used the term “data mining” to mean unsavoury“data dredging”, “data snooping”’ or “fishing expedition” in the search of


1.1 Introduction

Why is it called “Data Mining”? IIpublishable P-values. That is, the extraction of possibly spurious structurefrom data using exhaustive and invalid methods—torturing the data untilit confesses.

These criticisms of DM are about getting valid statistical inference. ManyDM methods were derived heuristically, and use brute-force computerpower, e.g., trees. Consequently, their statistical properties are usually nottractable, and most standard statistical inference based on P-values etc.are invalid. However, the focus of DM is often prediction and notstatistical inference.

There is no firm distinction between DM and statistics.

I understand mining to be a very carefully planned search for valuables hidden out of

sight, not a haphazard ramble. Mining is thus rewarding, but, of course, a dangerous

activity. (D. R. Cox, in the discussion of Chatfield, 1995)


1.1 Introduction

Why mine data? I

1 The data are being producedBig datasets are increasingly collected routinely, e.g., corporationsamass tremendous amounts of information daily. Most products havebar codes so many transactions are recorded.

I Walmart records 20 million transactions per dayI Mobil has a database of geological data consisting over 100 terabytes

of informationI Credit card purchases and tax returns form large data setsI The human genome project comprises of gigabytes of dataI The Earth Observing System (EOS) satellites will produce 50 gigabytes

of data every hour (Exercise: how much data in one year? Ans: 0.42petabytes)

I AT&T has terabyte datasets of phone transactions, e.g., records of theplace, time, and destination of calls. Consequently, scientists thereroutinely see datasets with millions, hundreds of millions, and evenbillions of records.

I stock trades


1.1 Introduction

Why mine data? III telephone callsI tax returnsI EFTPOS, bank transactions, credit card chargesI issuing of library booksI Car registration, WOF, . . .I Google is the most efficient information gathering machine ever built.

Every time you use it to search the web, the query you typed, the timeand date, and the IP address and unique “cookie” ID assigned to yourcomputer are recorded and retained for 18 months. Around 2008,Google processed around 20 petabytes of data per day (to put this incontext, the entire written works of humanity, in all languages, occupyabout 50 petabytes.)Google’s mission: to organize the world’s information and make ituniversally accessible and useful.

I etc., etc., etc.

Consequently, adjectives such as “data tsunami”, “data torrents”, “datadeluge”, “drowning in data”, are common now.


1.1 Introduction

Why mine data? III2 The data is being warehoused

Storage devices such as CD-ROMS and tapes provide cheap means ofstoring very large amounts of data. Gigabytes storage devices arecommon, and terabyte devices are possible. Thumb drives havebecome very popular.

There have been advances in data base management systems andlanguages, e.g., SQL

3 Computing power is cheapComputer performance has been doubling every 18–24 months(Moore’s Law, 1965). Hard disk capacity has been increasing at aneven faster rate.

As of Feb 2012, the retail price of a i7-processor PC with 8Gb memoryand 2TByte hard drive and 24” screen is about NZ$2100.


1.1 Introduction

Why mine data? IVNowadays new computing strategies are being exploited to spread theload among multiple computers connected like the many neurons in abrain, e.g., grid computing is where a group of networked computersare used to divide and conquer computing problems, e.g., find life inthe universe, solve a cryptology problem, compute the next largestknown prime number; cloud computing is where the internet is usedto distribute data and computing tasks to many computers all aroundthe world but without a centralized hardware infrastructure of gridcomputing.

4 Competition between companies is strongFear between companies has been partly a reason for the adoption ofDM


1.1 Introduction

Why mine data? V5 Integrated software is now available

In the past there were specialized software written by researchers.They have been implemented in a unified way and also incorporatedwith database access.

“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.” (Sherlock

Holmes)


1.1 Introduction

Studies from the past I

Not only is data being collected now but data collected in the past isbecoming online. Governments, research funders and academiccommunities are getting increasingly interested in the potential of savingresearch data for the purposes of secondary and shared use. This is nowmade much easier with the arrival of GRID technologies (the KiwiAdvanced Research and Education Network (KAREN) in New Zealand)and with the wholesale digitisation of data of all kinds. There areimplications for researchers. For example, the National Institutes of Health(NIH) requires all grantees to make their data available within a year ofthe publication of their first major research output. The Economic andSocial Research Council (ESRC) in the UK requires grantees to deposittheir data.

There is a trend towards reproducible research these days.


1.1 Introduction

Studies from the past IIHere are some links that provide online data:

http://www.nzssds.org.nz/ for the New Zealand Social Science DataService,

http://assda.anu.edu.au/ for the Australian Social Science DataArchive,

http://www.iassistdata.org/ for the International Association forSocial Science Information Services and Technology,

http://www.cessda.org for Council of European Social Science DataArchives.

Statistics has a bright future; I’m not so sure about statisticians. (D. R. Cox)


http://www.nzssds.org.nz/

http://assda.anu.edu.au/

http://www.iassistdata.org/

http://www.cessda.org

1.1 Introduction

How DM is being used in business today I

Sales/Marketing

Buyer Behavior

Customer Retention

Quality Control

Other Sales/Mkt

Inventory

Fraud

Cost/Utilization


1.1 Introduction

How DM is being used in business today IIDM is having a major impact on business, industry and science. There arehundreds of examples; here are a few.

Tracking down criminals, e.g., Oklahoma City bombing case,Unabomber case, etc.

Fraud detection, e.g., credit card scams, fallacious tax returns orfictitious insurance claims

Supermarkets as information broker, e.g., Safeway, Foodtown cardslink demographic variables with purchase patterns

Customer retention (minimizing attrition). By learning who is likely toleave and why, a company can devise a retention plan

Identifying bad customers, e.g., people who own credit cards thatthey almost never use and if they do, pay their bills fully, people whoamass large debts before declaring bankruptcy


1.1 Introduction

How DM is being used in business today IIIAn airline, for example, would be able to predict when to cut fares,and by how much, in order to ensure filling an aircraft on a particularroute.

A telecommunications company can use DM to identify peak usagetimes for optimal network efficiency

A bank can keep optimal cash levels on hand

Retailers can identify profitable distribution channels

In 1998, 80% of the Fortune 500 companies were involved in DM or in aDM pilot project.

The key to success in business is to know something that nobody else knows. (Aristotle

Onassis)


1.1 Introduction

A success story: Australia’s worse serial killer

The following comes from a talk given by Matt Wand in September 2004.

In the early 1990s, seven young backpackers were murdered in what was tobecome Australia’s most notorious serial murder case. The police haddeveloped a profile of the killer. Investigators therefore applied a datamining method to vehicle records, gym memberships, gun licensing andinternal police records. As a result, the list of suspects was progressivelynarrowed from 18 million individuals to a short list of 32, which includedthe murderer. Thousands of precious police hours were saved and policewere able to focus their investigations on a more manageable list ofpotential suspects, leading to the eventual successful conviction of theBackpacker Murderer.

“New Zealanders who move to Australia raise the intelligence of both countries.”

(Robert Muldoon)


1.1 Introduction

A success story: “Diapers and beer”

The observation that customers who buy diapers (nappies) are more likely(on average) to buy beer allowed supermarkets to place beer and diapersnearby, knowing many customers would walk between them. Placingpotato chips between increased sales of all three items.

Scientists do not invent truth—they discover it. (Roger Penrose)


1.1 Introduction

Another success story I

Due to competitiveness and confidentiality we don’t get to hear of allsuccess stories, but here is one.

The Australian Health Insurance Commission (HIC), an Agency of theAustralian federal government, maintains computerized records of everydoctor’s consultation in the country, including details on the diagnosis,prescribed drugs, and recommended treatment. Maintaining a data storemeasuring many terabytes, the HIC has one of the largest collections ofdata in the world, most of it is yet to be explored.

Data mining is being used to identify previously undiscovered patternsimportant to the business of the organisation. Fraud prevention anddisease management can both benefit from new insights.

It was hypothesized that some GPs, when confronted with a patient with ashallow sature (cut), deliberately make a claim on it as a deep sature. The


1.1 Introduction

Another success story IIreason being that they get twice the amount of money on a deep onecompared to a shallow one.

By looking at the (huge) data base, the ratio # shallows / # deep wascomputed for every GP, and tracked over time. It was found that the ratiowas quite similar for GPs for the first 5 years after graduation, but thensome of them deviated to extreme values. Some of these GPs were sentletters of warning, and one was struck off. Expenses in this area have gonedown.

In another case, HIC was able to detect fraud (“inappropriate practice”) inpathology tests, and save A$1 million in the first year of operation.

97.3% of all statistics are made up.


1.1 Introduction

Striking fool’s gold I

Done wrong, DM can produce bogus correlations that range from uselessto dangerous. Here are some examples.

Around the time of the 1929 Wall Street Crash, observers noted aclose correlation between New York and London share prices andlevels of solar radiation.

Michael Drosnin wrote the best-seller “The Bible Code” which claimedto find hidden messages in the Bible about dinosaurs, Bill Clinton, theassassination of Israeli Prime Minister Yitzhak Rabin etc. What hedid was he wrote the Hebrew Bible on a huge grid of letters and useda computer to look for words that appeared across, up, down ordiagonally. The cryptic ‘messages’ consisted of seemingly relatedwords that appeared near each other, e.g., “dinosaur” and “asteroid”.


1.1 Introduction

Striking fool’s gold IIIn 1992, ProCyte Corp. in USA, was dismayed when a clinical trialfound that its new drug, Iamin, didn’t seem to promote generalhealing of diabetic ulcer wounds. So the company searched throughsubsets of data and found that Iamin seemed to work on certain footwounds. But after another expensive and fruitless clinical trial it wasshown to be a statistical fluke. Not allowed drug status, Iamin is nowsold as a wound dressing.

Finance is rife with wrong-headed DM, e.g., (from DavidJ. Leinweber, CEO of First Quadrant Corp. in USA, which manages$20 billion in assetts) after sifting through a United Nations CD-ROMhe discovered that the single best predictor of the Standard &Poor’s 500-stock index was butter production in Bangladesh.

Guns will make us powerful; butter will only make us fat. (Hermann Goering

1893–1946)


1.1 Introduction

Striking fool’s gold IIIAndrew W. Lo, from MIT, said “Given enough time, enough attempts, andenough imagination, almost any pattern can be teased out of any data set.”

Bad DM is likely to get worse. With desktop computers becoming morepowerful, DM tools are being used by people who are clueless aboutstatistics.

A senior research analyst at META Group in USA said “we need to be surewe’re not just empowering people to shoot themselves in the foot.”

Another type of fool’s gold is overfitting (see next chapter): “My modelfits the (training) data perfectly . . . I’ve struck it rich!”

The statistician cannot evade the responsibility for understanding the process he applies

or recommends. (Sir Ronald A. Fisher)


1.1 Introduction

Confirming vs discovering I

There are two basic styles of DM:

1 Hypothesis testing (aka top-down approach) involves confirmingor disproving preconceived ideas.

2 Knowledge Discovery (aka bottom-up approach) is essentiallyletting a smart computer program loose on a data set and having itcome back with information that we can then apply to make moneyand win customers.

Knowledge discovery can be directed or undirected . The first is whenwe want to explain the value of some particular variable in terms ofthe other variables. It uses classification, regression and/or prediction.Undirected learning asks the computer to identify patterns in the datathat may be significant. It provides supplementary knowledge.


1.1 Introduction

Confirming vs discovering IIUndirected knowledge discovery is to recognize relationships in thedata and directed knowledge discovery is to explain thoserelationships once they have been found.

I distrust the facts and the inferences. (Ralph Waldo Emerson)


1.1 Introduction

Confirming vs discovering IIIDirected Learning Examples:

I What is the predicted profitability of this new customer?I What variables influence spending?

Undirected Learning Examples:I What patterns of customer spending can be found?I Who are the big spenders?

Discovery is the ability to be puzzled by simple things. (Noam Chomsky)


1.1 Introduction

Mining the miners

DM so far has been largely a commerical enterprise. As in most goldrushes of the past, the goal is to “mine the miners”. The largest profits aremade by selling the tools to the miners, rather than in doing the actualmining. DM is used to sell computer hardware and software.

Hardware manufacturers emphasize high computational requirementsfor DM: a lot of internal memory (RAM) and massive amounts of diskspace.

Software providers emphasize competitive edge. “Your competitor isdoing it, so you had better keep up.”

I think there is a world market for maybe five computers. (IBM Chairman Thomas

Watson, 1952)


1.1 Introduction

General data mining software I

There are a lot to choose from. Here are a handful.

SASI Approximately 44,000 business, government and university sites in

108 countriesI 2007 Revenues were US$2.15 billionI Strong in data warehousing (SAS/Warehouse AdministratorTM)I SAS Enterprise Miner, released in 1998, has won several prizes and

medals. Auckland University obtained a licence for this in mid-2002.Licences have changed for 2012.

SPSS “Clementine”

TIBCO Spotfire Inc. “S-PLUS” Almost dead.

AlphaMiner An open source data mining platform.http://www.eti.hku.hk/alphaminer


http://www.eti.hku.hk/alphaminer

1.1 Introduction

General data mining software IIR: There are many R packages now, including Rattle, RWeka and tm.Revolution (http://www.revolutionanalytics.com/) is R forbusiness plus added features, e.g., “For the first time, R users canprocess, visualize and model terabyte-class data sets in a fraction ofthe time of legacy products—without employing expensive orspecialized hardware.”

RapidMiner : this is growing in popularity.

http://maltman.hmdc.harvard.edu/micah_altman/socsci.shtml#MINING is a linkfor other open source data mining software.

I have travelled the length and breadth of this country, and talked with the best peoplein business administration. I can assure you on the highest authority that dataprocessing is a fad and won’t last out the year. (Unknown editor in charge of businessbooks for Prentice Hall, c.1957)


http://www.revolutionanalytics.com/

1.1 Introduction

General data mining software IIIBesides these, there are a wide variety of specialized single-purpose products.

The Meta Group in the U.S. estimated that demand for DM software and services wouldexceed US$8 billion in the year 2000. The key word is “business intelligence”.

He divided people into statisticians, people who knew about statistics, and people whodidn’t. He liked the middle group best. He didn’t like the real statisticians much becausethey argued with him, and he thought people who didn’t know any statistics were justanimal life. (Nigel Balchin)


1.1 Introduction

The data mining process I

Various companies/packages have a step-by-step process for conductingDM. For example, SPSS uses the 5 A’s: Assess, Access, Analyze, Act andAutomate. SAS uses SEMMA: Sample, Explore, Modify, Model, Assess.

We will use the Two Crows Process Model:

1 Define business problem The objectives must be stated clearly,e.g., what improvements are desired and how will they be measured?

2 Build a DM database This involves data collection, datadescription, data cleaning, and maintaining the database.

3 Explore the data This includes visualization, cluster analysis andlink analysis.

4 Prepare data for modeling The final data preparation step beforebuilding models, this involves selecting variables and rows,constructing new variables, and transforming variables.


1.1 Introduction

The data mining process II5 DM model building This is an iterative process. Perform the

classification or regression, usually for prediction. Train and test themodel, e.g., using simple validation, cross validation, bootstrapping.

6 Evaluation and interpretation The model must be evaluated interms of its results, and interpret their significance. Do residual plotsetc.

7 Deploy the model and results The model can be used to

1 recommend actions based on simply viewing the model and its results,e.g., look at the clusters identified, the rules that define the model,

2 apply the model to different data sets.

The minute a statistician steps into the position of the executive who must make

decisions and defend them, the statistician ceases to be a statistician. (William Edward

Deming)


1.1 Introduction

The data mining process IIIAccording to CRoss Industy Standard Process for Data Mining,http://www.crisp-dm.org, we have

Understanding customer: 10% to 20%

Understanding data: 20% to 30%

Prepare data: 40% to 70%

Build model(s): 10% to 20% (data mining)

Evaluate model(s): 10% to 20%

Take action: 10% to 20%

I am a statistic. (Johnnie Tillmon)


http://www.crisp-dm.org

1.1 Introduction

The data mining process IVSome authors have the following as their data mining process.

Data Mining

Data Warehouse

Selection andSampling

Cleaning andpre−processing

Transformationand reduction

ReducedData

KnowledgeEvaluation andvisualization Patterns/Models

The only useful function of a statistician is to make predictions, and thus to provide a

basis for action (William Edwards Deming)


1.1 Introduction

Some practical tips I

The following tips have been suggested by some ‘real’ data miners:

A benchmark method is invaluable.It is extremely useful to fit a simple model to some data before tryinga more complicated model. For example, fit a logistic regression or alinear model before trying out a decision tree or neural network.Why? So that the results of the more complicated analysis can becompared to the simpler one. One can see the added value. If there islittle benefit in the complicated model, then the simpler one may bepreferable. Clients also like getting a result early on, with the promisethat the data miner is working on an improved model.

Know your tools, especially their strengths and weaknesses. Also,every statistical method has its set of assumptions which should bechecked if used.“If you only have a hammer, everything looks like a nail.”


1.1 Introduction

Some practical tips IICommunication is very important, as well as being able to work in ateam.

It is essential to understand the client’s problem. Often the client hasunrealistic expectations and expect some magic. They often have noidea of exactly what questions their data is capable of answering.

If it can be done in R then do it in R. It may be a little slower, butthe added functions such as plotting make up for it.

Some statistical methods are estimated iteratively, e.g., logisticregression. To speed up convergence, one idea is to fit the model to asmall random sample of the data first. Then use the estimate fromthat as the initial value for the model fitted to the entire dataset.

Statistics is the art of stating in precise terms that which one does not know. (William

Kruskal)


1.1 Introduction

Statistical modelling

Choosing a good model for a dataset is an art. Here are some guidingprinciples for the data modeller.

“All models are wrong, but some are more useful than others” (fromBox, 1979).

Don’t fall in love with one model to the exclusion of alternatives.

Use diagnostic procedures to check how well the data fit the model .

No method dominates all others over all situations. Friedman (1994).There is usually no uniformly best method. Each model has a set ofsituations where it works best.

The purpose of models is not to fit the data but to sharpen the questions. (Samuel

Karlin)


1.1 Introduction

Some recent developments

The recently ended Netflix contest, which offered $1 million to anyonewho could significantly improve the company’s movie recommendationsystem, was a battle waged with the weapons of modern statistics.

http://news.bbc.co.uk/2/hi/technology/8311627.stm is anarticle entitled “Government opens data to public” regarding datarelease in UK in late 2009. The site www.data.gov.uk is built withsemantic web technology, which enables the data it offers to be drawntogether into links and threads as the user searches. The Americanversion is www.data.gov.


http://news.bbc.co.uk/2/hi/technology/8311627.stm

1.2 Generic topics

1.2 Generic topics

This section looks at several generic topics associated with data mining.

Garbage in, garbage out.


1.2 Generic topics

How large is large? I

Table: Some metric prefixes, and computer storage.

kilo 210 1, 024 kilobytes (Kb) 12 sheet

mega 220 1, 048, 576 megabytes (Mb) 1 reamgiga 230 1, 073, 741, 824 gigabytes (Gb) 51 m stacktera 240 1.099512× 1012 terabytes (Tb) 51 km stackpeta 250 1.1259× 1015 petabytes (Pb) 52,200 kmexa 260 1.152922× 1018 exabytes (Eb) 53,500,000 km

A typical A4 page of text is about 2000 bytes (c.2K). The number offundamental particles in the universe is estimated to be around 1080 andthat the age of the universe, in seconds, is about 1018.

A terabyte is one trillion bytes. Be careful of the word “billion”: a Britishbillion is a million million but an American billion is only a thousandmillion!© Thomas Yee (Stat Dept, UoA) Chapter 1: Introduction to Data Mining 42/1182012/2 42 / 118

1.2 Generic topics

How large is large? IIA database has two basic dimensions:

1 the number of rows (records/subjects/cases),

2 the number of columns (variables).

One to-day is worth two to-morrows. (Benjamin Franklin)


1.2 Generic topics

How large is large? III32-bit computers have the limitation of a 4G address space. 64-bitcomputers do not have such a problem (yet?).

Parkinson’s law of data, a corollary of Parkinson’s law (Parkinson, 1955)states that Data expands to fill the space available for storage. Theamount of data in the world has been doubling every 18–24 months.

Similar to Parkinson’s law is Moore’s law , which states that CPU speeddoubles roughly every 18 months.

640K ought to be enough for anybody. (Bill Gates, 1981)


1.2 Generic topics

Order notation I

In data mining, the computational cost of an algorithm is an importantconsideration. Often one is restricted to solutions that are(computationally) cheap, or equivalently, not (computationally) expensive.One method of measuring the computational cost of an algorithm is to usethe so-called order notation. Note that this applies to both time andstorage (memory and/or hard disk).

Definition: f (n) = O(g(n)) if and only if there exists two constants c andn0 such that |f (n)| ≤ c |g(n)| for all n ≥ n0. We say that the computingtime of an algorithm is O(g(n)) to mean that its execution time takes nomore than a constant multiplied by g(n).

For sufficiently large n, an O(n) algorithm is faster than, e.g., an O(n2)algorithm. Note that f (n) = o(g(n)) if and only if lim

n→∞|f (n)/g(n)| = 0.


1.2 Generic topics

Order notation IIIt can be shown that

O(1) < O(log n) < O(n) < O(n log n) <

O(n2) < O(n3) < O(2n) < O(n!) < O(nn).

Examples of algorithmic complexity (Wegman):

O(n12 ) plot a scatter plot.

O(n) calculate means, variances, kernel density estimates.

O(n log n) calculate fast Fourier transforms.

O(n c) calculate singular value decomposition of a r × c matrix, solvea multiple linear regression.

O(n2) solve most clustering algorithms.

O(an) detect multivariate outliers.


1.2 Generic topics

Order notation IIITable: Computational feasibility on a Pentium PC, 10 megaflop performanceassumed.

tn n1/2 n n log n n3/2 n2

Tiny 10−6 s 10−5 s 2× 10−5 s 0.0001 s 0.001 sSmall 10−5 s 0.001 s 0.004 s 0.1 s 10 sMedium 0.0001 s 0.1 s 0.6 s 1.67 min 1.16 daysLarge 0.001 s 10 s 1.3 min 1.16 days 31.7 yearsHuge 0.01 s 16.7 min 2.8 hours 3.17 years 317, 000 years


1.2 Generic topics

Order notation IVTable: The Huber-Wegman taxonomy of data set sizes. Nb. usually 8 bits equalsone byte or character.

tDescriptor Data set size in bytes Storage modeTiny 102 Piece of paperSmall 104 A few pieces of paperMedium 106 A floppy diskLarge 108 Hard diskHuge 1010 Multiple hard disksMassive 1012 Robotic magnetic tape, storage silosSupermassive 1015 Distributed data archives

People who deal with bits should expect to get bitten. (Jon Bentley)


1.2 Generic topics

Order notation VTable: Number of operations for algorithms of various computational complexitiesand various data set sizes.

n n1/2 n n log n n3/2 n2

Tiny 10 102 2× 102 103 104

Small 102 104 4× 104 106 108

Medium 103 106 6× 106 109 1012

Large 104 108 8× 108 1012 1016

Huge 105 1010 1011 1015 1020

Controlling complexity is the essence of computer programming. (Kernigan)


1.2 Generic topics

Exercises I1 Let A and B be general n × n matrices, and x be a general n × 1

vector. Work out the cost of computing the following quantities, e.g.,n(n − 2)2 = O(n3) multiplications, n(n − 2) = O(n2) additions.

(a) A + B(b) 5 A(c) A x(d) xT A x(e) AB(f) trace(A) where the trace of a matrix is the sum of its diagonal elements(g) trace(AT A)(h) trace((A + B)T (A + B))

2 The cost of inverting a n × n matrix is, theoretically, O(n3). If ittakes 20 minutes to compute the inverse of a 1000 by 1000 matrix,how long would it take (theoretically) to invert a 4000 by 4000matrix?


1.2 Generic topics

Example: sorting I

There are quite a few algorithms for sorting n elements. One of thesimplest is called bubble sort which compares adjacent elements in a listand swaps them if they are not in order. Eventually the smallest element‘bubbles’ to the top. Repeat this process each time stopping one indexedelement less, until you compare only the first (or last) two elements in thelist. The largest element will have ‘sunk’ to the bottom. Here’s an Sfunction to do bubble sorting on a vector. Show that bubble sort is O(n2).


1.2 Generic topics

Example: sorting IIbubblesort = function(x) {

if(!is.vector(x)) stop("'x' must be a vector")

n = length(x)

for(i in 1:(n-1)) {

for(j in n:(i+1)) {

if(x[j-1] > x[j]) {

# Swap x[j-1] and x[j]

temp = x[j-1]

x[j-1] = x[j]

x[j] = temp

}

}

}

x

}


1.2 Generic topics

Example: sorting IIINb. The fastest general purpose sorting algorithms are more complicatedbut achieve O(n log n) in speed.


1.2 Generic topics

Data compression†Almost every operating system allows files to be compressed. In UNIX, thecommands compress and uncompress compress and uncompress files.

The various “zip” utilities and programs (e.g., xz and unxz, gzip andgunzip, bzip2 and bunzip2, WinZip) can typically create an archive filecontaining the compressed contents of an entire directory/folder. (OnUNIX machines the tar (tape archive) utility transforms a directory into asingle file, which can then be compressed. The “zip” utilities combine thesetwo steps.)

R can read in compressed files directly: see unz(). Also, NetCDF (networkCommon Data Form) and the OpenDAP system is meant to avoid theneed to download huge netcdf files locally, but it’s hard to compile and use.


1.2 Generic topics

Dirty data

In practice this is a major issue.

Missing data may produce bias. See later.

Outliers affect the statistical methods being applied. They becomeincreasingly difficult to detect as the dimension (number of variables)goes up.

US Census data reportedly has error rates of up to 20%.Data preparation is often referred to as “scrubbing the data” and is themost resource consuming step in the DM process.

More men come to doom through dirty profits than are kept by them. (Sophocles)


1.2 Generic topics

Data cleaning I

Very importantly, data cleaning/preparation takes up the major componentof a typical data mining project.

It typically takes a minimum of 40% of the time, and usually about75% of the time. It is tedious and boring, and may easily take a fewmonths. Traditionally, it is carried out by hand and very hard toautomate.Data cleaning involves range checking and logic checks.

One of the tasks is to get the data into a format that can be readinto a statistical program.

A project needs a domain expert—this is somebody who knowseverything (or a group of people who collectively know everything)about the data. Data miners need to talk to the domain experts a lotin order to understand the variables, handle missing values, etc.


1.2 Generic topics

Data cleaning IIData can be collectively actively or passively . The first means thedata are checked on entry, and generally results in better quality.Passively collected data were entered (expediently) and neverscrutinized and may be very shocking in quality. Clients often havetoo high an opinion of their data.

Smoking is a leading cause of statistics.


1.2 Generic topics

Missing Data I

Missing values can be caused by a variety of reasons. They often representunknown but knowable information. Structural missing data representvalues that logically could not have a value, e.g., a male being asked “howmany pregnancies have you had?” Missing values are often coded indifferent ways and sometimes miscoded as zeros. The reasons for thecodings and the consistency of the codings must be investigated. Somemissing values are informative.

Some analysis strategies are


1.2 Generic topics

Missing Data II1. Complete-case analysis: Also known as the listwise-deletion

method, use only the cases (observations) that have complete recordsin the analysis, e.g., na.omit() in the S language. If the missingnessis related to the inputs or the target, then ignoring missing values canbias the results.

In DM, the chief disadvantage of complete-case analysis is practical:even a smattering of missing values in high dimensional data cancause a huge reduction in data. For example, if 12 cases with12 inputs contains 9 randomly chosen missing values (6.25%), then inthe worse case, a complete-case analysis would be able to use only3 cases.

Complete-case analysis can be okay if the proportion of deletedobservations is small relative to the entire data set and if themechanism that leads to the missing data is independent of the


1.2 Generic topics

Missing Data IIIvariables in question—this is known as missing at random (MAR) ormissing completely at random (MCAR).

Predictors

Cases

? ? ?

??

?

?

?

?

Exercise: Suppose there are n cases and d variables, and that datavalues are missing at random with probability p. Using the valuesn = 1000, d = 10 and for p = 0.1 and 0.3, calculate the expectednumber of complete cases.


1.2 Generic topics

Missing Data IV2. Imputation: Fill the missing values with some reasonable value,

and then run the analysis on the full (filled-in) data.

The simplest type of imputation is to replace missing values by themean (mode for categorical variables) of the complete cases. This iscalled mean imputation. The method can be refined by using themean within homogeneous groups of the data.

Hot-deck imputation is where a missing value is imputed bysubstituting a value from a similar but complete record in the samedata set. Regression imputation is where a missing value is imputedby the predicted value of a regression on the completely recorded data.Multiple imputation fills in the missing values m > 1 times, where theimputed values are generated each time from a distribution that maybe different for each missing value; this creates m different data sets,


1.2 Generic topics

Missing Data Vwhich are analyzed separately, and then the m results are combined toestimate model parameters, standard errors and confidence intervals.

The missing values of categorical variables could be treated as aseparate category. For example, type of residence might be: ownhome, rents home, live with parents, mobile home, and unknown.This method would be preferable if the missingness is itself a predictorof the target.

3. EM algorithm: This has expectation and maximization steps, andusually runs slowly. The key concept is that of unobserved latentvariables. It was proposed by Dempster et al. (1977).


1.2 Generic topics

Missing Data VISome technical definitions (Thomas Lumley)†

Suppose we have a regression model with Y and X , plus some additionalvariables Z . Some components of (X ,Y ) are missing. Let Ri be theindicator variable that the ith observation is not missing. Then

if Ri⊥Yi |Xi then complete case analysis is valid.

if Ri depends on Yi or Zi then then complete case analysis is typicallybiased.

Under Rubin’s hierarchy: suppose Zi = completely observed variables,Wi = variables with some missing values. Then

MCAR: R⊥(Z ,W ). That is, data are missing independently of bothobserved and unobserved data. For example, a participant flips a cointo decide whether to complete a questionnaire.


1.2 Generic topics

Missing Data VIIMAR: R⊥W |Z . That is, given the observed data, data are missingindependently of unobserved data. For example, male participants aremore likely to refuse to fill out a questionnaire on a depression, but itdoes not depend on the level of their depression. Another example isaccidentally omitting an answer on a questionnaire.

Informative/non-ignorable missing: everything else. This includesNMAR.

Rubin proved two important results about MAR:

Under MAR, inference based on the likelihood of (W ,Z ) is valid (thelikelihood of R factorizes out)

MAR is a completely untestable assumption: given any jointdistribution on (R,W ,Z ) there is an MAR mechanism that gives riseto the same observed data distributions.


1.2 Generic topics

Missing Data VIIIBooks on missing data include Little and Rubin (2002), Allison (2002),McKnight et al. (2007), Enders (2010).

There are known knowns. These are things we know that we know. There are known

unknowns. That is to say, there are things that we know we don’t know. But there are

also unknown unknowns. There are things we don’t know we don’t know. (Donald

Rumsfeld, US Secretary of Defense 2001–6)


1.2 Generic topics

Sampling

Rather than working with all of the data, one can analyze a sample of thedata. While predictive power goes up as the sample size increases, thecomputational cost goes up too. For other than the most basic statisticalprocedures, there are no simple formulae for determining a “sufficient”sample size. In some cases, simulation may be possible.

Case-control sampling may be a very effective way for very rare targetclasses. For example, if out of 14 million transactions 500 are fraudulent,then the ‘effective’ sample size is closer to 500 than 14 million.

I don’t predict. I never have and I never will. (Tony Blair, ex-British Prime Minister)


1.2 Generic topics

Confidentiality

Statistical consulting and the analysis of data usually entails confidentialityof the data. With large data bases, this may become more important.There have been cases where sensitive information about certainindividuals in society have been identified. For example, a database ofpeoples’ incomes, age and race may sound like anonymity-preserving, but ifyou’re the only Maori politician aged 58, you can be identified quitequickly, especially with auxiliary information from another data source.

Statistics has been the most successful information science. Those who ignore statistics

are condemned to re-invent it. (B. Efron)


1.2 Generic topics

Experts

The domain expert understands the particulars about the problem; therelevant background knowledge, context and terminology, and thestrengths and weaknesses of the current solution (if there is one).

The data expert understands the structure, size and format of the data.

The analytical expert understands the capabilities and limitations of the(e.g., statistical) methods that may be relevant to the problem.

The embodiment of this expertise might take one, two, three or morepeople.

Are statisticians normal?


1.2 Generic topics

The time aspect

Rapid changing data may make previously discovered patterns invalid. Astatistical model therefore has a shelf-life; it’s ability tomodel/classify/cluster/predict etc. erodes over time. As new data iscollected, it is important to refit the model or do another analysis. Somemodels may be ok for a few years; others for a few days.

A statistician can have his head in an oven and his feet in ice, and he will say that on

the average he feels fine.


1.2 Generic topics

Multiple testing

In data mining one often explores the data from many different angles andusing different subsets of the data. This is fine, but one has to be verycareful when quoting P-values from these types of ‘analyses’. Bonferroni’sTheorem warns us that if there are too many possible conclusions to draw,some will be true for purely statistical reasons, with no physical reality.

A Famous Example David Rhine, a “parapsychologist” at DukeUniversity, USA, in the 1950s tested students for “extrasensory perception”by asking them to guess 10 cards—red or black. He found about 1/1000of them guessed all 10, and instead of realizing that that is what you’dexpect from random guessing, declared them to have ESP. When heretested them, he found they did no better than average. His conclusion:telling people they have ESP causes them to lose it!


1.2 Generic topics

Bonferroni’s correction I

Consider the following table.

Table: National Women’s Hospital births data over 1968–1987 broken down byrace.

RACE Females MalesEuropean 30848 33338

Maori 6674 7130Pacific Islander 5521 6024

Chinese 449 446Indian 531 497Other 361 389


1.2 Generic topics

Bonferroni’s correction IILet pi be the proportion of females for the ith race, i = 1, . . . , 6, where iruns down the table. What’s interesting? We know biologically thatP(female) is less than 0.5. One thing of interest is whether the pi areequal.

If we test H0 : p1−p5 = 0 versus H1 : p1−p5 6= 0, then we get a t-statisticof 2.287, yielding a P-value of 0.022. There seems fairly good evidencethat Indians are less likely to have boys than Europeans. Or do they?


1.2 Generic topics

Bonferroni’s correction IIIIn general, when we scan data for “interesting” comparisons, we are ineffect performing many (hypothesis) tests. There’s nothing wrong withexploring the data; the problem comes when we report P-values.

Suppose we test M hypotheses, and that all of them are true. Suppose thetests are independent. Every time we compute a test statistic we have a5% chance of getting a significant result. The probability of finding atleast one significant effect is

1− P(no significant effects) = 1− 0.95M .

For M = 20 tests this probability is 0.64. That is, although the individualtest-wise error rate is 5%, 64% of studies will make at least one erroneousclaim under these circumstances. That is, the overall error rate is morethan 5%


1.2 Generic topics

Bonferroni’s correction IVThere is a conservative correction to multiple testing called the Bonferronicorrection. It is simple: if you are potentially doing M tests, just multiplyyour P-values by M. If the result is still significant, you can feel safe aboutclaiming a significant effect.

Looking at it another way (through confidence intervals rather thanthrough hypothesis testing), rather than using the usual formula

θ̂ ± z(α/2) se(θ̂) (1)

we useθ̂ ± z((α/M)/2) se(θ̂). (2)

This results in wider individual confidence intervals because if eachindividual confidence interval covers its true value with probability 95%then the probability that all intervals cover their true values will be less


1.2 Generic topics

Bonferroni’s correction Vthan 95%. Thus we need to modify our intervals so that theysimultaneously cover their true values with probability 95%.

The Bonferroni correction can be very conservative, i.e., does not lead tothe shortest intervals (that is, it rejects the null hypothesis less often thanit ‘should’ when the null hypothesis is true.) In specific circumstances wecan do quite a lot better.

This whole subject of multiple testing is very important in data mining.For example, when looking for genes that might cause coronary heartdisease we need to firmly understand these ideas.

A mathematician, like a painter or poet, is a maker of patterns. If his patterns are more

permanent than theirs, it is because they are made with ideas. (G. H. Hardy, A

Mathematician’s Apology)


1.2 Generic topics

Bonferroni’s inequality

Why does the Bonferroni correction ‘work’? It is based on the followingresults.

Bonferroni’s inequality for two events A and B is: If P(A) = α andP(B) = β then

P(A ∩ B) ≥ α + β − 1. (3)

In general, Bonferroni’s inequality for M events A1, . . . ,AM is:

M∑i=1

P(Ai )−∑i<j

P(Ai ∩ Aj) ≤ P(∪Mi=1Ai )

≤M∑i=1

P(Ai ). (4)

Let the parameters be θj , and 100(1− α/M)% confidence intervals be Ij .Let the events Aj = θj /∈ Ij = “interval not ok”. Then© Thomas Yee (Stat Dept, UoA) Chapter 1: Introduction to Data Mining 76/1182012/2 76 / 118

1.2 Generic topics

P(all M intervals ok) = 1− P(at least 1 interval not ok)

= 1− P(∪j Aj)

≥ 1−∑

j

P(Aj)

= 1−∑

j

α/M

= 1− α.

A recent approach for multiple testing that has received a lot of attentionis the method of Benjamini-Hochberg. It comes from a 1995 JRSSB paperand is related to Stein estimation.


1.2 Generic topics

Practical versus statistical significance

From Stage 1 you would have learnt that something that is statisticallysignificant does not necessarily mean that it has any practical relevance.This fact is particularly important in data mining where, with a sufficientlylarge sample, almost any hypothesis test can be rejected.

Exercise: Suppose that in a data set of 3 million people comprising ofexactly 50% men that 20484 men had a certain type of cancer, while19901 women had that type of cancer too. Also suppose the data set canbe thought of a random sample from a populous country. Let p1 andp2 be the probability a man and woman has that certain type of cancer,respectively. Compute a 99% confidence interval for p1 − p2. Do youreject H0 : p1 = p2? Why? What is your H1? Do you think there is anypractical significance in this result? Why or why not?


1.3 Statistical methods used in DM

1.3 Statistical methods used in DM I

These roughly fall under the following headings. The first three of thesefall into the category of directed learning and will be encountered inChapter 5.

1 Classification is a learning function that maps (classifies) a dataitem into one of several predefined classes using the values ofexplanatory variables.

Examples: determining which phone numbers correspond to facsimilemachines, diagnosing a disease based on some symptoms, financialdecision making—buy, sell or hold? signature/character recognition(handwritten).

Use classification trees, neural networks (NNs) etc.



1.3 Statistical methods used in DM II2 Regression A response variable is modeled as a function of

explanatory variables.

Use linear regression, NNs, regression trees etc.

Regression is taught in detail in STATS 330.

3 Prediction The “future” value of the of the response variable ispredicted using the explanatory variables. Can be implemented in avariety of ways (e.g. using classification or regression).

Example: predicting which customers will leave within the next 6months.



1.3 Statistical methods used in DM III4 Link Analysis † (aka dependency modeling, affinity grouping) is

used to determine which things go together. That is, finding a modelwhich describes significant dependencies between the variables.

Example: people who buy cigarettes also buy alcohol withprobability p1; people who buy alcohol also buy cigarettes withprobability p2.

5 Clustering is the task of splitting a heterogeneous populationinto a number of more homogeneous subgroups (clusters). It seeks toidentify a finite set of categories or clusters to describe the data.Clustering is an example of unsupervised or undirected learning .

Unlike classification, there are no predefined classes. That is, inclassification, the classes are known a priori , whereas with clusteranalysis, we don’t know how many groups there are, or what they are.It is up to the researchers to try to attach meaning to the clusters.



1.3 Statistical methods used in DM IV6 Summarization (aka description) involves simply describing what

is going on in a complicated database in a way that increases ourunderstanding of the people, products or processes that produced thedata. A good description will often suggest an explanation.

Methods include summary statistics (e.g., 5 number summary) on thevariables, tables, histograms and other graphs, advanced visualizationtools etc.

Statistics is the art of never having to say you’re wrong. Variance is what any two

statisticians are at.



Fictitious example I

Suppose you had data on the following persons, e.g., age, sex, race,education, location of birth, family background, crimes, times of crimes,places of crimes, etc.

R

amon E

duardo Arellano-Felix

Usam

a Bin L

aden

James J. B

ulgerV

ictor Manuel G

erena

Glen Stew

art Godw

inJam

es Charles K

opp

Eric R

obert Rudolph

Agustin V

asquez-Mendoza

Arthur L

ee Washington, Jr.

D

onald Eugene W

ebb

T

he FBI is offering rew

ards for information leading to the apprehension of T

op Ten M

ostW

anted Fugitives. Check each fugitive page for the specific am

ount.

Notice: T

he official FBI T

en Most W

anted Fugitives list is maintained on the FB

I World

Wide W

eb Site. This inform

ation may be copied and distributed, how

ever, any unauthorizedalteration of any portion of the FB

I’s Ten M

ost Wanted Fugitives posters is a violation of

federal law (18 U

.S.C., Section 709). Persons w

ho make or reproduce these alterations are

subject to prosecution and, if convicted, shall be fined or imprisoned for not m

ore than oneyear, or both.



Fictitious example IIFor each of the six statistical methods listed on the previous pages, canyou think of a worthwhile application of the method to these data?

It isn’t that they can’t see the solution. It is that they can’t see the problem. (Gilbert

Chesterton)



Fictitious example III1 Classification Classify everybody into their (country of)

whereabouts (on a certain fixed date).

2 Regression Fit a regression model where the response is deservedprison sentence (years) and the explanatory variables are age,convictions, education, etc.

3 Prediction Estimate the probability of each person re-offendingwithin the next 6 months.

4 Dependency modeling Do the data suggest that more educatedpeople tend to be thieves and less educated people tend to be sexoffenders?

5 Clustering How many types of criminals are there? Answer: theremight be 3 distinct groups: thieves, murderers, and sex offenders.

6 Summarization Mean age, mean prison sentence, percent malebroken down by race.

43% of all statistics are worthless.© Thomas Yee (Stat Dept, UoA) Chapter 1: Introduction to Data Mining 85/1182012/2 85 / 118


Learning systems can be

Opaque i.e., black-box systems, e.g., NNs. These do not explaintheir learning behaviour. They are hard to interpret.

Transparent i.e., explainable systems, e.g., trees, linearregression. They provide a detailed account of why it performs aspecific operation and its reasoning is visible.

A knowledge of statistics is like a knowledge of foreign languages or of algebra; it may

prove of use at any time under any circumstances. (Arthur L. Bowley)



Allied fields I

DM is an interdisciplinary field, bringing together several techniques

MachineLearning

HighPerformanceComputers

ParallelAlgorithms

PatternRecognition

AppliedStatistics

Visualisation

Database

Data Mining



Allied fields IIKnowledge and Discovery in Databases (KDD) This is concernedwith the extraction of patterns from large databases. KDD is oftenconsidered the same as DM. However, DM is considered a single stepin the overall discovery process.

EDA (Exploratory Data Analysis) Coined by Tukey in the mid1960’s this started a revolution in statistics. He defined statistics interms of a set of problems (as are most fields) rather than a set oftools, namely those problems that pertain to data.

Informatics “is research on, development of, and use of technological,sociological, and organizational tools and approaches for the dynamicacquisition, indexing, dissemination, storage, querying, retrieval,visualization, integration, analysis, synthesis, sharing (which includeselectronic means of collaboration), and publication of data such thateconomic and other benefits may be derived from the information byusers of all sections of society.”



Allied fields IIIMachine Learning “is the study of computer algorithms that improveautomatically through experience. Applications range from datamining programs that discover rules in large data sets, to informationfiltering systems that automatically learn users’ interests.” [Mitchell,1997].

Machine learning has the notion that a computer, fed with a numberof observations about known, solved cases, can develop a general setof underlying rules that are universally true. It’s like a newcomertrying to learn the rules of rugby by watching a rugby game.

Machine learning is a branch of AI.



Allied fields IVPattern Recognition: given some examples of complex signals and thecorrect decisions for them, make decisions automatically for a streamof future examples.

For example, identify the species of a flowering plant; decide to buy orsell a stock option; classify an X-ray image of a tumour as cancerousor benign.

Pattern recognition has its roots in engineering, and is typicallyconcerned with image classification. Pattern recognition methodologycrosses over many areas.

From a statistical perspective, DM can be viewed as a computerautomated exploratory data analysis of (usually) large complex data sets.Because of the multidisciplinary nature of DM, there is some confusingterminology. For example, “bias” means different things in statistics,machine learning, and neurocomputing.

Computers: a person doing calculations. (Dictionary definition, 1936)



Applications of DM†Not all very large data bases (VLDB) are commercial. Examples fromscience and engineering abound. These are usually associated withcomputer automated data collection:

Astronomical (sky maps)

Meteorological (weather, pollution monitoring stations)

Satellite remote sensing

High energy physics

Industrial process control

Biological, e.g., human genome project

These kinds of data can also profit (in principle) from DM technology.

A statistician is a person who draws a mathematically precise line from an unwarranted

assumption to a forgone conclusion. (Unknown)



Some quotes†We are drowning in information, but starving for knowledge. (JohnNaisbett)

If you’ve got terabytes of data, and you’re relying on data mining tofind interesting things in there for you, you’ve lost before you’ve evenbegun. (Herb Edelstein)

. . . He also forced everyone, small and great, rich and poor, free andslave, to receive a mark on his right hand or on his forehead, so thatno one could buy or sell unless he had the mark, which is the name ofthe beast or the number of his name. This calls for wisdom. If anyonehas insight, let him calculate the number of the beast, for it is man’snumber. His number is 666 . (Revelation 13: 16–18)

See Slide 108.


1.4 Debunking DM myths

1.4 Debunking DM myths I

A great deal of what is said about DM is incomplete, exaggerated orwrong. DM has taken the business world by storm, but with many newtechnologies (and gold rushes), there seems to be a lot of mythscirculating. The new technology cycle typically goes like this: enthusiasmfor an innovation leads to spectacular assertions. Ignorant of thetechnology’s true capabilities, users jump in without adequate preparationor training. Then, sobering reality sets in. Finally, frustrated and unhappy,users complain about the new technology and urge a return to “business asusual.”

When you undertake a data mining project, avoid a cycle of unrealisticexpectations followed by disappointment. Understand the facts instead,and your DM efforts will be successful. Simply put, DM is used to discoverpatterns and relationships in your data in order to help you make betterbusiness decisions.



1.4 Debunking DM myths IIMyth: DM produces surprising results that will utterly transform yourbusiness

Fact: Most often, the results of DM yield steady improvement to analready successful organisation. It often contributes importantincremental changes rather than revolutionary ones.

Myth: DM techniques are so sophisticated that they can substitutefor domain knowledge or for experience in analysis and model building

Fact: no analysis technique can replace experience and knowledge ofthe business and its markets. On the contrary, DM makes educationand experience in many areas more important than ever. An expert inonly analytical techniques, without having knowledge of the business,is of no help.



1.4 Debunking DM myths IIIMyth: DM tools automatically find the patterns you’re looking for,without being told what to do

Fact: DM is most cost-effective when used to solve a particularproblem. DM tools are best directed toward a specific goal. Forexample, simply giving a DM tool a mailing list and expecting it tofind customer profiles that improve efficiency of a direct-mailcampaign is not particularly effective. You need to be more specific inyour goals. For example, your model might emphasize customers whohave previously bought expensive items; to increase the number ofresponses, your model might emphasize customers who haveresponded to previous mailings.



1.4 Debunking DM myths IVMyth: DM is useful only in certain areas, such as marketing, sales,and fraud detection

Fact: Virtually any process can be studied, understood and isimproved using DM. DM is useful wherever data can be collected. Ofcourse, in some instances, cost/benefit calculation might show thatthe time and effort of the analysis is not worth the likely return. Forexample, suppose you suspect that if you collect just one more pieceof information about your customers, you could double the number oforders you received. But you also know that mailing to twice as manypeople will also double the number of orders. If gathering the data ismore expensive than sending the extra mailings, then it makes senseto increase the mailings rather than mine the data.



1.4 Debunking DM myths VMyth: Only massive databases are worth mining

Fact: it’s true that many methods used in DM were specificallydesigned for analyzing very large data sets, and that many DMapplications involve massive data sets. But a moderately sized orsmall data set can also yield valuable information. For example,buying patterns may depend most strongly on the day of the week ortime of the year. A modest database consisting of only “day” and“sales” could show this pattern, give the retailer some idea of itsmagnitude, and allow for planning of inventory and staffing.

Even when building a massive database, try out some simple analysison the data while the database is still moderate in size. You maydecide to collect the data differently or to collect different dataaltogether.



1.4 Debunking DM myths VIMyth: DM is another fad that will soon fade, allowing us to return tostandard business practice

Fact: Although the name may change, DM as a vital application willnot go away. Companies have been using related quantitativetechniques in many parts of their businesses for a long time.

Myth: DM is an extremely complex process

Fact: The algorithms of DM may be complex, but new tools havemade those algorithms easier to apply. Often, just the correctapplication of relatively simple analyses, graphs, and tables can reveala great deal about our business. Much of the difficulty in applying DMcomes from the same data-organisation issues that arise when usingany modeling techniques. These include data preparation tasks—suchas deciding which variables to include and how to encode them—anddeciding how to interpret and take advantage of the results.


1.5 Fraud detection

1.5 Fraud detection

Statistical methods can detect fraud by uncovering patterns (or lack ofpatterns!) that characterize deliberate deception.

Suppose, for example, you are given the choice of flipping a coin 200 timesand recording the results (the full sequence of tosses), or pretending to flipa coin and faking the results. Then, with high probability of being correct,a good statistician can tell you which choice you chose (How?).Statisticians have the technology to do such things, which have wideimplications.

Fraud detection can be achieved using DM methods specialized to thistask. One such technique is Benford’s Law.

Fraud: the crime of deceiving somebody in order to get money or goods illegally.

(Oxford Advanced Learner’s Dictionary, 2000)


1.5 Fraud detection

Benford’s law I

This was apparently first discovered in 1881 by astronomer/mathematicianS. Newcombe. It started by the observation that the pages of a book oflogarithms were dirtiest at the beginning and progressively cleanerthroughout. He wrote a 2 page paper on it but it was unnoticed. In 1938,a General Electric physicist called F. Benford rediscovered the law on thissame observation. Over several years he collected data from differentsources as different as atomic weights, baseball statistics, numerical datafrom Reader’s Digest, and drainage areas of rivers. In contrast toNewcomb, Benford’s seminal paper was drew a great deal of attention,partly due to being adjacent to a soon-to-be-famous physics paper.

Benford’s Law (aka the significant-digit law) is the empirical observationthat in many naturally occuring tables of numerical data, the leading


1.5 Fraud detection

Benford’s law IIsignificant (nonzero) digit is not uniformly distributed in {1, 2, . . . , 9}.Instead, the leading significant digit (= D, say) obeys the law

P(D = d) = log10

(1 +

1

d

), d = 1, . . . , 9. (5)

This means P(first significant digit = 1) = log10 2 ≈ 0.301,P(first significant digit = 2) = log10

32 ≈ 0.176, etc.

Before the curse of statistics fell upon mankind we lived a happy, innocent life, full of

merriment and go, and informed by fairly good judgement. (Hilaire Belloc)


1.5 Fraud detection

Benford’s law III

1 2 3 4 5 6 7 8 9

Digit

Pro

babi

lity

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Figure: Probability function of Benford’s distribution.


1.5 Fraud detection

Benford’s law IVQuestion 1: Why should this be?

Mathematicians have shown that the logarithmic distribution is theonly distribution that is invariant under changes of scale, e.g., if theoriginal data is multiplied by 2 or π. For example, this means theBenford’s Law holds whether the data are in metric or imperial. Thisproperty was discovered in 1961.

The logarithmic distribution is also the only probability distributionthat is invariant under changes of base, e.g., if the original data isconverted from base 10 to base 100 and vice versa. For example, thismeans Benford’s Law holds whether the data are measured in gramsor kilograms, bytes or megabytes.

The proofs of all these are difficult. Even though these are two veryimportant properties of (5), they do not fully explain the widespreadappearance of the distribution in real data.


1.5 Fraud detection

Benford’s law VSo then, what is the real reason for Benford’s Law? Probably thisone: In 1996, a discovery was made that showed that Benford’s Lawcorresponds to the “distribution of distributions”. Much of what wemeasure can be thought of as the outcome of some process oranother, e.g., height is caused by genetics and environment. Each ofthese predictors have a distribution of their own. But if we grabrandom handfuls of data from a hotchpotch of such distributions,then the overall distribution of D converges to (5). That is, if randomsamples are taken from random distributions, then the distribution ofD in the combined sample will always converge to Benford’s Law.


1.5 Fraud detection

Benford’s law VIQuestion 2: What kinds of data could be expected to followBenford’s Law?

Two rules of thumb. Firstly, the sample size should be big enough togive the predicted proportions a chance to assert themselves. Thisrule of thumb should be satisfied within a data mining context!Secondly, the numbers should be free of artifical limits, and allowed totake pretty much any value they please. For example, expecting theprices of 10 different types of beer to conform to Benford’s Law isunrealistic on both counts.

Examples following Benford’s Law include stock market prices, 1990census populations of the 3141 counties in the US, numbersappearing in newspapers and old magazines.


1.5 Fraud detection

Benford’s law VIIBenford’s Law has been generalized to second and higher significant digits.For readable accounts of Benford’s Law, see, e.g., Matthews (1999),Hill (1999).

Exercise: Show that the probabilities, as defined by (5), add to exactlyunity.


1.5 Fraud detection

Implementation I

Benford’s Law has been applied to tax returns (of both individuals andcompanies) and experimental scientific data, e.g., medical trials, to detectfraud. In fact, the US Institute of Internal Auditors has started holdingtraining courses on it, hailing it as the biggest advance in the field for years.

As statisticians, we know that we could apply the χ2 test to see if aparticular data set conformed to Benford’s Law. Recall, however, thatsomething statistically significant may not be practically significant. Asn→∞, even small differences will become ‘statistically significant’.Therefore, it is beneficial to plot the data and compare it to the figure onSlide 102. Gross departures could indicate the presence of fraud.

With the current exponentially increasing availability of digital data andcomputing power, the use of subtle and powerful statistical tests for


1.5 Fraud detection

Implementation IIdetection of fraud and fabricated data is also certain to increasedramatically. Benford’s Law is only the beginning.

Figure: A specimen bar code.


1.5 Fraud detection

More on Benford’s law I

If a set of numbers is to conform to Benford’s Law then what is needed isnot necessarily a constant growth rate but a geometric sequence when thedata is sorted from smallest to largest. That is, each member in the seriesis larger than the previous one by a fixed ratio. This geometric assumptionis the basis of Benford’s Law.

Note that a simple geometric sequence can be written

Sn = a rn−1 (6)

where a is the first value of the sequence, r is the common ratio and ndenotes the nth term. Each successive value is a fixed percentage increaseover the value before it.


1.5 Fraud detection

More on Benford’s law IIWhat data sets should conform to Benford’s Law?

1 The data set should describe the sizes of similar phenomena, e.g., thepopulations of towns and cities, the areas of lakes, the length of rivers,the height of mountains, the market values of the NYSE, the revenuesof companies on the NYSE, inventories of companies on the NYSE,daily sales volume of the companies on the London Stock Exchange.

2 There should be no built-in minimum or maximum values in the dataset, e.g., a $10 (commission) charge might be made for buyingsomething like shares or currency, maximum tax deductions formoving expenses in USA. A built-in minimum of 0 is acceptable.

3 The data set should not be made up of assigned numbers, e.g.,numbers given to things in place of words. For example, IRD numbers,bank account numbers, car licence plate numbers, telephone numbers.


1.5 Fraud detection

More on Benford’s law III4 The data set should have more small items than big items, e.g., there

are more towns than cities, more small companies than big ones, moresmall lakes than big lakes.

Research has shown that the numbers in data sets should have at leastfour digits for a good fit with Benford’s Law. If there are fewer digits thenthere is a slightly larger bias in favour of the lower digits.

Of course a large data set is also required for a close fit to the Benforddistribution.


1.6 References†

1.6 References† I

Allison, P. D., 2002. Missing Data. Sage Publications, ThousandOaks, CA, USA.

Anonymous, A., 2005. Introduction to Data Mining and KnowledgeDiscovery, 3rd Edition. Two Crows Corporation.

Anonymous, A., 2008. . Statistical Analysis and Data Mining .URL http://www.interscience.wiley.com/SAM

Benjamini, Y., Hochberg, Y., 1995. Controlling the false discoveryrate: a practical and powerful approach to multiple testing. J. Roy.Statist. Soc. Ser. B 57 (1), 289–300.

Berry, M. J. A., Linoff, G. S., 2004. Data Mining Techniques forMarketing, Sales, and Customer Support, 2nd Edition. Wiley,Indianapolis.


http://www.interscience.wiley.com/SAM

1.6 References†

1.6 References† II

Berry, M. W., Browne, M. (Eds.), 2006. Lecture Notes in DataMining. World Scientific, Singapore.

Bishop, C. M., 2006. Pattern Recognition and Machine Learning.Springer, New York, USA.

Bolton, R. J., Hand, D. J., 2002. Statistical fraud detection: a review.Statist. Sci. 17 (3), 235–249.

Bramer, M. A., 2007. Principles of Data Mining. Springer, New York,USA, electronic resource.

Cios, K. J., 2007. Data Mining: A Knowledge Discovery Approach.Springer, New York, USA.

Coy, P., 1997. He who mines data may strike fool’s gold. BusinessWeek , 44–June 16 issue.


1.6 References†

1.6 References† III

Dempster, A. P., Laird, N. M., Rubin, D. B., 1977. Maximumlikelihood from incomplete data via the EM algorithm (withdiscussion). J. Roy. Statist. Soc. Ser. B 39, 1–38.

Enders, C. K., 2010. Applied Missing Data Analysis. Guilford Press,New York, USA.

Fayyad, U. M. (Ed.), 1996. Advances in Knowledge Discovery andData Mining. AAAI Press & MIT Press, Menlo Park, CA, USA.

Fewster, R. M., 2009. A simple explanation of Benford’s law. TheAmerican Statistician 63 (1), 26–32.

Han, J., Kamber, M., 2006. Data Mining: Concepts and Techniques,2nd Edition. Morgan Kaufmann, San Francisco, CA, USA.

Hand, D., Mannila, H., Smyth, P., 2001. Principles of Data Mining.MIT Press, Cambridge, MA, USA.


1.6 References†

1.6 References† IV

Hastie, T. J., Tibshirani, R. J., Friedman, J. H., 2009. The Elementsof Statistical Learning: Data Mining, Inference and Prediction, 2ndEdition. Springer-Verlag, New York, USA.

Hill, T. P., 1999. The difficulty of faking data. Chance 12 (1), 27–31.

Izenman, A. J., 2008. Modern Multivariate Statistical Techniques:Regression, Classification, and Manifold Learning. Springer, New York,USA.

Kantardzic, M., 2003. Data Mining: Concepts, Models, Methods, andAlgorithms. IEEE Press, Piscataway, NJ, USA.

Kargupta, H. (Ed.), 2004. Data Mining: Next Generation Challengesand Future Directions. AAAI Press, Menlo Park, CA, USA.

Larose, D. T., 2006. Data Mining Methods and Models.Wiley-Interscience, Hoboken, NJ, USA.


1.6 References†

1.6 References† V

Little, R. J. A., Rubin, D. B., 2002. Statistical Analysis with MissingData, 2nd Edition. Wiley, New York, USA.

Matthews, R., 1999. The power of one. New Scientist 163 (2194),27–30.

McCue, C., 2007. Data Mining and Predictive Analysis: IntelligenceGathering and Crime Analysis. Butterworth-Heinemann, Boston, MA,USA.

McKnight, P. E., McKnight, K. M., Sidani, S., Figueredo, A. J., 2007.Missing Data: A Gentle Introduction. Guilford Press, New York, USA.

Nigrini, M. J., 2000. Digital Analysis Using Benford’s Law: Tests &Statistics For Auditors, 2nd Edition. Global Audit Publications,Vancouver, BC, Canada.


1.6 References†

1.6 References† VI

Nisbet, R., Elder IV, J., Miner, G., 2009. Handbook of StatisticalAnalysis and Data Mining Applications. Elsevier.

Pearson, R. K., 2005. Mining Imperfect Data: Dealing withContamination and Incomplete Records. SIAM, Philadelphia.

Pyle, D., 1999. Data Preparation for Data Mining. Morgan Kaufmann,San Francisco, CA, USA.

Pyle, D., 2003. Business Modeling and Data Mining. MorganKaufmann, San Francisco, CA, USA.

Roiger, R. J., Geatz, M. W., 2003. Data Mining: a Tutorial-basedPrimer. Addison Wesley, Boston, MA, USA.

Tan, P.-N., Steinbach, M., Kumar, V., 2005. Introduction to DataMining. Pearson Addison Wesley, Boston, MA, USA.


1.6 References†

1.6 References† VII

Torgo, L., 2010. Data Mining with R. CRC Press, Boca Raton, FL,USA.

Webb, A., 2002. Statistical Pattern Recognition, 2nd Edition. Arnold,London.

Weiss, S. M., Indurkhya, N., 1998. Predictive Data Mining: aPractical Guide. Morgan Kaufmann, San Francisco, CA, USA.

Witten, I. H., Franke, E., 2005. Data Mining: Practical MachineLearning Tools and Techniques. Morgan Kaufmann, Amsterdam;Boston, MA, USA.


chapter 1: introduction to data miningyee/784/handouts/dmb.pdfchapter 1: introduction to data mining...

Documents