issues in mining survey datahilder/my_students_theses_and...where. surveys are now commonly used by...

77
ISSUES IN MINING SURVEY DATA A Project Report Submitted to the Department of Computer Science In Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science University of Regina By Syed Uzair Ahmed Bahelvi Regina, Saskatchewan May 4, 2005 c Copyright 2005: Syed Uzair Ahmed Bahelvi

Upload: others

Post on 09-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

ISSUES IN MINING SURVEY DATA

A Project Report

Submitted to the Department of Computer Science

In Partial Fulfillment of the Requirements

for the Degree of

Master of Science

in

Computer Science

University of Regina

By

Syed Uzair Ahmed Bahelvi

Regina, Saskatchewan

May 4, 2005

c© Copyright 2005: Syed Uzair Ahmed Bahelvi

Page 2: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Abstract

A survey is a system for collecting information from or about people to describe,

compare, or explain their knowledge, attitudes, and behavior. Today, billions of

dollars are spent annually collecting survey data, and surveys are everywhere. The

extensive use of surveys has lead to collection of huge amount of survey data. This

massive amount of data is not of much use unless and until it is analyzed and some

useful knowledge is extracted from it. For decades, statistics has been used as a tool

for analyzing survey data. Hence, survey data are usually collected and recorded in

a format suitable for statistical analysis. Recently, data mining has emerged as an

active field for analyzing data. This has led the survey data organizers to record

data in a format helpful for data miners. Great care is taken while collecting and

recording data, but still there remain a few problems for data miners. In this project

we present a few problems faced while mining survey data for association rules and

present a solution to solve that problem.

The purpose of a survey can range from describing some phenomenon, to establish-

ing some relationships between variables. In this project, we concentrate on bivariate

associations, that is, to see whether two variables are associated. We show how

crosstabulations, a bivariate analysis technique can be used to determine bivariate

associations. We show that the same information can be obtained from association

i

Page 3: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

mining. We then show how association mining becomes more advantageous than

crosstabulations as the number of variables for bivariate analysis increase.

ii

Page 4: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Acknowledgments

I would like to express my sincere thanks to my supervisor, Dr. Robert Hilderman

for his advice, assistance and financial support. Without his encouragement this work

would not have been accomplished. I would also like to thank the Faculty of Graduate

Studies and Research, and Department of Computer Science for financial support, my

friends and relatives for their moral support, and my parents for making me capable

of doing this type of work. Finally, I would like to thank my brother, and my sister.

iii

Page 5: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Post Acknowledgments

I would also like to thank the internal examiners, Dr. Samira Sadaoui and Dr.

Lisa Fan for their valuable comments and suggestions on my project report.

iv

Page 6: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Contents

Abstract i

Acknowledgments iii

Post Acknowledgments iv

Table of Contents v

List of Tables viii

List of Figures x

Chapter 1 Introduction 1

1.1 Overview of Knowledge Discovery in Databases . . . . . . . . . . . . 2

1.2 Overview of Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Overview of Survey Data Analysis . . . . . . . . . . . . . . . . . . . . 6

1.4 Objectives of the Project . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Contributions of the Project . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Organization of the Project Report . . . . . . . . . . . . . . . . . . . 8

Chapter 2 Background 10

v

Page 7: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

2.1 Overview of Association Rule Mining . . . . . . . . . . . . . . . . . . 11

2.2 Key Issues of Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.1 Purposes of Surveys . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.2 Types of Surveys . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.3 Designs of Surveys . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.4 Sample Selection in a Survey . . . . . . . . . . . . . . . . . . . 19

2.2.5 Type of Information Collected in a Survey . . . . . . . . . . . 19

2.2.6 Survey Data Processing . . . . . . . . . . . . . . . . . . . . . 19

2.2.7 Types of Variables in a Survey . . . . . . . . . . . . . . . . . . 22

2.2.8 Noise in Survey Data . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Conclusion of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . 24

Chapter 3 Data Preparation for Association Rule Mining 25

3.1 Features of the CCHS Data . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Survey Data Mining Problems . . . . . . . . . . . . . . . . . . . . . . 29

3.3 PreAssociation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 PreAssociation Software . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.1 Block Diagram of PreAssociation Software . . . . . . . . . . . 34

3.4.2 Features of PreAssociation Software . . . . . . . . . . . . . . . 39

3.5 IBM’s Intelligent Data Miner . . . . . . . . . . . . . . . . . . . . . . 41

3.6 Conclusion of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . 42

Chapter 4 Crosstabulations and Association Rules 44

4.1 Overview of Crosstabulations . . . . . . . . . . . . . . . . . . . . . . 44

4.1.1 The Structure of a Crosstabulation . . . . . . . . . . . . . . . 45

vi

Page 8: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

4.1.2 Interpreting a Crosstabulation . . . . . . . . . . . . . . . . . . 47

4.2 Crosstabulations Vs Association Rules . . . . . . . . . . . . . . . . . 48

4.3 Conclusion of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . 50

Chapter 5 Experimental Results 51

5.1 Crosstabulations Method to find Associations . . . . . . . . . . . . . 52

5.2 Association Mining Method to find Associations . . . . . . . . . . . . 52

5.3 Interrelation between Crosstabulations and Association Rules . . . . 54

5.4 Comparison between Crosstabulations and Association Mining . . . . 56

5.5 Conclusion of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . 59

Chapter 6 Conclusion and Future Work 62

vii

Page 9: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

List of Tables

2.1 An Example Survey Dataset . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Variables contained in the Example Survey Dataset and the Associated

Domain Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Survey Data in the Raw Format . . . . . . . . . . . . . . . . . . . . . 26

3.2 Description of the Survey Data Variables . . . . . . . . . . . . . . . . 27

3.3 Description of the Survey Data Variable Values . . . . . . . . . . . . 28

3.4 Survey Data in Market Basket Format . . . . . . . . . . . . . . . . . 32

3.5 Full Form of PreAssociation Block Diagram Labels . . . . . . . . . . 34

4.1 A simple Crosstabulation . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Crosstabulation with Row, Column and Total Percentages . . . . . . 46

5.1 Crosstabulation for Sex versus HDI . . . . . . . . . . . . . . . . . . . 52

5.2 Association Rules for Sex and HDI . . . . . . . . . . . . . . . . . . . 53

5.3 Correspondence between Association Rules and Crosstabulations-Part1 56

5.4 Correspondence between Association Rules and Crosstabulations-Part2 57

5.5 Crosstabulation for Sex versus HDI . . . . . . . . . . . . . . . . . . . 58

5.6 Crosstabulation for Sex versus ToD . . . . . . . . . . . . . . . . . . . 58

5.7 Crosstabulation for HDI versus ToD . . . . . . . . . . . . . . . . . . 59

5.8 Association Rules part1 . . . . . . . . . . . . . . . . . . . . . . . . . 60

viii

Page 10: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

5.9 Association Rules part2 . . . . . . . . . . . . . . . . . . . . . . . . . 61

ix

Page 11: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

List of Figures

1.1 KDD Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1 PreAssociation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Block Diagram of PreAssociation Software . . . . . . . . . . . . . . . 33

3.3 File Breaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 Data File Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Data Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6 Description Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.7 Print Market Basket . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.8 Association Rules Generated by Intelligent Miner . . . . . . . . . . . 43

x

Page 12: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Chapter 1

Introduction

Recently, data mining has been recognized as a key research topic by many re-

searchers and as a powerful data analysis tool in many application areas. One such

application is survey data analysis. For decades, statistical tools have remained as the

only tools and techniques for survey data analysis. For this reason, survey data are

usually created and recorded in a format suitable for statistical analysis. But recent

trend of data mining has set the survey data organizers to create data suitable for

data miners. Though a lot of care is taken while collecting and recording survey data,

there still exist a few issues for data miners. Overcoming such issues is one of the

steps of knowledge discovery in databases. In this chapter we present an introduction

to knowledge discovery in databases, survey data and survey data analysis.

In Section 1.1, we present an overview of knowledge discovery in databases. Sec-

tion 1.2 presents an overview of surveys. Section 1.3 presents an overview of survey

data analysis and then presents the various factors that affect the survey data analy-

sis and the stepwise procedure of survey data analysis. In Section 1.4, we present

the objective of our project and in Section 1.5 the contributions of our project are

presented. Finally, Section 1.6 presents an overview of organization of the project

1

Page 13: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

report.

1.1 Overview of Knowledge Discovery in Databases

Data mining, which is also referred to as Knowledge Discovery in Databases

(KDD), is defined as a nontrivial process of identifying valid, novel, potentially useful,

and understandable patterns in data [20, 25, 10]. Here data implies a set of facts (for

example, records in a database) and patterns implies some form of an expression in

some language describing a small subset of the data. Nontrivial implies that KDD is

not a simple computation process, rather it involves some search or inference. The

term process implies that KDD is comprised of many steps. Valid implies that the

discovered patterns should be valid on new data with some degree of certainty and the

term novel implies that the discovered knowledge should not be obvious, that is, the

discovered knowledge should be some new information, not already known. Poten-

tially useful implies that the KDD process should provide some benefit to the user or

the system, that is, should help solve some problem or provide some useful knowledge

or direction. Finally understandable implies that the results obtained should be easy

for human interpretation.

As shown in Figure 1.1, KDD is an iterative process that consists of a number of

steps. Some of the commonly used steps [20, 12, 19, 10] are as follows:

• The goal identification step necessitates understanding of the application do-

main, the relevant prior knowledge and then identification of the goal of the

KDD process from the user’s point of view.

• The data selection step creates a target dataset, that is, some subset of data

2

Page 14: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Patterns

Transformed Preprocessed Data Data Target Data Data

- - - - - - - -

Selection

Preprocessing

Transformation

Data Mining

Interpretation/ Evaluation

Knowledge

Figure 1.1: KDD Process

records and/or data variables that will be mined.

• The data cleaning and preprocessing step includes removal of noise (unwanted

or erroneous data), gathering information pertaining to noise, and strategies to

handle missing data.

• The data reduction and transformation step reduces the volume of the data

through dimensionality reduction and transformations to find invariant repre-

sentations of the data.

• The data mining step specifies the data mining algorithm to be used for discov-

ering patterns and generate a particular representational form such as clusters,

classification trees, or association rules.

• The interpretation and evaluation step involves visualization of the discovered

patterns and verification of these patterns against the expected results.

3

Page 15: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

• The application step applies the newly discovered and verified knowledge to

some problem domain, such as to decision making process or to cross verify the

knowledge obtained from some other system.

The steps of the knowledge discovery process are interdependent. That is, the task

to be performed in a certain step depends to some extent on the task to be performed

in some other step. For example, given a database D, if we decide to extract the

knowledge from D in the form of association rules (which can be related to the goal

identification step), then the task to be performed in data mining step is association

mining. If not already available in the required format, the data transformation step

has to transform D into the format required for association mining.

1.2 Overview of Surveys

A survey is a system for collecting information from or about people to describe,

compare, or explain their knowledge, attitudes, and behavior [3, 11, 2, 6]. Today,

billions of dollars are spent annually collecting survey data, and surveys are every-

where. Surveys are now commonly used by both public and private organizations

to collect information. Governments, political parties, private corporations, trade

unions, community groups, health researches and university researchers are all prime

users of survey knowledge. The extensive use of surveys has lead to collection of huge

amount of survey data. This massive amount of data is not of much use unless and

until it is analyzed and some useful knowledge is extracted from it.

The routine of survey data analysis may not be difficult, but properly to guide it

and the accompanying interpretation requires a familiarity with the background of

4

Page 16: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

the survey and with all its stages [11]. Some of the key issues of surveys to become

familiar with are as follows [11, 9, 2]:

• The purpose of a survey can range from describing some phenomenon, to es-

tablishing relationships between variables, to providing information about what

changes can be introduced to produce a desired change in something else.

• The sample population is a subset of population treated as representative of the

entire population.

• The types of survey which are divided based on the medium through which

they are conducted such as (a) mail or postal surveys (b) group-administered

or hand-delivered questionnaires (c) face-to-face interviews and (d) telephone

interviews.

• The survey design includes longitudinal design, experimental design and cross-

sectional design, depending upon the time in which the measurements of the

sample population are taken.

• The type of information collected largely depends upon the purpose of the survey

and the industry conducting the survey. The information ranges from lifestyle

habits, to households, to participation in election.

• The data processing is the final step of the survey system and the first step

towards analyzing survey data. In this step, the answers from the survey re-

spondents are translated and transferred into a computerized data file.

5

Page 17: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

1.3 Overview of Survey Data Analysis

Once the data have been collected they are analyzed to get some useful informa-

tion. There are three factors that affect on how the data are analyzed [4] including

the number of variables being examined (one variable, two variables, and three or

more variables), the level of measurement of the variables (nominal, ordinal, and

interval/ratio), and the purpose of data analysis (descriptive or inferential).

Depending upon the above three factors, the process of data analysis can be

defined stepwise as follows [4]:

• Determine the number of variables in question.

• Depending upon the number of variables, choose the appropriate method of

analysis, that is:

– One variable → Univariate analysis

– Two variables → Bivariate analysis

– Three variables → Multivariate analysis

• Depending upon the level of measurement of variables, choose an appropriate

univariate, bivariate or multivariate technique. Following are some of the sur-

vey data analysis techniques.

– Univariate → Frequency distributions

– Bivariate → Crosstabulations, Scattergrams, Regression, Rank order correla-

tion, Comparison of means

– Multivariate → Conditional tables, Partial rank order correlation, Multiple

and partial correlation, Multiple and partial regression, Path analysis

6

Page 18: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

• Depending upon the data analysis technique, choose an appropriate descriptive

statistics and an appropriate inferential statistics, if required.

1.4 Objectives of the Project

The objectives of this project are:

• To prepare survey data for generating human interpretable association rules

and

• To show that the same information can be obtained from association mining

as from crosstabulations, and to show that association mining is advantageous

over crosstabulations as the number of variables for bivariate analysis increase.

Realizing this objective consists of two steps:

• To preprocess survey data for association mining.

• To generate association rules, crosstabulations on survey data and compare

them.

1.5 Contributions of the Project

This project makes the following contributions to the field of survey data analysis:

• We develop and implement an algorithm to preprocess the survey data for as-

sociation mining and also address the data selection step of the KDD process.

While doing so, we discuss the various issues pertaining to the format in which

7

Page 19: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

the survey data is typically collected and recorded. The survey data is not pro-

vided in the format required for association mining. Therefore, before associa-

tion mining function could be applied, the survey data needs to be preprocessed

and transformed to the required format.

• We generate the association rules and build crosstabulations on the survey data

and compare them. We theoretically show that the same information can be

obtained from the association rules as obtained from the crosstabulations. We

show that bivariate associations between n variables can be obtained in just one

run of the association mining compared to building n×(n-1)/2 crosstabulations.

We then run the experiments to validate our point.

1.6 Organization of the Project Report

The remainder of this report is organized as follows. In Chapter 2, we present

the general overview of association rule mining and highlight its characteristics. We

then present an overview of the key issues of surveys that are required for survey data

analysis.

In Chapter 3, we present the problem faced while generating association rules from

survey data. we develop and implement an algorithm to preprocess the survey data

for association mining. We then present the functions of our software components

and highlight the features of our software. Using IBM’s Intelligent Data Miner [8],

we show how to generate association rules from the preprocessed survey data.

In Chapter 4, we present an overview of crosstabulations and show how the as-

sociation rules correspond to the information given by crosstabulations. We then

8

Page 20: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

theoretically describe how the association rules are advantageous over crosstabula-

tions as the number of variables being examined increase.

In Chapter 5, we present and discuss the experimental results to validate our

theoretical justifications.

In Chapter 6, we conclude and provide a summary of our work.

9

Page 21: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Chapter 2

Background

In surveys, the information is collected from a sample of people who have been

selected to represent a larger population and the resulting information is then gen-

eralized back to the larger population. In this way, three types of knowledge can be

gained from surveys [11]: (a) accurate description of how attitudes and behaviors are

distributed in the population, (b) analysis of the associations among attitudes and

behaviors, and (c) clues to cause and effect relationships. To obtain the required

knowledge from a survey, the data collected in the survey has to be processed and

analyzed using an appropriate technique. Association rule mining is a data mining

technique used to find the associations between various items in the given data. The

association rules reveal the existing dependencies in the data, that is, the presence of

one item or a collection of items implies the presence of some other item or a collec-

tion of items. This in turn gives us a deeper insight of the data and helps us make

better decisions.

In Section 2.1, we present an overview of association rule mining by reviewing the

significant previous work done in this area [14, 23, 1]. In Section 2.2, we present an

overview of the key issues of surveys that are important for survey data analysis.

10

Page 22: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

2.1 Overview of Association Rule Mining

In this section, we present the formal definition of association mining, the descrip-

tion of the association rule support, the description of the association rule confidence,

and the description of the association rule lift.

• Problem Definition: Formally, association mining has been defined as follows

[14, 23, 1, 10, 18, 24]: Let I = i1, i2, .,im be a set of m distinct literals, called

items. Let D be a set of transactions, where each transaction T is a set of items

such that T⊂I. A transaction T is said to contain X if and only if X⊂T. An

association rule is an implication of the form X→Y, where X⊂I, Y⊂I, and X∩Y

= Ø. X is called the antecedent or rule body and Y is called the consequent

or rule head of the rule. The rule body expresses the condition that must be

satisfied for the rule head to become true. A set of items (such as rule body

or the rule head) is called an itemset. So, an association rule also expresses

associations between itemsets. The association rule X→Y has support s in

transaction set D if s% of transactions in D contain X∪Y. The rule X→Y holds

with confidence c in transaction set D if c% of transactions in D that contain

X also contain Y. In other words, the confidence of an association rule X→Y

measures the conditional probability of Y given X, denoted by P(Y|X). Lift of

an association rule is a value that gives us information about the increase in

probability of the “then” (consequent) part given the “if” (antecedent) part,

that is, lift gives us the information that helps us to determine whether the

antecedent part of a rule (rule body) has a positive effect on the consequent

part of the rule (rule head), has a negative impact or has no impact at all.

11

Page 23: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

To give a better idea of the support, confidence and lift measures, we

demonstrate them with the help of an example. Since the theme of this paper

is to demonstrate the potential use of data mining on survey data, a sample

survey database is used in the example instead of the typical “market basket”

transactional database that is usually described in previous work [14, 23, 1].

Hence, the terminologies relevant to survey data will be used rather then those

associated with transactional data. That is, an item will be referred to as a

variable, an itemset as a variableset, a transaction as a record, and a TID

(Transaction Identification Number) as a record number.

RecordNumber SEX HDI NCC ToD1 Male Excellent 0 Regular2 Male Excellent 0 Regular3 Female Good 2 Occasional4 Male Very Good 1 Former5 Female Good 2 Occasional6 Male Good 1 Regular7 Male Excellent 0 Former8 Female Fair 3 Never9 Female Good 2 Occasional10 Female Poor 5+ Former11 Female Poor 5+ Occasional12 Male Fair 2 Former13 Female Good 2 Never14 Female Very Good 1 Regular15 Female Fair 3 Former

Table 2.1: An Example Survey Dataset

Consider the example survey dataset containing 15 records, shown in Ta-

ble 2.1, representing a population health survey data. The Record Number

variable contains the unique identification number given to each survey respon-

dent. Each of the four remaining variables contains the responses to a series

12

Page 24: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

HDI NCC ToDSEX (Health Description (Number of (Type of(Gender) Index) Chronic conditions) Drinker)Male Excellent 0 RegularFemale Very Good 1 Occasional

Good 2 FormerFair 3 NeverPoor 4

5+

Table 2.2: Variables contained in the Example Survey Dataset and the AssociatedDomain Values

of survey questions for each respondent. For example, the SEX variable cor-

responds to the query “What is your Gender”. Similarly the HDI (Health

Description Index), NCC (Number of Chronic Conditions) and ToD (Type of

Drinker) variables correspond to the queries “Which category would you con-

sider to represent your health”, “How many chronic conditions do you exhibit”

and “How often do you drink”, respectively. For example, the person identified

by Record Number 3, is of ‘Female’ Sex, exhibits ‘Good’ Health Description

Index, has ‘2’ Number of Chronic Conditions and is an ‘Occasional’ Type of

Drinker.

Let I = {Sex, HDI, NCC, ToD} be a set of 4 variables. Let D = {records 1

to 15 of Table 1} be a set of records over I, where each record contains a set of

variable values. For instance, in Table 2.1, the record 1 ={Male Sex, Excellent

HDI, 0 NCC, Regular ToD}. Referring to records 1, 2 and 6 of Table 2.1, we

obtain a rule saying “if Sex = Male then ToD = Regular”. This rule has only

one variable in rule body (Sex = Male) and rule head (ToD = Regular) that can

be represented as: Sex(Male) → ToD(Regular). Similarly, referring to records

13

Page 25: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

1, 2, 3 and 9 of Table 2.1, we obtain the following association rules that have

variableset in their rule head and/or rule body: Sex(Male) + HDI(Excellent)

→ ToD(Regular), Sex(Female) + HDI(Good) → NCC(2) + ToD(Occasional).

The number of variables in the rule body and rule head together form the

length of the association rule. Hence, in the above three rules, the first one is

of length 2, the second rule is of length 3 and third rule is of length 4.

• Rule Support: Each variableset I, in a set of records D, has an associated

measure of statistical significance called support [25, 23, 1]. For a variableset

X⊂I, support can be represented as support(X) = s, if the fraction of records

in D containing X is s. That is, support(X) is the number of records containing

variableset X divided by the total number of records. For example, from Ta-

ble 2.1, support for the variableset “(Sex(Male) + HDI(Excellent))” is 3/15 =

0.2 or 20%, which comes from records numbered 1,2 and 7. This can also be

represented as support(Sex(Male) + HDI(Excellent)) = 20%.

In a given set of records, the support of a given rule is always equal to the

support of the variableset that consists of the variables in the rule body and

the variables in the rule head of the rule. For instance, the support for the rule

“Sex (Male) + HDI (Excellent) → ToD (Regular)” is equal to the support for

the variableset “Sex (Male) + HDI (Excellent) + ToD (Regular)”.

• Rule Confidence: A given rule has a measure of its strength called the con-

fidence [25, 23, 1]. The rule X → Y in the transaction set D has confidence c

represented as, conf (X → Y ), if c% of the transactions in D that contain X also

contains Y. For example, from Table 2.1, conf (Sex (Male) + HDI (Excellent)

14

Page 26: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

→ ToD (Regular)) = 2/3 = 0.66 or 66%.

Confidence can also be calculated as conf (X → Y ) = support(X ∩ Y )/

support(X ). It is often desirable to concentrate on rules with high values of

support and confidence. Such rules with strong support and high confidence

are called as strong rules [23].

• Rule Lift: Lift is the ratio of confidence to expected confidence [13, 15] and

expected confidence can be understood as follows: If two variablesets X and Y

are statistically independent, the support for records containing both the vari-

ablesets X and Y is equal to the product of the support for variableset X and

the support for variableset Y. i.e., support(X∩Y ) = support(X ) × support(Y ).

The confidence obtained for such records containing statistically independent

variablesets is called expected confidence. From the definition of confidence,

we have conf (X→Y ) = support(X∩Y )/support(X ). Further, assuming statis-

tical independence, expected confidence can be written as exp conf (X→Y ) =

support(X )× support(Y )/support(X ) = support(Y ). In other words, expected

confidence means, “confidence, if the presence of variableset X does not enhance

the presence of variableset Y”, which according to the above equation is equal

to the support of the variableset Y.

Now, lift can be defined as the factor by which confidence exceeds expected

confidence, that is, lift(X→Y )= conf (X→Y )/exp conf (X→Y )= conf (X→Y )/

support(Y ). For example, referring to records 1 and 2 of Table 2.1, we have

lift(Sex(Male) + HDI(Excellent)→ ToD(Regular)) = conf (Sex(Male) + HDI(Excellent)

→ ToD(Regular)) / support(ToD(Regular)) = (2/3)/(4/15) = 5/2 = 2.5.

A lift value greater than one is considered as high and indicates a strong

15

Page 27: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

association between variables or variablesets, that is, a high lift value indicates

a positive relation between the rule head and the rule body. A lift value less

than one is considered low and indicates a negative relation between rule head

and rule body and lift value of one is considered neutral. Thus the above rule

tells us that Male Sex and Excellent HDI has a positive effect on the person

being a Regular ToD, that is, if a person is of Male Sex and Excellent HDI then

it is very likely for the person to be a Regular TOD because the lift value is

greater than one.

2.2 Key Issues of Surveys

Once the survey data has been collected it has to be analyzed. To properly guide

the survey data analysis process, we have to be familiar with the background of the

survey and with all its stages [11]. In this section, we briefly present the key issues

of surveys [9, 2, 5] .

In Section 2.2.1, we present a few common purposes of surveys. In Section 2.2.2,

we present the various types of survey. In Section 2.2.3, we present the different

designs of surveys. In Section 2.2.4, we present the advantages of sample selection

in a survey. In Section 2.2.5, we briefly present the type of information collected

in a survey. In Section 2.2.6, we present the processing of data after surveys are

conducted. In Section 2.2.7, we present the types of variables in survey data. In

Section 2.2.8, we define noise and briefly present how noise gets collected in a survey

data.

16

Page 28: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

2.2.1 Purposes of Surveys

There are many purposes for which a survey is conducted. Some of the common

ones are as follows [11]:

• To describe some phenomenon.

• To establish some relationships between variables.

• To describe how attitudes and behaviors are distributed in the population.

• To analyze the associations among attitudes and behaviors.

• To find the clues to cause and effect relationships.

2.2.2 Types of Surveys

Survey respondents can provide answers either orally or in writing. Oral responses

come in interviews that can be conducted either in person or over the telephone.

Written responses come from self-administered questionnaires that are distributed

either through the mail or by delivery. Survey research involves either direct, per-

sonal contact with respondents or indirect, less personal contact. Interviews pro-

vide for a two-way exchange of information and thus the closest contact. In con-

trast, self-administered questionnaires are impersonal since they allow no dialogue be-

tween respondents and researchers. Based on the medium through which the surveys

are conducted they have been divided as [9]:(a) Mail or postal surveys (b) Group-

administered or Hand-delivered questionnaires (c) Face-to-Face interviews and (d)

Telephone interviews.

17

Page 29: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

2.2.3 Designs of Surveys

The three important designs of surveys can be found in [9, 4], including:

• Longitudinal: Designs where measurements are taken at multiple times are

referred to as longitudinal designs. A longitudinal survey involves surveying

the same group of people over a longer period of time. This allows us to track

trends in population health. These powerful designs help in pinpointing causal

influences because we can see how variables change together over time.

• Experimental: In experimental designs a group of individuals are randomly

assigned to either the experimental group or the control group. After grouping,

the measurements for dependent variable, which is the same in both the groups,

are taken from both the groups. Then the independent variable is activated only

for experimental group and the measurements for dependent variable are again

taken from both the groups. Finally, the measurements for dependent vari-

able taken from both the groups before and after the activation of independent

variable are compared. This method provides an effective way of ensuring the

change in dependent variable caused by independent variable.

• Cross Sectional: Designs where measurements are taken at one single point

in time are referred to as cross-sectional designs. These designs can be likened

to single snapshots from a camera, as compared to a continuous longitudinal

view provided by a motion picture.

18

Page 30: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

2.2.4 Sample Selection in a Survey

It would be very expensive, and not very practical, to survey every person in a

Country. Collecting information from a sample is more economical than collecting

information from an entire population [4]. Sampling, a statistical method, is an

established way to determine the characteristics of an entire population using the

answers from a randomly chosen sample. The information obtained by surveying a

sample population may actually be more satisfactory than that obtained by surveying

the entire population. For various reasons, errors occur in data collection. They are

more likely to occur when available resources are spread more thinly among more

numerous survey units. Consequently, these errors are more likely to occur when

enumerating a population than a sample because the former is more numerous than

the later.

2.2.5 Type of Information Collected in a Survey

The type of information collected in surveys is very diverse. Some of the topics of

information are [11]: population, housing, community studies, family life, sexual be-

havior, family expenditure, nutrition, health, education, social mobility, occupations

and special groups, leisure, travel, political behavior, race relations and minority

groups, old age, crime and deviant behavior.

2.2.6 Survey Data Processing

Once all the questionnaires have been returned or all the interviews completed,

the work of analyzing the results begin. A first step in this process is to transfer

(and translate) the answers from the respondents into a computerized data file [4].

19

Page 31: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Using the data file that results from the coding process, the survey can then be

systematically analyzed [9]. All the information necessary to read and understand

the data file is provided. The reader knows the name of the variable. In case where

the variable name is not sufficient informative for readers, an extended variable name

or variable label is given. Values and their labels also appear in the table. If some

respondents did not answer the questions or the questions did not apply to them, the

number of missing cases is listed.

In Section 2.2.6.1, we present a description on how the information collected from

the survey is coded and transferred to the computer data files. In Section 2.2.6.2, we

present the format in which survey data typically collected and recorded.

2.2.6.1 Coding the Survey Data

Coding involves transferring the survey information to computer data files. The

codes are the values used to represent different responses. For statistical analysis,

numeric representations or codes are used for each values of every variable. For ex-

ample, if the respondent answers “male” to the question, “what is your gender?” a

number such as “0” is chosen to represent male survey respondents. A different num-

ber, such as “1” is then assigned to female survey respondents. In case of computer

assisted questionnaires or interviews, the coding scheme is developed as part of the

questionnaire. Responses are directly and automatically entered into the computer

database.

20

Page 32: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

2.2.6.2 Format of Survey Data

While creating the codebook for survey data, each of survey respondent receives

an identification number (ID) to differentiate from other survey respondents. The

responses to each question (variable) will be coded by assigning a numeric value for

each possible response, with all the responses making up a data line. In turn, all the

data lines will make up a data set or data matrix. A data line could look like 01 2 23

1 0 10 6 or like 0122310106, with each number representing the value of a variable,

as in 01 (ID), 2 (female), 23 (age) and so on.

1. Fixed format: The computer must be instructed on exactly how to read the

data line. For example, the computer is instructed to read column 1 and 2

together as one number; to skip column 3; read column 4 as one number, etc.

The number of the digits to be read as one number is referred to as field width,

while the left-hand column where the computer starts reading is referred to as

the field location. Note that the value “1”, when entered into a field two columns

wide is written as “01”, or “1”, with the space preceding the digit 1 representing

a blank (known as right justification). The above process is referred to as fixed

format because the location of each variable in the data line is fixed. For each

variable, then, the computer reads the number in each data line to determine

its value.

A completed codebook for fixed format survey data contains the following in-

formation [9]:

• Question number (or Identification number): This makes it clear to locate

the information.

21

Page 33: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

• Variable name: Each variable name must be unique and usually short

names starting with alphabetic character are used.

• Variable label : These are longer character strings clarifying the precise

meaning of the shorter variable name.

• Value label : This identifies each category or value of a variable.

• Variable value: A number that stands for or represents each value a vari-

able takes on.

• Field width: This identifies how many values the variable can take on.

• Field location or column location: This identifies the column in which the

values for a particular variable start.

2. Free format: Data that are entered in free format have the variable values

entered in a consistent sequence for all cases, but the specific field location is

not fixed. Each code in this case is separated by a space or a comma. In the

case of free format, the codebook would require neither the field location nor

the field width.

2.2.7 Types of Variables in a Survey

In a survey there are two types of variables [21], such as:

1. Direct Variables : If the value of a variable is collected directly from the response

of the survey respondent, then such a variable is called direct variable.

2. Derived Variables : If the value of a variable is derived from the values of direct

variables, then such a variable is called derived variable.

22

Page 34: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

2.2.8 Noise in Survey Data

We define noise as the data values that appear in the data as a result of, unan-

swered queries or queries that are not relevant to the survey respondent. These are

the values that we do not want to appear in our results.

There are various instances through which noise gets collected in the survey data.

We show here two common contexts in which noise gets collected in the survey data,

by considering variable values ‘Not Applicable’ and ‘Not Stated’ as noise.

• Noise through direct variables : If the survey respondent does not respond to

a direct variable, then its value is recorded as ‘Not Stated’. If a variable is

designed specific to the people of a certain region, then the value of that variable

is recorded as ‘Not Applicable’ for the people of the other regions. For example,

if a variable is designed specific to the people of Saskatchewan, then the value

of such variable for the people of Ontario is recorded as ‘Not Applicable’.

• Noise through derived variables : If the value of a derived variable is highly

dependent on the particular value ‘v’ of a direct variable, then the value of the

derived variable becomes ‘Not Applicable’ for any value of the direct variable

other than ‘v’. For example, Consider that the value of “Type of Drinker”

variable is derived from two variables “Frequency of drinking” and “Ever Had

a drink?”. Also consider that the value of “Type of Drinker” variable is highly

dependent on the value of “Ever had a drink?” variable. If the value of that

variable is ‘No’ or ‘Not Stated’ , then the value of the “Type of Drinker” variable

becomes ‘Not Applicable’ or ‘Not Stated’, respectively.

23

Page 35: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

2.3 Conclusion of the Chapter

In this chapter, we presented an overview of association rule mining and high-

lighted its parameters. We then presented an overview of surveys and highlighted the

key issues of surveys required for survey data analysis.

24

Page 36: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Chapter 3

Data Preparation for Association Rule

Mining

For decades, statistics has been used as a tool for analyzing survey data. Hence,

survey data are usually collected and recorded in a format suitable for statistical

analysis. Recently, data mining has emerged as an active field for analyzing data.

This has led the survey data organizers to record data in a format helpful for data

miners. Though high care is taken while collecting and recording data, there remain

a few problems for data miners. In this chapter we present the problems faced while

mining survey data for association rules and present a solution.

In Section 3.1, we present how survey data are typically recorded using CCHS

data as an example. In Section 3.2, we present the problems faced while generating

association rules from survey data and propose a solution to the problem. In Sec-

tion 3.3, we develop an algorithm, which we name as PreAssociation algorithm, to

preprocess survey data for association mining. In Section 3.4, we present the func-

tions and features of the PreAssociation software, which is an implementation of the

PreAssociation algorithm.

25

Page 37: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

3.1 Features of the CCHS Data

The CCHS is a cross-sectional survey that collects information related to health

status, health care utilization and health determinants for the Canadian population

[21]. The CCHS operates on a two-year collection cycle, Cycle1.1 and Cycle 1.2.

CCHS data that is recorded in the fixed format. The data file used in our exper-

iments contains data collected in the first year of collection for the CCHS (Cycle

1.1). Information was collected between September 2000 and November 2001, for 136

health regions, covering all provinces and territories of Canada.

0121110215.510222110114.810323232226.020424121236.210523232125.930626131316.540723110435.820823243347.010929232324.911022255236.021122255225.721225142136.431323232266.211423221415.911523243334.82

Table 3.1: Survey Data in the Raw Format

The CCHS data has 614 data variables and 130800 records. It is stored in a

matrix format with 614 columns, each column corresponding to a data variable and

130800 rows, each row corresponding to a data record, as a typical survey data is

stored. Each record corresponds to the responses of an individual survey respondent

to all the 614 data variables (questions). The data file is available in various formats,

26

Page 38: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Variable Name Length Position Variable DescriptionXYZ1 2 1-2 Record NumberXYZ2 2 3-4 ProvinceXYZ3 1 5-5 SEXXYZ4 1 6-6 Health Description IndexXYZ5 1 7-7 Number of Chronic ConditionsXYZ6 1 8-8 Type of SmokerXYZ7 1 9-9 Type of DrinkerXYZ8 3 10-13 HeightXYZ9 1 14-14 Standard Weight

Table 3.2: Description of the Survey Data Variables

saved with the extensions .sav, .dat, .txt, etc. We have used the .txt file for our

experiments. Table 3.1 shows the raw format in which the CCHS data is stored, an

encoded text file, by considering a few variables from the CCHS data. Note that the

data file has no delimiters that makes it difficult to differentiate between different

variables. Also, the data is stored in numeric codes, which makes it difficult to

interpret their meaning. Therefore, a separate file is provided to give the details of

the data variables. Table 3.2 shows the details of how the data file shown in Table 3.1

is encoded, that is, the variable name, the length of the variable (the number of digits

or the number of columns corresponding to the variable), the position of the variable

(the column corresponding to the start position of the variable) and the description

of the short variable name. For example, the XYZ4 variable has length one, that is,

it is represented by one digit in the data file of Table 3.1, is stored in column 6 and

stands for Health Description Index. Table 3.3 shows the associated variable values

and gives the description of the numeric values. For example, XYZ4 variable has 7

associated numeric values. The description of the numeric values for XYZ4 is also

shown in Table 3.3.

Now, referring to Table 3.1, Table 3.2 and Table 3.3, the first data records can

27

Page 39: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

XYZ2 XYZ3 XYZ4 XYZ521-Ontario 1-Male 1-Excellent 0-No chronic cond.22-Manitoba 2-Female 2-Very Good 1-1 chronic cond.23-Saskatchewan 6-Not Applicable 3-Good 2-2 chronic cond.24-Quebec 9-Not Stated 4-Fair 3-3 chronic cond........ 5-Poor 4-4 chronic cond.66-Not Applicable 6-Not Applicable 5-5+ chronic cond.99-Not Stated 9-Not Stated 6-Not Applicable

9-Not StatedXYZ6 XYZ7 XYZ8 XYZ91-Regular 1-Regular 666-Not Applicable 1-Over Weight2-Occasional 2-Occasional 999-Not Stated 2-Under Weight3-Former 3-Former 3-Acceptable Weight4-Never 4-Never 4-Average Weight6-Not Applicable 6-Not Applicable 6-Not Applicable9-Not Stated 9-Not Stated 9-Not Stated

Table 3.3: Description of the Survey Data Variable Values

be read as follows: The survey respondent with record number 01, is from Ontario

province, is of male sex, has excellent health description index, exhibit no chronic

conditions, is an occasional type of smoker, is a regular type of drinker, has 5.5 feet

of height, and is over weight.

Of the 614 variables in the CCHS data, some are direct variables and some are

derived variables. By direct variable, we mean that the value of that variable is the

direct response of the survey respondent to the questionnaire corresponding to that

variable. For example the XYZ4 is a direct variable corresponding to the question

“Which category would you consider to represent your health”, provided with cate-

gories. By derived variable, we mean that the value of that variable is derived from

the values of other direct variables. For example the XYZ7 variable is a derived vari-

able derived from the values of the direct variable like “Ever had a drink”, “Frequency

of drinking alcohol”, “ Number of drinks at a time” etc.

28

Page 40: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

3.2 Survey Data Mining Problems

While generating association rules from raw survey data, we face a couple of

problems, including:

• Uncomprehensible Association Rules : As shown in Section 3.1, survey data is

recorded in an ASCII encoded text file. If we apply association mining on it,

we will obtain the association rules like:

1 → 2

0 + 1 → 2.

These association rules makes no sense because we cannot figure out which

variable they are referring to and hence what value they represent.

• Noise in Survey Data : As explained in Section 2.2.8, a lot of noise gets accu-

mulated in survey data and hence if association mining is applied on such data

many rules will be generated with noise in their rule body and/or rule head.

To overcome the problem faced while generating association rules from survey

data, we seek the following solution, that is, preprocess survey data such that:

• Human interpretable association rules are generated.

• Noise is eliminated before the application of the association mining.

In the following sections, we show how we try to achieve the goal of generating

noise free and human interpretable association rules.

29

Page 41: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

PreAssociation Algorithm 1. begin 2. D = {input data file} 3. I = {data description file} 4. O = {output data file} 5. V = {selected data variables} 6. R = {selected data records} 7. for each data record r ∈ R 8. begin 9. for each data variable v ∈ V 10. begin 11. a = read the numerical value for ‘v’ from D 12. if (a ≠Noise) 13. begin 14. d1 = retrieve the description for ‘a’ from I 15. d2 = retrieve the description for 'v' from I 16. print (r, d1+ d2) to O in market basket format 17. end; 18. end; 19. end; 20. end; Figure 3.1: PreAssociation Algorithm

3.3 PreAssociation Algorithm

Figure 3.1 shows the PreAssociation algorithm developed to preprocess the survey

data for association mining. Lines 2 through 6 specify the user inputs to the algorithm.

In line 2, the user specifies the input data file D, that is, the file containing survey

data in raw format. In line 3, the user specifies the data description file I, that

is, the file containing the variable position, variable length, variable description and

variable value description. In line 4, the user specifies the output file O, that is, the

file in which the preprocessed survey data will be stored. In line 5, the user specifies

the particular data variables to be used for preprocessing (hence use for association

mining after preprocessing) among the total number of data variables. In line 6, the

30

Page 42: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

user specifies the particular data records to be preprocessed (hence use for association

mining after preprocessing) among the total number of data records. Line 7 selects a

data record r, to be preprocessed, from the data file D and loops through lines 8 to 19

until all the data records r ∈ R are preprocessed. The preprocessing of the selected

data record r starts from line 8. Line 9 selects a data variable v, to be preprocessed,

from the selected data record r and loops through lines 10 to 18 until all the data

variables v ∈ V are preprocessed. The preprocessing of the selected data variables

starts from line 10. Line 11 reads the ascii data value, a, for the selected data variable

v from the input data file D. Line 12 checks if the data value a is a noise or not. If a

is a noise, then preprocessing for that data value stops and the line 9 loops to read

the next variable value. If a is not a noise, then lines 13 through 17 are executed.

Lines 14 and 15 retrieve the descriptions for a and v from the data description file I.

Line 16 prints the descriptions for a and v along with r, in the market basket format,

to the output data file O. We explain the algorithm shown in Figure 3.1 in detail by

running over the example survey dataset of Table 3.1 that has been built using a few

variables and the format of CCHS dataset.

Consider the dataset shown in Table 3.1 as the input data file, D, to the Pre-

Association Algorithm. Consider for example selected data variables V = {XYZ3,

XYZ4, XYZ5, XYZ7} and selected data records R = {XYZ2 = 23}. Let us also

consider Noise = {Not Applicable}. Then after the application of the Pre-Association

algorithm to the above data files, we obtain the output data file, O, as shown in the

Table 3.4.

If we have a closer look at the output data file above, the variable ’Type of Drinker’

is missing from the record number 13. The reason is that in record number 13, the

31

Page 43: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

RecordNo. Variables with Corresponding Values03 Female Sex03 Good Health Description Index03 2 chronic cond. Number of Chronic Conditions03 Occasional Type of Drinker05 Female Sex05 Good Health Description Index05 2 chronic cond. Number of Chronic Conditions05 Occasional Type of Drinker07 Male Sex07 Excellent Health Description Index07 No chronic cond. Number of Chronic Conditions07 Former Type of Drinker08 Female Sex08 Fair Health Description Index08 3 chronic cond. Number of Chronic Conditions08 Never Type of Drinker13 Female Sex13 Good Health Description Index13 2 chronic cond. Number of Chronic Conditions14 Female Sex14 Very Good Health Description Index14 1 chronic cond. Number of Chronic Conditions14 Regular Type of Drinker15 Female Sex15 Fair Health Description Index15 3 chronic cond. Number of Chronic Conditions15 Former Type of Drinker

Table 3.4: Survey Data in Market Basket Format

variable ’Type of Drinker’ has ’Not Applicable’ as its value (coded as 6 in input data

file), which we consider as unwanted value or noise in our data file and hence get

eliminated from the output data file.

3.4 PreAssociation Software

In this section we present the implementation of the PreAssociation algorithm

and name it as PreAssociation software, which preprocesses a given survey data for

32

Page 44: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

PreAssociation Processor

IDF

SVI N

ODF

SDR

DDF

File Breaker

Data Reader

Data File Tracker

Description Reader

Print PreAssociation Format

VLF VDF vDF

SVI SDR N

VVD

VPF

CRN CVI

PreAssociation Processor

NDV

DDF

IDF

ODF

Figure 3.2: Block Diagram of PreAssociation Software

association mining.

In Section 3.4.1, we present the block diagram of the PreAssociation software and

show how its various components function together. In Section 3.4.2, we present the

features of the PreAssociation software.

33

Page 45: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Diagram Label Full FormIDF Input Data FileODF Output Data FileDDF Description Data FileSVI Selected Variables IndexSDR Selected Data RecordsN NoiseVLF Variable Length FileVPF Variable Position FileVDF Variable Description FilevDF Value Description FileVVD Variable Value DescriptionCRN Current Record NumberCVI Current Variable IndexVD Variable DescriptionvD Value Description

Table 3.5: Full Form of PreAssociation Block Diagram Labels

3.4.1 Block Diagram of PreAssociation Software

Figure 3.2 shows the block diagram of the PreAssociation software. Due to space

problem, short notations are used to label the block diagram. The full forms of the

short notations are given in Table 3.5.

3.4.1.1 Input to PreAssociation Software

The PreAssociation software takes the following five inputs:

1. Input data file (IDF) containing survey data in raw format.

2. Data description file (DDF) describing the input data file.

3. Indexes of the selected data variables (SVI)(a manual list is created to represent

all the survey data variables by indexes as shown in Figure 3.4).

34

Page 46: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

4. Selected data records (SDR) that specify the number of records to be pre-

processed out of the total number of records and the particular records to be

preprocessed. The particular records are selected by, first selecting one or more

data variables and then selecting a particular category of that variable/s. For

example, we can first select the Sex variable and then select Male category in

the Sex variable to preprocess the records with only Male Sex.

5. Noise (N) through which we can specify the unwanted values in the preprocessed

survey data and hence the unwanted values in the association rules.

3.4.1.2 Components of PreAssociation Software

The PreAssociation software consists of five components, including:

1. File Breaker:

The function of the File Breaker is to break the Data Description File into

subfiles. As shown in Figure 3.3, the File Breaker component takes the Data

Description File (DDF) as input and breaks it into four different files, that is,

Variable Length File (VLF) that contains the lengths of all the data variables

in the input data file. Variable Position File (VPF) that contains the starting

position of all the data variables in the input data file. Variable Description File

(VDF) that contains the descriptions of all the data variables. Value Description

File (vDF) that contains the descriptions of the values of all the data variables.

2. Data File Tracker:

The function of the Data File Tracker is to track the Input Data File (IDF),

that is, to keep track of which data record and which data variable in that data

35

Page 47: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

XYZ1 1-2 Record Number XYZ2 3-4 Province XYZ3 5-5 Sex XYZ4 6-6 Marital Status ………………………….. ………………………….. XYZ2 21-Onatrio 22-Manitoba 23-Saskatchewan ………………… 66-Not Applicable 99-Not Stated XYZ3 1-Male 2-Female 6-Not Applicable 9-Not Stated / XYZ4 1-Married 2-Single 3-Widowed ……………. 6-Not Applicable 9-Not Stated / ……………… ………………

1 Record Number 2 Province 3 Sex 4 .. ..

Marital Status ………………………… …………………………

21 Ontario 22 Manitoba 23 Saskatchewan … …………….

2

99 Not Stated 1 Male 2 Female 6 Not Applicable

3

9 Not Stated 1 Married 2 Single 3 Widowed … …………

4

9 Not Stated .. ………………….

Variable Description File

Variable Position FileVariable Length File

Value Description File

Data Description File

File Breaker DDF

VLF VPF

VDF vDF

Figure 3.3: File Breaker

record has to be read from the Input Data File for processing in the current

operation. As shown in Figure 3.4, the Data File Tracker component takes

the Input Data File, the Selected Variables Index (SVI), and the Selected Data

Records as input and provides the Current Record Number (CRN) and Current

Variable Index (CVI) as output.

3. Data Reader: The function of the Data Reader is to read the data from

the Input Data File from the specified location, that is, specified record and

specified variable in that record.

As shown in Figure 3.5, the Data Reader component takes the Input Data File

36

Page 48: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

0000012322…. 0000022111…. 0000032312…. ………………. ………………. 1308002321….

Province = 23 Total < = 120500 …………………

2 3 4 .. 523

Input Data File

Selected Variables Index

Data File Tracker

Selected Data Records

Current Record = Previous Record + 1

Is Current Record Selected?

N Y

Get Next Variable Index

Current Record Number Current Variable Index

3 2

Get Next Record

Y

Figure 3.4: Data File Tracker

(IDF), the Variable Length File (VLF), the Variable Position File (VPF), the

Current Record Number (CRN), and the Current Variable Index (CVI) as input

and provides the Ascii Data Value (ADV) as output.

4. Description Reader:

The function of the Description Reader is to retrieve the descriptions for the

given variable and the given variable value from the given Data Description File

(DDF).

As shown in Figure 3.6, the Description Reader component takes the Variable

Description File (VDF), the Value Description File (vDF), the Noise (N), the

37

Page 49: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

0000012322…. 0000022111…. 0000032312…. ………………. ………………. 1308002321….

1 6 2 2 3 1 4 1 .. ..

1 1 2 7 3 9 4 10 .. ..

Input Data File Variable Length File Variable Position File

Data Reader Current Variable Index

Current Record Number

3 2

3 2

2

2 7 2

3, 7, 2

23

23

Ascii Data Value

23

Figure 3.5: Data Reader

Current Variable Index (CVI), and the Ascii Data Value (ADV) as input and

provides Variable Value Description (VVD) as output.

5. Print Market Basket: The function of the Print Market Basket is to print

the preprocessed survey data in the market basket format.

As shown in Figure 3.7, the Print Market Basket component takes the Current

Record Number (CRN) and the Variable Value Description (VVD) as input and

prints them to the Output Data File (ODF) in market basket format.

38

Page 50: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

1 Record Number 2 Province 3 Sex 4 .. ..

Marital Status …………………… ……………………

21 Ontario 22 Manitoba 23 Saskatchewan … …………….

2

99 Not Stated 1 Male 2 Female 6 Not Applicable

3

9 Not Stated 1 Married 2 Single 3 Widowed … …………

4

9 Not Stated .. ………………….

Variable Description File Value Description File

Ascii Data Value

23

Current Variable Index

2

Description Reader 2

2

2, 23

Saskatchewan

Province

Variable Value Description

Saskatchewan Province

Noise

Not Applicable, Not Stated

Check if data value is noise

Figure 3.6: Description Reader

3.4.2 Features of PreAssociation Software

In this section we walk through the features available in our software PreAssoci-

ation. PreAssociation is implemented in C language on Unix platform.

The PreAssociation software converts the given survey data from survey-format to

PreAssociation-format, without which it would not be possible to generate meaning-

ful association rules from survey data. Following are some of the helpful features

availed by our software:

• The software allows us to select any number of records and any particular

records out of the total number of records available. This helps us select a part

of data as a sample data, just in case we need to perform any sample test.

39

Page 51: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

000001 Saskatchewan Province 000001 Female Sex 000001 Single Marital Status ……… ……………………… 000003 Saskatchewan Province 000003 Male Sex 000003 Single Marital Status ……… ……………………… ……… ……………………… 120490 Saskatchewan Province 120490 Female Sex 120490 Married Marital Status ……… ………………………

Output Data FileVariable Value Description

Saskatchewan Province

Current Record Number

3

Print Market Basket Format

Saskatchewan Province

3

3, Saskatchewan Province

Figure 3.7: Print Market Basket

• The software gives us the option to select any number of variables of our choice

among the total variables available to perform association mining on them. This

helps us to concentrate and generate association rules of only those variables,

which we are concerned about rather than generating thousands of rules consist-

ing of all the variables, making it hard to interpret the results. But at the same

time, the program also allows us to select all the variables available depending

on our choice and need.

For example, as it has been mentioned earlier, the CCHS was conducted over

entire Canada and data was collected from all the provinces and territories.

These provinces were in turn divided into various health regions. Taking this

point into consideration, we have designed the software such that, it avails us

with the option of selecting the data for association mining from any particular

province and the program also gives us the option of selecting any particular

health region from within that province. This helps us to concentrate on any

40

Page 52: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

sub section of the data and generate rules on it. It also helps us to do a com-

parative study at different levels of data. For example, we can compare the

rules generated for different health regions within the same province or we can

do a comparison at provincial level by comparing the rules generated for dif-

ferent provinces. The program also allows us to select all the data records for

preprocessing.

• Another feature availed by our software helps us to get rid of the unwanted

values (or noise) in the data, without deleting the whole record containing that

unwanted value as done by SPSS, a statistic software. By eliminating just the

unwanted value and not the entire record containing it, we retain the other

valuable information present in that record.

3.5 IBM’s Intelligent Data Miner

In this chapter, we briefly present the tutorial on IBM’s Intelligent Data Miner,

a data mining tool and show how to generate the association rules from the Pre-

Association data file. Due to the limited space available, we explain only the impor-

tant and essential steps.

The process of KDD using the Intelligent Miner include the following steps [8]:

• Create the data object by selecting the input data file.

• Transform the data by reorganizing it, eliminating duplicate records, or con-

verting it from one form to another (unfortunately there is no option for trans-

forming the survey data for association mining).

• Specify the parameters for the selected data mining function.

41

Page 53: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

• Run the data mining function.

• Visualize and analyze the resulting data.

We now generate association rules from preprocessed survey data using the Intel-

ligent Data Miner. We use CCHS dataset as our input data file. For demonstration,

we choose the following 4 variables: Sex , Type of Drinker, Health Description Index

and Number of Chronic Conditions. We preprocess the CCHS data for the above 4

variables and transform it into PreAssociation format which serves as the input data

file for association mining.

From the preprocessed survey data, the two columns, that is the Record Numbers

and the Variables with corresponding values become input data fields for Intelligent

Miner. Then we have to run the data miner to generate the human interpretable

association rules as shown in Figure 3.8.

3.6 Conclusion of the Chapter

In this chapter, we presented the problems faced while generating association rules

from survey data. We then presented the algorithm and the software developed to

preprocess the survey data for association mining. We also presented the functions

of our software components and highlight the features of our software. Using IBM’s

Intelligent Data Miner [8], we showed how to generate association rules from the

preprocessed survey data.

42

Page 54: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Figure 3.8: Association Rules Generated by Intelligent Miner

43

Page 55: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Chapter 4

Crosstabulations and Association Rules

Bivariate analysis is a statistical data analysis method used to see whether two

variables are associated [4]. Two variables are said to be associated when the distri-

bution of values on one variable differs for different values of the other. In statistics

there are various techniques used for bivariate analysis. The most common one is

crosstabulations. In this chapter, we compare crosstabulations with association rules

and show how association rules correspond to the information given by the crosstab-

ulations. We then show that as the number of variables increase, bivariate analysis

through assocation mining is simpler than crosstabulations.

In Section 4.1, we present an overview of crosstabulations. In Section 4.2, we

procide a theoretical justification to show the interrelation between crosstbulations

and association rules.

4.1 Overview of Crosstabulations

Crosstabulations are a way of displaying data so that we can detect association

between two variables [4].

44

Page 56: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

In Section 4.1.1, we present the structure of a crosstabulation. In Section 4.1.2,

we present how to interpret a crosstabulation.

4.1.1 The Structure of a Crosstabulation

SexMale Female Totals

Regular 3 1 4Type of Occasional 0 4 4Drinker Former 3 2 5

Never 0 2 2Totals 6 9 15

Table 4.1: A simple Crosstabulation

In Table 4.1 there are two variables: Sex across the top and Type of Drinker on

the side, having two and four categories respectively. There are two columns (one

for each category of the top variable) and four rows (for each category of Type of

Drinker). Tables can be described by their size which is represented as number of

columns by number of rows. Thus Table 4.1 is a two by four table. There are three

sets of totals: those on the side are called row totals (4, 4, 5, 2) since they represent

the total number of people in that row. Those on the bottom (6, 9) are column total

since they represent the number of people in that column. The number in the bottom

right-hand corner (15) is the grand total.

Each column is broken into subgroups according to people’s characteristics on the

other variable. Thus the Female column is broken down into four groups: female

regular type of drinker, female occasional type of drinker, female former type of

drinker and female never type of drinker. Each of these subcategories is called a cell.

45

Page 57: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

The numbers in each cell represent the cell frequency or count.

To easily interpret the associations, the cell frequencies can be converted into

percentages. A percentage is calculated by seeing what proportion of the total a

particular number represents. But since we have three totals, depending on which

total is used we get a different percentage with a different interpretation. For example,

let us consider the top right hand cell(1) of Table 4.1, representing female regular type

of drinkers, and convert its raw numbers into percentages.

SexMale Female Totals

Regular row% 75% 25% 100%column% 50% 11.1% 26.7%total% 20% 6.7% 26.7%

Occasional row% 0% 100% 100%Type column% 0% 44.4% 26.7%of total% 0% 26.7% 26.7%

DrinkerFormer row% 60% 40% 100%

column% 50% 22.2% 33.3%total% 20% 13.3% 33.3%

Never row% 0% 100% 100%column% 0% 22.2% 13.3%total% 0% 13.3% 13.3%

Totals row% 40% 60% 100%column% 100% 100% 100%total% 40% 60% 100%

Table 4.2: Crosstabulation with Row, Column and Total Percentages

• The column total will produce a column percentage( column% ). Using this we

get 1/9 × 100 = 11.1%. This means 11.1% of the females are regular type of

drinkers. It does not mean that 11.1% of regular type of drinkers are females.

• The row total will produce a row percentage( row% ). This gives us 1/4 × 100

46

Page 58: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

= 25%. This means that 25% of regular type of drinkers are females.

• The grand total will produce a total percentage( total% ). This gives us 1/15

× 100 = 6.7%. This means 6.7% of the whole population sample were female

regular type of drinkers.

Table 4.2 shows the crosstabulation with row%, column% and total% after con-

version of the cell frequencies of Table 4.1 into percentages.

4.1.2 Interpreting a Crosstabulation

A crosstabulation arranged as in Table 4.2 becomes very difficult to be read. We

must decide which percentage to use. When trying to find the associations in the

table, total% is of no value. We then have to choose between row% and column%.

This choice depends on two things: first, on the way the table is arranged; second,

depending on which variable is treated as independent and which as dependent. Once

we have decided on the independent and dependent variable, the association between

them can be detected by comparing their subgroups or categories. If the categories

of the independent variable differ in terms of their characteristics on the dependent

variable, then we can say they are associated; if there is no difference then they are

not associated.

The process of detecting association in a crosstabulation can be defined as follows

[4]:

1. Determine which variable to be treated as independent and which as dependent.

2. Choose appropriate percentage: column% if the independent variable is across

the top; row% if it is on side.

47

Page 59: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

3. Compare percentages for each category of the independent variable within one

category of the dependent variable at a time. Thus

• If the independent variable is across the top, use column% and compare

these across the table. Any difference between these reflects some associ-

ation.

• If the independent variable is on the side use row% and compare these

down the table.

In general, we represent the row% or column% as %within the independent vari-

able.

4.2 Crosstabulations Vs Association Rules

By now we are familiar with both, the association rules and the crosstabulations.

In this section, we show how they interrelate. Consider for example, that we want to

find the association between male sex and regular type of drinkers in the dataset of

Table 2.1. The association between them can be detected as follows:

• From the definition of support, we have support(X → Y ) = number of records

containing (X and Y )/total number of records. Referring to Table 2.1 we have,

support(Sex(Male) → ToD(Regular))= 3/15 = 0.2 or 20%. The same support

value holds for the rule ToD(Regular) → Sex(Male). From the crosstabulation,

we know that %total = cell frequency/grandtotal. Therefore for male sex and

regular type of drinker, we have %total = cell frequency(male and regular type

of drinker)/grandtotal = 3/15 × 100 = 20%.

48

Page 60: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

We can clearly see that the same values are obtained from both the association

rule support and %total of crosstabulation. Similarly, when we compare the

rule support and %total values for bivariate associations of all variables, we find

that they are the same. Hence we conclude that support ≡ %total (”≡” means

equivalent).

• From the definition of confidence, we have conf (X → Y ) = support(X and

Y )/support(X ). Referring to Table 2.1, we have conf (Sex(Male)→ ToD(Regular))

= (3/15)/(6/15) = 50% and conf (ToD(Regular)→ Sex(Male)) = (3/15) /(4/15)

= 75%. From the crosstabulation,we know that (considering Sex as indepen-

dent variable) %within male sex = cell frequency / (row total for regular type

of drinker) = row% = 3 / 4 = 75%. Also %within for regular type of drinker =

cell frequency / (column total for male sex )= column% = 3/6 = 75%.

Again we can clearly see that the same values are obtained from both the associ-

ation rule confidence and %within of independent variable from crosstabulation.

Similarly, when we compare the rule confidence and %within values for bivariate

associations of all variables, we find that they are the same. Hence we conclude

that confidence ≡ %within

Let us suppose that we want to find the bivariate associations between the Sex,

HDI, NCC, and ToD variables in the dataset of Table 4.2. If we choose the crosstab-

ulations method, we have to build crosstabulations for Sex versus HDI, Sex versus

NCC, Sex versus ToD, HDI versus NCC, HDI versus ToD,NCC versus ToD and then

interpret them. In total to find the bivariate associations between four variables we

have to build 6 crosstabulations, that is, we have to build crosstabulations for all the

49

Page 61: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

bivariate combinations of variables. In general, to find the bivariate associations be-

tween n variables, we have to build n×(n-1)/2 crosstabulations. On the other hand,

if we choose to use the association mining method to find the bivariate associations

among n variables, a single association mining run will generate the association rules

and hence the bivariate associations for all the n variables. Also we have seen that the

same information can be obtained from both association rules and crosstabulations.

Hence, we conclude that association mining is preferable over crosstabulations

to detect the bivariate associations between variables, and especially becomes more

useful as the number of variables to be examined increase.

4.3 Conclusion of the Chapter

In this chapter, we presented an overview of crosstabulations and showed how the

association rules correspond to the information given by crosstabulations. We then

theoretically described how the association rules are advantageous over crosstabula-

tions as the number of variables for bivariate analysis increase.

50

Page 62: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Chapter 5

Experimental Results

In this chapter, we present our experimental results to validate our theoretical

justifications. For our experiments we use the CCHS data [22] with 130,880 records

as an example dataset. To demonstrate the correspondence between association rules

and crosstabulations, we consider to find the associations between the Sex and Health

Description Index variables. We do this using two approaches:

• Crosstabulations Method.

• Association Mining Method.

In Section 5.1, we present the experimental results of crosstabulation method for

bivariate analysis. In Section 5.2, we present the experimental results of association

mining method for bivariate analysis. In Section 5.3, we show the interrelation be-

tween crosstabulations and association rules with the help of the experimental results.

In Section 5.4, we compare the experimental results obtained from crosstabulation

method and association mining method for bivariate analysis to validate our theoret-

ical justifications in the previous section.

51

Page 63: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

5.1 Crosstabulations Method to find Associations

To build crosstabulations, we used SPSS, a statistical data analysis and data

management system [17, 16]. Table 5.1 shows the crosstabulation for the variables

Sex versus Health Description Index. The Sex variable has two categories, that is,

Male and Female and the Health Description Index(HDI) variable has five categories,

that is, Poor, Fair, Good, Very Good, and Excellent. Hence it is a two by five

crosstabulation. Each cell entry in the crosstabulation contains the row, column and

total percentages for each cross category combination. For example, the top leftmost

cell contains the row, column and total percentages corresponding to Male Sex and

Poor Health Description Index.

HDIPoor Fair Good Very Good Excellent Totals

Male %within Sex 3.5% 10.0% 27.0% 35.4% 24.1% 100%%within HDI 45.1% 44.2% 45.2% 46.1% 48.7% 46.2%total% 1.6% 4.6% 12.5% 16.4% 11.2% 46.2%

SexFemale %within Sex 3.5% 10.0% 27.0% 35.4% 24.1% 100%

%within HDI 45.1% 44.2% 45.2% 46.1% 48.7% 46.2%total% 1.6% 4.6% 12.5% 16.4% 11.2% 46.2%

Total %within Sex 3.6% 10.5% 27.5% 35.5% 22.9% 100%%within HDI 100% 100% 100% 100% 100% 100%total% 3.6% 10.5% 27.5% 35.5% 22.9% 100%

Table 5.1: Crosstabulation for Sex versus HDI

5.2 Association Mining Method to find Associations

To generate association rules, we use IBM’s Intelligent Data Miner [8], a data

mining tool and used the Assocaition Visualizer [7] to interpret them. Table 5.2

52

Page 64: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

shows the association rules for the variables Sex and Health Description Index. We

have twenty association rules in total, each rule represented in a row. The first column

Rule No., represents the numbers given to the association rule for easy reference. The

second and third columns represent the support and confidence values, respectively.

The fourth column represents the association rules. For example, the first row with

Rule No. 1, represents the rule Sex(Male)→ Health Description Index (Poor) with

1.61% support and 3.5% confidence.

RuleNo. Support Confidence Association Rules1. 1.61% 3.5% Sex(Male) → HDI(Poor)2. 4.62% 10.0% Sex(Male) → HDI(Fair)3. 12.45% 36.9% Sex(Male) → HDI(Good)4. 16.35% 35.4% Sex(Male) → HDI(Very Good)5. 11.15% 24.1% Sex(Male) → HDI(Excellent)6. 1.96% 3.6% Sex(Female) → HDI(Poor)7. 5.85% 10.9% Sex(Female) → HDI(Fair)8. 15.07% 28.0% Sex(Female) → HDI(Good)9. 19.12% 35.6% Sex(Female) → HDI(Very Good)10. 11.72% 21.8% Sex(Female) → HDI(Excellent)11. 1.61% 45.1% HDI(Poor) → Sex(Male)12. 1.96% 54.9% HDI(Poor) → Sex(Female)13. 4.62% 44.2% HDI(Fair) → Sex(Male)14. 5.85% 55.8% HDI(Fair) → Sex(Female)15. 12.45% 45.3% HDI(Good) → Sex(Male)16. 15.07% 54.8% HDI(Good) → Sex(Female)17. 16.35% 46.1% HDI(Very Good) → Sex(Male)18. 19.12% 53.9% HDI(Very Good) → Sex(Female)19. 11.15% 48.8% HDI(Excellent) → Sex(Male)20. 11.72% 51.3% HDI(Excellent) → Sex(Female)

Table 5.2: Association Rules for Sex and HDI

53

Page 65: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

5.3 Interrelation between Crosstabulations and Association

Rules

Consider the first cell in the top leftmost corner of crosstabulation shown in Ta-

ble 5.1 corresponding to Male Sex and Poor Health Description Index. We have three

percentages in the cell:

• %within Sex = 3.5%, which assumes Sex as independent variable and says that

3.5% of Male Sex have Poor Health Description Index.

• %within Health Description Index = 45.1%, which assumes Health Description

Index as independent variable and says that 45.1% of people who have Poor

Health Description Index are of Male Sex.

• % of total = 1.6%, which says that 1.6% of the total population are of Male

Sex and have Poor Health Description Index.

Now, consider the association rules shown in Table 5.2 corresponding to Sex and

Health Description Index variables. Of those rules, rule number 1 and rule number 11

correspond to Male Sex and Poor Health Description Index. We obtain the following

information from those rules:

• From rule number 1, we have conf(Sex(Male) → Health Description Index

(Poor)) = 3.5%, which means that a Male Sex has Poor Health Description

Index with 3.5% confidence.

• From rule number 11, we have conf(Health Description Index (Poor)→ Sex(Male))

= 45.1%, which means that a Male Sex has Poor Health Description Index with

45.1% confidence.

54

Page 66: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

• From rules numbered 1 and 11, we have support(Sex(Male) → Health Descrip-

tion Index (Poor)) = 1.61% and support(Health Description Index (Poor) →Sex(Male)) = 1.61%, respectively. Both the rules means that out of the total

population, people who are of Male Sex and have Poor Health Description Index

have 1.6% support.

From the above, we can clearly see that for Male Sex and Poor Health Description

Index :

1. %within Sex = conf(Rule No. 1) = 3.5%.

2. %within Health Description Index = conf(Rule No. 11) = 45.1%.

3. % of total = support(Rule No. 1) = support(Rule No. 11) = 1.61%.

Likewise, we compare each and every cell entries of the crosstabulation shown in

Table 5.1 with the association rules shown in Table 5.2 and obtain the correspondence

between them as shown in the Table 5.3 and Table 5.4.

In Table 5.3 and Table 5.4, we have two columns and ten rows (apart from the top

label row). The first column represents the variable categories. The second column

represents the correspondence between the cell entries of crosstabulation for those

variable categories and parameters of association rules for those variable categories.

For example, consider the first row. The first column of first row of Table 5.3 repre-

sents the variable categories Male Sex and Poor HDI. The second column gives the

correspondence between %within Sex, %within HDI and % of total, and support and

confidence of corresponding association rules. It says that %within Sex is equal to

confidence of Rule No. 1, %within HDI is equal to Rule No. 11 and % of total is

equal to the support of Rule No. 2, and also equal to the support of Rule No. 11.

55

Page 67: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Relation between Crosstabulation Cell Entries andVariable Categories Association Rule ParametersMale Sex %within Sex = conf(Rule No. 1)and %within HDI = conf(Rule No. 11)Poor HDI % of total = support(Rule No. 1) = support(Rule No. 11)Male Sex %within Sex = conf(Rule No. 2)and %within HDI = conf(Rule No. 13)Fair HDI % of total = support(Rule No. 2) = support(Rule No. 13)Male Sex %within Sex = conf(Rule No. 3)and %within HDI = conf(Rule No. 15)Good HDI % of total = support(Rule No. 3) = support(Rule No. 15)Male Sex %within Sex = conf(Rule No. 4)and %within HDI = conf(Rule No. 17)Very Good HDI % of total = support(Rule No. 4) = support(Rule No. 17)Male Sex %within Sex = conf(Rule No. 5)and %within HDI = conf(Rule No. 19)Excellent HDI % of total = support(Rule No. 5) = support(Rule No. 19)

Table 5.3: Correspondence between Association Rules and Crosstabulations-Part1

From the experimental results shown in Table 5.3 and Table 5.4, we conclude and

validate that:

• Support(a given association rule) ≡ % of total of the crosstabulation cell corre-

sponding to the variables in the rule and

• Confidence(a given association rule) ≡ % within of the independent variable of

the crosstabulation cell corresponding to the variables in the rule

5.4 Comparison between Crosstabulations and Association

Mining

Consider to find bivariate associations between the Sex, Health Description Index,

and Type of Drinker variables.

56

Page 68: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Relation between Crosstabulation Cell Entries andVariable Categories Association Rule ParametersFemale Sex %within Sex = conf(Rule No. 6)and %within HDI = conf(Rule No. 12)Poor HDI % of total = support(Rule No. 6) = support(Rule No. 12)Female Sex %within Sex = conf(Rule No. 7)and %within HDI = conf(Rule No. 14)Fair HDI % of total = support(Rule No. 7) = support(Rule No. 14)Female Sex %within Sex = conf(Rule No. 8)and %within HDI = conf(Rule No. 16)Good HDI % of total = support(Rule No. 8) = support(Rule No. 16)Female Sex %within Sex = conf(Rule No. 9)and %within HDI = conf(Rule No. 18)Very Good HDI % of total = support(Rule No. 9) = support(Rule No. 18)Female Sex %within Sex = conf(Rule No. 10)and %within HDI = conf(Rule No. 20)Excellent HDI % of total = support(Rule No. 10) = support(Rule No. 20)

Table 5.4: Correspondence between Association Rules and Crosstabulations-Part2

• Using Crosstabulations Method: To detect the bivariate associations be-

tween 3 variables we have to build 3×(3-1)/2 = 3 crosstabulations, that is,

1. Crosstabulation for Sex versus Health Description Index(HDI) as shown

in Table 5.5.

2. Crosstabulation for Sex versus Type of Drinker as shown in Table 5.6

3. Crosstabulation for Health Description Index versus Type of Drinker as

shown in Table 5.7

• Using Association Mining Method:

Using association mining, we can detect the bivariate associations between 3

variables (or any number of variables) by generating all the possible association

rules of length 2, in just one run of association mining as shown in Table 5.8

and Table 5.9.

57

Page 69: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

HDIPoor Fair Good Very Good Excellent Totals

Male %within Sex 3.5% 10.0% 27.0% 35.4% 24.1% 100%%within HDI 45.1% 44.2% 45.2% 46.1% 48.7% 46.2%total% 1.6% 4.6% 12.5% 16.4% 11.2% 46.2%

SexFemale %within Sex 3.5% 10.0% 27.0% 35.4% 24.1% 100%

%within HDI 45.1% 44.2% 45.2% 46.1% 48.7% 53.8%total% 1.6% 4.6% 12.5% 16.4% 11.2% 53.8%

Total %within Sex 3.6% 10.5% 27.5% 35.5% 22.9% 100%%within HDI 100% 100% 100% 100% 100% 100%total% 3.6% 10.5% 27.5% 35.5% 22.9% 100%

Table 5.5: Crosstabulation for Sex versus HDI

ToDRegular Occasional Former Never Totals

Male %within Sex 64.3% 15.1% 12.2% 8.5% 100%%within HDI 54.4% 33.5% 40.1% 36.9% 46.2%total% 29.7% 7.0% 5.6% 3.9% 46.2%

SexFemale %within Sex 46.3% 25.7% 15.6% 12.4% 100%

%within HDI 45.6% 66.5% 59.9% 63.1% 53.8%total% 24.9% 13.8% 8.4% 6.7% 53.8%

Total %within Sex 54.6% 20.8% 14.0% 10.6% 100%%within HDI 100% 100% 100% 100% 100%total% 54.6% 20.8% 14.0% 10.6% 100%

Table 5.6: Crosstabulation for Sex versus ToD

By comparing, the association rules shown in Table 5.8 and Table 5.9 with crosstab-

ulations shown in Table 5.5, Table 5.6 and Table 5.7 , we find that all the information

given by the 3 crosstabulations is given by just one association mining run. Thus we

conclude that association mining method is preferable over crosstabulations method

to detect bivariate associations as the number of variables increase.

58

Page 70: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

ToDRegular Occasional Former Never Totals

Poor %within Sex 31.9% 22.5% 33.8% 11.8% 100%%within HDI 2.1% 3.9% 8.6% 4.0% 3.6%total% 1.1% 0.8% 1.2% 0.4% 3.6%

Fair %within Sex 40.2% 23.7% 25.1% 11.0% 100%%within HDI 7.7% 11.9% 18.7% 10.9% 10.5%total% 4.2% 2.5% 2.6% 1.2% 10.5%

Good %within Sex 52.3% 22.2% 15.1% 10.4% 100%%within HDI 26.4% 29.4% 29.6% 27.1% 27.5%total% 14.4% 6.1% 4.2% 2.9% 27.5%

HDIVery %within Sex 59.0% 20.5% 10.6% 9.9% 100%Good %within HDI 38.4% 35.1% 26.8% 33.3% 35.6%

total% 21.0% 7.3% 3.8% 3.5% 35.6%

Excellent %within Sex 60.6% 18.0% 10.0% 11.4% 100%%within HDI 25.4% 19.8% 16.3% 24.7% 22.9%total% 13.9% 4.1% 2.3% 2.6% 22.9%

Total %within Sex 54.6% 20.8% 14.0% 10.6% 100%%within HDI 100% 100% 100% 100% 100%total% 54.6% 20.8% 14.0% 10.6% 100%

Table 5.7: Crosstabulation for HDI versus ToD

5.5 Conclusion of the Chapter

In this chapter, we ran the experiments to validate our theoretical justifications.

Through the experimental results, we showed that same information is obtained from

association rules and from crosstabulations. We then experimentally showed that as

the number of variables for bivariate analysis increase, association mining is advan-

tageous over crosstabulations.

59

Page 71: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

RuleNo. Support Confidence Association Rules1. 1.61% 3.5% Sex(Male) → HDI(Poor)2. 4.62% 10.0% Sex(Male) → HDI(Fair)3. 12.45% 36.9% Sex(Male) → HDI(Good)4. 16.35% 35.4% Sex(Male) → HDI(Very Good)5. 11.15% 24.1% Sex(Male) → HDI(Excellent)6. 29.51% 63.8% Sex(Male) → HDI(Regular)7. 6.91% 14.9% Sex(Male) → HDI(Occasional)8. 5.59% 12.1% Sex(Male) → HDI(Former)9. 3.88% 8.4% Sex(Male) → HDI(Never)10. 1.96% 3.6% Sex(Female) → HDI(Poor)11. 5.85% 10.9% Sex(Female) → HDI(Fair)12. 15.07% 28.0% Sex(Female) → HDI(Good)13. 19.12% 35.6% Sex(Female) → HDI(Very Good)14. 11.72% 21.8% Sex(Female) → HDI(Excellent)15. 24.75% 46.0% Sex(Female) → HDI(Regular)16. 13.72% 25.5% Sex(Female) → HDI(Occasional)17. 8.35% 15.5% Sex(Female) → HDI(Former)18. 6.63% 12.3% Sex(Female) → HDI(Never)19. 1.61% 45.1% HDI(Poor) → Sex(Male)20. 1.96% 54.9% HDI(Poor) → Sex(Female)21. 1.12% 31.5% HDI(Poor) → ToD(Regular)22. 0.79% 3.9% HDI(Poor) → ToD(Occasional)23. 1.19% 33.4% HDI(Poor) → ToD(Former)24. 0.41% 4.0% HDI(Poor) → ToD(Never)25. 4.62% 44.2% HDI(Fair) → Sex(Male)26. 5.85% 55.8% HDI(Fair) → Sex(Female)27. 4.18% 39.9% HDI(Fair) → ToD(Regular)28. 2.46% 23.5% HDI(Fair) → ToD(Occasional)29. 2.61% 24.9% HDI(Fair) → ToD(Former)30. 1.14% 10.9% HDI(Fair) → ToD(Never)31. 12.45% 45.3% HDI(Good) → Sex(Male)32. 15.07% 54.8% HDI(Good) → Sex(Female)33. 14.30% 52.0% HDI(Good) → ToD(Regular)34. 6.05% 22.0% HDI(Good) → ToD(Occasional)35. 4.12% 15.0% HDI(Good) → ToD(Former)

Table 5.8: Association Rules part1

60

Page 72: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

RuleNo. Support Confidence Association Rules36. 2.85% 10.4% HDI(Good) → ToD(Never)37. 16.35% 46.1% HDI(Very Good) → Sex(Male)38. 19.12% 53.9% HDI(Very Good) → Sex(Female)39. 20.84% 58.7% HDI(Very Good) → ToD(Regular)40. 7.23% 20.4% HDI(Very Good) → ToD(Occasional)41. 3.73% 10.5% HDI(Very Good) → ToD(Former)42. 3.50% 9.9% HDI(Very Good) → ToD(Never)43. 11.15% 48.8% HDI(Excellent) → Sex(Male)44. 11.72% 51.3% HDI(Excellent) → Sex(Female)45. 13.79% 60.3% HDI(Excellent) → ToD(Regular)46. 4.08% 17.8% HDI(Excellent) → ToD(Occasional)47. 2.27% 9.9% HDI(Excellent) → ToD(Former)48. 2.59% 11.3% HDI(Excellent) → ToD(Never)49. 29.51% 54.4% ToD(Regular) → Sex(Male)50. 24.75% 45.6% ToD(Regular) → Sex(Female)51. 1.12% 2.1% ToD(Regular) → HDI(Poor)52. 4.18% 7.7% ToD(Regular) → HDI(Fair)53. 14.30% 26.4% ToD(Regular) → HDI(Good)54. 20.84% 38.4% ToD(Regular) → HDI(Very Good)55. 13.79% 25.4% ToD(Regular) → HDI(Excellent)56. 6.91% 33.5% ToD(Occasional) → Sex(Male)57. 13.72% 66.5% ToD(Occasional) → Sex(Female)58. 0.79% 3.9% ToD(Occasional) → HDI(Poor)59. 2.46% 11.9% ToD(Occasional) → HDI(Fair)60. 6.05% 29.4% ToD(Occasional) → HDI(Good)61. 7.23% 35.0% ToD(Occasional) → HDI(Very Good)62. 4.08% 19.8% ToD(Occasional) → HDI(Excellent)63. 5.59% 40.1% ToD(Former) → Sex(Male)64. 8.35% 59.9% ToD(Former) → Sex(Female)65. 1.19% 8.6% ToD(Former) → HDI(Poor)66. 2.61% 18.7% ToD(Former) → HDI(Fair)67. 4.12% 29.6% ToD(Former) → HDI(Good)68. 3.73% 26.8% ToD(Former) → HDI(Very Good)69. 2.27% 16.3% ToD(Former) → HDI(Excellent)70. 3.85% 37.0% ToD(Never) → Sex(Male)71. 6.63% 63.0% ToD(Never) → Sex(Female)72. 0.41% 4.0% ToD(Never) → HDI(Poor)73. 1.14% 10.9% ToD(Never) → HDI(Fair)74. 2.84% 27.1% ToD(Never) → HDI(Good)75. 3.50% 33.3% ToD(Never) → HDI(Very Good)76. 2.59% 24.7% ToD(Never) → HDI(Excellent)

Table 5.9: Association Rules part2

61

Page 73: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Chapter 6

Conclusion and Future Work

The objective of this project was to preprocess survey data for generating human

interpretable association rules and to show the advantage of using association mining

for bivariate analysis. The following goals were realized in obtaining this objective:

• PreAssociation, an algorithm for preprocessing survey data for association min-

ing, was developed.

• The functions and features of PreAssociation Software that was implemented

based on the PreAssociation algorithm to preprocess survey data for association

mining were presented.

• Crosstabulations that are used for bivariate analysis of variables were presented

and compared with association rules to show that same information can be

obtained from both.

• Advantage of using association mining over crosstabulation technique as the

number of variables for bivariate analysis increase was shown.

62

Page 74: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

In this project, we showed how association rule mining can be used for bivariate

analysis. Association mining does more than this. It can be used to find the asso-

ciations between more than two variables, that is for multivariate analysis. So this

project work can be further extended to find the interrelation between association

rules and multivariate statistical techniques.

63

Page 75: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

Bibliography

[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large

databases. In Proceedings of the 20th International Conference on Very Large

Databases (VLDB), 1994.

[2] Andy B. Anderson, Peter H. Rossi, and James D. Wright. Handbook of Survey

Research. Harcourt Brace Javanovivh Publishers, 1 edition, 1983.

[3] Siu L. Chow. Research Methods in Psychology: A Primer. Detselig Enterprises

Ltd., 2 edition, 1992.

[4] D. A. de Vaus. Surveys in Social Research. Unwin Hyman Ltd., 2 edition, 1990.

[5] Arlene Fink. The Survey Kit: How to Design Surveys. Sage Publications, 2

edition, 2002.

[6] Arlene Fink. The Survey Kit: How to Manage, Analyze, and Interpret Survey

Data. Sage Publications, 2 edition, 2002.

[7] IBM DB2 Intelligent Miner for Data. Using the Association Visualizer Version

6 Release 1. IBM Corporation, 1 edition, 1999.

[8] IBM DB2 Intelligent Miner for Data. Using the Intelligent Miner for Data Ver-

sion 6 Release 1. IBM Corporation, 1 edition, 1999.

64

Page 76: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

[9] Neil Guppy and George Gray. Successful Surveys: Research Methods and Prac-

tice. Harcourt Canada, 2 edition, 1999.

[10] Robert J. Hilderman and Howard J. Hamilton. Knowledge Discovery and Mea-

sures of Interest. Kluwer Academic Publishers, 1 edition, 2001.

[11] G. Kalton and C. A. Moser. Survey Methods in Social Investigation. Basic Books

Inc., 2 edition, 1972.

[12] Micheline Kamber and Jiawei Han. Data Mining: Concepts and Techniques.

Morgan Kaufmann Publishers, 2001.

[13] Jiuyong Li and Yanchun Zhang. Direct Interesting Rule Generation. Third IEEE

International Conference on Data Mining, 2003.

[14] Shamkant Navathe, Ashok Savasere, and Edward Omiecinski. An Efficient Al-

gorithm for Mining Association Rules in Large Databases. In Proc. 21st Inter-

national Conference on Very Large Databases (VLDB), 1995.

[15] Douglas Newlands, Geoffrey I. Webb, and Shane Butler. On detecting differences

between groups. Proceedings of the ninth ACM SIGKDD international conference

on Knowledge discovery and data mining, 2003.

[16] Maria J. Norusis. SPSS for Windows: Base System User’s Guide Release 6.

SPSS Inc., 1 edition, 1993.

[17] Marija J. Norusis. The SPSS Guide to Data Analysis. SPSS Inc., 1 edition, 1988.

65

Page 77: ISSUES IN MINING SURVEY DATAhilder/my_students_theses_and...where. Surveys are now commonly used by both public and private organizations to collect information. Governments, political

[18] Mitsunori Ogihara and Mohammed J. Zaki. Theoretical Foundations of Associ-

ation Rules. 3rd SIGMOD’98 Workshop on Research Issues in Data Mining and

Knowledge Discovery (DMKD), 1998.

[19] G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases.

AAAI/MIT Press, 1991.

[20] Gregory Piatetsky-Shapiro, Usama Fayyad, and Padhraic Symth. From Data

Mining to Knowledge Discovery in Databases. AAAI/MIT Press, 1996.

[21] Canadian Community Health Survey. CCHS Cycle1.1 (2000-2001), Public Use

Microdata File Documentation. Canadian Community Health Survey, 2001.

[22] Canadian Community Health Survey. Questionnaire for Cycle 1.1. Canadian

Community Health Survey, 2001.

[23] A. Swami, R. Agrawal, and T. Imielinski. Mining Association Rules between

sets of Items in Massive Databases. Proc. of the ACM-SIGMOD International

Conference on Management of Data, 1993.

[24] Philip S. Yu and Charu C. Aggarwal. Online Generation of Association Rules.

Proceedings of the Fourteenth International Conference on Data Engineering,

1998.

[25] Philip S. Yu, Ming-Syan Chen, and Jiawei Han. Data Mining: An Overview from

Database Perspective. Ieee Trans. on Knowledge And Data Engineering, 1994.

66