files.transtutors.com€¦ · web viewwhat is the structure of big data? explain the data analytics...

Tutorial/Assignment Sheet

Subject Name: Big Data Analytics Tutorial number: 1 Subject Code: CS-828 Class / Semester: 7Max. Marks : 1 Time Allowed: 1 week

Q. No:

1.

2.

3.

4.

What is Big data? Explain the characteristics of big data.

What is the need of Big Data?

What are risks associated with big data analytics?

What is the structure of big data? Explain the data analytics life cycle.

Mapped Cos

CS 828.1

CS 828.1

CS 828.1

CS 828.1

Tutorial/Assignment Sheet

Subject Name: Big Data Analytics Tutorial number: 2 Subject Code: CS-828 Class / Semester: 7Max. Marks : 1 Time Allowed: 1 week

Q. No:

1.

2.

3.

4.

What are the different analytics problems? Explain

Explain the different models used for different problems.

What are performance measures for scoring Problems?

What are the performance measures for classification problems?

Mapped Cos

CS 828.2

CS 828.2

CS 828.2

CS 828.2

Prepared By: Bindiya Ahuja Approved By:

Dr. Suresh KumarCourse Coordinator Bindiya Ahuja (Professor & HOD-CSE)

Solution of Tutorial 1

Ans 1. 'Big Data' is also a data but with a huge size. 'Big Data' is a term used to describe collection of data that is huge in size and yet growing exponentially with time.In short, such a data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

Characteristics Of 'Big Data'

(i)Volume – The name 'Big Data' itself is related to a size which is enormous. Size of data plays very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon volume of data. Hence, 'Volume' is one characteristic which needs to be considered while dealing with 'Big Data'.

(ii)Variety – The next aspect of 'Big Data' is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Now days, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. is also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analysing data.

(iii)Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks and social media sites, sensors,Mobile devices, etc. The flow of data is massive and continuous.

(iv)Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

Ans 2 Improving insights about our customers• Improved decision making• Competitive advantage• Innovation/Service creation• Controlling and reducing operating costs• Productivity• Shorten time to market for product/service

• It could replace human /support human decision making with automated algorithms

Ans 3: Lack of management support• Limited Budget• Legacy Issues• Difficulty integrating initiatives surrounding unstructured data• Poor data quality• Difficulty uncovering actionable insights/inadequate reporting• Escalating infrastructure and maintenance costs due to data growth

Ans 4:

Traditional Data Mining Life Cycle

In order to provide a framework to organize the work needed by an organization and deliver clear insights from Big Data, it’s useful to think of it as a cycle with different stages. It is by no means linear, meaning all the stages are related with each other. This cycle has superficial similarities with the more traditional data mining cycle as described in CRISP methodology.

CRISP-DM Methodology

The CRISP-DM methodology that stands for Cross Industry Standard Process for Data Mining, is a cycle that describes commonly used approaches that data mining experts use to

tackle problems in traditional BI data mining. It is still being used in traditional BI data mining teams.

Take a look at the following illustration. It shows the major stages of the cycle as described by the CRISP-DM methodology and how they are interrelated.

CRISP-DM was conceived in 1996 and the next year, it got underway as a European Union project under the ESPRIT funding initiative. The project was led by five companies: SPSS, Teradata, Daimler AG, NCR Corporation, and OHRA (an insurance company). The project was finally incorporated into SPSS. The methodology is extremely detailed oriented in how a data mining project should be specified.

Let us now learn a little more on each of the stages involved in the CRISP-DM life cycle −

Business Understanding − This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition. A preliminary plan is designed to

achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used.

Data Understanding − The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

Data Preparation − The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.

Modeling − In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, it is often required to step back to the data preparation phase.

Evaluation − At this stage in the project, you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to evaluate the model thoroughly and review the steps executed to construct the model, to be certain it properly achieves the business objectives.

A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

Deployment − Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer.

Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process.

In many cases, it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model, it is important for the customer to understand

upfront the actions which will need to be carried out in order to actually make use of the created models.

SEMMA Methodology

SEMMA is another methodology developed by SAS for data mining modeling. It stands for Sample, Explore, Modify, Model, and Asses. Here is a brief description of its stages −

Sample − The process starts with data sampling, e.g., selecting the dataset for modeling. The dataset should be large enough to contain sufficient information to retrieve, yet small enough to be used efficiently. This phase also deals with data partitioning.

Explore − This phase covers the understanding of the data by discovering anticipated and unanticipated relationships between the variables, and also abnormalities, with the help of data visualization.

Modify − The Modify phase contains methods to select, create and transform variables in preparation for data modeling.

Model − In the Model phase, the focus is on applying various modeling (data mining) techniques on the prepared variables in order to create models that possibly provide the desired outcome.

Assess − The evaluation of the modeling results shows the reliability and usefulness of the created models.

The main difference between CRISM–DM and SEMMA is that SEMMA focuses on the modeling aspect, whereas CRISP-DM gives more importance to stages of the cycle prior to modeling such as understanding the business problem to be solved, understanding and preprocessing the data to be used as input, for example, machine learning algorithms.

Big Data Life Cycle

In today’s big data context, the previous approaches are either incomplete or suboptimal. For example, the SEMMA methodology disregards completely data collection and preprocessing of different data sources. These stages normally constitute most of the work in a successful big data project.

A big data analytics cycle can be described by the following stage −

Business Problem Definition

Research

Human Resources Assessment

Data Acquisition

Data Munging

Data Storage

Exploratory Data Analysis

Data Preparation for Modeling and Assessment

Modeling

Implementation

In this section, we will throw some light on each of these stages of big data life cycle.

Business Problem Definition

This is a point common in traditional BI and big data analytics life cycle. Normally it is a non-trivial stage of a big data project to define the problem and evaluate correctly how much potential gain it may have for an organization. It seems obvious to mention this, but it has to be evaluated what are the expected gains and costs of the project.

Research

Analyze what other companies have done in the same situation. This involves looking for solutions that are reasonable for your company, even though it involves adapting other solutions to the resources and requirements that your company has. In this stage, a methodology for the future stages should be defined.

Human Resources Assessment

Once the problem is defined, it’s reasonable to continue analyzing if the current staff is able to complete the project successfully. Traditional BI teams might not be capable to deliver an optimal solution to all the stages, so it should be considered before starting the project if there is a need to outsource a part of the project or hire more people.

Data Acquisition

This section is key in a big data life cycle; it defines which type of profiles would be needed to deliver the resultant data product. Data gathering is a non-trivial step of the process; it normally involves gathering unstructured data from different sources. To give an example, it could involve writing a crawler to retrieve reviews from a website. This involves dealing with text, perhaps in different languages normally requiring a significant amount of time to be completed.

Data Munging

Once the data is retrieved, for example, from the web, it needs to be stored in an easyto-use format. To continue with the reviews examples, let’s assume the data is retrieved from different sites where each has a different display of the data.

Suppose one data source gives reviews in terms of rating in stars, therefore it is possible to read this as a mapping for the response variable y ∈ {1, 2, 3, 4, 5}. Another data source gives reviews using two arrows system, one for up voting and the other for down voting. This would imply a response variable of the form y ∈ {positive, negative}.

In order to combine both the data sources, a decision has to be made in order to make these two response representations equivalent. This can involve converting the first data source response representation to the second form, considering one star as negative and five stars as positive. This process often requires a large time allocation to be delivered with good quality.

Data Storage

Once the data is processed, it sometimes needs to be stored in a database. Big data technologies offer plenty of alternatives regarding this point. The most common alternative is using the Hadoop File System for storage that provides users a limited version of SQL, known as HIVE Query Language. This allows most analytics task to be done in similar ways as would be done in traditional BI data warehouses, from the user perspective. Other storage options to be considered are MongoDB, Redis, and SPARK.

This stage of the cycle is related to the human resources knowledge in terms of their abilities to implement different architectures. Modified versions of traditional data warehouses are still being used in large scale applications. For example, teradata and IBM offer SQL databases that can handle terabytes of data; open source solutions such as postgreSQL and MySQL are still being used for large scale applications.

Even though there are differences in how the different storages work in the background, from the client side, most solutions provide a SQL API. Hence having a good understanding of SQL is still a key skill to have for big data analytics.

This stage a priori seems to be the most important topic, in practice, this is not true. It is not even an essential stage. It is possible to implement a big data solution that would be working with real-time data, so in this case, we only need to gather data to develop the model and then implement it in real time. So there would not be a need to formally store the data at all.

Exploratory Data Analysis

Once the data has been cleaned and stored in a way that insights can be retrieved from it, the data exploration phase is mandatory. The objective of this stage is to understand the data, this is normally done with statistical techniques and also plotting the data. This is a good stage to evaluate whether the problem definition makes sense or is feasible.

Data Preparation for Modeling and Assessment

This stage involves reshaping the cleaned data retrieved previously and using statistical preprocessing for missing values imputation, outlier detection, normalization, feature extraction and feature selection.

Modelling

The prior stage should have produced several datasets for training and testing, for example, a predictive model. This stage involves trying different models and looking forward to solving the business problem at hand. In practice, it is normally desired that the model would give some insight into the business. Finally, the best model or combination of models is selected evaluating its performance on a left-out dataset.

Implementation

In this stage, the data product developed is implemented in the data pipeline of the company. This involves setting up a validation scheme while the data product is working, in order to track its performance. For example, in the case of implementing a predictive model, this stage would involve applying the model to new data and once the response is available, evaluate the model.

Solution Tutorial 2

Ans 1: Scoring Problems2. Classification Problems3. Clustering Problems

Ans 2: Some common classification methods

Method DescriptionNaive

BayesNaive Bayes classifiers are especially useful for problems with many input

variables, categorical input variables with a very large number of possible values, and text classification. Naive Bayes would be a good first attempt at solving the product categorization problem.

Decision trees

Decision trees are useful when input variables interact with the output in “if-then” kinds of ways (such as IF age > 65, THEN has.health.insurance=T). They are also suitable when inputs have an AND relationship to each other (such as IF age < 25 AND student=T, THEN...) or when input variables are redundant or correlated. The decision rules that come from a decision tree are in principle easier for nontechnical users to understand than the decision processes that come from other classifiers.

Logistic regression

Logistic regression is appropriate when you want to estimate class probabilities (the probability that an object is in a given class) in addition to class assignments.[a] An example use of a logistic regression–based classifier is estimating the probability of fraud in credit card purchases. Logistic regression is also a good choice when you want an idea of the relative impact of different input variables on the output. For example, you might find out that a $100 increase in transaction size increases the odds that the transaction is fraud by 2%, all else being equal.

Support vector machines

Support vector machines (SVMs) are useful when there are very many input variables or when input variables interact with the outcome or with each other in complicated (nonlinear) ways. SVMs make fewer assumptions about variable distribution than do many other methods, which makes them especially useful when the training data isn’t completely representative of the way the data is distributed in production.

Common scoring methods

We’ll cover the following two general scoring methods in more detail in later chapters.

Linear regression

Linear regression builds a model such that the predicted numerical output is a linear additive function of the inputs. This can be a very effective approximation, even when the underlying situation is in fact nonlinear. The resulting model also gives an indication of the relative impact

of each input variable on the output. Linear regression is often a good first model to try when trying to predict a numeric value.

Logistic regression

Logistic regression always predicts a value between 0 and 1, making it suitable for predicting probabilities (when the observed outcome is a categorical value) and rates (when the observed outcome is a rate or ratio). As we mentioned, logistic regression is an appropriate approach to the fraud detection problem, if what you want to estimate is the probability that a given transaction is fraudulent or legitimate.

Some common clustering methods include these:

· K-means clustering

· Apriori algorithm for finding association rules

· Nearest neighbor

Ans 3: Evaluating scoring models

Evaluating models that assign scores can be a somewhat visual task. The main concept is looking at what is called the residuals or the difference between our predictions f(x[i,]) and actual outcomes y[i]. Figure 5.7 illustrates the concept.

Figure 5.7. Scoring residuals

The data and graph in figure 5.7 were produced by the R commands in the following listing.

Listing 5.5. Plotting residuals

d <- data.frame(y=(1:10)^2,x=1:10)

model <- lm(y~x,data=d)

d$prediction <- predict(model,newdata=d)

library('ggplot2')

ggplot(data=d) + geom_point(aes(x=x,y=y)) +

geom_line(aes(x=x,y=prediction),color='blue') +

geom_segment(aes(x=x,y=prediction,yend=y,xend=x)) +

scale_y_continuous('')

Root mean square error

The most common goodness-of-fit measure is called root mean square error (RMSE). This is the square root of the average square of the difference between our prediction and actual values. Think of it as being like a standard deviation: how much your prediction is typically off. In our case, the RMSE is sqrt(mean((d$prediction-d$y)^2)), or about 7.27. The RMSE is in the same units as your y-values are, so if your y-units are pounds, your RMSE is in pounds. RMSE is a good measure, because it is often what the fitting algorithms you’re using are explicitly trying to minimize. A good RMSE business goal would be “We want the RMSE on account valuation to be under $1,000 per account.”

Most RMSE calculations (including ours) don’t include any bias correction for sample size or model complexity, though you’ll see adjusted RMSE in chapter 7.

R-squared

Another important measure of fit is called R-squared (or R2, or the coefficient of determination). It’s defined as 1.0 minus how much unexplained variance your model leaves (measured relative to a null model of just using the average y as a prediction). In our case, the R-squared is 1-sum((d$prediction-d$y)^2)/sum((mean(d$y)-d$y)^2), or 0.95. R-squared is dimensionless (it’s not the units of what you’re trying to predict), and the best possible R-squared is 1.0 (with near-zero or negative R-squared being horrible). R-squared can be thought of as what fraction of the y variation is explained by the model. For linear regression (with appropriate bias corrections), this interpretation is fairly clear. Some other models (like logistic regression) use deviance to report an analogous quantity called pseudo R-squared.

Under certain circumstances, R-squared is equal to the square of another measure called the correlation (or Pearson product-moment correlation coefficient; see http://mng.bz/ndYf). R-squared can be derived from RMSE plus a few facts about the data (so R-squared can be thought of as a normalized version of RMSE). A good R-squared business goal would be “We want the model to explain 70% of account value.”

However, R-squared is not always the best business-oriented metric. For example, it’s hard to tell what a 10% reduction of RMSE would mean in relation to the Netflix Prize. But it would be easy to map the number of ranking errors and amount of suggestion diversity to actual Netflix business benefits.

Correlation

Correlation is very helpful in checking if variables are potentially useful in a model. Be advised that there are at least three calculations that go by the name of correlation: Pearson, Spearman, and Kendall (see help(cor)). The Pearson coefficient checks for linear relations, the Spearman coefficient checks for rank or ordered relations, and the Kendall coefficient checks for degree of voting agreement. Each of these coefficients performs a progressively more drastic transform than the one before and has well-known direct significance tests (see help(cor.test)).

Don’t use correlation to evaluate model quality in production

It’s tempting to use correlation to measure model quality, but we advise against it. The problem is this: correlation ignores shifts and scaling factors. So correlation is actually computing if there is any shift and rescaling of your predictor that is a good predictor. This isn’t a problem for training data (as these predictions tend to not have a systematic bias in shift or scaling by design) but can mask systematic errors that may arise when a model is used in production.

Absolute error

For many applications (especially those involving predicting monetary amounts), measures such as absolute error (sum(abs(d$prediction-d$y))), mean absolute error (sum(abs(d$prediction-d$y))/length(d$y)), and relative absolute error (sum(abs(d$prediction-d$y))/sum(abs(d$y))) are tempting measures. It does make sense to check and report these measures, but it’s usually not advisable to make these measures the project goal or to attempt to directly optimize them. This is because absolute error measures tend not to “get aggregates right” or “roll up reasonably” as most of the squared errors do.

As an example, consider an online advertising company with three advertisement purchases returning $0, $0, and $25 respectively. Suppose our modeling task is as simple as picking a single summary value not too far from the original three prices. The price minimizing absolute error is the median, which is $0, yielding an absolute error of sum(abs(c(0,0,25)-20)), or $25. The price minimizing square error is the mean, which is $8.33 (which has a worse absolute error of $33.33). However the median price of $0 misleadingly values the entire campaign at $0. One great advantage of the mean is this: aggregating a mean prediction gives an unbiased prediction of the aggregate in question. It is often an unstated project need that various totals or roll-ups of the predicted amounts be close to the roll-ups of the unknown values to be predicted. For monetary applications, predicting the totals or aggregates accurately is often more important than getting individual values right. In fact, most statistical modeling techniques are designed for regression, which is the unbiased prediction of means or expected values.

Ans 4: Evaluating classification models

A classification model places examples into two or more categories. The most common measure of classifier quality is accuracy. For measuring classifier performance, we’ll first introduce the incredibly useful tool called the confusion matrix and show how it can be used to calculate many important evaluation scores. The first score we’ll discuss is accuracy, and then we’ll move on to better and more detailed measures such as precision and recall.

Let’s use the example of classifying email into spam (email we in no way want) and non-spam (email we want). A ready-to-go example (with a good description) is the Spambase dataset (http://mng.bz/e8Rh). Each row of this dataset is a set of features measured for a specific email and an additional column telling whether the mail was spam (unwanted) or non-spam (wanted). We’ll quickly build a spam classification model so we have results to evaluate. To do this, download the file Spambase/spamD.tsv from the book’s GitHub site (https://github.com/WinVector/zmPDSwR/tree/master/Spambase) and then perform the steps shown in the following listing.

Listing 5.1. Building and applying a logistic regression spam model

spamD <- read.table('spamD.tsv',header=T,sep='\t')

spamTrain <- subset(spamD,spamD$rgroup>=10)

spamTest <- subset(spamD,spamD$rgroup<10)

spamVars <- setdiff(colnames(spamD),list('rgroup','spam'))

spamFormula <- as.formula(paste('spam=="spam"',

paste(spamVars,collapse=' + '),sep=' ~ '))

spamModel <- glm(spamFormula,family=binomial(link='logit'),

data=spamTrain)

spamTrain$pred <- predict(spamModel,newdata=spamTrain,

type='response')

spamTest$pred <- predict(spamModel,newdata=spamTest,

type='response')

print(with(spamTest,table(y=spam,glmPred=pred>0.5)))

## glmPred

## y FALSE TRUE

## non-spam 264 14

## spam 22 158

A sample of the results of our simple spam classifier is shown in the next listing.

Listing 5.2. Spam classifications

> sample <- spamTest[c(7,35,224,327),c('spam','pred')]

> print(sample)

spam pred

115 spam 0.9903246227

361 spam 0.4800498077

2300 non-spam 0.0006846551

3428 non-spam 0.0001434345

The confusion matrix

The absolute most interesting summary of classifier performance is the confusion matrix. This matrix is just a table that summarizes the classifier’s predictions against the actual known data categories.

The confusion matrix is a table counting how often each combination of known outcomes (the truth) occurred in combination with each prediction type. For our email spam example, the confusion matrix is given by the following R command.

Listing 5.3. Spam confusion matrix

> cM <- table(truth=spamTest$spam,prediction=spamTest$pred>0.5)

> print(cM)

prediction

truth FALSE TRUE

non-spam 264 14

spam 22 158

Using this summary, we can now start to judge the performance of the model. In a two-by-two confusion matrix, every cell has a special name, as illustrated in table 5.4.

Table 5.4. Standard two-by-two confusion matrix

Prediction=NEGATIVE Prediction=POSITIVETruth mark=NOT IN

CATEGORYTrue negatives (TN)

cM[1,1]=264False positives (FP)

cM[1,2]=14Truth mark=IN

CATEGORYFalse negatives (FN)

cM[2,1]=22True positives (TP)

cM[2,2]=158

Changing a score to a classification

Note that we converted the numerical prediction score into a decision by checking if the score was above or below 0.5. For some scoring models (like logistic regression) the 0.5 score is likely a high accuracy value. However, accuracy isn’t always the end goal, and for unbalanced training data the 0.5 threshold won’t be good. Picking thresholds other than 0.5 can allow the data scientist to trade precision for recall (two terms that we’ll define later in this chapter). You can start at 0.5, but consider trying other thresholds and looking at the ROC curve.

Most of the performance measures of a classifier can be read off the entries of this confusion matrix. We start with the most common measure: accuracy.

Accuracy

Accuracy is by far the most widely known measure of classifier performance. For a classifier, accuracy is defined as the number of items categorized correctly divided by the total number of items. It’s simply what fraction of the time the classifier is correct. At the very least, you want a classifier to be accurate. In terms of our confusion matrix, accuracy is (TP+TN)/(TP+FP+TN+FN)=(cM[1,1]+cM[2,2])/sum(cM) or 92% accurate. The error of around 8% is unacceptably high for a spam filter, but good for illustrating different sorts of model evaluation criteria.

Categorization accuracy isn’t the same as numeric accuracy

It’s important to not confuse accuracy used in a classification sense with accuracy used in a numeric sense (as in ISO 5725, which defines score-based accuracy as a numeric quantity that can be decomposed into numeric versions of trueness and precision). These are, unfortunately, two different meanings of the word.

Before we move on, we’d like to share the confusion matrix of a good spam filter. In the next listing we create the confusion matrix for the Akismet comment spam filter from the Win-Vector blog.

Listing 5.4. Entering data by hand

> t <- as.table(matrix(data=c(288-1,17,1,13882-17),nrow=2,ncol=2))

> rownames(t) <- rownames(cM)

> colnames(t) <- colnames(cM)

> print(t)

FALSE TRUE

non-spam 287 1

spam 17 13865

Because the Akismet filter uses link destination clues and determination from other websites (in addition to text features), it achieves a more acceptable accuracy of (t[1,1]+t[2,2])/sum(t), or over 99.87%. More importantly, Akismet seems to have suppressed fewer good comments. Our next section on precision and recall will help quantify this distinction.

Accuracy is an inappropriate measure for unbalanced classes

Suppose we have a situation where we have a rare event (say, severe complications during childbirth). If the event we’re trying to predict is rare (say, around 1% of the population), the null model—the rare event never happens—is very accurate. The null model is in fact more accurate than a useful (but not perfect model) that identifies 5% of the population as being “at risk” and captures all of the bad events in the 5%. This is not any sort of paradox. It’s just that accuracy is not a good measure for events that have unbalanced distribution or unbalanced costs (different costs of “type 1” and “type 2” errors).

Precision and recall

Another evaluation measure used by machine learning researchers is a pair of numbers called precision and recall. These terms come from the field of information retrieval and are defined as follows. Precision is what fraction of the items the classifier flags as being in the class

actually are in the class. So precision is TP/(TP+FP), which is cM[2,2]/(cM[2,2]+cM[1,2]), or about 0.92 (it is only a coincidence that this is so close to the accuracy number we reported earlier). Again, precision is how often a positive indication turns out to be correct. It’s important to remember that precision is a function of the combination of the classifier and the dataset. It doesn’t make sense to ask how precise a classifier is in isolation; it’s only sensible to ask how precise a classifier is for a given dataset.

In our email spam example, 93% precision means 7% of what was flagged as spam was in fact not spam. This is an unacceptable rate for losing possibly important messages. Akismet, on the other hand, had a precision of t[2,2]/(t[2,2]+t[1,2]), or over 99.99%, so in addition to having high accuracy, Akismet has even higher precision (very important in a spam filtering application).

The companion score to precision is recall. Recall is what fraction of the things that are in the class are detected by the classifier, or TP/(TP+FN)=cM[2,2]/(cM[2,2]+cM[2,1]). For our email spam example this is 88%, and for the Akismet example it is 99.87%. In both cases most spam is in fact tagged (we have high recall) and precision is emphasized over recall (which is appropriate for a spam filtering application).

It’s important to remember this: precision is a measure of confirmation (when the classifier indicates positive, how often it is in fact correct), and recall is a measure of utility (how much the classifier finds of what there actually is to find). Precision and recall tend to be relevant to business needs and are good measures to discuss with your project sponsor and client.

F1

The F1 score is a useful combination of precision and recall. If either precision or recall is very small, then F1 is also very small. F1 is defined as 2*precision*recall/(precision+recall). So our email spam example with 0.93 precision and 0.88 recall has an F1 score of 0.90. The idea is that a classifier that improves precision or recall by sacrificing a lot of the complementary measure will have a lower F1.

Sensitivity and specificity

Scientists and doctors tend to use a pair of measures called sensitivity and specificity.

Sensitivity is also called the true positive rate and is exactly equal to recall. Specificity is also called the true negative rate and is equal to TN/(TN+FP)=cM[1,1]/(cM[1,1] +cM[1,2]) or about 95%. Both sensitivity and specificity are measures of effect: what fraction of class members are identified as positive and what fraction of non-class members are identified as negative.

An important property of sensitivity and specificity is this: if you flip your labels (switch from spam being the class you’re trying to identify to non-spam being the class you’re trying to identify), you just switch sensitivity and specificity. Also, any of the so-called null classifiers (classifiers that always say positive or always say negative) always return a zero score on either sensitivity or specificity. So useless classifiers always score poorly on at least one of

these measures. Finally, unlike precision and accuracy, sensitivity and specificity each only involve entries from a single row of table 5.4. So they’re independent of the population distribution (which means they’re good for some applications and poor for others).

Common classification performance measures

Table 5.5 summarizes the behavior of both the email spam example and the Akismet example under the common measures we’ve discussed.

Table 5.5. Example classifier performance measures

Measure Formula Email spam

exampleAkismet spam

exampleAccura

cy(TP+TN)/

(TP+FP+TN+FN)0.9214 0.9987

Precision

TP/(TP+FP) 0.9187 0.9999

Recall TP/(TP+FN) 0.8778 0.9988Sensitiv

ityTP/(TP+FN) 0.8778 0.9988

Specificity

TN/(TN+FP) 0.9496 0.9965

All of these formulas can seem confusing, and the best way to think about them is to shade in various cells in table 5.4. If your denominator cells shade in a column, then you’re measuring a confirmation of some sort (how often the classifier’s decision is correct). If your denominator cells shade in a row, then you’re measuring effectiveness (how much of a given class is detected by a the classifier). The main idea is to use these standard scores and then work with your client and sponsor to see what most models their business needs. For each score, you should ask them if they need that score to be high and then run a quick thought experiment with them to confirm you’ve gotten their business need. You should then be able to write a project goal in terms of a minimum bound on a pair of these measures. Table 5.6 shows a typical business need and an example follow-up question for each measure.

files.transtutors.com€¦ · web viewwhat is the structure of big data? explain the data analytics...

Documents