wine case report

Upload: neeladevin

Post on 11-Oct-2015

73 views

Category:

Documents


0 download

DESCRIPTION

Business Intelligence and Analysis Project Report

TRANSCRIPT

Wine production Report

Table of ContentsAbstract2Introduction2Data Processing3Data analysis4Finding Correlations4Classification Tree6Linear Regression8Summary of analysis9Findings from Classification Tree9Findings from Regression10Insights13Appendix14Plotting Commands14Commands for Decision Tree16White Wine Tree Creation:16Red Wine Tree Creation:16Commands for Linear Regression16

Wine production Reportdata analysis on wine production Abstract

We, the Datum team have selected wine quality for the study. For a better business insight, we have incorporated many compatible data mining techniques to analyze the data. We study two types of wines (Red and White), which have different physiochemical characteristics. There are different wineries across the globe and our data pertains to a particular winery in Portugal. Our dataset has only the basic ingredients of wine preparation along with the quality as rated by wine experts. Our study is to examine the combination of ingredients and their influence on the quality. We performed classification of quality through decision tree; thereby, the classifiers that influence wine quality are identified. Our next processing was linear regression, which not only helps in predicting quality but also helps in identifying the significant ingredients. The models we created could be used independently for quality predictions or as a support to wine tasting evaluations by experts and could help in improving wine production.

Introduction

Wine is a beverage from fermented grape and other fruit juices with lower amount of alcohol content. Quality of wine is graded based on the taste of wine and vintage. Tasting wine is an ancient process as the wine itself is. When it comes to the quality of wine, many other factors or attributes comes in to consideration other than the flavor. The dataset that we chose to analyze Wine Quality, represents the quality of wines ( white & red ) based on different physiochemical attributes ( fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH , sulphates and alcohol ). The quality score for each wine combination in our dataset varies from 0 to 10 (ranging from least to highest). This report will uncover some important relationships between wine chemical contents like acidity and sugar levels versus its quality. The dataset exhibits a vast and distinct chemical and acidic combination of two types of wine (white & red). By employing smart data analysis techniques, we can unearth a hand full of important and interesting insights that would be helpful in producing better quality wines and that would also be prolific for the economical/financial sector and business sector of the production company.

All the attributes in the dataset are numerical except for the quality, which is ordinal and Wine type which is nominal. Upon thorough study of the dataset its attributes and value ranges, we zeroed in on three insights; the insights are supported by the analysis techniques such as Decision tree and linear regression. The business strategies and value add are evolved based on the degree of perfection of the developed models and proven findings. We can consider lauding a production if the final product is produced with minimum cost and maximum quality.

An important insight that we are able to procure is how much each of the chemical components contribute to the quality of the wine and how can we grade the quality level of a newly produced wine. We have decided to create a classification decision tree to arrive at quality class label with attributes with highest information gain as nodes. Through a decision tree, we will be able to identify the distinguishing factors affecting quality level of a newly created wine and thereby fix a reasonable price to the wine in the market.

Out of the thirteen attributes, the statistically significant attribute that influence quality of wine is an essential finding. By employing linear regression analysis, we come up with a model that highlights the significant attributes. The result of this regression analysis will be helpful in production and in quality prediction by studying the impact of those significant attributes in determining the quality.

The main focus of this report is to illustrate how those attributes are analyzed using data mining techniques in R tool. Also it explains in detail, our teams findings as a result of this analysis. All the findings will be depicted with clear explanations and depictions in layman terms and also a full technical explanation of the data mining approaches and algorithms correlated to the findings will be given.

Data Processing

The data set we have considered for our analysis is pertaining to some of the physiochemical characteristics of red and white wines from the Vinho Verde region of Portugal. Due to privacy and logistic issues, only physiochemical (inputs) and the sensory (output) variables are available. There is no data about grape types, wine brand, wine selling price etc. The data set is available through UCI machine learning repository.

As initial step of preprocessing, we checked for missing values in the dataset and there were none. We looked for the duplicated values and removed them the dataset. We studied the range and the units used for each attribute and could see that they are the right measurements and are in the natural range for ingredients of wines. The dataset contains a large number of outliers, some of them are extreme. But when we observe an extreme outlier for a particular attribute, we could see that the other attributes are still in range and we decided against discarding such observations. The extreme outlier is helpful in finding interesting factual. Data analysis

The analysis of this data can be primarily viewed as a classification problem. The classes are ordered and not balanced. The dataset contains data for red and white wines; they share a common data structure. This data set contains 11 explanatory variables which are physiochemical parameters of a wine, and 1 response variable, quality. The quality rating is from 1 to 10. Each data case is for one wine and the quality rating for each is the median value of inputs from multiple judges.

Finding Correlations

The goal of our analysis is to predict quality based on the 11 predominant physiochemical characteristics. Before selecting the appropriate modeling technique and the most significant attributes to be used for modeling, we studied the relationship between each of these individual attributes and quality.

The results discussed below are based on our observations on the entire data set of white and red wines. [We did the same analysis individually on white and red wine data as well. For the significant attributes discussed below, we observed similar relationships in both types. ]

Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6

Figure 1 is a plot of volatile acidity against quality. The plot shows a negative relationship between quality and this explanatory variable. Volatile acidity is a measure of acetic acid in wine and an indicator of the freshness of the wine. Its levels increase due to oxidation, when wine is exposed to the air, resulting in vinegar smell and in extreme cases of oxidation, the wine more or less transforms to vinegar. Hence the negative relation seems logical.

Figure 2 illustrates the relationship between quality and the residual sugar in the wine; again this is a negative relationship, but progression is weak and is not applicable to lower quality wines. Residual sugar typically refers to the sugar remaining in the wine after fermentation is stopped; in some cases this could be additionally added to the wine as well. The negative relationship is aligned to the general perception that drier wines are of higher quality.

Figure 3 illustrates the negative correlation relationship between quality and chlorides present in the wine. The presence of chlorides leads to the saltiness in wines. In the plot the negative progression is apparent in higher qualities. The saltiness has a regional preference/acceptance.

Figure 4 illustrates the relationship between quality and total sulfur dioxide. We observe a negative relationship for this attribute to quality. Sulfur dioxide is a preservative added to wines to protect it against oxidation. The high quality wines have low levels of total SO2 and the manufacturers would have achieved the right balance between preservation and maintaining aroma in this case. Wines graded average show a range of values of SO2, implying varying techniques used for balancing, by manufacturers.

Figure 5 illustrates the relationship between quality and Sulphates. Sulphates and sulfur dioxide both act as preservative for the wine. Like SO2 in this case also we observed a negative relationship to quality.

Figure 6 illustrates the relationship between quality and Alcohol. This is the only prominent positive relationship that we observed. The poorest quality wines have the lowest levels of alcohol. In the higher quality wines the observed alcohol levels are also high. A higher level of alcohol is a desirable characteristic for high quality wines. In our plots this tendency is clearly visible, especially in case of red wines.Classification Tree

Classifiers are the distinguishing factors that help in quality analysis. Decision tree is the statistical model helpful in providing the information gain rich classifiers and the corresponding quality response for the classifiers.

From the plots and observations on the data it is clear that each of the chemical component and acidic compound has either a negative or positive impact on the quality of the wine. By studying the impacts of each of these attributes on the quality grades, we decided to create a model which can be used to grade the quality of new wine products by studying its chemical components and acidic compounds.

A model that focuses only on the attributes that have the highest information and impact on the quality grade given by experts is created. We decided to build a classification decision tree that has only the attribute nodes with the highest information gain and lower entropy. Classification can be described as a form of data analysis through which we create data models that describes important data classes. Classification has two steps; one is learning step wherein we create a classification model and the second one is classification step where we use the model for predicting class labels. A decision tree can be understood as a flowchart-like tree structure, where each non-leaf node denotes a test on an attribute. Each branch represents an outcome of the test and each leaf node denotes a class label.

First we create a tree in R with the ordinal variable quality as the class label, for the Red wine data. We provide all the attributes except Wine.Type as candidates for tree nodes.

By studying the tree and the summary of the tree, we have identified that only three attributes alcohol, sulphates and volatile.acidity have higher information gain.

Secondly we created a separate decision tree with the response variable quality as the class label and all attributes except Wine.Type as candidate tree nodes, for the White wine data. The resulting tree is as shown below

From the summary of the tree, we have identified that only three attributes alcohol, sulphates and volatile.acidity have higher information gain.

Linear Regression

Multiple linear regression models, a technique used to analyze the influence of a number of independent variables on the dependent variable. The linear model explains the following features helpful for the analysis:

Variation in the dependent variable corresponding to the variation in the independent variable Prediction of the value of dependent variable based on the given conditions of independent variable Estimation of the independent variable influence on the dependent variable holding the other independent variables as constant The reliability of the model based on degrees of explanation of independent variable Standard errors between the prediction and the observation Significance of the model with the given confidence levels Individual significance of the independent variables These insights hold best of the basic parameters to explain whether the data provided is sufficient to estimate the dependent variable.

In order to accommodate the variations in the range of particular attributes for each of these wines (e.g. alcohol levels) we chose to perform regression separately on each type by splitting the dataset. On top of all the statistics, the model is created and validated with the observation data. The data set is split randomly into train and test dataset. The train dataset is used to predict the model while the test dataset is used to validate the model. Thereby, the statistical errors are reduced though the dataset can be erroneous in content.Summary of analysis

Findings from Classification Tree

From the summary, we find that only 43 % of the red wine tuples were classified in the decision tree created using our red wine dataset. Quality grades ranging from 4 to 7 are classified in our red wine decision tree.

From the residual mean deviance, we find that only 58 % of the white wine tuples were classified in the decision tree created using our white wine dataset. Quality grades ranging from 4 to 7 are classified in our decision tree. The bad side is, the model which we have created using the dataset only allows us to classify wines with a quality grade ranging from 4 to 7. Wines with extreme quality grades (0 to 3 and 8 to 10) cannot be classified with this model.

Findings from Regression

Results of executing Linear modeling for Red wine to predict quality of the wine based on all the independent variable

On observing the results of linear model the following findings are articulated.

Estimate feature provides the estimated influence of the independent variable on the quality. The positive or negative influence is indicated by the sign of the feature. For instance, there is one unit decrease in the quality for every 21.49% increase in density. Standard error provides the degree of fit of the predicted regression equation for the sample data. T value of each coefficient is the test that the coefficient is different from zero. For example, T value of Fixed Acidity is given by Estimate/Standard Error = 0.017914/0.032041 = 0.599 Pr(>|t|) is the significance level of the t value which should be greater than 5%. For instance, the p value of Fixed acidity is 0.5762 so the component is not significant for analysis. Residual Standard error is the sum of squares of all the residual errors. Adjusted R square is the amount of variance of quality explained by the given set of independent variables. This model predicts that 36.05% variation in quality is explained by the independent variables. F statistics and p value show the statistically significant relationship between the independent and dependent variable provided the p value is less than 0

Prediction of the model:

The predicted model is evaluated with the test data. On evaluation, the quality of test data is in the range of predicted quality range.

Insights

The final verdict on wine quality is the sensory perceptions. With a good prediction model, the manufacturers can simulate the sensory testing based on the levels of the attributes and this decision making system can be used to support wine tasting evaluations by experts and could help in improving wine production. The model that we have created could be used by manufacturers to create models for target markets, modeling consumer tastes typical to niche markets. The freshness and low alcohol content of Vinho Verde are two important aspects for a demanding market as Canada, whereas it is the freshness and affordable prices are more desirable in the U.S. The results from the classification give us the most relevant classifiers of the data. This information can be used for a better control on the wine production phase. Given the harvest conditions, when we know the levels of certain attributes like acidity or sugars in the produce, we can control the manufacturing process to compensate for these and still achieve the desired quality levels for the products.

BBC World News Article: The world is facing a wine shortage, with global consumer demand already significantly outstripping supply.(link -http://www.bbc.com/news/world-24746539).

Due to the ongoing vine pull and poor weather the production in the Old World wine countries i.e. Europe has been plummeting. Therefore this supply could be met by new wine producers from the New World Wine countries by introducing different product lines based on the range of average quality i.e 4 to 7 at reasonable cost which would attract the large market segment of the wine consumers which are the price constrained and the commodity buyers. Therefore a classification tree model derived from the data analysis helps identify the levels of alcohol, volatile acidity and free sulfur dioxide that should be maintained to meet the quality of a particular product line.

AppendixPlotting Commands

> plot(winedata$quality, winedata$Wine.Type)> plot(winedata$quality, winedata$fixed.acidity)> plot(winedata$fixed.acidity,winedata$quality)> plot(winedata$volatile.acidity,winedata$quality)> plot(winedata$citric.acid,winedata$quality)> plot(winedata$residual.sugar,winedata$quality)> plot(winedata$chlorides,winedata$quality)> plot(winedata$free.sulfur.dioxide,winedata$quality)> plot(winedata$total.sulfur.dioxide,winedata$quality)> plot(winedata$density,winedata$quality)> plot(winedata$pH,winedata$quality)> plot(winedata$sulphates,winedata$quality)> plot(winedata$alcohol,winedata$quality)> pairs(white)

1. > pairs(red)

> cor(winedata[,2:13])

Commands for Decision Tree

White Wine Tree Creation:

> whiteTree plot(whiteTree)> text(whiteTree)> summary(whiteTree)

Red Wine Tree Creation:

redTree plot(redTree)> text(redTree)> summary(redTree)Commands for Linear Regression

> #Split the dataset randomly into Train and Test Dataset> red_ind white_ind #Split the dataset based on the random index > red_train_ds white_train_ds #Collect the Test Dataset> red_test_ds white_test_ds #Creation of Multiple Linear Regression Model

>red_wine_lmwhite_wine_lm summary(red_wine_lm)> summary(white_wine_lm)

> #Predict the values for Test data> red_test_pred white_test_pred