task 1: decision trees - files.transtutors.com€¦ · web viewin the given data of...

ENTERPRISE BUSINESS INTELLIGENCE (BUS5EBI)

TOPICS COVERED– ORANGE DATA MINING, LINEAR REGRESSION, ASSOCIATION RULE MINING, CLUSTERING

SUBMITTED BY:SUSNATA CHAKMA -19577529MOHAMMAD RAMJANUL KABIR -19592907REJO THOMAS KUTTY -19243202JENSON CHANDY -18810030

1

SUBMITTED TO:Mr Naren Shanbhag

Table of ContentsTask 1: Decision Trees (Using Orange Data Mining).......................................................................3

Identifying whether there are Anomalies (Unusual data/Missing Values)..................................3

Two possible strategies to deal with Missing Values.....................................................................4

1. Imputing the data-...............................................................................................................4

2. Deleting/ removing the data-...............................................................................................4

Creating the dataset with appropriate steps..................................................................................4

Step 1:...........................................................................................................................................4

Step 2:...........................................................................................................................................5

Step 3:...........................................................................................................................................5

Step 4:...........................................................................................................................................6

Step 5:...........................................................................................................................................6

Factors that matter when retaining Employees.............................................................................7

Conditions that will lead to employee attrition..............................................................................8

Observations....................................................................................................................................8

Task 2: Linear Regression (Using R Studio).....................................................................................9

Two possible strategies to handle Missing values..........................................................................9

Linear Regression..........................................................................................................................10

Regression model:..........................................................................................................................12

Task 3: Association Rule mining (Using R Studio).........................................................................19

Is there any missing values in the dataset?..................................................................................19

Two possible strategies to handle dataset:...................................................................................19

Association Rule Mining:..............................................................................................................20

Top 3 Lift Value.............................................................................................................................20

Product Recommendation to sales team:.....................................................................................21

Two Products to sell in a bundle:.................................................................................................22

Observations and recommendations:...........................................................................................22

Task 4: Clustering (Using R Studio)................................................................................................23

Dataset description........................................................................................................................23

It consists of:..............................................................................................................................23

Importing dataset into R Studio...................................................................................................23

View Dataset...............................................................................................................................23

Creating a duplicate of the original dataset.................................................................................24

Counting Missing Values..............................................................................................................25

Eliminating the Missing Values....................................................................................................25

Removing the unwanted columns.................................................................................................26

2

Determining Number of clusters..................................................................................................27

K mean Clustering.........................................................................................................................28

Plot before cleaning/ processing data.......................................................................................29

Plot after cleaning/ processing data..........................................................................................29

Price tiers can be identified from the cluster...........................................................................30

Business Recommendations..........................................................................................................30

Class -1 mobile...........................................................................................................................30

Class-2 mobile............................................................................................................................31

Class-3 mobile............................................................................................................................31

Task 1: Decision Trees (Using Orange Data Mining)Identifying whether there are Anomalies (Unusual data/Missing Values)

3

In the given data of employee_attrition_team09 there are in total 1044 variables and also there are 34 features. As the above diagram states that there are no missing values for the current data. Usually it depends on the data which we have been given whether there is any kind of missing values or not.

Two possible strategies to deal with Missing ValuesThe two strategies to deal with the missing values in the data they are

1. Imputing the data- The imputing method is used to add random figure or data in the missing dataset.

2. Deleting/ removing the data- When there are missing values in the specific dataset we will delete the entire data from the dataset.

Creating the dataset with appropriate stepsThe steps to create a decision tree are as follows:

Step 1:

In the beginning, we will be uploading the data from the excel sheet into ‘File’. Over here the data is named as ‘employee_attrition_team09’. The data consists of ‘1044 variables’. The data involves such as attrition, Age, Business Travel, Daily Rate, Department, etc. After uploading the data we will have to fix a Target for our data. In this case, we will be focussing on the ‘Attrition’.

4

Step 2:

Next, we will be creating the ‘Data Table’ to rectify whether we have selected the correct data or not.

Step 3:

5

Now coming to the step 3 which is ‘Data Sampler’. In this step, we will use Data sampler to do sampling in that the fixed proportion data is ‘70%’. We are using 70 % of our data for training purpose and the remaining 30% we will be using it for test purpose.

Step 4:

In Step 4 we will be creating a ‘Tree’ Widget. A Tree Widget is being used to choose what parameters we require for our tree viewer.

Step 5:

6

Total percentage of attrition is 16.3%. In Level 1 the number of people who work overtime is 31.4 and the people who don’t do overtime they are 10.8%. The employees who earn the salary more than $3743 are 19.4% and the employees who are earning less than $3743 are 55.4%. The earnings of the sales representatives each month which are less than $10048 are 26%. Job Satisfaction for Sales Representatives who have got less than 3 they are 72.2% and those are satisfied they are 11.1%. Also the officers such as Human resource officer, technical degree, life science people who haven’t got promotion for last 13 months they are in total 93.2 and those who worked more than 13 months they are totalling around 55.6%. The managers who have got age less than 30 years they are in total 70.6% and above 30 years are 94.9%.

Factors that matter when retaining Employees

There are various factors that matter when retaining employees they are Job Satisfaction, Monthly Salary, Business Travel, and Daily Rate more than $1102. Due to these reasons the employees get retained.

7

Conditions that will lead to employee attrition

Conditions that lead to employee attrition they are Job Satisfaction, Age groups more than 30 years, Stock option Level, Education, Monthly earning Less than $4459. If these conditions are not fulfilled soon can lead the company to have more employee attrition.

Observations In our Final observations, we can state that our modal which we have created has got 70% for training Purpose and the remaining 30% we have it for test purpose. The overall satisfaction for our model we can say that employees get satisfied if they are given good salary with promotions as well as considering the age groups too.

8

Task 2: Linear Regression (Using R Studio)

The dataset for Task 2 consists of 355 observations with 12 variables.The 12 variables are –

The given number in the first column.

The ID in the second column.

Next is the name of the company.

Then in which industry does the company belong to.

The 5th column shows the inception/establishment of the company.

The 6th column shows the number of employees in each company.

Next is the state where each company is situated in.

The 8th column shows the revenue of each industry.

The 9th column showcases the company expenses.

The 11th column shows the generated profit from the industries.

The last column shows the growth of each company.

Upon importing the dataset in RStudio, it can be seen that the dataset consists of missing values by running the code, !complete.cases whichi highlighted the missing values with “NA” and using nrows to count the number of rows with missing values where it was only 7 missing values.

Two possible strategies to handle Missing values

Two possible strategies could be used to handle missing values, one is by imputing where the system takes in the average value to generate a relatable number to replace the missing value and the second method is by omitting the row which consists of missing value from the dataset itself.

But first one need to understand the dataset before implementing any of the strategies, in our case we have relatively small number of missing values compared to the number of observations. Hence, omitting the missing value will not disrupt the dataset and shall provide approximate results later on the analysis process. Similarly, if the number of missing values is larger compared to the number of observations, for instance a dataset consisting of 1000 observation with a missing value of 350 observations means imputing is necessary to work on the dataset.

9

Therefore, we have used the omitting strategy for our dataset.

After omitting the missing values, we created another dataset with no missing values and imported in the R, naming the dataset as “future_500_without_missing “.

Linear RegressionOn this part first, we imported the new dataset that is without missing value.Based on the new dataset we have chosen Profit as the dependent variable out of the 12 variables and Revenue and Expenses as the independent variables for our regression model. Because the variable profit determines which company to invest on for the stakeholders and we have chosen revenue and expenses as independent variable as both the variables has an impact on Profit.

10

As mentioned that Profit = Revenue – Expenses in general accounting knowledge.The other variable that can also be chosen as independent variable is growth as all the other variable such as industry, name, employees and other variables except the serial number and ID has a direct impact on growth but for our case we have only focused on Profitability of the organization for a stakeholder to choose on which company to invest.

In Block 9 as can be seen from the above screen shot we have used the command “ future_500_without_missing$Revenue <- as.numeric(gsub('\\$|,', '', future_500_without_missing$Revenue)) “ to remove the “ $” dollar sign from Revenue to use in the regression model and similarly in Block 10, we have used the command “uture_500_without_missing$Expenses <- as.numeric(gsub('\\Dollars|,', '', future_500_without_missing$Expenses))’’ to remove the word “ Dollar” from Expenses to use it in the regression model as in RStudio numeric. values can be accepted only to generate a regression model.

Upon completing the commands later we used “Profit.lm<- lm(Profit~Revenue + Expenses, data=future_500_without_missing)

Profit.lm “to create a regression model regarding profit, revenue and expenses.Then we created a dummy dataset naming it “new” to execute the fact that by changing the value of the revenue and expenses to generate new profit which can be seen in the console of R studio.

For our dataset “new “we have chosen expenses as 2000 and Revenue as 1000 and used the command “Predict ()” to generate the Profit.

11

Finally we executed the function “summary” to see the the regression model by using the coefficient and for validation we have used the command “summary (Profit.lm) $r.squared “ to see the r-square value for the model.

Regression model:

Profit = Revenue – Expenses

From the summary we can conclude that as p-value is less than 5 percent meaning there is sufficient evidence to accept the model and as the r-square value is 1 meaning that the model itself is strongly fitted, where the values in the dataset is very accurate.

From our dataset the variable “ growth” will be the best candidate as an independent variable to the other variables as it takes into consideration of all the other variables as dependent variable but we have chosen profit which takes only revenue and expenses into consideration as dependent variable, we have not chosen Growth for our regression model as using Rstudio

12

to create a model it only accepts numeric values but does not consider factor such as which industry the company belong to or other factors like inception and where the company is located at which city.

As it can be seen from the screenshot that in Block 13, we have executed the dataset without missing values and created another dummy dataset as “new1” where there are only data of two variables: Profit and Expenses.

We have done this to make scatterplot between the two variables using the function “corrplot”.

13

From the scatter plot we can see that there is a negative correlation between Expenses and Profit meaning that increase in one variable there is simultaneously decrease in the other variable. Here increase in expense shows decrease in Profit according to the regression model.

In Block 14, we have used same function as Block 13 but we have executed the dataset only for Profit and Revenue.

14

We have done this to make scatterplot between the variables Profit and Expenses using the function “corrplot “.

15

From the scatter plot we can see that there is a positive correlation between Revenues and Profit meaning that increase in one variable there is simultaneously increase in the other variable. Here increase in Revenues shows increase in Profit according to the regression model.

In both the scatter plot diagrams it can be seen that there are extreme values and is well fitted.

Finally, we have also utilized the command “corrplot “once again to plot a diagram with numbers that represents positive and negative correlations between the three variables: Profit, Revenue and Expenses.

16

Here, positive number represent positive correlation and negative number represent negative correlation.

Numbers are in between +1 to -1 and the intensity of the colour represent how strong the relations are.

From our diagram, it can be seen there is very strong positive correlation between Revenue and Profit with a number of 0.83. And there is a moderately strong negative correlation between Expenses and Profit.

For Increasing Profit, predictions that are taken into consideration such as the extreme values present in both Revenue and Expenses which may not be moderate due to inflation in the industry and hence may result in increasing in profit but in reality, it may be different.

From the command “summary” we can also see standard error present in the model which are neglected as it is impossible to have zero error.

Based on the results, other value-added observation such as mean industry average profit can be calculated to compare the profit of a company with in any industry and one can also

17

identify which company to invest based maximum profit and where not to invest based on minimum profit.

18

Task 3: Association Rule mining (Using R Studio)Is there any missing values in the dataset?

As we can see in the screenshot that the minimum length of transactions in the dataset is ‘1’ which refers to that the transactions contains at least one item. So, we can say that there are no missing values in the dataset.

Two possible strategies to handle dataset:1. So, if there are missing values in a big dataset and the size of missing value is small

than we can delete the rows with missing values.As there are fewer missing values and the data size is huge it doesn’t affect much to our analysis. For example, if we have a dataset of 100000 or more observations and missing values are few hundred than it’s not going to affect the analysis much and we can delete the rows with missing values.

2. Another way of handling missing data is we may obtain or collect the data’s from where the dataset is taken.If the size of missing value is large than its not possible to delete the rows. Because than we will be left with very small size of dataset for analysis. So, we may obtain an updated dataset with less or no missing values in it for further analysis.

19

Association Rule Mining:

So, in the first step we install the required packages ‘arules’ and ‘aruleviz’ and load the library.

In the second phase we import the dataset as transactions data by using read.transactions function and then we inspect, summarize and check the itemFrequencyPlot. We use item frequency plot in order to know the most frequently bought items from the dataset.

In the third phase of the coding we use the ‘apriori’ algorithm for association rule mining with a support of 0.001 and confidence level of 0.8. Than we check the summary and inspect the rules.

After that we again implement the algorithm with a confidence level 1 in order to reduce the size of our rules.

Now in the 4th phase of the coding we sort the rules by ‘lift’ from the highest to the lowest.

Finally, in the last step we export the file as csv.

Top 3 Lift Value1. As we see in the screenshot the {butter, hard cheese, other vegetables, whole milk}

with {whipped/sour cream} transaction has the highest level of lift with 13.557282 times more than the expected confidence level. Confidence level 1 refers that it is 100 % sure that if a person buys product ‘A’ s/he will definitely buy product ‘B’.

20

2. The {domestic eggs, frankfurter, tropical fruit, whole milk} with {whipped/sour cream} transaction has the second highest with the same level of lift which is13.55782 times more than the expected confidence 1.

3. The {frankfurter, frozen meals, tropical fruit, whole milk} with {pip fruit} transaction has the third highest level of lift which is 13.124060 times more than the expected confidence 1.

Product Recommendation to sales team:

According to the item frequency plot we can see that most frequently bought products are,

1. Whole milk2. Other vegetables3. Rolls/buns

So, we can recommend to the sales team that these 3 items are the most frequently bought products and these products should be stocked more than the other products. As customers

21

buying the whole milk, other vegetables and rolls/buns the most it would be assumed that these items will be sold more.

Two Products to sell in a bundle:

As we can see that the transaction of {frankfurter, frozen meals, tropical fruit, whole milk} with {pip fruit} has the third highest lift of 13.124060 and transaction of {frankfurter, frozen meals, pip fruit, whole milk} with {tropical fruit} has the fourth highest lift with 9.473541 with a confidence level 1. Here we can identify that in both transactions we have the tropical fruit and pip fruit in common. Hence, we can assume that if the sales team make a bundle of tropical fruit and pip fruit customers are more likely to buy it more.

Observations and recommendations:As per the analysis it has been seen that people buy the whole milk most frequently. So, we can recommend the sales team that they can put the whole milk at the end of all sections as people are going to buy whole milk anyways and while reaching to the whole milk section they will be browsing the other sections as well which might increase the sales of other products as well.

Another thing can be recommended that as people are buying whole milk and other vegetables most frequently the sales team can sell it on bundle and give an offer with that. For example, they can give a discount on sour cream if they buy the bundle of vegetables and whole milk.

22

Task 4: Clustering (Using R Studio)The aim of the task is to find the price tiers based on the mobile phone features. There are 20 attributes and 2130 observations.

Dataset descriptionDataset contains an observation of 2130 and 20 attributes. The data consist of values of different features to determine the price of the mobile.

It consists of:battery_power -Total energy a battery can store in one time measured in mAhblue -Has Bluetooth or notclock_speed -speed at which microprocessor executes instructionsdual_sim -Has dual sim support or notfc -Front Camera mega pixelsfour_g -Has 4G or notint_memory -Internal Memory in Gigabytesm_dep - Mobile Depth in cmmobile_wt - Weight of mobile phonen_cores - Number of cores of processorpc - Primary Camera mega pixelspx_height - Pixel Resolution Heightpx_width - Pixel Resolution Widthram - Random Access Memory in Megabytessc_h - Screen Height of mobile in cmsc_w - Screen Width of mobile in cmtalk_time - longest time that a single battery charge will last when you arethree_g - Has 3G or nottouch_screen - Has touch screen or notwifi - Has WIFI or not

Importing dataset into R StudioView Dataset

23

Creating a duplicate of the original datasetTo prevent the defect to the original dataset, a duplicate is created.

24

Counting Missing ValuesMissing values from the data table is counted which is 171 in 2130 observations.

Because the number of missing values is less we can eliminate the missing values from the data set using the omit function.

Eliminating the Missing Values Eliminated the missing values and named as MPU1

25

The dataset consists of 1959 observations and 20 variables

Removing the unwanted columns Considering all the variables it is relevant to consider the Ram and Internal memory to determine the price tiers for the mobile.

Hence all other variables are removed, and the two variables Ram and Internal memory left in the observations.

Dataset with ram and internal memory

26

Determining Number of clustersWe use elbow method to find the optimal value of K to find the number of clusters.

To get an optimum value, 10 iterations are repeatedly performed, and the plot is obtained.

27

From the point 3 number of clusters, the ratio is changing rapidly and hence the number of cluster can be decided as 3.

K mean Clustering Kmean clustering is performed by the command plot <-kmeans(MPU1, 3)

Between_SS /total_ss is about 88.6% and hence it is a good fit.

28

Plot before cleaning/ processing data

Plot after cleaning/ processing data

Three clusters are of almost similar cluster size which are 650,643 and 666

29

Price tiers can be identified from the cluster

Business Recommendations From the cluster we have decided to suggest give three rages of mobiles

Class 1 - an internal memory between 2GB and 25GB with a ram of 200mb to 1.5GB

Class 2 - an internal memory ranging from 25GB to 45GB with a ram of 1.5GB to 2.5GB

Class 3 - an internal memory between 45GB and 65GB with a ram of 2.5 GB to 4GB

Class -1 mobile Class 1 can be a low budget mobile which can be with some of the other features and the price can be less and market in under developed markets. It is with lower ram and lower internal memory.

Recommendation for the class -1 other features

BluetoothOne core processerWIFI3G

30

Class-2 mobileClass 2 is a medium class mobile in which most of other features can be added and can be marketed in a medium developed market. It is with a medium ram and medium memory.

Recommendation for class-2 other features

BluetoothPrimary camera3GTouch screenWIFI 4 core processers

Class-3 mobileClass 3 is a premium class phone with all the functions and can be marketed among higher class people and the price can be higher as well. High performance with high ram and memory.

Recommendation for class-3 other features

BluetoothDual simFront camera4G6 core processorPrimary camera3GTouch screenWIFI

31

task 1: decision trees - files.transtutors.com€¦ · web viewin the given data of...

Documents