shawn cicoria, john sherlock, manoj muniswamaiah, lauren clarke

18
Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data Shawn Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke Seidenberg School of Computer Science and Information Systems Pace University White Plains, NY, US [email protected], {js20454w,mm42526w,lc18948w}@pace.edu

Upload: kagami

Post on 23-Feb-2016

107 views

Category:

Documents


0 download

DESCRIPTION

Shawn Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke Seidenberg School of Computer Science and Information Systems Pace University White Plains, NY, US [email protected], {js20454w,mm42526w,lc18948w}@pace.edu. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Classification of Titanic Passenger Data and

Chances of Surviving the Disaster

Data Mining with Weka and Kaggle Competition DataShawn Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Seidenberg School of Computer Science and Information SystemsPace University

White Plains, NY, [email protected], {js20454w,mm42526w,lc18948w}@pace.edu

Page 2: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Background• Titanic Disaster – April 15, 1912• 1,502 passengers and crew perished out of 2,224[2]• Researchers still try to identify chance of survival[2,3]

[2] “Titanic: Machine Learning from Disaster,” Kaggle.com. [Online]. Available: https://www.kaggle.com/c/titanic-gettingStarted. [Accessed: 13-Dec-2013].[3] Wiki, “Titanic.” [Online]. Available: http://en.wikipedia.org/wiki/Titanic. [Accessed: 13-Dec-2013].

Page 3: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Kaggle.com• Crowd sourcing and competition for Analytics and

Data mining• Online Presence• Example Competition• General Electric (GE) offering $200,000

Page 4: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

WekaWaikato Environment for Knowledge Analysis• Open Source tool• http://www.cs.waikato.ac.nz/ml/weka/

• Collection of machine learning algorithms and analytical tools• Cross Platform – Java based• Primary authors – Researchers at University of Waikato

NZ

Page 5: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Basic Premise• What classes of passengers impacted the survivability

for the Titanic Disaster?• Sex• Cabin Class• Point of Departure• Age

Page 6: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Source Data• Kaggle (Kaggle.com)• Titanic Disaster Competition

• https://www.kaggle.com/c/titanic-gettingStarted• Used Test Data set

Page 7: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Data Set – Coaxing for Weka• Original Data

• Data Modifications

Field Action Comment PassengerId Removed Not needed for analysis as it’s

just an identifier Survived Converted to No/Yes Needed Nominal identifier Pclass Removed -> created Class

column instead Needed Nominal Identifier

Class New column Simple calculation based upon PClass

Age Removed -> created AgeGroup class

Wanted simple classification coding

AgeGroup Formula based; some values not supplied. But ended up with 4 group other than Unknown (Child, Adolescent, Adult, Old)

Arbitrarily did the following: =IF(F2="", "Unk",IF(F2<10, "Child", IF(F2<20, "Adolescent", IF(F2<50, "Adult", "Old"))))

Ecode Removed -> created class Embarked

Needed nominal identifier

Embarked New column that converted Ecode to the real name of the departure point for the passenger

Page 8: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Final Data Format

• Final CSV

• ARFF Format

@relation 'train4-weka.filters.unsupervised.attribute.Remove-R1,3,6,8' @attribute Survived {No,Yes} @attribute Class {1st,2nd,3rd} @attribute Sex {male,female} @attribute AgeGroup {Child,Adolescent,Adult,Old,Unk} @attribute Embarked {Southampton,Cherbourg,Queenstown,Unk} @data No,3rd,male,Adult,Southampton Yes,1st,female,Adult,Cherbourg

Page 9: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

J48 Classifier• C.45 Based

• 81% correct classification• 42nd in Kaggle

if submitted !!

J48 pruned tree ------------------ Sex = male: No (577.0/109.0) Sex = female | Class = 3rd | | Embarked = Southampton: No (88.0/33.0) | | Embarked = Cherbourg: Yes (23.0/8.0) | | Embarked = Queenstown | | | AgeGroup = Child: Yes (0.0) | | | AgeGroup = Adolescent: Yes (5.0/1.0) | | | AgeGroup = Adult: No (5.0/1.0) | | | AgeGroup = Old: Yes (0.0) | | | AgeGroup = Unk: Yes (23.0/4.0) | | Embarked = Unk: No (0.0) | Class = 1st: Yes (94.0/3.0) | Class = 2nd: Yes (76.0/6.0) Number of Leaves : 11 Size of the tree : 15

=== Summary === Correctly Classified Instances 722 81.0325 % Incorrectly Classified Instances 169 18.9675 % Kappa statistic 0.5714 Mean absolute error 0.2911 Root mean squared error 0.385 Relative absolute error 61.5359 % Root relative squared error 79.1696 %

Information gain Amount of information gained by knowing the value of the attribute (Entropy of distribution before the split) –(entropy of distribution after it)Claude Shannon, American mathematician and scientist 1916–2001

Page 10: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

J48 Tree Diagram• Sex largest

impact• Cabin Class• Departure

point

Page 11: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Simple K Means Clustering• Sex had clear

clustering impact

Page 12: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Simple K Means Clustering• Cabin Class

showed significant clustering• 3rd class not so

great

Page 13: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Simple K Means Clustering• Age Group • Hard to distinguish

if any• Lowest influencer

in J48

Page 14: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Simple K Means• Point of Departure• Southampton seems

significant• We didn’t identify if

departure was associated with Cabin Class – another study needed.

Page 15: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Simple K Means• Point of

Departure vs. Survived• Instance Colored

by Class (1st, 2nd, 3rd)• Show’s strong

association between embark and class

Page 16: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Conclusions and Summary• Sex clearly had the

most significant impact on the survival rate• J48 classifier ~ 81%

correctly classified instances• Kaggle competition

43rd place.

Page 17: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Finally• Weka is powerful, however;• Requires significant coaxing of the data into a more amiable format

• At first, we had chosen baseball statistics• Became overwhelmed • Baseball statistics were tossed out – very late in our project.

• Kaggle to the rescue• Stumbled upon this dataset• Simple manipulation had compatible ARFF format

• Demonstrated which classes of passengers had the greatest impact on survivability.

Page 18: Shawn  Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

References[1] GE, “Flight Quest Challenge,” Kaggle.com. [Online]. Available: https://www.gequest.com/c/flight2-main. [Accessed: 13-Dec-2013].[2] “Titanic: Machine Learning from Disaster,” Kaggle.com. [Online]. Available: https://www.kaggle.com/c/titanic-gettingStarted. [Accessed: 13-Dec-2013].[3] Wiki, “Titanic.” [Online]. Available: http://en.wikipedia.org/wiki/Titanic. [Accessed: 13-Dec-2013]. [4] Kaggle, Data Science Community, [Online]. Available: http://www.kaggle.com/ [Accessed: 13-Dec-2013][5] Weka 3: Data Mining Software in Java, [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/ [Accessed: 13-Dec-2013] [6] C4.5 Algorithm, Wikipedia, Wikimedia Foundation, [Online]. Available: http://en.wikipedia.org/wiki/C4.5_algorithm, [Accessed: 13-Dec-2013]