titanic linkedin presentation - 20022015

Post on 18-Aug-2015

46 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Tackling the Titanic

Alex Akulov

Carlos Hernandez

20 February 2015

Approach in an internal analytics competition

2Deloitte Titanic Analytics Competition | Walkthrough

Aussies challenge the worldDeloitte Australia called out the global member firms to an analytics competition - the response was loud and clear.

Nations31

Teams192

Practitioners359

Submissions1,954

3Deloitte Titanic Analytics Competition | Walkthrough

Deloitte tackles the TitanicThe task was to predict the fate for half the passengers aboard the ship, based on the outcomes for the first half.

Survived

Died

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked31 0 1Uruchurtu, Don. Manuel E male 40 0 0 PC 17601 27.7208 C

246 0 1Minahan, Dr. William Edward male 44 2 0 19928 90C78 Q746 0 1Crosby, Capt. Edward Gifford male 70 1 1 WE/P 5735 71B22 S

17 0 3Rice, Master. Eugene male 2 4 1 382652 29.125 Q1 0 3Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S2 1 1Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833C85 C3 1 3Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 S

4Deloitte Titanic Analytics Competition | Walkthrough

FeedbackFeedback

Model

Tuning

Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Model

Tuning

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

5Deloitte Titanic Analytics Competition | Walkthrough

FeedbackFeedback

Model

Tuning

Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Model

Tuning

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

From raw data to usable features:

DeterministicDeriving values from existing features

ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)

6Deloitte Titanic Analytics Competition | Walkthrough

Feature EngineeringAttributes were derived from existing data to generate an enhanced view of passengers and help with model accuracy.

Name: Rice, Master. Eugene Family Name: Rice

Given Name: Eugene

Passenger Type: Master

Class: 1 Sex: F Spouses: 0

Age Estimate: 46

Passenger IDSurvived? (Y/N)Passenger ClassNameGiven NameFamily NamePassenger TypeGenderGender Code * AgeGender Code * Passenger ClassAgeAge Estimate (Regression)Age Estimate (Distribution)SibSpSiblingsSpousesParchParentsChildrenWife? (Y/N)Husband? (Y/N)Father? (Y/N)Mother? (Y/N)Travel Type 1Travel Type 2TicketDeath in Group? (Y/N)Father * Death in GroupGroup SizeTicket First CharacterTicker First LetterFareFare per PassengerFare (log)CabinDeckCabin NumberEmbarked

7Deloitte Titanic Analytics Competition | Walkthrough

FeedbackFeedback

Model

Tuning

Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Model

Tuning

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

From raw data to usable features:

DeterministicDeriving values from existing features

ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)

Visualization driver feature selection for model development.

Assumptions can be tested to save time in model development.

8Deloitte Titanic Analytics Competition | Walkthrough

Data visualizationTableau rapid-fire visualizations enabled us to get to know the data and segment it better for analysis.

Sex / Pclass

female male

1 2 3 1 2 3

0

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

320

340

360

Count of Name

SurvivedNo

Yes

Females in 3rd class need independent analysis.

Females in 1st/2nd class should be grouped.

Males in 3rd class could skew results. Best to analyze independently.

Pclass

1 2 3

20

40

60

80

100

120

140

160

180

200

220

240

260

280

Fare

Passenger class cannot be determined based on fares.

There was an overlap of class cabins.

Classes are distributed across decks.

Pclass

Alone

female male

Family

female male

Group

female male

No Yes No Yes No Yes No Yes No Yes No Yes

1

2

3

0

50

100

150

200

# Survived

0

50

100

150

200

# Survived

0

50

100

150

200

# Survived

There is no direct correlation between a passenger’s travel type (Alone/Family/Group) and survival rate, as previously theorized.

9Deloitte Titanic Analytics Competition | Walkthrough

From raw data to usable features:

DeterministicDeriving values from existing features

ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)

FeedbackFeedback

Model

Tuning

Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Model

Tuning

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Visualization driver feature selection for model development.

Assumptions can be tested to save time in model development.

Use Statistical and Machine Learning models suitable for the task.

Determine which features are useful and derive new if necessary.

10Deloitte Titanic Analytics Competition | Walkthrough

Model DevelopmentMultiple statistical tools were used since some proved better than others in predicting outcome for groups of passengers.

KNIME Analytics Platform was used for prototyping and testing of various modeling approaches:

• Splitting data into groups

• Decision Trees

• Random Forest

• Logistic Regression

• Support Vector Machines (SVM)

11Deloitte Titanic Analytics Competition | Walkthrough

From raw data to usable features:

DeterministicDeriving values from existing features

ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)

Visualization driver feature selection for model development.

Assumptions can be tested to save time in model development.

Use Statistical and Machine Learning models suitable for the task.

Determine which features are useful and derive new if necessary.

FeedbackFeedback

Model

Tuning

Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Model

Tuning

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Submission is scored and ranked against other teams.

Once a certain threshold is met, the model is ready for tuning.

12Deloitte Titanic Analytics Competition | Walkthrough

Kaggle SubmissionVancouver Data Divers achieved the goal of being top 10% two weeks before the competition deadline.

Gender

ClassClass

MF

Survived

1st or 2nd

LogisticRegression

3rd

Decision Treeon 'Master?', 'Fare'

If Master? = 1Survived

1st

Logistic Regression

2nd 3rd

13Deloitte Titanic Analytics Competition | Walkthrough

From raw data to usable features:

DeterministicDeriving values from existing features

ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)

Visualization driver feature selection for model development.

Assumptions can be tested to save time in model development.

Use Statistical and Machine Learning models suitable for the task.

Determine which features are useful and derive new if necessary.

Submission is scored and ranked against other teams.

Once a certain threshold is met, the model is ready for tuning.

FeedbackFeedback

Model

Tuning

Vancouver Data Divers tackle the TitanicIterative approach to modeling and tuning is imperative to achieving a high score.

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Model

Tuning

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Adjusting model parameters can increase the model accuracy without changing the input variables.

Cross validation is one approach to test the model after each tuning cycle.

14Deloitte Titanic Analytics Competition | Walkthrough

Model TuningModel parameters can be adjusted to achieve better predictive results – moved the team from 19th to 13th spot

Python was used to tune the model by:

• Choosing the optimal features

• Adjusting model parameters

• Reducing manual effort

32,768 combinations of 15 features

2,000 attempts per hour

17 hours on a Deloitte laptop

15Deloitte Titanic Analytics Competition | Walkthrough

Vancouver Data Divers Placed 13th OverallThe local IM&AT talent is capable of tackling predictive modeling projects

Vancouver office is on the map for global Analytics talent

Our story got one client excited thinking about predictive modeling opportunities

Developed stronger predictive analytics capabilities that can be shared within the practice

top related