my first data science project (data science thailand meetup #1)

Data Science Thailand Meetup #1 (My first Data Science Project)

16 October 2015 By Komes Chandavimol

[email protected]

http://datascienceth.com/category/seminar/slides/

Café Amazon *  Need to understand more about customer opinion, behaviors on the brand of Café Amazon

The plan *  The first steps is to collect the sets of social media information starting from public datasets such as twitters. This could the good steps for future Big Data platform.

Café Amazon Use Case

http://datascienceth.com/category/seminar/slides/

*  Source – Twitters that relates to “Café Amazon”, “Amazon Coffee”, or “Amazon”

Source -‐ Twitter

Twitter Data Discovery

*  Step 1: Identify Class (Label) *  Step 2: Data Cleaning *  Step 3: Word Counting *  Step 4: Tokenization *  Step 5: Model Building *  Step 6: Using Bayes and the MAP model

Tweet Classification using Excel

Step 1: Identify Class (Label)

Amazon (AMZ) Others (OTH)

Step 1: Identify Class (Label)

Identify Class (Label)

*  Change all to lower cases *  =lower(A2)

*  Removing space , . ? ! ; and , *  =SUBSTITUTE(B2,". "," "),=SUBSTITUTE(C2,": "," ") *  =SUBSTITUTE(D2,"?"," ")=SUBSTITUTE(E2,"!"," ") *  =SUBSTITUTE(F2,";"," ")=SUBSTITUTE(G2,","," ")

Step 2: Clean Data

Step 2: Clean Data

Tweet Classification (Label)

lowercase remove . remove : remove ?

My J soya iced latte. I think Amazon Cafe makes the best AMZ =LOWER(B2) =SUBSTITUTE(C2,". "," ")

=SUBSTITUTE(D2,": "," ")

=SUBSTITUTE(E2,"?"," ")

I am at Amazon Café AMZ =LOWER(B2) =SUBSTITUTE(C2,". "," ")



green tea time ❤🍵🍃 @ #cafeAmazon pic.twitter.com/28jzc4ojOy

AMZ =LOWER(B2) =SUBSTITUTE(C2,". "," ")



#Coffee Instagram by @Tikkieinlove #Tuesday#relax time#icecoffee#espresso#cafeamazon#iphone6plus#

AMZ =LOWER(B2) =SUBSTITUTE(C2,". "," ")



wBestSellers: #> Amazon #Deals: Save $62 (48% OFF) Mr. Coffee BVMC-‐EL1 Café

OTH =LOWER(B2) =SUBSTITUTE(C2,". "," ")



#Amazon #Coffee Store©☕goo.gl/b8U4a1☕#BeanCoffee #Coffeemaker #grinder #cappuccino #hillsbros




#> Amazon #Deals: Save $62 (48% OFF) Mr. Coffee BVMC-‐EL1 Caf... bit.ly/1GHYQfG |




Step 3: Word Counting *  Prepare Words from

First N row, set space position = 0 =LEN(C2)

Step 3: Word Counting

3.1 Separate Words

After N Rows =IFERROR(MID(A2,B2+1,B102-‐B2-‐1),".") =LEN(C2)


3.2 Check the result

=IFERROR(MID(A2,B2+1,B102-‐B2-‐1),".") =LEN(C2) =IFERROR(FIND(" ",A127,B27+1),LEN(A127)+1)


3.3 Using Pivot Table and count each word

=C4/C$3 =LN(D4) =B4+1

Step 4: Tokenization

4 Using Pivot Table and count each word

=C4/C$3 =LN(D4) =B4+1

•  Add one to everything •  Calculate P(Token/APP) •  Calculate Log (P)

Step 5: Model Building


5 Building the Model

=IF(LEN(D2)<=3,0,IF(ISNA(VLOOKUP(D2,PropAMZ!$A$4:$E$386,5,FALSE)),LN(1/PropAMZ!$C$3),VLOOKUP(D2,PropAMZ!$A$4:$E$386,5,FALSE)))

=SUM(D14:AI14)

=IF(C14>C26,"AMZ","OTHER")

Step 6: Using Model


5 Testing the Model

Summary

*  Step 1: Identify Class (Label) *  Step 2: Data Cleaning *  Step 3: Word Counting *  Step 4: Tokenization *  Step 5: Model Building *  Step 6: Using Bayes and the MAP model

Other Solutions?

www.datascienceth.com -‐ R

www.datascienceth.com -‐ Python www.datascienceth.com RapidMiner

my first data science project (data science thailand meetup #1)

Data & Analytics