my first data science project (data science thailand meetup #1)
TRANSCRIPT
Data Science Thailand Meetup #1 (My first Data Science Project)
16 October 2015 By Komes Chandavimol
http://datascienceth.com/category/seminar/slides/
Café Amazon * Need to understand more about customer opinion, behaviors on the brand of Café Amazon
The plan * The first steps is to collect the sets of social media information starting from public datasets such as twitters. This could the good steps for future Big Data platform.
Café Amazon Use Case
http://datascienceth.com/category/seminar/slides/
* Source – Twitters that relates to “Café Amazon”, “Amazon Coffee”, or “Amazon”
Source -‐ Twitter
Twitter Data Discovery
* Step 1: Identify Class (Label) * Step 2: Data Cleaning * Step 3: Word Counting * Step 4: Tokenization * Step 5: Model Building * Step 6: Using Bayes and the MAP model
Tweet Classification using Excel
Step 1: Identify Class (Label)
Amazon (AMZ) Others (OTH)
Step 1: Identify Class (Label)
Identify Class (Label)
* Change all to lower cases * =lower(A2)
* Removing space , . ? ! ; and , * =SUBSTITUTE(B2,". "," "),=SUBSTITUTE(C2,": "," ") * =SUBSTITUTE(D2,"?"," ")=SUBSTITUTE(E2,"!"," ") * =SUBSTITUTE(F2,";"," ")=SUBSTITUTE(G2,","," ")
Step 2: Clean Data
Step 2: Clean Data
Tweet Classification (Label)
lowercase remove . remove : remove ?
My J soya iced latte. I think Amazon Cafe makes the best AMZ =LOWER(B2) =SUBSTITUTE(C2,". "," ")
=SUBSTITUTE(D2,": "," ")
=SUBSTITUTE(E2,"?"," ")
I am at Amazon Café AMZ =LOWER(B2) =SUBSTITUTE(C2,". "," ")
=SUBSTITUTE(D2,": "," ")
=SUBSTITUTE(E2,"?"," ")
green tea time ❤🍵🍃 @ #cafeAmazon pic.twitter.com/28jzc4ojOy
AMZ =LOWER(B2) =SUBSTITUTE(C2,". "," ")
=SUBSTITUTE(D2,": "," ")
=SUBSTITUTE(E2,"?"," ")
#Coffee Instagram by @Tikkieinlove #Tuesday#relax time#icecoffee#espresso#cafeamazon#iphone6plus#
AMZ =LOWER(B2) =SUBSTITUTE(C2,". "," ")
=SUBSTITUTE(D2,": "," ")
=SUBSTITUTE(E2,"?"," ")
wBestSellers: #> Amazon #Deals: Save $62 (48% OFF) Mr. Coffee BVMC-‐EL1 Café
OTH =LOWER(B2) =SUBSTITUTE(C2,". "," ")
=SUBSTITUTE(D2,": "," ")
=SUBSTITUTE(E2,"?"," ")
#Amazon #Coffee Store©☕goo.gl/b8U4a1☕#BeanCoffee #Coffeemaker #grinder #cappuccino #hillsbros
OTH =LOWER(B2) =SUBSTITUTE(C2,". "," ")
=SUBSTITUTE(D2,": "," ")
=SUBSTITUTE(E2,"?"," ")
#> Amazon #Deals: Save $62 (48% OFF) Mr. Coffee BVMC-‐EL1 Caf... bit.ly/1GHYQfG |
OTH =LOWER(B2) =SUBSTITUTE(C2,". "," ")
=SUBSTITUTE(D2,": "," ")
=SUBSTITUTE(E2,"?"," ")
Step 3: Word Counting * Prepare Words from
First N row, set space position = 0 =LEN(C2)
Step 3: Word Counting
3.1 Separate Words
After N Rows =IFERROR(MID(A2,B2+1,B102-‐B2-‐1),".") =LEN(C2)
Step 3: Word Counting
3.2 Check the result
=IFERROR(MID(A2,B2+1,B102-‐B2-‐1),".") =LEN(C2) =IFERROR(FIND(" ",A127,B27+1),LEN(A127)+1)
Step 3: Word Counting
3.2 Check the result
=IFERROR(MID(A2,B2+1,B102-‐B2-‐1),".") =LEN(C2) =IFERROR(FIND(" ",A127,B27+1),LEN(A127)+1)
Step 3: Word Counting
3.3 Using Pivot Table and count each word
=C4/C$3 =LN(D4) =B4+1
Step 4: Tokenization
4 Using Pivot Table and count each word
=C4/C$3 =LN(D4) =B4+1
• Add one to everything • Calculate P(Token/APP) • Calculate Log (P)
Step 5: Model Building
Amazon (AMZ) Others (OTH)
5 Building the Model
=IF(LEN(D2)<=3,0,IF(ISNA(VLOOKUP(D2,PropAMZ!$A$4:$E$386,5,FALSE)),LN(1/PropAMZ!$C$3),VLOOKUP(D2,PropAMZ!$A$4:$E$386,5,FALSE)))
=SUM(D14:AI14)
=IF(C14>C26,"AMZ","OTHER")
Step 6: Using Model
Amazon (AMZ) Others (OTH)
5 Testing the Model
Summary
* Step 1: Identify Class (Label) * Step 2: Data Cleaning * Step 3: Word Counting * Step 4: Tokenization * Step 5: Model Building * Step 6: Using Bayes and the MAP model
Other Solutions?
www.datascienceth.com -‐ R
www.datascienceth.com -‐ Python www.datascienceth.com RapidMiner