data science weekend 2017. cleverdata. text mining of beauty blogs: о чем говорят...

34
Text mining of Beauty Blogs: Text mining of Beauty Blogs: О чем говорят женщины? Артем Просветов Data Scientist, CleverDATA

Upload: newprolab

Post on 22-Jan-2018

310 views

Category:

Data & Analytics


0 download

TRANSCRIPT

TextminingofBeautyBlogs:

TextminingofBeautyBlogs:Очемговорятженщины?

АртемПросветовDataScientist,CleverDATA

cleverdata.ru |[email protected]

RawblogdataRawdata:98,496 pagesinformatof~1,000,000files.Readyforanalysis:58,719 Englishpages(59.6%)

40.4%data:emptypagesandpageswitherrors,notEnglishpages(23,461),photo/videopageswithouttext(2,315),articlesfromtechcrunch.com(3,402)

cleverdata.ru |[email protected]

From60k ofpages→~2000 authors.

Pages→Authors

cleverdata.ru |[email protected]

Meanblogpostsize(inwords)

One can distinguish 2 populationsof bloggers:

• twitter style' authors with shortposts (~20%)

• full-length bloggers with 200-500 mean words per post(~80%)

cleverdata.ru |[email protected]

Used APIs and services:

- Sentity (https://sentity.io/)

- Twinword (https://www.twinword.com/)

- Textualinsights (http://www.textualinsights.com/)

- VivekN (https://github.com/vivekn/sentiment-web)

Sentimentanalysis

cleverdata.ru |[email protected]

Sentimentanalysis

• - the resulting sentiment rate is basedon 4 independent rate systems.

• - the majority of the blogs have positiveemotion rate.

• - the mean sentiment rate is «positivewarm» 0.72.

• - all this results are intuitively consistentand are in a good agreement withmanual tests

cleverdata.ru |[email protected]

We used afew traffic rank systems:

Estimation of blog efficiency

• Alexa Rank,that basically audits and makes public the frequency ofvisits on various Web sites.

• Yandex Thematic Citation Index (TIC),that determines the“credibility”of Internet resources based on aqualitative assessmentof links to other sites.

• Google Page Rank,that works by counting the number and qualityof links to blog to determine arough estimate of how important thewebsite is.

cleverdata.ru |[email protected]

Content relevance rate is based on fuzzy string matching:

- Every company product name was string matched with all amount of blogs.- String matching is based on Levinstein's metric.- Pages with 90%matching rate were marked up.- Tests with direct brand name matching showed that we get about 90-100%accuracy on each product name deppends on words in title.- The result relevance rate for each author is summed from all marks ofhis/hers pages.

RelevanceRate

cleverdata.ru |[email protected]

Levenshtein distance is astring metric for measuring the difference betweentwo sequences.

Informally,the Levenshtein distance between two words is the minimumnumber of single-character edits (i.e.insertions,deletions or substitutions)required to change one word into the other.

Levinshtein distance between 'beer'and 'bread'is 44/100

Levenshteindistance

cleverdata.ru |[email protected]

Themostactiveauthors

writewithsentiment

rateinshortrange:

0.74+/- 0.03

Sentiment rate

Blog

siz

e (p

ages

)

SentimentsvsBlogsize

cleverdata.ru |[email protected]

Themostdiscussed

blogshavemiddle-

sizeauthors.

Log(Blog size)

Mea

n di

scus

sion

DiscussionvsBlogsize

cleverdata.ru |[email protected]

Again,2kindsofbloggers:

- 'twitterstyle'authorswithshortposts

- full-lengthbloggers

Log(mean words per page)

Log(

Blog

siz

e)

WordsvsPages

cleverdata.ru |[email protected]

fyouwanttomakeabigdiscussion,youshouldpraisesomething.

Allhighlydiscussedauthorsaresentimentpositive(>=0.4)

Sentiment rate

Mea

n di

scus

sion

DiscussionvsSentiments

cleverdata.ru |[email protected]

We use Klout service to rank authorsaccording to online social influence.Klout measures the size of auser'ssocial media network and correlates thecontent created to measure how otherusers interact with that content.

- the median Klout score is 40.1

UsingofKloutscoreforbloggers

cleverdata.ru |[email protected]

One can distinguish apopulationof beginner bloggers with lowKlout score,that have tendencyto amplification of sentiments.

Sentiment rate

Klou

tsco

re

SentimentsvsKloutscore

cleverdata.ru |[email protected]

• Amountofblogpages

• Meandiscussionsize

• AlexaRank +YandexTIC +GooglePageRank

• Relevancerate

• Sentimentrate

• Klout score

FinalAuthorRatingisbasedon

cleverdata.ru |[email protected]

4independentsentimentratingsystemsarecombined

AlexaRank

YandexThematicCitationIndex

GooglePageRank

listofmostPReffectiveauthors

Pragmaticstatisticalinformation

keyrecommendationsforblogger

resultingsentimentrateisfullyconsistentwithtests

Blogefficiencyrating

Blogrelevancerating

Sentimentanalysis

Makeyourdataclever

Basedonfuzzystringmatching

Blogratinginaccordancetomentionsofcompanyproductsintext

cleverdata.ru |[email protected]

Name Url Sentiment Pages MeanComments

HayleyCarr http://www.londonbeautyqueen.com 0.71 229 10.9

Luzanne http://pinkpeonies.co.za 0.77 66 68.3

Allison http://www.neversaydiebeauty.com 0.70 182 42.9

MicaKelly,Beth,JessicaDiner http://blog.birchbox.co.uk 0.74 196 0.26

Poonam http://beautyandmakeupmatters.com 0.78 142 4.3

Silvie http://mysillylittlegang.com 0.74 571 0.64

TOPRatedAuthors

cleverdata.ru |[email protected]

Testingtheresult

HayleyCarr (TopRatedAuthor):“BlaBlaBla isdefinitelyabrandtobereckonedwith...AlloftheBlaBlaBla productshavemultiplepurposes,aswellassmellingandfeelingfabulous;thepackagingiscleanandfreshwhilststilllookinggreatinyourbathroom,aswellashavinguniqueapplicationmethodsthatonlyaidtheproductperformance...It'sdefinitelyworthcheckingoutthisgrowingbrand,beforeitstartstakingovertheworld.“

cleverdata.ru |[email protected]

Authors←→Products

cleverdata.ru |[email protected]

Inordertoassociateabloggerwithaproductwemust:

• Findproductsforpromotion

• Findmaintopicsofeachblogger

• Matchtopicsofeachbloggerwith productnames

• Findbestcombinationsofbloggerandproduct

cleverdata.ru |[email protected]

Findingthemostperspectiveforpromotionproducts

cleverdata.ru |[email protected]

Inordertoassociateabloggerwithaproductwemust:

• Findproductsforpromotion

• Findmaintopicsofeachblogger

• Matchtopicsofeachbloggerwith productnames

• Findbestcombinationsofbloggerandproduct

cleverdata.ru |[email protected]

Let'sbuilddocument-termmatrix,whereeachrowisadocument,eachtermisacolumnandacolorintensityindicatesthatatermappearsinadocumentatleastonce.

WecanuseTF-IDFmethodtogetdocument-termmatrix.

Findingtopics:thedocument-termmatrix

cleverdata.ru |[email protected]

Findingtopics:TF- IDF

• TermfrequencyTF(t,d) isthenumberoftimesthattermtoccursindocumentd.

• Theinversedocumentfrequency(IDF)isameasureofhowmuchinformationthewordprovides,thatis,whetherthetermiscommonorrareacrossalldocuments.

• Termfrequency–inversedocumentfrequency,isanumericalstatisticthatisintendedtoreflecthowimportantawordistoadocumentinacollectionorcorpus.

cleverdata.ru |[email protected]

• NMFisavariantofMatrixFactorizationwherewestartwithamatrixD withdocument-termmatrix,andconstraintheelementsofW andT tobenon-negative.

• LetsusinterpreteachrowoftheT matrixasatopic.

Topicextraction:NMF

cleverdata.ru |[email protected]

Inordertoassociateabloggerwithaproductwemust:

• Findproductsforpromotion

• Findmaintopicsofeachblogger

• Matchtopicsofeachbloggerwith productnames

• Findbestcombinationsofbloggerandproduct

cleverdata.ru |[email protected]

• Foreachauthorwebuilddocument-termmatrix.

• Foreachdocument-termmatrixweperformmatrixfactorizationandfindmaintopics

• Foreachproductwematchproductnamewithmaintopicsofauthorandfindtherateofintensity.

• Ifauthorhaveexactproductnameinoneofhis/herstitles,wesettherateofintensityto0 (theauthorhasalreadymadereviewofthetheproduct).

Topicextraction

cleverdata.ru |[email protected]

Thusforeachpairofauthor-productwefindrateofintensityandwecanvisualizeitinformofheatmapwhereproductsaresortedbymeanrateof

intensityandauthorsaresortedbyauthorrating:

Note:themostratedauthorsarehighlyintensiveonmatrix

Theintensitymatrix

cleverdata.ru |[email protected]

Inordertoassociateabloggerwithaproductwemust:

• Findproductsforpromotion

• Findmaintopicsofeachblogger

• Matchtopicsofeachbloggerwith productnames

• Findbestcombinationsofbloggerandproduct

cleverdata.ru |[email protected]

Nextweextractthemostresonancepeaksfromproduct-authormatrixofintensity.Aftereachpeakextractionthecolumnwithapeakisdropped,soforeachauthorwegetonlyoneproduct.

Weneedtobuildrecommendationsonlyfor4productsandwecanselect40bestratedauthorsforthistask.

Theintensitymatrix

cleverdata.ru |[email protected]

Inordertoassociateabloggerwithaproductwemust:

• Findproductsforpromotion

• Findmaintopicsofeachblogger

• Matchtopicsofeachbloggerwith productnames

• Findbestcombinationsofbloggerandproduct

• Profit!

cleverdata.ru |[email protected]

BlaBlaBlaBodyOil Allison http://www.neversaydiebeauty.com

BlaBlaBlaWrinkleRepair CindyBatchelor http://mystylespot.net

BlaBlaBlaFaceSerum MariePapachatzis http://iamthemakeupjunkie.blogspot.ru

BlaBlaBlaFaceOil Emily- StyleLobster http://stylelobster.com

Theresultingassociations