data science weekend 2017. cleverdata. text mining of beauty blogs: о чем говорят...
TRANSCRIPT
TextminingofBeautyBlogs:
TextminingofBeautyBlogs:Очемговорятженщины?
АртемПросветовDataScientist,CleverDATA
cleverdata.ru |[email protected]
RawblogdataRawdata:98,496 pagesinformatof~1,000,000files.Readyforanalysis:58,719 Englishpages(59.6%)
40.4%data:emptypagesandpageswitherrors,notEnglishpages(23,461),photo/videopageswithouttext(2,315),articlesfromtechcrunch.com(3,402)
cleverdata.ru |[email protected]
Meanblogpostsize(inwords)
One can distinguish 2 populationsof bloggers:
• twitter style' authors with shortposts (~20%)
• full-length bloggers with 200-500 mean words per post(~80%)
cleverdata.ru |[email protected]
Used APIs and services:
- Sentity (https://sentity.io/)
- Twinword (https://www.twinword.com/)
- Textualinsights (http://www.textualinsights.com/)
- VivekN (https://github.com/vivekn/sentiment-web)
Sentimentanalysis
cleverdata.ru |[email protected]
Sentimentanalysis
• - the resulting sentiment rate is basedon 4 independent rate systems.
• - the majority of the blogs have positiveemotion rate.
• - the mean sentiment rate is «positivewarm» 0.72.
• - all this results are intuitively consistentand are in a good agreement withmanual tests
cleverdata.ru |[email protected]
We used afew traffic rank systems:
Estimation of blog efficiency
• Alexa Rank,that basically audits and makes public the frequency ofvisits on various Web sites.
• Yandex Thematic Citation Index (TIC),that determines the“credibility”of Internet resources based on aqualitative assessmentof links to other sites.
• Google Page Rank,that works by counting the number and qualityof links to blog to determine arough estimate of how important thewebsite is.
cleverdata.ru |[email protected]
Content relevance rate is based on fuzzy string matching:
- Every company product name was string matched with all amount of blogs.- String matching is based on Levinstein's metric.- Pages with 90%matching rate were marked up.- Tests with direct brand name matching showed that we get about 90-100%accuracy on each product name deppends on words in title.- The result relevance rate for each author is summed from all marks ofhis/hers pages.
RelevanceRate
cleverdata.ru |[email protected]
Levenshtein distance is astring metric for measuring the difference betweentwo sequences.
Informally,the Levenshtein distance between two words is the minimumnumber of single-character edits (i.e.insertions,deletions or substitutions)required to change one word into the other.
Levinshtein distance between 'beer'and 'bread'is 44/100
Levenshteindistance
cleverdata.ru |[email protected]
Themostactiveauthors
writewithsentiment
rateinshortrange:
0.74+/- 0.03
Sentiment rate
Blog
siz
e (p
ages
)
SentimentsvsBlogsize
cleverdata.ru |[email protected]
Themostdiscussed
blogshavemiddle-
sizeauthors.
Log(Blog size)
Mea
n di
scus
sion
DiscussionvsBlogsize
cleverdata.ru |[email protected]
Again,2kindsofbloggers:
- 'twitterstyle'authorswithshortposts
- full-lengthbloggers
Log(mean words per page)
Log(
Blog
siz
e)
WordsvsPages
cleverdata.ru |[email protected]
fyouwanttomakeabigdiscussion,youshouldpraisesomething.
Allhighlydiscussedauthorsaresentimentpositive(>=0.4)
Sentiment rate
Mea
n di
scus
sion
DiscussionvsSentiments
cleverdata.ru |[email protected]
We use Klout service to rank authorsaccording to online social influence.Klout measures the size of auser'ssocial media network and correlates thecontent created to measure how otherusers interact with that content.
- the median Klout score is 40.1
UsingofKloutscoreforbloggers
cleverdata.ru |[email protected]
One can distinguish apopulationof beginner bloggers with lowKlout score,that have tendencyto amplification of sentiments.
Sentiment rate
Klou
tsco
re
SentimentsvsKloutscore
cleverdata.ru |[email protected]
• Amountofblogpages
• Meandiscussionsize
• AlexaRank +YandexTIC +GooglePageRank
• Relevancerate
• Sentimentrate
• Klout score
FinalAuthorRatingisbasedon
cleverdata.ru |[email protected]
4independentsentimentratingsystemsarecombined
AlexaRank
YandexThematicCitationIndex
GooglePageRank
listofmostPReffectiveauthors
Pragmaticstatisticalinformation
keyrecommendationsforblogger
resultingsentimentrateisfullyconsistentwithtests
Blogefficiencyrating
Blogrelevancerating
Sentimentanalysis
Makeyourdataclever
Basedonfuzzystringmatching
Blogratinginaccordancetomentionsofcompanyproductsintext
cleverdata.ru |[email protected]
Name Url Sentiment Pages MeanComments
HayleyCarr http://www.londonbeautyqueen.com 0.71 229 10.9
Luzanne http://pinkpeonies.co.za 0.77 66 68.3
Allison http://www.neversaydiebeauty.com 0.70 182 42.9
MicaKelly,Beth,JessicaDiner http://blog.birchbox.co.uk 0.74 196 0.26
Poonam http://beautyandmakeupmatters.com 0.78 142 4.3
Silvie http://mysillylittlegang.com 0.74 571 0.64
TOPRatedAuthors
cleverdata.ru |[email protected]
Testingtheresult
HayleyCarr (TopRatedAuthor):“BlaBlaBla isdefinitelyabrandtobereckonedwith...AlloftheBlaBlaBla productshavemultiplepurposes,aswellassmellingandfeelingfabulous;thepackagingiscleanandfreshwhilststilllookinggreatinyourbathroom,aswellashavinguniqueapplicationmethodsthatonlyaidtheproductperformance...It'sdefinitelyworthcheckingoutthisgrowingbrand,beforeitstartstakingovertheworld.“
cleverdata.ru |[email protected]
Inordertoassociateabloggerwithaproductwemust:
• Findproductsforpromotion
• Findmaintopicsofeachblogger
• Matchtopicsofeachbloggerwith productnames
• Findbestcombinationsofbloggerandproduct
cleverdata.ru |[email protected]
Inordertoassociateabloggerwithaproductwemust:
• Findproductsforpromotion
• Findmaintopicsofeachblogger
• Matchtopicsofeachbloggerwith productnames
• Findbestcombinationsofbloggerandproduct
cleverdata.ru |[email protected]
Let'sbuilddocument-termmatrix,whereeachrowisadocument,eachtermisacolumnandacolorintensityindicatesthatatermappearsinadocumentatleastonce.
WecanuseTF-IDFmethodtogetdocument-termmatrix.
Findingtopics:thedocument-termmatrix
cleverdata.ru |[email protected]
Findingtopics:TF- IDF
• TermfrequencyTF(t,d) isthenumberoftimesthattermtoccursindocumentd.
• Theinversedocumentfrequency(IDF)isameasureofhowmuchinformationthewordprovides,thatis,whetherthetermiscommonorrareacrossalldocuments.
• Termfrequency–inversedocumentfrequency,isanumericalstatisticthatisintendedtoreflecthowimportantawordistoadocumentinacollectionorcorpus.
cleverdata.ru |[email protected]
• NMFisavariantofMatrixFactorizationwherewestartwithamatrixD withdocument-termmatrix,andconstraintheelementsofW andT tobenon-negative.
• LetsusinterpreteachrowoftheT matrixasatopic.
Topicextraction:NMF
cleverdata.ru |[email protected]
Inordertoassociateabloggerwithaproductwemust:
• Findproductsforpromotion
• Findmaintopicsofeachblogger
• Matchtopicsofeachbloggerwith productnames
• Findbestcombinationsofbloggerandproduct
cleverdata.ru |[email protected]
• Foreachauthorwebuilddocument-termmatrix.
• Foreachdocument-termmatrixweperformmatrixfactorizationandfindmaintopics
• Foreachproductwematchproductnamewithmaintopicsofauthorandfindtherateofintensity.
• Ifauthorhaveexactproductnameinoneofhis/herstitles,wesettherateofintensityto0 (theauthorhasalreadymadereviewofthetheproduct).
Topicextraction
cleverdata.ru |[email protected]
Thusforeachpairofauthor-productwefindrateofintensityandwecanvisualizeitinformofheatmapwhereproductsaresortedbymeanrateof
intensityandauthorsaresortedbyauthorrating:
Note:themostratedauthorsarehighlyintensiveonmatrix
Theintensitymatrix
cleverdata.ru |[email protected]
Inordertoassociateabloggerwithaproductwemust:
• Findproductsforpromotion
• Findmaintopicsofeachblogger
• Matchtopicsofeachbloggerwith productnames
• Findbestcombinationsofbloggerandproduct
cleverdata.ru |[email protected]
Nextweextractthemostresonancepeaksfromproduct-authormatrixofintensity.Aftereachpeakextractionthecolumnwithapeakisdropped,soforeachauthorwegetonlyoneproduct.
Weneedtobuildrecommendationsonlyfor4productsandwecanselect40bestratedauthorsforthistask.
Theintensitymatrix
cleverdata.ru |[email protected]
Inordertoassociateabloggerwithaproductwemust:
• Findproductsforpromotion
• Findmaintopicsofeachblogger
• Matchtopicsofeachbloggerwith productnames
• Findbestcombinationsofbloggerandproduct
• Profit!
cleverdata.ru |[email protected]
BlaBlaBlaBodyOil Allison http://www.neversaydiebeauty.com
BlaBlaBlaWrinkleRepair CindyBatchelor http://mystylespot.net
BlaBlaBlaFaceSerum MariePapachatzis http://iamthemakeupjunkie.blogspot.ru
BlaBlaBlaFaceOil Emily- StyleLobster http://stylelobster.com
Theresultingassociations