multilingual text mining: lost in (machine) translation, found in native language mining
DESCRIPTION
Analyzing language usage on the internet with data mining, natural language processing and text analytics and the challenges ahead.TRANSCRIPT
!"#$%&'"())*"
+$,-,.%/$0,"1&23"+.%.%/4""5673".%"8+09:.%&;"1<0%7,0-6%'"
"=6$%>".%"?0-@&"50%/$0/&"+.%.%/"
A6:.%."BC"D<.:0<."1:&"D303&"E%.@&<7.3F"6G"?&H"I6<J"03"K$L0,6"
#0%F0"M%9C"
+$,-,.%/$0,"(C)'"1$976%'"NO"NP<.,"Q!'"()Q("
!"#$%&'"())*"
50%/$0/&"E70/&"6%"3:&"M%3&<%&3"
R6,$S&"6%"T:.%&7&"S.9<6U,6//.%/"7.3&"V&.U6":07"7$<P077&>"1H.W&<"6%"S$,-P,&"69907.6%7"
!"#$%&'"())*"
50%/$0/&"E70/&"6%"3:&"M%3&<%&3"
:WP4XXHHHC.%3&<%&3H6<,>73037C96SX73037YC:3S"
!"#$%&'"())*"
1<&%>.%/"Z:<07&7X+&S&"[&3&9-6%"
+6%.36<&>"E<>$"%&H7"3:<6$/:6$3"()QQ"
T6SP0<&>"%&H7"G<6S"0<6$%>"3:&">0F7"6G"\70S0"K.%"50>&%"J.,,.%/"8+0F"Q]^;"H.3:"3:&"0<-9,&7"F&0<"36">03&C""K07&>"6%"3:.7'"&23<093"7./%._90%3"%&H"P:<07&7C"
M,,$73<03&7"63:&<"%&H7"U&7.>&7"K.%"50>&%"J.,,.%/'"&C/C"J.,,.%/"6G"=0<66`"K0./'",&0>&<"6G"+$W0:.>0"a$0S."+6@&S&%3C"
?63&4""9$<<&%3,F"$7.%/"6%,F"96%3&%3"0%0,F7.7"36">&3&93"3<&%>.%/"P:<07&7C""1<$&"S&S&">&3&9-6%"<&`$.<&7"769.0,"%&3H6<J"G&03$<&7"07"H&,,C"
!"#$%&'"())*"
=093$0,X16P.90,"N%0,F7.7"6G"?&H7"
b"
!"#$%&'"())*"
N%0,Fc.%/"Z63&%-0,"K.07".%"+&>.0"
?6%]16P.90,"3&23"0%0,F7.74"9:0<093&<.c0-6%7"0<&"76$/:3"6G"3:&"6P.%.6%7'"G&&,.%/7'"0%>"0d3$>&7"&2P<&77&>".%"0"3&23'"<03:&<"3:0%"e$73"6G"3:&"36P.97"3:&"3&23".7"0U6$3"
=09.,.303&7"0%0,F7.7"6G":6H"S&>.0"<&P6<37"&@&%374"
• ""D$SS0<.c&":6H"3:&"P<&77":07"7:.f&>".3g7"0d3$>&"36H0<>7"M%>.0"6@&<"3:&"P073"F&0<"• ""D:6H":6H">.L&<&%3"<&/.6%7"8D.%>:'"Z&7:0H0<'"=<6%-&<"3&<<.36<F;">.L&<".%"3:&.<"P&<9&P-6%"6G"3:&"9$<<&%3"0>S.%.73<0-6%"• ""V:03"0<&"3:&"S0.%".77$&7"U&.%/"<&P6<3&>h"
!"#$%&!'%!!"#$!!()*!'+&"!,)*!(*#-!%&#'!./&!01!2#*3!"4&!!05!678(&)! ![anSary nE kha myry ray^E myN eamr shyl ayk bd dmaG awr Zdy XKS hyN ] !
[Ansari said, “according to me Aamir Sohail is one crazy and stubborn man”]!TARGET
Attributes ID:t1
AGENT Attributes ID:a1 Nested-source: “w” TargetID:t1
!""#"$%&'!(()*+,(-.!"#$%&'&(!)**+*,-./01.$!2.3%*+4.!"2*.25+*0$!6+36(!!7.3%*+4.8*9:%;-$!*&!<95+*+4.8*9:%;-$!2,==!
EXPRESSIVE ELEMENT Attributes ID:ex1 , TargetID:t1, Emotion:anger Intensity:high, Nested-Source: “w”, a1, Polarity:negative
Non-Topical Analysis
Agent: Opinion holder Target: Target of Opinion being expressed (a topic, a person, organization etc.) Attitude: includes Expressive Element
www.janyainc.com
FACETED SEARCH: DRILL DOWN TO RELEVANT CONTENT/DATA
People are filled with anger and sorrow because of the policies made by Musharaf. OPINION HOLDER – Writer, People
TARGET –Musharaf’s policies (Musharaf is an implied target)
Human Behavior Analysis • Process social media content, provide tools for analysts to:
• Identify social networks: groups, members • Identify topics of discussion and sentiment
• E.g. angry at govt., wanting retaliation, peacemakers • Thought influencers
• Identify social goals through analysis of verbal communication
• Manipulation: Persuasion, threats, coercion • Religious supremacy: religious analogues • recruitment
Social Media Content
Link Diagrams
Predictive Modeling
!"#$%&'"())*"
T:0,,&%/&7"
i6H&@&<'"j>$90-6%"+.%.73&<"+6:0SS0>"i0%.G":0,G]>&0>"<&9&%3,F".%"NG/:0%.730%'"k"S.,,.6%"9:.,><&%"6G"79:66,"3&23U66J7":0@&"
U&&%">.73<.U$3&>"36"76S&":6P&C"
!"#$%&'(&)*+%&%,"#+-"%(&(.(,/%01#2&%3&4#2-1+05&
Name translation output:
l66/,&"3<0%7,0-6%"0$/S&%3&>"UF"UF"D&S0%3&2™"3<0%7,0-6%"6G"%0S&7C"
mi6H&@&<'"j>$90-6%"+.%.73&<"+6:0SS0>"i0%&&G"N3S0<"<&9&%3,F".%"NG/:0%.730%'"k"S.,,.6%"9:.,><&%"6G"79:66,"3&23U66J7":0@&"U&&%">.73<.U$3&>"36"76S&":6P&Cn"
?0-@&",0%/$0/&"P<69&77.%/"<&`$.<&>"G6<"_%&]/<0.%&>"0%0,F7.7o"
Google Translation Context Aware Translation
!"#$%&'"())*"
i$S0%"3<0%7,0-6%"G6<"0,,"N<0U.9"@0<.0%37"U&,6H".7"3:&"70S&4"m1:&<&".7"%6"&,&93<.9.3F'"H:03":0PP&%&>hn"
N<0U.9"[.0,&937"0<&"%63":0%>,&>"H&,,".%"9$<<&%3"S09:.%&"3<0%7,0-6%"7F73&S7C"
T\5NKN"&%0U,&7"+DN"366,7"36".%3&<P<&3">.0,&937"96<<&93,FC"""
166,7"S0>&"G6<"+DN"G0.,"6%"N<0U.9">.0,&937"
Q("
6,"718&9",1"#%& 6,"718&:3*,8(&;(<%& =332-(&;,"#+-"%(&
j/FP-0%" !"# $%&'(%) )*+!,#) -.* !/,
N3`303"&,&93<.90,"H.<&7'"V:F"0<&"Z673&>h
5&@0%-%&" $)*+!, 0"12 3#,0-,"! 0"#
TJ,6"+0G&&7:")*+!,'"5&9:":&9Jh
M<0`." -+"4 $5)*+!, 3,)2 30 p$"+NT\?"&,&93<.9.3F'"/66>h
+DN" )6)2 $5)*+!, /73")#-#89
[6&7"%63":0@&"&,&93<.9.3F'"H:03":0PP&%&>h
!"#$%&'"())*"
V&U"07"T6<P$74""+.%.%/"V.J.P&>.0"G6<"50%/$0/&"A&76$<9&7"
• ""1<0%7,0-6%",&2.96%7"0$36S0-90,,F"&23<093&>"G<6S"T:.%&7&"V.J.P&>.0'"$7&"9<677",0%/$0/&",.%J7"36"0>>"j%/,.7:"3<0%7,0-6%7"• ""j07F"36"<&/&%&<03&"H.3:"%&H"@&<7.6%7"6G"V.J.P&>.0"• ""T:.%&7&"V.J.P&>.0".7"96%730%3,F"/<6H.%/"
Code Mixing, Switching
! Use of Latin script: lack of transliteration standards makes it difficult to process
! Urdish, Spanglish, Hinglish etc.
Afsoos key baat hai . kal tak jo batain Non Muslim bhi kartay hoay dartay thay abhi this man has brought it out in the open. [It is sad to see that those words that even a non muslim would fear to utter until yesterday, this man has brought it out in the open]
Solutions: • Apply “romanized” POS tagger, English tagger in tandem: use machine learning to combine evidence and generate final tag, language ID • For longer English spans, use English NLP system
Language Resource Acquisition
Less Commonly taught languages (LCTL) • Yoruba, Russian, Swahili • Dialects
Very few few linguistics resources available • electronic lexicons • translation lexicons • part-of-speech taggers, chunkers
• Typically, very expensive to produce these resources by hand
• The web provides a new opportunity to automatically acquire these resources “web as corpus”
!"#$%&'"())*"
1:&"A60>"N:&0>h"
T6%3&23'"T6%3&23'"T6%3&23o" U,6/C/37]3<0%7,0-6%C96S"