using nlp to find contextual relationships between fashion houses

7
Finding Contextual Relationships between Fashion Houses Madiha Mubin Department of Computer Science Stanford University [email protected] Sushant Shankar Department of Computer Science Stanford University [email protected] Abstract Understanding how companies and products are compared online can provide insights for analyzing a market. This paper proposes a method to bootstrap entities (‘players’) in a market and their relationship to each other. Our corpus was high quality fashion blogs on eyewear. Our robust system takes the dataset and discovers competitive associations between different fashion houses and returns weights for each association and phrases that describe them. This is done in three steps: a) looking for patterns in sentences that are of a competitive nature, b) using Parts of Speech tagging and Named Entity Recognizer to extract singleton and paired entities, and c) traversing the Typed Dependency Graph to ex- tract contextual phrases that describe the com- petitive association. Out of the top 10 fashion houses that we extracted, all the relations we caught were actual competitive relationships between fashion houses. Of all long contex- tual phrases that we extracted, 92.1% were ac- curate and informative. 1 Introduction Comparative analysis is an important component of natural language understanding. While it makes in- tuitive sense to compare similar products or compet- ing companies, defining the similarity is tricky. For instance, Apple Inc. and Google Inc. are compared in the mobile industry, but if Google Inc. and Face- book are compared, then it is more likely a compar- ison based on their social networking products. In this paper, our goal is to extract information from online blogs and articles to (i) find pairs of competi- tive companies in the Fashion industry and (ii) to ex- tract context to provide meaning to those pairs (Fig- ure 1). The notion of context can be extended to ge- ographical location, various demographics, markets, and even descriptors like ‘launched a new line of prescription sunglasses’. For instance, from the sen- tence ‘In recent news, Gucci’s rival Prada launched a new line of retro sunglasses’, we want to extract [‘Gucci’, ‘Prada’, ‘launched a new line of retro sun- glasses’]. Figure 1: Shows different types of context (e.g. demographic and geographic) that can be associated between two companies about a certain product. In order to narrow the scope of the problem, we focused on Eyewear, however our method can gen- eralize to other products from the fashion world. 2 Related Work Recently, a weakly supervised method was proposed to learn comparative questions and extract compara- tors simultaneously (Li et. al., 2010). The process begins with identifying an Indicative Extraction Pat- tern (IEP) - a sequence that can be used to identify other comparative questions. From this initial seed, more comparative pairs are identified. For each

Upload: sushant-shankar

Post on 30-Nov-2014

553 views

Category:

Technology


0 download

DESCRIPTION

Understanding how companies and products are compared online can provide insights for analyzing a market. This paper proposes a method to bootstrap entities (‘players’) in a market and their relationship to each other. Our corpus was high quality fashion blogs on eyewear. Our robust system takes the dataset and discovers competitive associations between different fashion houses and returns weights for each association and phrases that describe them. This is done in three steps: a) looking for patterns in sentences that are of a competitive nature, b) using Parts of Speech tagging and Named Entity Recognizer to extract singleton and paired entities, and c) traversing the Typed Dependency Graph to extract contextual phrases that describe the competitive association. Out of the top 10 fashion houses that we extracted, all the relations we caught were actual competitive relationships between fashion houses. Of all long contextual phrases that we extracted, 92.1% were accurate and informative.

TRANSCRIPT

Page 1: Using NLP to find contextual relationships between fashion houses

Finding Contextual Relationships between Fashion Houses

Madiha MubinDepartment of Computer Science

Stanford [email protected]

Sushant ShankarDepartment of Computer Science

Stanford [email protected]

Abstract

Understanding how companies and productsare compared online can provide insights foranalyzing a market. This paper proposes amethod to bootstrap entities (‘players’) in amarket and their relationship to each other.Our corpus was high quality fashion blogson eyewear. Our robust system takes thedataset and discovers competitive associationsbetween different fashion houses and returnsweights for each association and phrases thatdescribe them. This is done in three steps:a) looking for patterns in sentences that areof a competitive nature, b) using Parts ofSpeech tagging and Named Entity Recognizerto extract singleton and paired entities, and c)traversing the Typed Dependency Graph to ex-tract contextual phrases that describe the com-petitive association. Out of the top 10 fashionhouses that we extracted, all the relations wecaught were actual competitive relationshipsbetween fashion houses. Of all long contex-tual phrases that we extracted, 92.1% were ac-curate and informative.

1 Introduction

Comparative analysis is an important component ofnatural language understanding. While it makes in-tuitive sense to compare similar products or compet-ing companies, defining the similarity is tricky. Forinstance, Apple Inc. and Google Inc. are comparedin the mobile industry, but if Google Inc. and Face-book are compared, then it is more likely a compar-ison based on their social networking products. Inthis paper, our goal is to extract information from

online blogs and articles to (i) find pairs of competi-tive companies in the Fashion industry and (ii) to ex-tract context to provide meaning to those pairs (Fig-ure 1). The notion of context can be extended to ge-ographical location, various demographics, markets,and even descriptors like ‘launched a new line ofprescription sunglasses’. For instance, from the sen-tence ‘In recent news, Gucci’s rival Prada launcheda new line of retro sunglasses’, we want to extract[‘Gucci’, ‘Prada’, ‘launched a new line of retro sun-glasses’].

Figure 1: Shows different types of context (e.g.demographic and geographic) that can be associatedbetween two companies about a certain product.

In order to narrow the scope of the problem, wefocused on Eyewear, however our method can gen-eralize to other products from the fashion world.

2 Related Work

Recently, a weakly supervised method was proposedto learn comparative questions and extract compara-tors simultaneously (Li et. al., 2010). The processbegins with identifying an Indicative Extraction Pat-tern (IEP) - a sequence that can be used to identifyother comparative questions. From this initial seed,more comparative pairs are identified. For each

Page 2: Using NLP to find contextual relationships between fashion houses

comparator pair extracted, all the questions contain-ing that pair are identified which allows recognitionof more comparator pairs. The comparative ques-tions and patterns are scored for reliability and thosethat are deemed reliable are stored for use as seedsto improve performance over time. Comparator pat-terns are generated using language rules that accountfor lexical generalized and specialized patterns. Thereliability of each pattern at every iteration is com-puted as weighted average of the performance of thepattern so far in terms of number of the questionsextracted by the pattern and a look-ahead reliabilityscore which helps reduce the problem of underesti-mation due to incomplete knowledge at a given iter-ation.

3 Methodology

In this paper, we augmented the comparator extrac-tion logic in three ways. First, we expanded our ex-traction process to be able to deal with regular sen-tences as well as questions that exhibited competi-tive or comparative nature. Second, we used bothentities in each pair that we extracted from compar-ative sentences to extract more information aboutthem to inform our pipeline. Finally, using single-tons and pairs of entities, we devised an algorithm toextract context from the sentences containing thoseentities.

For this project, we were able to accumulate aset of approximately 300 fashion-related articles bymanually searching the web. In fact, the main ra-tionale behind using fashion blogs was to find arti-cles that explicitly talked about fashion house rival-ries. One such example is http://www.christian-dior-glasses.com/articles/. These articles were blogs andreports of trends and changes in the fashion indus-try pertaining to eyewear from year 2010 to currenttime. Using this data source not only allowed us toextract good quality entitiy pairs, but helped us vali-date our content extraction pipeline as it was extract-ing temporally relevant information.

3.1 Pattern Generation for extraction

We focused on relations between fashion houses ofa competitive nature. We looked for paragraphs inour corpus that had words with prefixes ‘rival’ or‘compet’. We considered just looking for sentences

which contained these words, but felt that the sen-tences around these also could potentially containrelations and contexts of a competitive nature.

3.2 Entity Pair Extraction

Figure 2: Shows our pipeline starting from high-quality relation-rich blogs to extract patterns and us-ing those as seeds to scrape more data which mightnot contain as much of the desired content.

3.2.1 Relation-Rich Tier (Pairs)We start with our dataset to ex-

tract reliable patterns of the type(CompanyX, CompanyY, [Context]) thatexhibit competitive behavior.

Once we identified our target paragraphs, we usedthe Stanford CoreNLP parser to extract Parts-of-Speech tags, Named Entity Recognition informa-tion and collapsed typed dependency tree structuresfor each sentence (Toutanova, 2003; de Marneffe,2006). Our entity extraction method simply lookedfor POS NNP entities that had NER tags ‘ORGA-NIZATION’ or ‘PERSON’. Many of the entities wecaught were fashion houses, but not all. For in-stance, we caught entities like ‘Tom Cruise’ fromsentence ‘Tom Cruise is a fan of Ray-Ban sun-glasses’ or ‘Apple’ from sentence ‘Apple was an-other attendee in an event with a fashion house’.

We noticed a common style in our dataset. Oftenafter the entity, a sentence contained further speci-fication of the the line or type of product that wasbeing compared or talked about, for instance, ‘Prada

Page 3: Using NLP to find contextual relationships between fashion houses

sunglasses’. To extract this, we looked for POS NNStags after any the entities found. For instance, one ofour patterns looked like: NNP (ORG) – NNS* – NNP(ORG). As we were looking only at fashion eyewearsites, these NNS tags tended to denote ‘eyeglasses’or ‘glasses’.

We hypothesized that for sentences that containedmore than one entity, all of the possible entitiescould be compared or grouped in some way witheach other. Therefore, sentences with more than apair of entities gave us more entities to compare.For instance, if we saw three entities(x, y, z), thenwe generated relations (x, y), (x, z) and (y, z), i.e.we generated all possible pairs from a list of enti-ties. For example, from the sentence ‘Additionally,another Christian Dior glasses competitor Mar-chon Eyewear renewed its agreement with NIKE.’,we generated three relations: (‘Christian Dior’.‘Marchon Eyewear’), (‘Christian Dior’, ‘Nike’) and(‘Marchon Eyewear’, ‘Nike’).

3.2.2 Relation-Poor Tier (Singletons)Most sentences that talk about a fashion house or

new line will not contain its rival or competition.However, if more is known about the context of thatrival or competition in isolation, it can help us un-derstand the context in which it is being compared toanother company/line. When there is no more thanone entity in a sentence, then we call these single-tons where the sentence could have some contextualinformation about that entity.

3.3 Context Extraction

The context extraction method described in this pa-per relies on the key assumption that sentences con-taining multiple entities as well as context typicallyintroduced the entities in the first part of the sentenceand the context in the second part. Consider the sen-tence: ‘This month Christian Dior glasses competi-tor Marchon released a line aiming for the teen mar-ket’. Here the entities identified are ‘Christian Dior’and ‘Marchon’ and the product being compared is’glasses’. As depicted in Figure 3, the entities wereintroduced in the left sub-tree of the typed depen-dency graph and context is within the right subtree.This pattern was identified by manually analyzing asample of sentences and by understanding the styleof documentation of articles in our corpus.

Algorithm 1 Extracting Context From Sentence SGetContext (E: entities in S, C: typed depen-dency list for S):

• T ← construct a tree from C

• Traverse from the left most subtree discardingany subtree T ′ such that there is an e ∈ E foundin T ′.

• If there are subtrees left:

– Use breadth-first search to assemble thecontext of the sentence starting from thefirst subtree after all the entities in E havebeen visited.

– R← Concatenate the words, ignore ‘cop’and ‘det’ for compression, recheck the or-dering of the words from the actual sen-tence.

– return R

• else

– return None

Figure 3: Shows a collapsed typed dependencytree for the sentence ‘This month Christian Diorglasses competitor Marchon released a line aimingfor the teen market’.

Following this assumption, we were able to de-sign an algorithm (Algorithm 1) for context extrac-tion given a pair of entities that were previouslyidentified in the sentence by our specialized patterns.

Figure 4 shows a run of Algorithm 1. The leftmost subtree is ignored because it contains entitiesand the right subtree is traversed to obtain context.As the subtree is traversed, words within ‘det’ or

Page 4: Using NLP to find contextual relationships between fashion houses

‘cop’ dependencies are ignored.This reduced the number of traversals without al-

tering the context of the sentence and was speciallyuseful for longer sentences.

Figure 4: Shows the typed dependency tree thatAlgorithm 1 traverses. Nodes circled red are entitiesand those circles blue are part of the context.

While a majority of sentences in our corpus fol-lowed the assumption we specified earlier, therewere a few that did not. For instance, Figure 5 showssuch an example where the entity ‘Kate Spade’ ap-pears at the end of the sentence causing the algo-rithm to miss out on the context completely. We dis-cuss the limitations and possible improvements toour algorithm in the discussion section.

Figure 5: Shows how the sentence ‘Now that timehas passed , Bebe is a now famous west coast basedcompany that rivals European companies such asKate Spade eyeglasses’ violated the key assumption.

4 Evaluation

4.1 Paired Relations

As such there is no gold standard to evaluate rela-tions. However, we decided to come up with a met-ric that captures whether we think a relation is say-ing something of significance. Since we have counts

of the occurrence of each entity individually and thenumber of occurrences with each other entity, wedecided Pointwise Mutual Information with Contex-tual Rescaling was a metric that would be able tocapture how often an entity is seen with another en-tity while taking into account how often it is seen byitself.

Let P (ei) be the probability of seeing entity eiand P (ei, ej) be the probability of seeing entity eiand ej in a pair. We had an n × n frequency matrixf of fashion houses. So fij represents the number oftimes ei appeared with ej in our corpus.

• pmiij = log[

P (ei,ej)P (ei)P (ej)

](assume log(0) = 0)

• scaledpmiij = pmiij × fijfij+1 ×

min(∑n

k=1 fkj ,∑n

k=1 fik)

min(∑n

k=1 fkj ,∑n

k=1 fik)+1

Figure 6 below shows these adjusted PMI values forall 169 of our entities and Figure 7 shows these PMIvalues for top 10 of our entities (in terms of the num-ber of mentions). Note that the PMI values are lowerfor the top entities than for others - this is because forthe less mentioned entities, the times they are men-tioned with another entity is a much higher ratio ofthe total times of they’re mentioned.

We set about manually checking relations fromthe top 10 relations. To do this, we chose a PMIthreshold 0.3 to consider relations from. We cangenerate a graph of relations using this threshold,see Figure 8. Checking these relations, 100% of therelations were corroborated by another source (i.e.they were related and competed or worked together).Our relation extraction is of course restricted by ourcorpus and amount of data, as there are many rela-tions between fashion houses (not to mention otherentities) that we do not catch as they are elsewhereon the web or unwritten.

Page 5: Using NLP to find contextual relationships between fashion houses

Figure 6: PMI with contextual discounting for allfashion houses x fashion houses. The fashion housesare ranked by the number of occurrences. Note thatthe top fashion houses seem to be compared with allother fashion houses - this is expected (the top fash-ion houses are seen as rivals by many other fashionhouses and also are mentioned more, so have morecustomers, suppliers, etc.).

Figure 7: PMI with contextual discounting for thetop 10 fashion houses.

The signature of ‘Marchon’ is particularly strik-ing. We investigated this trend using other blogs andnewspaper sources and found out that Marchon is asupplier of eyewear to someof the top fashion housesas well as a competitor to others.

Figure 8: Shows the relationships between the top10 fashion houses. These entities were connected iftheir PMI score is greater than or equal to 0.3.

Figure 6 also shows some interesting relationshipsbetween not so frequently occuring fashion houses.Using a higher threshold > 1, we were able to createa landscape of relationships between those fashionhouses also.

Figure 9: Shows relationships between fashionhouses that were not in the top 10. These had higherPMI values and so they depict relationships withhigher confidence.

4.2 Context

For many of the edges in graph, we were able tocatch contextual phrases. Table 1 shows some con-textual phrases that we do catch and Table 2 arephrases that actually have the wrong meaning. ForTable 2, in the first case, the sentence was ‘In re-cent Prada eyeglasses news, rival companies Altairand Tommy Bahama will be teaming up to createeyewear for the upcoming 10 years’ and the secondsentence was ‘Lastly , Kate Spade eyeglasses com-petitors Altair and Tommy Bahama grew their cur-rent partnership for creation eyeglass frames for along time’. The sentence structures that we miss areoften more complex than our algorithm can catch asit talks about a entity x’s rival entity y who partnerswith entity z and it is not right to say x partners withz.

To evaluate the accuracy of our contexts, we firstlook at unique contexts. Due to many short con-texts, we have many repetitions - 181 unique con-texts out of 429 total contexts. We found that none ofthe contexts that were less than word length 3 were

Page 6: Using NLP to find contextual relationships between fashion houses

Entity 1 Entity 2 PMI Contextual PhraseRalph Lauren Michael Bastian 0.35 ‘Held event Monday promote his upcoming designer rimless line’Marcolin Group Dolomiti 1.13 ‘released their new round eyeglasses’Marcolin Group Kenneth Cole Prod. 1.20 ‘announced their licensing agreement for distribution of glasses’RayBan Charmant Group 0.71 ‘told the press that a new vintage eyeglasses line was released’Christian Dior S. Filo Group 1.54 ‘are partnering online sell their designer eyeglasses’Prada Marchon 0.74 ‘released line aiming for teen market’Marchon Prada 0.74 ‘just launched iWear’

Table 1: Positive Contextual Phrases extracted.

Entity 1 Entity 2 PMI Contextual PhrasePrada Tommy Bahama 0.45 ‘teaming for up create for eyewear for the upcoming 10 years’Kate Spade Tommy Bahama 0.54 ‘grew their with current partnership creation eyeglass frames long time’

Table 2: Negative Contextual Phrases extracted.

useful. It’s interesting to note that the word ‘do-nate’ occurred 18 times, the words ‘launched’ and‘renewed’ 12 times, and the word ‘agreed’ 10 ties.Some of these sentences did not elaborate furthercontextually, but others are contextual phrases thatour method does not catch.

We found 183 contexts (42.7% of total contexts)and 102 unique contexts (56.4% of total unique con-texts) that are longer than two words. Out of these102 unique contexts, we found three 2.9% sentencesreally did not carry much information at all (ourmethod caught the right phrase but did not meananything - a reflection of the sentence) and fivesentences (4.1%) that were in-naccurate (indicatingproblems with our method). This means that 92.1%of our contexts were both accurate and informative.Two of these were because we did not completelyput back prepositions into our context (as it wouldrequire more graph traversals); three were becausethe sentences were of a more complex structure.

5 Conclusion

We have proposed a method to discover entitiesand relationships between fashion houses. In ad-dition, we have shown a method to provide con-textual phrases describing these relationships. Thiswas done by: a) looking for patterns in sentencesthat are of a competitive nature, b) using Parts ofSpeech tagging and Named Entity Recognition toextract singleton and paired entities, and c) travers-ing the Typed Dependency Graph to extract con-

textual phrases that describe the competitive asso-ciation. We showed an evaluation metric that usesPointwise Mutual Information with Contextual Dis-counting to identify important or ‘surprising’ rela-tionships along with graph representations of the re-lationships between entities. We have also shownthat our context extraction system is quite accurate– of all long contextual phrases that we extracted,92.1% were accurate and informative. This methodcan be applied to bootstrap and learn entities and re-lationships in any market given the appropriate cor-pus.

6 Future Work

6.1 Improve contextual phrase extractionAs pointed out in our evaluation, we miss phrasesthat have more complex sentence structure. In ad-dition, we do not catch sentences where the contextis mentioned before the entities are (this would re-quire a simple modification of our algorithm wherewe start at the end instead of the beginning).

In addition, our contextual phrase extractionsometimes identifies phrases that are common tomultiple entities, but often has phrases that are spe-cific to one entity – our method does not disam-biguate these.

6.2 Filtering contextsOur method finds phrases for contexts that do notprovide much information. This is partially due toa method that is too simple and needs more rules

Page 7: Using NLP to find contextual relationships between fashion houses

for different structures of sentences and contexts.However, largely it is because most sentences do notcontain relevant information and do not say much.There needs to be a method to filter the contexts thatare possibly relevant. A classifier can be built usingfeatures such as the length of the phrase, the numberof modifiers in the context (as a proxy for how spe-cific the context is), and other such features can bedeveloped to classify an important context vs. a lessinformative one.

6.3 Types of relationship

We believe one of the most exciting contributions ofthe paper is the rich set of relations and contexts thatwe are able to generate from a relatively small cor-pus with just two seed prefixes to search for (‘rival’and ‘compete’). If a more extensive lexicon is de-veloped for this type of relations, we believe we cancatch even more relations and contexts. Addition-ally, this paper is focused on catching relationshipsof a competitive or rival nature. What we found isthat since we are looking for sentences around sen-tences that contain ‘rival’ or ‘compete’, we are alsocatching relationships that are partnerships (licens-ing agreements for example), or sentences that con-tain multiple relationships (i.e., ‘Rival of x, y, ispartnering with z)’. We can similarly create a lex-icon for other types of relationships such as partner-ships, supplier, customer, or other types of relation-ships between companies and their products.

6.4 Obtaining more data

While these high quality fashion blogs are limited innumber, they give us a starting point by providingus a set of high confidence relation triples. In future,we hope to augment our corpus by adding fashion-related articles from reliable newspapers. Most ofthe fashion houses compete over more than oneproducts and by adding information for other prod-ucts, we hope to be able to recover more interestingcompetitive relationships.

Acknowledgments

We would like to thank Christopher Potts for helpfuldiscussions and constructive feedback and for TypedDependency visualization code.

ReferencesSasha Li et. al 2010. Comparable Entity Mining from

Comparative Question. Proceedings of the 48th An-nual Meeting of the Association for ComputationalLinguistics:650–658.

Siddharthan 2011. Text Simplification Using TypedDependencies: A Comparison of the Robustness ofDifferent Strategies. Proceedings of the 13th Eu-ropean Workshop on Natural Language Generation(ENLG):2–11.

Kristina Toutanova and Christopher D. Manning 2003.Enriching the Knowledge Sources Used in a MaximumEntropy Part-of-Speech Tagger. In Proceedings of theJoint SIGDAT Conference on Empirical Methods inNatural Language Processing and Very Large Corpora(EMNLP/VLC-2000):63–70.

Kristina Toutanova, Dan Klein, Christopher Manning,and Yoram Singer. 2003. Feature-Rich Part-of-SpeechTagging with a Cyclic Dependency Network. In Pro-ceedings of HLT-NAACL 2003:252–259.

Marie-Catherine de Marneffe, Bill MacCartney andChristopher D. Manning. 2006. Generating TypedDependency Parses from Phrase Structure Parses . InLREC 2006.

Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating Non-local Informationinto Information Extraction Systems by Gibbs Sam-pling . In Proceedings of the 43rd Annual Meeting ofthe Association for Computational Linguistics (ACL2005):363–370.