towards finding relevant information graphics: identifying ... · pdf filetowards finding...

Towards Finding Relevant Information Graphics: Identifyingthe Independent and Dependent Axis from User-Written Queries

Zhuo Li and Matthew Stagitis and Kathleen McCoy and Sandra CarberryDepartment of Computer and Information Science

University of DelawareNewark, DE 19716

Abstract

Information graphics (non-pictorial graphics such as barcharts and line graphs) contain a great deal of knowl-edge. Information retrieval research has focused on re-trieving textual documents and on extracting imagesbased on words appearing in the accompanying articleor based on low-level features such as color or texture.Our goal is to build a system for retrieving informationgraphics that reasons about the content of the graphicitself in deciding its relevance to the user query. As afirst step, we aim to identify, from a full sentence userquery, what should be depicted on the independent anddependent axes of potentially relevant graphs. Naturallanguage processing techniques are used to extract fea-tures from the query and machine learning is employedto build a model for hypothesizing the content of theaxes. Results have shown that our models can achieveaccuracy higher than 80% on a corpus of collected userqueries.

IntroductionThe amount of information available electronically hasgrown dramatically. Although information retrieval researchhas addressed the need to be able to identify and access in-formation relevant to a user’s needs, these research effortshave focused on the text of documents and to some extent ontheir pictorial images. Unfortunately, information graphics(non-pictorial graphics such as bar charts and line graphs)have been largely ignored. Such graphics are prevalent inpopular media such as newspapers, magazines, blogs, andsocial networking sites and, unlike graphs in scientific arti-cles where the article text explicitly refers to and explainsthe graphics, the information conveyed by graphics in popu-lar media is often not repeated in the article’s text (Carberry,Elzer, and Demir 2006). Yet these graphics are a significantknowledge source.

Consider, for example, an author who is constructing a re-port and wishes to make a point concerning how social me-dia such as Twitter can give predictive insights to the publicopinions on popular events such as the United States pres-idential election. The author could try to describe in words

Copyright c© 2013, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

that the comparative number of tweets mentioning each can-didate generally corresponds to the popularity of that can-didate. On the other hand, imagine that the author has athis/her disposal a query system that enables him/her to putout a query to retrieve an information graphic that com-pares the number of tweets for the two candidates (PresidentBarack Obama and Governor Mitt Romney) in the weeksleading up to the election. The following, which we will re-fer to as Q1, might be such a query:

Q1: How many tweets mentioned Obama compared toRomney from October to November?

Figure 1: Comparative number of tweets mentioningBarack Obama and Mitt Romney

Figure 1, found on the official website of HootSuite (so-cial media management system), would be an ideal re-sponse to query Q1. Imagine how much more powerfullyand convincingly the author’s point can be made by in-cluding this graphic in their article because the compara-tive trends shown in this graphic track remarkably well withopinion polls leading up to the election (which ultimatelywas won by President Obama).

Unfortunately, most commercial search engines are un-able to respond effectively to requests for information graph-ics. This is largely due to their inability to understand thecontent of the graphic; instead they rely on the text ofthe document containing the graphic, with special attentionpaid to the image tag information, image file name, andnearby text paragraphs surrounding the image. For example,if query Q1 is input to Google with a request for images,the first returned graphic compares Canadian tweets for the

Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference

226

two candidates on the three debate days and the secondreturned graphic compares the sentiment of Obama tweetswith Gallup polls. All of the top returned graphics are simi-larly off-target. This is due to the retrieval system’s relianceon the appearance of the query words in the text of the web-page source file, not on the graphic’s content and whether itis relevant to the query.

Visa

Mastercard

Discover

American Express

Diner's Club

0 100 200 300

U.S. Credit Cards in Circulation in 2003 (millions)

Figure 2: Bar Chart showing the rank of American Express

The long-term goal of our research is to build a systemfor retrieving relevant information graphics in response touser queries. Our prior work (Wu et al. 2010; Elzer, Car-berry, and Zukerman 2011; Burns et al. 2012) has produceda system for recognizing the overall message of an informa-tion graphic by reasoning about the content of the graphic,including its communicative signals such as one bar in a barchart being colored differently from other bars. For exam-ple, our system recognizes that the bar chart in Figure 2is conveying the rank of American Express with respect tothe other credit card companies in terms of US credit cardsin circulation, what we categorize as a Rank message, andwhich can be formally represented as Rank(American Ex-press, {Visa, Mastercard, Discover, Diner’s Club}, US creditcards in circulation). This recognized message will be usedin our retrieval system to represent the graphic’s high-levelinformational content.

In order to retrieve graphics in response to a user query,it is necessary to identify requisite characteristics of rele-vant graphics. These include the category of the intendedmessage (such as a Rank message), any focused parameters(such as American Express), the entities on the independentaxis (such as credit card companies), and the entities on thedependent axis (such as number of credit cards in circula-tion). We present in this work our methodology for devel-oping a learned model that takes as input a natural languagequery and hypothesizes the content of the independent anddependent axes of relevant graphics. Future work is designed

to extend this methodology to hypothesize the category ofintended message and any focused parameters.

Related WorkMost image retrieval systems use various kinds of text an-notation of the image, such as automatic annotation learnedfrom a manual annotation training set (Jeon, Lavrenko, andManmatha 2003), co-occurring text generated from the mul-timedia document (Lapata 2010) and user-provided meta-data tags in social media like Flickr and Youtube (Gao et al.2011). State of the art content-based image retrieval looksinto the image and uses low-level features to recognize en-tity objects such as cars and tables, human faces, and back-ground scenery from natural images (Yue et al. 2011; Doer-sch et al. 2012; Kapoor et al. 2012). Research on informa-tion graphics has focused on identifying the type of graphicsuch as bar charts and line graphs (Shao and Futrelle 2006;Mishchenko and Vassilieva 2011b). Others have focusedon information extraction by converting information graph-ics into tabular form data, such as XML representations(Huang and Tan 2007; Mishchenko and Vassilieva 2011a;Gao, Zhou, and Barner 2012). To the best of our knowledge,our work is the only project which attempts to recognizethe high-level message or knowledge conveyed by an infor-mation graphic and use it in determining the relevance of agraphic to a user query.

Motivation for Identifying Axis Content fromQueries

One approach to retrieving information graphics given thegraphic’s caption, axis labels, and intended message wouldbe to match the words in the query to the words in thegraphic directly. However, in many situations the majorityof the words in the query are not actually contained in thegraphic. Even when they are, simply taking graphics witha high overlapping word count would produce poor searchresults as illustrated in the following examples.

Consider the following two queries:Q2: Which Asian countries have the most endangeredanimals?Q3: Which endangered animals are found in the mostAsian countries?These two queries contain almost identical words but are

asking for completely different graphics. Query Q2 is ask-ing for a comparison of Asian countries (independent axis)according to the number of endangered animals (dependentaxis) in each country. On the other hand, query Q3 is askingfor a comparison of different endangered animals (indepen-dent axis) according to the number of Asian countries theydwell in (dependent axis). The two queries are asking forgraphics where the two mentioned entities, Asian countriesand endangered species, are completely flipped around, justby organizing the query in a different way. This differenceresults in two completely different content graphs to be re-trieved.

In addition, a query may contain more than two entities.Consider the following two queries where a third entity ismentioned in the query:

227

Q4: How many major hurricanes occurred in each eastcoast state in 2002?Q5: How many east coast states have had major hurri-canes since 2002?

This third entity could be a confusing factor causing ir-relevant graphics to be retrieved. Query Q4 is asking for acomparison of states on the east coast according to the num-ber of major hurricanes in 2002, where “2002” should onlybe part of the label on the dependent axis. Query Q5, on theother hand, is asking for the trend in the number of east coaststates affected by major hurricanes during the years since2002, where the independent axis should contain a listing ofthe years since 2002 and the dependent axis should give thenumber of east coast states with major hurricanes in each ofthe years listed.

Based on the above concerns, we contend that prepro-cessing the query Q is needed to identify words that can bematched against descriptions of the independent and depen-dent axes of candidate graphs. This will eliminate many ir-relevant graphics and reduce the number of graphs that needto be given further consideration.

Methodology for Hypothesizing Axis Content

Query

CCG Parser

Query-Entity

Pairs

Query and Entity

Attributes

Independent and Dependent

Axis Decision Tree

Classified into: Independent,

Dependent, or None

Figure 3: System Model

Our methodology is to extract clues from the user’s queryand use these clues to construct a learned model for hypoth-esizing the content of the independent and dependent axesof relevant graphs. Figure 3 outlines the application of thelearned model to hypothesizing the content of the axes of po-tentially relevant graphs. Given a new query, we first pass itto a CCG parser (Clark and Curran 2007) to produce a parsetree. We use the parser developed by Clark and Curran sinceit is trained expressly for questions whereas most parsers aretrained entirely or mostly on declarative sentences. From theparse tree, we populate a set of candidate entities E1, E2,. . . , En, and extract a set of linguistic attributes associatedwith each query-entity pair, Q-Ei. Each query-entity pair isinput to a decision tree for determining whether the entityrepresents the content of the independent axis, or the con-tent of the dependent axis, or none of the axes.

In the following subsections, we will discuss how entitiesthat might represent axis content are extracted, the clues in auser query that we use as attributes in the decision tree, howthese attributes are extracted from a user query, and the con-struction and evaluation of our models for hypothesizing therequisite content of the axes of potentially relevant graphs.

Candidate Entity EnumerationTo hypothesize the content of the independent and depen-dent axes from a user query, we first need to generate a setof candidate entities that will be considered by the decisiontree as possible content of the axes. To do this, first of all,phrases that describe a period of time are extracted as timeintervals and included in the candidate entities. Afterwards,the parse tree is analyzed and noun phrases that are not partof time intervals are also extracted and added to the set ofcandidate entities.

The set of candidates is filtered to remove certain cate-gories of simple noun phrases which are used to describe agraph rather than to refer to the content of the graph. Theseinclude nouns such as “trend” or “change” that are partof the trend category and “comparison” and “difference”which are part of the comparison category of words. Pro-nouns such as “that” or “it” are also filtered from the en-tity candidate list since they do not carry content themselvesbut rather refer back to entities previously introduced. Sim-ple compound noun phrases that do not contain any preposi-tions or conjunctions are considered as a single entity, suchas “American Idol”. In contrast, compound noun phrasescontaining prepositions and (or) conjunctions, such as “therevenue of Ford and Toyota”, are broken down into compo-nent pieces such as “the revenue”, “Ford”, and “Toyota”.

How does the revenue of Discover compare to American Express in 2010?

Wh-Q

Y/N-Q

NP

NP PP

VP

V PP

P NP

NP PP

Figure 4: Abbreviated Parse tree for query Q6

For example, Figure 4 shows an abbreviated version ofthe parse tree for the following query1:

Q6: How does the revenue of Discover compare toAmerican Express in 2010?

First of all, a specific time point phrase, “in 2010”, is de-tected from this query sentence. Then the noun phrase “therevenue” and “Discover” are extracted from the parse tree

1While the parser we use is a CCG parser, we have drawn thetree using more traditional syntactic node names for clarity.

228

and added to the set of candidate entities. The noun phrase“American Express” is added to the candidate list as a singlenoun phrase given that it does not contain any prepositionsor conjunctions. The final set of query-entity pairs for queryQ6 are:

Q6-E1: the revenueQ6-E2: DiscoverQ6-E3: American ExpressQ6-E4: in 2010

Cues from the User’s QueryWe are considering only full sentence queries, such as theexamples presented above. While most text retrieval systemswork with keyword queries, they are retrieving documentseach of which can fulfill a variety of information needs. Wewish to retrieve information graphics that are relevant to auser who has a specific information need in mind that thegraphics are to fulfill. We wish to therefore develop mech-anism to analyze the semantics of full sentence queries toidentify characteristics of graphics relevant to the user’s in-formation need.

The content and structure of such queries provide cluesto the content of relevant graphics. We have identified a setof attributes that might suggest whether a candidate entityreflects the content of the independent or dependent axisof a relevant graphic. These clues can be divided into twoclasses: query attributes that are features of the whole queryand are independent of any specific entity, and entity at-tributes that are particular to each specific candidate entity.

One query attribute is the question type of the query sen-tence. For example, “Which” or “What” queries, such as Q2

and Q3, are often followed by a noun phrase that indicatesthe class of entities (such as countries) that should appearon the independent axis. On the other hand, “How many”and “How much” queries, such as Q4 and Q5, are often fol-lowed by a noun phrase that indicates what quantity (suchas number of major hurricanes) should be measured on thedependent axis.

Comparatives and superlatives also provide clues. For ex-ample, the presence of a superlative, such as “highest” in thequery “Which countries had the highest GDP in 2011?”,often suggests that the dependent axis should capture thenoun phrase modified by the superlative. Similarly, compar-atives such as “higher” in the query “Which countries havea higher GDP than the U.S. in 2011?” suggests that thedependent axis should capture the following noun phrase,which in this case is also “GDP”.

Certain categories of phrases provide strong evidence forwhat should be displayed on the independent axis. For ex-ample, consider the following queries:

Q7: How does CBS differ from NBC in terms of view-ers?Q8: How does CBS compare with other networks interms of viewers?

The presence of a comparative verb such as “differ” or“compare” suggests that the entities preceding and follow-ing it capture the content of the independent axis. Further-more, the plurality of the noun phrases is another clue. If

both the noun phrases preceding and following the compar-ative verb are singular, as in query Q7, then the noun phrasessuggest entities that should appear on the independent axis;on the other hand, if one is plural (as in Q8), then it suggeststhe class of entities to be displayed on the independent axis,of which the singular noun phrase is a member.

Similar to comparative word sets, certain words, such as“trend” or “change” in a query such as “How have oilprices changed from January to March?”, indicate a changein the dependent variable, which is “oil prices”. Such wordssuggest that the entity (noun phrase) dominated by them islikely to be on the dependent axis. On the other hand, thecharacteristic of representing a time interval, such as the en-tity “from January to March” or “in the past three years”,suggests that the entity captures the content of the indepen-dent axis.

Extracting Attributes for each Query-Entity PairTable 1 describes the categories of attributes we have in-cluded in our analysis for identifying the content of the in-dependent and dependent axes. Recall that query attributesare based on features of the whole query whereas entity at-tributes are particular to the entity in the query entity pair.For each query-entity pair, we determine the value for eachof the attributes. This is accomplished by analyzing the parsetree and the part-of-speech tags of the elements of the parsetree, and by matching regular expressions for time intervalsagainst the query-entity pair.

For example, let us consider query Q6. Query Q6 is of“How does” question type, causing the query attribute fromthe question type attribute category to be set to True for ev-ery query-entity pair derived from Q6. Regular expressionsdetect that Q6 contains a phrase describing a specific timepoint, which is query-entity pair Q6-E4 that is “in 2010”, sothe attribute from the associated with the presence of a timeinterval in the query is set to True for this query-entity pair,and False for the other query-entity pairs from Q6. In the restof the query, the system finds the presence of a word fromthe comparison word category therefor setting the query at-tribute of presence of Comparison category words to be Truefor all the query-entity pairs from Q6; for query-entity pairQ6-E2 and Q6-E3, the attribute designating that the entity ison the left and right side respectively of the comparison verbis set to True. Since E1 is the leftmost noun phrase follow-ing the question head “How does” in the parse tree (Figure4), the attribute reflecting the leftmost noun phrase is set toTrue for query-entity pair Q6-E1 and to False for the otherquery-entity pairs. Since all entities are tagged as singularnouns by the part-of-speech tagger, the plurality attribute isset to False for each query-entity pair.

Constructing the Decision TreeIn order to construct a corpus of full-sentence queries ori-ented toward retrieval of information graphics, a human sub-ject experiment was conducted. Each participant was at least18 years of age and was either a native English speaker orwas fluent in English. Each subject was shown a set of in-formation graphics on a variety of topics such as adoptionof children, oil prices, etc. For each displayed graphic, the

229

WholeQueryAttributes

Presence of superlative or comparativein the query sentence.Question types, including Which, What,How many or How much, How do orHow have, What is.Presence of Trend category words orComparison category words in thequery sentence.Main verb is a trend verb or comparisonverb.If main verb is a comparison verb, thiscomparison verb is followed by a prepo-sition, such as with, to, against, from, oramong and amongst, or between. Other-wise, the comparison verb has no prepo-sition and is used at the end of the query.

SpecificEntityAttributes

Whether the entity contains a superla-tive or comparative.Whether the entity describes a time in-terval, a specific time point, or neitherof them.Whether the entity contains or is modi-fied by gradient category phrases, suchas in terms of, or quantity categorywords, such as the number, or compar-ison category words, or trend categorywords.Whether the entity is modified by inclu-sive words such as all and total, or byexclusive words such as other and therest, or by words that indicates enumer-ation such as each and every.Whether the entity is the direct nounphrase following each question type.Whether the entity is singular or plural.If the main verb belongs to either thecomparison or trend category, whetherthe entity is on the left or right side ofthe main verb.

Table 1: A description of attribute categories

subject was asked to construct a query that could be best an-swered by the displayed graphic. After dropping off-targetqueries, this resulted in a total of 192 queries. 2

To construct a training set, candidate entities are extractedfrom each query, a set of query-entity pairs is constructed,and values for each of the attributes are extracted. Our de-cision tree takes in the query-entity pairs, along with theirattribute values, and returns a decision of independent axisentity, dependent axis entity, or neither. The classificationof each query-entity pair for training is assigned by one re-searcher and then verified by another researcher with the fi-nal annotation of each query-entity pair indicating both re-searchers’ consensus. To construct the decision tree for iden-

2The link to the experiment online SQL database ishttp://www.eecis.udel.edu/∼stagitis/ViewAll.php

tifying the content of the dependent and independent axes,this training set is then passed to WEKA, an open-sourcemachine learning toolkit.

Evaluation of the MethodologyLeave-one-out cross validation holds out one instance fromthe full dataset of n instances. The classifier is constructedfrom the other n-1 instances and tested on the held-out in-stance. This procedure is repeated n times with each of then instances serving as the held-out test instance, and the re-sults are averaged together to get the overall accuracy of theclassifier. This strategy maximizes the size of the trainingset.

In our case, each query can produce more than one query-entity pair. This leads to an unfair advantage in evaluatingour classifier, since the training set contains query-entitypairs extracted from the same query as the held-out testinstance. This would mean that all query attribute values(those based on the whole query) would exist identically inboth the training and testing sets at the same time. In orderto prevent this from happening, we use a variant of leave-one-out cross validation which we refer to as leave-one-query-out cross-validation. In leave-one-query-out cross-validation, all of the query-entity pairs extracted from thesame query will be bundled together and used as the testingset while all of the remaining query-entity pairs will be usedto construct the classifier. A total of n repetitions of this cus-tom cross-validation are performed, where n is the numberof unique queries in the data set.

Note that we are evaluating the overall system (not justthe decision trees) since the held-out query is parsed and itsentities, along with the values of the attributes, are automat-ically computed from the parse tree, its part-of-speech tags,and the use of regular expressions.

We evaluated our methodology using accuracy as the eval-uation metric: the proportion of instances in which the an-notated correct classification matches the system’s decision(Independent, Dependent, or None). This is depicted in Ta-ble 2, in which the system is awarded 1 point each time itmatches the system annotation.

System OutputInd. Dep. None

Annotation Ind. +1 +0 +0Dep. +0 +1 +0None +0 +0 +1

Table 2: Standard Accuracy Scoring

Axis ExtractionThere are 76 entities that are considered as on neither ofthe axes, which is only 7.1% of the total entities generated.Therefore 92.9% of the entities generated by our system arehelpful in identifying the independent and dependent axis ofthe hypothesized graph. Prediction of the independent anddependent axis has a baseline accuracy of 54.15%, that is, ifthe system always predicts the majority class of Independent

230

Axis for every query-entity pair. However, our methodologydid considerably better than the baseline, achieving an over-all accuracy of 81.51%, with overall precision of 79.26%,overall recall of 73.96%, and overall F-measure of 76.5%.For identifying the Independent Axis, we achieved precisionof 81.61%, recall of 78.07%, and F-measure of 79.77%. Foridentifying the Dependent Axis, we achieved precision of82.14%, recall of 87.24%, and F-measure of 84.58%.

The confusion matrices for the axis classification is dis-played in Tables 3.

System OutputNone Ind. Dep.

Annotation None 43 9 24Ind. 5 324 86Dep. 10 64 506

Table 3: Axes Classification Confusion Matrix

Conclusion and Future WorkAs presented in this paper, we have achieved the first steptoward successful retrieval of relevant information graphicsin response to a user query. Our methodology for identify-ing the content of the independent and dependent axes usesnatural language processing and machine learning and hasa success rate of 81.51% range, substantially exceeding thebaseline accuracy. Our future work will include extendingour methodology to recognize the message category (suchas a Rank message or a Trend message) of graphics that arepotentially relevant to a user query and any focused param-eters of the graphic’s message. We will then use these ina mixture model that matches requisite features identifiedfrom the user’s query with features of candidate graphs inorder to rank graphs in terms of relevance. To our knowl-edge, this work is the only ongoing effort that attempts touse the informational content of a graphic in determining itsrelevance to a user query.

ReferencesBurns, R.; Carberry, S.; Elzer, S.; and Chester, D. 2012. Au-tomatically recognizing intended messages in grouped barcharts. Diagrammatic Representation and Inference 8–22.Carberry, S.; Elzer, S.; and Demir, S. 2006. Informa-tion graphics: an untapped resource for digital libraries. InProceedings of the 29th annual international ACM SIGIRconference on Research and development in information re-trieval, 581–588. ACM.Clark, S., and Curran, J. 2007. Wide-coverage efficient sta-tistical parsing with ccg and log-linear models. Computa-tional Linguistics 33(4):493–552.Doersch, C.; Singh, S.; Gupta, A.; Sivic, J.; and Efros, A.2012. What makes paris look like paris? ACM Transactionson Graphics (TOG) 31(4):101.Elzer, S.; Carberry, S.; and Zukerman, I. 2011. The auto-mated understanding of simple bar charts. Artificial Intelli-gence 175(2):526–555.

Gao, Y.; Wang, M.; Luan, H.; Shen, J.; Yan, S.; and Tao,D. 2011. Tag-based social image search with visual-textjoint hypergraph learning. In Proceedings of the 19th ACMinternational conference on Multimedia, 1517–1520. ACM.Gao, J.; Zhou, Y.; and Barner, K. E. 2012. View: Visualinformation extraction widget for improving chart imagesaccessibility. In IEEE International Conference on ImageProcessing (ICIP). IEEE.Huang, W., and Tan, C. 2007. A system for understandingimaged infographics and its applications. In Proceedings ofthe 2007 ACM symposium on Document engineering, 9–18.ACM.Jeon, J.; Lavrenko, V.; and Manmatha, R. 2003. Automaticimage annotation and retrieval using cross-media relevancemodels. In Proceedings of the 26th annual internationalACM SIGIR conference on Research and development in in-formaion retrieval, 119–126. ACM.Kapoor, A.; Baker, S.; Basu, S.; and Horvitz, E. 2012. Mem-ory constrained face recognition. In Computer Vision andPattern Recognition (CVPR), 2012 IEEE Conference on,2539–2546. IEEE.Lapata, M. 2010. Image and natural language processing formultimedia information retrieval. Advances in InformationRetrieval 12–12.Mishchenko, A., and Vassilieva, N. 2011a. Chart imageunderstanding and numerical data extraction. In Digital In-formation Management (ICDIM), 2011 Sixth InternationalConference on, 115–120. IEEE.Mishchenko, A., and Vassilieva, N. 2011b. Model-basedchart image classification. Advances in Visual Computing476–485.Shao, M., and Futrelle, R. 2006. Recognition and classifica-tion of figures in pdf documents. Graphics Recognition. TenYears Review and Future Perspectives 231–242.Wu, P.; Carberry, S.; Elzer, S.; and Chester, D. 2010. Recog-nizing the intended message of line graphs. DiagrammaticRepresentation and Inference 220–234.Yue, J.; Li, Z.; Liu, L.; and Fu, Z. 2011. Content-based im-age retrieval using color and texture fused features. Mathe-matical and Computer Modelling 54(3):1121–1127.

231

towards finding relevant information graphics: identifying ... · pdf filetowards finding...

Documents