using flickr geotags to find similar tourism … · homemade server letting me gather 1 millions of...

POLITECNICO DI MILANOMaster of Science in Computer Engineering

for the Communication

Department of Computer Engineering

USING FLICKR GEOTAGS TO FINDSIMILAR TOURISM DESTINATIONS

Supervisor: Prof. Lorenzo CantoniCo-Supervisor: Dr. Davide Eynard

Master Thesis of: Leonardo Gentile, matricola 744177

Academic Year: 2010 - 2011

Thanks to Professor Cantoni who gave me the great chance to undertakethis thesis on a very interesting topic.

Thanks to Alessandro Inversini and Elena Marchiori for their advices inthe communication field and for spreading my survey around the world.

Thanks to Eng. Giuseppe Moscato aka PeppeSka who kindly shared hishomemade server letting me gather 1 Millions of tags from Flickr.

Thanks to Stefano Celentano who kindly shared his internet connection.

Thanks to my family for supporting me during my long student life(yes don’t worry, It’s over..)

A very special thanks to Davide Eynard who patiently and always verykindly guided and advised me during the creation of this work.

Abstract

The amount of geo-referenced information available on the Web is constantlyincreasing due to the large availability of location-aware mobile devices andmap interfaces. This is enabling new search paradigms (e.g. “What ishere”) but also it is generating a large amount of unexplored georeferencedcollections. In particular, in photo collections like Flickr the co-existenceof geographical metadata in conjunction with text-based annotations (tags)generates interesting location-driven trends and patterns in textual data.When enough information is available, analysis systems can identify thesepatterns and extract aggregate knowledge. This inspired me in creating anovel method to extract representative place descriptions using users’ textannotations obtained from Flickr geo-referenced photos. In such a way Ipropose an attempt to predict similar locations based on the similarity oftheir respective descriptions. The prototype has been implemented as a webbased tool and it has positively evaluated, through a survey, by more thana hundreds of users.

I

Contents

Abstract I

1 Introduction 21.1 The Web2rism project . . . . . . . . . . . . . . . . . . . . . . 21.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 62.1 Folksonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Broad Folksonomy . . . . . . . . . . . . . . . . . . . . 82.1.2 Narrow Folksonomy . . . . . . . . . . . . . . . . . . . 102.1.3 Folksonomies Conclusions . . . . . . . . . . . . . . . . 11

2.2 The Geo World & Geo Web . . . . . . . . . . . . . . . . . . . 122.2.1 The GeoTags . . . . . . . . . . . . . . . . . . . . . . . 142.2.2 Geotagging Photos . . . . . . . . . . . . . . . . . . . . 152.2.3 Yahoo! GeoPlanet & WoeId . . . . . . . . . . . . . . . 192.2.4 Flickr . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Weighting and Scoring Methods . . . . . . . . . . . . . . . . . 252.3.1 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . 252.3.2 Vector Space Model . . . . . . . . . . . . . . . . . . . 272.3.3 VSM Related Issues . . . . . . . . . . . . . . . . . . . 29

2.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . 302.4.1 Wormholes . . . . . . . . . . . . . . . . . . . . . . . . 302.4.2 World Explorer . . . . . . . . . . . . . . . . . . . . . . 33

3 My Approach 383.1 Approach Introduction . . . . . . . . . . . . . . . . . . . . . . 393.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

II

3.4 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5 Flickr: from Narrow to Broad Folksonomy . . . . . . . . . . . 443.6 Extract a representative description . . . . . . . . . . . . . . 47

3.6.1 Vector Space Model Representation . . . . . . . . . . 473.6.2 Problem Decomposition . . . . . . . . . . . . . . . . . 483.6.3 TF-IDF Weights . . . . . . . . . . . . . . . . . . . . . 493.6.4 Weighting Systems . . . . . . . . . . . . . . . . . . . . 513.6.5 VSM Limitations . . . . . . . . . . . . . . . . . . . . . 52

3.7 Scoring and Retrieving Similar Places . . . . . . . . . . . . . 543.7.1 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 543.7.2 Retrieving Similar Places . . . . . . . . . . . . . . . . 55

4 Implementation 564.1 Development Platform . . . . . . . . . . . . . . . . . . . . . . 56

4.1.1 PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.1.2 MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . 574.1.3 Python . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 API & Online Services . . . . . . . . . . . . . . . . . . . . . . 594.2.1 Yahoo Geoplanet . . . . . . . . . . . . . . . . . . . . . 594.2.2 Flickr . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2.3 Yahoo Query Language . . . . . . . . . . . . . . . . . 61

4.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . 634.3.1 Data Storage . . . . . . . . . . . . . . . . . . . . . . . 644.3.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . 65

4.4 User Interface Design & Data Presentation . . . . . . . . . . 69

5 Tests and Evaluations 715.1 About Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.1 Critical Factors . . . . . . . . . . . . . . . . . . . . . . 735.2 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.1 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6 Conclusion 826.1 Current Status of the Work . . . . . . . . . . . . . . . . . . . 826.2 Application Fields . . . . . . . . . . . . . . . . . . . . . . . . 836.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

A Terms and Abbreviations 89

B Blacklist 90

List of Figures

2.1 Generic Tag Distribution in the broad folksonomies . . . . . . 92.2 Mobile devices positioning systems and accuracies . . . . . . 132.3 Geoplanet Hierarchy and Relationships . . . . . . . . . . . . . 212.4 Flickr Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.5 A three-dimension example of the Vector Space Model . . . . 282.6 GeotaggedWorld photos distribution collected for the “Worm-

holes” research . . . . . . . . . . . . . . . . . . . . . . . . . . 312.7 Wormholes detection from Mount Everest with σ= 50 km. . . 322.8 The World Explorer Map for a large scale of details . . . . . . 342.9 The World Explorer Map for a narrow scale of details for the

City of Rome in Italy . . . . . . . . . . . . . . . . . . . . . . 34

3.1 Tag distribution for the city of New York using the “Top 100”dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Tag distribution for the city of New York using the “Random”dataset truncated to the 150th tag . . . . . . . . . . . . . . . 45

3.3 Tag distribution for the city of New York using the “Random”dataset truncated to the 150th tag using a log-log scale . . . 46

4.1 Data Flow Representation . . . . . . . . . . . . . . . . . . . . 634.2 Database Representation for the “Random” Dataset . . . . . 644.3 Table Representation for the generic “score” table . . . . . . 684.4 Search Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.5 Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . 694.6 Similarities for the city of “Rome” . . . . . . . . . . . . . . . 70

5.1 Tag distribution for the city of New York using the “Random”dataset truncated to the 150th tag . . . . . . . . . . . . . . . 72

5.2 Tag Weights distribution for the city of New York using the“Random” dataset truncated to the 150th tag using the W-rnd1 e W-rnd3 weights . . . . . . . . . . . . . . . . . . . . . . 73

IV

1

5.3 Survey Introduction and Instruction . . . . . . . . . . . . . . 755.4 Screenshot of the survey for the city of Marseille . . . . . . . 765.5 Users’ survey answers for the first day of evaluation. . . . . . 775.6 Users’ survey answers for the last day of evaluation . . . . . . 785.7 Shared tags between “Seville” and “Cordoba” according to

the System A . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.8 Shared tags between “Seville” and “Cordoba” according to

the System D (truncated) . . . . . . . . . . . . . . . . . . . . 795.9 Five cities similar to “Rome” according to System A and Sys-

tem D (in ranked order) . . . . . . . . . . . . . . . . . . . . . 805.10 Shared tags between “Rome” and “Tarragona” according to

System A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.11 Five cities similar to “Rome” according to System B and Sys-

tem C (in ranked order) . . . . . . . . . . . . . . . . . . . . . 81

B.1 Stop-words or Blacklist composed by the most common Flickrtags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

0

Chapter 1

Introduction

This chapter represents an introduction overview to the whole report. First,the general information about the side-related “Web2rism” project is pre-sented and the motivation supporting my decision to undertake this researchproject is explained. Next, the objective that this research attempts to sat-isfy and its connection with my motivation is described. Lastly, the structureof the report is given to let the reader have a clear idea of what each chapteris about.

1.1 The Web2rism project

The Web2Rism project has been carried out by the webatelier1 lab at Uni-versità della Svizzera Italiana (USI - Lugano, Switzerland) and funded bythe CTI - the Swiss Confederation’s Commission for Technology and Inno-vation2 - and a private company called PromAx Communication3.

The webatelier lab, directed by the professor Lorenzo Cantoni4, is a re-search and development laboratory, which deals with a broad range of topicsrelated to new media communication especially in the eTourism field com-bining a strong academic background and a relevant business experience.Research projects of Webatelier deal with online communication strategiesfor destinations and tourism companies: eWord-of-Mouth and destinations’online reputation, eLearning and gaming in tourism, argumentation in usergenerated contents, usability and usages studies, websites’ information ar-chitecture, booking engines design.

1http://www.webatelier.net2http://www.kti.admin.ch3http://www.promax.ch/index.html4http://newmine.blogspot.com/

2

1.2. Motivations 3

The aim of the Web2Rism (web 2.0 and tourism) project was to build abusiness intelligence software for the tourism field, which analyzes the onlinereputation of a given destination based on User Generated Contents (UGC),published on different services in the so-called web 2.0, e.g.: blogs, wikis,social networks, and so on. The project was divided in a research phasefollowed by its development for a total duration of two years (2008-2010).The software, also named Web2rism, has been designed, developed and re-leased in December 2010 by researchers of Webatelier whom also releasedseveral research papers on the topic of web reputation for touristicdestination[11][12].

1.2 Motivations

The main aim of the “Web2rism” project was to analyze online reputationof tourism destinations. When I joined the “Web2rism” team I startedmy research not exactly on the reputation analysis but on a related topic,that is, find similar tourism destination with the aim of creating a tourismsuggestion system. There are different ways and already available onlinetools that satisfy these needs, most of them based of data generated by usertravel behaviors or exploiting the users’ reviews about tourism destinations.These methods are, for example, widely used by big travel and tourismonline portals such as Expedia5 or Venere6.However, I wanted to develop a system not based on users’ reviews or tourismanalyses. In particular I started to wonder if it may have been possible toextract knowledge from the users’ online photos. In other words, whena person, for example a tourist takes a photo in a particular place he is,in a way, expressing his estimation for that place. This, alone, represents avaluable information letting the analysts identify usage patterns (e.g. a largenumber of photos taken in a place during a particular day may identify anevent/concert/parade, ectr.). Usually in the online photo collections, whichFlickr is the most representative example, the users can geotag their photos,meaning that they can annotate the geographical place where the photoswere taken, but also they can annotate them using textual keywords withthe aim of creating a short description of their photos.This represents a very interesting information because we can analyze notonly geographical photo trends but also the trends amongst the textualdescription aggregated by geographical locations (e.g. how many users are

5http://www.expedia.com/6http://www.venere.com/

1.3. Objective 4

using the tag “Trevi” in the city center of “Rome”?).I began the research studying the existing research papers about the topic,formalizing the hypotheses and setting the objectives.

1.3 Objective

In the past the information retrieval communities have studied methodsfor efficiently indexing, retrieving, ranking and browsing documents in geo-graphic data collections[17, 30]. However, in collections like Flickr the co-existence of location metadata together with unstructured text-based anno-tations allows the generation of interesting location-driven aggregate knowl-edge: when enough information is available, analysis systems can identifyuseful location-driven trends and patterns in the text data.The exploitation of Flickr’s geotags in conjunction with users’ text annota-tions has shown to be effective for various tasks: global event detection [25],mapping of popular tags and photos to geographical locations[1, 15], findingimportant landmarks and representative photos[4]. Furthermore, severalmethods have been proposed to predict the geotags of a photo, based on itstextual tags[29], visual information[4] and individual user travel patterns[14].I propose an attempt to predict similar geographic location using the Flickrgeoreferenced photos and relative tags. The similarity is based on locationdescriptions that are, in turn extracted aggregating textual photos annota-tion. As far as I know only another research[3] trying to reach this aim havebeen published.

1.4. Thesis Outline 5

1.4 Thesis Outline

The thesis begins with an introduction of the background topics necessaryto understand the whole research. The Background chapter begins with anexcursus about the concept of folksonomy considered particularly importantin the scope of this research.Later on the discussion focuses on what the “Geoweb” is and why it isbecoming an important emerging trend, considering also its related topicsanalyzed in this research. The Background chapter ends with an overviewof the Information Retrieval methods used in the research and with the dis-cussion of the related researches that mainly affected this work.

In Chapter 3, the theoretical approach of this work is explained startingfrom the assumptions representing the foundations of the whole researchending with a step-by-step illustration of all the choices I made.

The system architecture behind the project heavily uses online data ser-vices and API to obtain the information on which the implemented toolperforms the analysis, scoring and retrieving, that are described in Chapter4 along with the whole system architecture.

Chapter 5 presents the tests conducted using a survey on a users fo-cus group, discussing later the evaluation based on the collected opinions.

Finally, in Chapter 6 the conclusions are drawn, highlighting the reachedgoals together with the main limitations of the current prototype, and sug-gesting possible future improvements.

Chapter 2

Background

This chapter’s aim is to provide the necessary background to better un-derstand the characteristics of the project I worked on. It begins with anexcursus about the concept of folksonomy and its subdivision in narrow andbroad. Then, the discussion focuses on what the “Geoweb” is and why it isbecoming an important emerging trend, considering also its related topicsanalyzed in this research (tags, geotags). Following the two main online datasource, Flickr and Yahoo GeoPlanet, used during the work will be presented.This chapter continues with an overview of the Information Retrieval meth-ods used in the research, that is the Vector Space Model and the TF-IDFweighting method. Finally, it ends with the discussion of the related re-searches that mainly affected this work.

2.1 Folksonomies

Collaborative tagging is a phenomenon where users assign free-form keywordsor short sentences (called tags) to annotate, describe and categorize shareddigital content typically over the web[16]. This practice is also known associal classification, social indexing, and social tagging. Collaborative tag-ging became popular on the Web around 20041 as part of highly popularweb and social applications that enable collaborative tagging in some formssuch as the music service Last.fm2, the social bookmarking application De-licious3, the social networking site Facebook4 and the photo management

1http://vanderwal.net/folksonomy.html2http://www.last.fm/3http://www.delicious.com/4http://www.facebook.com/

6

2.1. Folksonomies 7

and sharing tool Flickr5. The fact that big systems like these use collabora-tive tagging shows that it has become a common and likely effective way todescribe various forms of digital content on the web.Together, the tags, in their respective contexts form a vocabulary oftenreferred to as a folksonomy[22], which can be used for organization and re-trieval of the digital content the folksonomy describes.Folksonomies can be also defined as large-scale bodies of lightweight annota-tions provided by humans, and they are becoming more and more interestingfor research communities that focus on extracting machine-processable se-mantic structures from them[2].Folksonomy, a term coined by Thomas Vander Wal6, is a blend of the termsfolks - multiple people with no particular designation - and taxonomy - ahierarchical structure of classification - meaning that by adding metadatato objects or resources a community builds a personalized taxonomy.Since the tags are usually free-form and unconstrained (although some tag-ging sites do not allow spaces or other non-alphanumeric characters to beincluded in tags) without imposing the use of pre-built vocabularies, folk-sonomies represent an alternative mechanism to the semantic web approachwhere experts build ontologies7 with predetermined relationships amongkey-words. This later approach, usually, requires domain-field experts anda community agreeing on most of the experts choices; while in collaborativetagging, keyword indexing grows as a natural process[5]. In other terms wecan refer to folksonomy as a means for people to tag objects (web pages,photos, videos, podcasts, etc.) using their own vocabulary so that it is easyfor them to refind that information again. It is important to notice that thefolksonomies being often in social networks context let others that use thesame vocabulary to find the object as well. Folksonomies work best when thetags used to describe objects are in the common vocabulary and not whata person perceives others will call it. It is possible to derive two differentconcepts from the general term foksonomy: broad and narrow folksonomies,that go beyond a simple understanding of tagging.

5http://www.flickr.com6http://www.vanderwal.net/about.php7http://www.shirky.com/writings/ontology_overrated.html

2.1. Folksonomies 8

2.1.1 Broad Folksonomy

An example of broad folksonomy is represented by a tool like Delicious.The broad folksonomy has many people tagging the same object and everyperson can tag the object with their own tags in their own vocabulary.We can describe this process dividing it in three main actions:

• A person creates the object (content) and makes it accessible to others.

• Other people (groups of people with the same vocabulary) tag theobject with their own terms.

• The people also find the information based on the tags.

Analyzing the broad folksonomies we can usually gather emerging trends,often with a distribution shape following the power law curve8 (Figure 2.1).

Broad Folksonomy & Power Law CurveThere are both benefits and drawbacks about the tagging approach to beconsidered. The main positive aspect of the tagging systems is that theyallow much greater malleability and adaptability in organizing informationthan do formal classification systems because “groups of users do not have toagree on a hierarchy of tags or detailed taxonomy, they only need to agree,in a general sense, on the ‘meaning’ of a tag enough to label similar mate-rial with terms for there to be cooperation and shared value” 9. However, anumber of problems arise from organizing information through folksonomiesincluding ambiguity in the meaning of tags and the use of synonyms whichcreates informational redundancy[7]. The main concern about the use ofcollaborative tagging to organize metadata is whether or not the systembecomes relatively “stable” with time and use. With “stable” we mean topoint out that users have developed some consensus about which tags bestdescribe an object and those tags are used most often. The most problem-atic claim for tagging systems would be that because users are not undera centralized controlling vocabulary, no coherent categorization scheme canemerge at all. In this case, tagging systems would be essentially unstable,where the tags used and their frequency of use would be in a constant stateof flux.

A tag distribution for an object or resource is defined as the collection ofall tags and their frequencies ordered by rank frequency for a given resource.

8http://www.shirky.com/writings/powerlaw_weblog.html9http://www.adammathes.com/academic/computer-mediated-

communication/folksonomies.html

2.1. Folksonomies 9

Figure 2.1: Generic Tag Distribution in the broad folksonomies

It has been empirically proven that tag distributions for broad folksonomiesactually stabilize over time producing a distribution known as a power law[7]like the one shown in Figure 2.1. This means that from the broad folkson-omy we can gather trends describing how a wide range of people are taggingone object.We can use the Figure 2.1 in order to describe this trend coming out fromthe action of tagging a “bookmark” on Delicious carried out by differentusers. The tags spike with tag “2” getting the largest portion of the tagswith 13 entries and tag “1” receiving 10 identical tags.From this point the trends for popular tags are easy to see with the spikeson the left (power terms) identifying some trends that could be used to ex-tract a controlled vocabulary or at least to have a broad spectrum of peopleknowing how to call the object and so find it (similar to those that taggedthe object, considering also that those that tag may not be representativeof the whole). We also see those tags out at the right end of the curve,known as the long tail. This is where there is a small minority of peoplewho call the object by a term, but those people tagging this object wouldallow others with a similar vocabulary mindset (or maybe same language)to find the object, even if they do not use the terms used by the massesover at the left end of the curve. If we take this example and spread itout over 400 or 1,000 people tagging the same object we will see a similardistribution with even more pronounced spikes and drop-off and a longertail because one important feature of power laws is that they can often be“scale-free” such that regardless of how larger the system grows, the shapeof the distribution remains the same, and thus “stable”.The long tail and power curve are benefits of the broad folksonomy comingfrom the richness provided by many people openly tagging the same object.

2.1. Folksonomies 10

As it will be described next, the narrow folksonomy does not have the sameproperties, but it will have other benefits. These benefits are non-existentfor those just simply tagging items, most often done by the content creatorfor their own content.

2.1.2 Narrow Folksonomy

The narrow folksonomy, which a web application like Flickr represents,provides benefit in tagging objects that are not easily searchable or have noother means of using text to describe or find the object (in this case imagesand photographs).The narrow folksonomy is done by one or a few people providing tags thatthe person uses to get back to that information. The tags, unlike in thebroad folksonomy, are singular in nature for each object (only one tag withthe term is used as compared to 13 people in the broad folksonomy using thesame tag). Often in the narrow folksonomy the person creating the object isproviding one or more of the tags to get things started. The goals and usesof the narrow folksonomy are different than the broad, but still very helpfulas more than one person can describe the one object.Also with the narrow there are few probabilties of really knowing how thetags are consumed or what portion of the people using the object would callit what, therefore it is not helpful in finding emerging vocabulary or trends.We do find that tags used to describe are also used for grouping, which isparticularly visible and relevant in Flickr.The narrow folksonomy does not have the richness of the broad folksonomy,but it still add value. The value, as in the case of Flickr, is in text tags beingapplied to objects that were not findable using traditional search engines orother text related tools that comprise much of how we find things on theinternet today. The narrow folksonomy does provide various audiences themeans to add tags in their own vocabulary that will help them and thoselike them to find the objects at a later time.

2.1. Folksonomies 11

2.1.3 Folksonomies Conclusions

We benefit from folksonomies as the both the personal vocabulary and thesocial aspects help people to find and retain a “chain” to objects on the webthat represent an interest to them. Who is doing the tagging and how thetags are consumed are important factors to understand. This also helps tosee that not all tagging is a folksonomy, but is just tagging. Folksonomytagging can provide connections across cultures, languages and disciplines(a photograph can be tagged using two different languages and alphabetsbut we can find valuable information because one object is tagged by bothcommunities using their own differing terms of practice).As a conclusion we can say that we take different advantages from folk-sonomies even if it is a narrow folksonomy

2.2. The Geo World & Geo Web 12

2.2 The Geo World & Geo Web

The amount of geographically annotated data over the Web is drasticallyincreasing, generating a new big trend where the geographic data (and meta-data) are available in several applications and online services[1].The Geospatial Web or Geoweb is a relatively new term that implies themerging of geographical (location-based) information with the abstract in-formation that currently dominates the Internet. This would lead to anenvironment where one could search for things based on location instead ofby keyword only – e.g. “What is Here?”.The concept of a Geospatial Web may have first been introduced in 1994 byDr. Charles Herring in his US Department of Defence research paper[9].The interest in the Geoweb has been guided by new technologies, conceptsand products. Virtual globes such as Google Earth10 and NASA WorldWind11 as well as mapping websites such as Google Maps12, Bing Maps13

and Yahoo Maps14 have been major factors in raising awareness towards theimportance of geography and location as a method to index information.The increase in advanced web development methods such as Ajax and theavailability of geographical Application Programming Interfaces (API) suchas Yahoo! Geoplanet API (see the next paragraph) and Google Maps APIFamily15 are providing inspiration to move Geographical Information Sys-tems (GIS) into the web. This is also due to the availability of map interfacesin several kind of location-aware mobile devices that are becoming accessibleto the mainstream market.

Location-Aware Mobile DevicesWith the increasing popularity of mobile communications and mobile com-puting, the demand for location-aware and adaptive applications grows.Location-aware devices and applications exploit knowledge about the phys-ical location of real-world objects such as mobile persons and devices, toadapt their functional behaviour and their appearance towards the user[19].The user can be located with different positioning systems.

Due to the massive production of affordable GPS-enabled cameras andmobile phones[21, 24] location metadata such as latitude and longitude areautomatically associated with the content generated by users.

10http://www.google.com/earth/11http://worldwind.arc.nasa.gov/12http://maps.google.com/13http://www.bing.com/maps/14http://maps.yahoo.com/15http://code.google.com/apis/maps/index.html


If the device is equipped with a GPS (Global Positioning System) module,the location is calculated in the user device and it can be defined very ac-curately within the range of 2–20 meters[13].

A mobile phone can be also located by the telecom operator in thenetwork[13]. The positioning is based on identifying the mobile networkcell in which the phone is located, or on measuring distances to overlappingcells. In urban areas the accuracy can be down to 50 meters, whereas inrural areas the accuracy may be several kilometres. The advantage of thecell-based positioning method is that no extra equipment is needed - an or-dinary mobile phone is already capable.

Figure 2.2: Mobile devices positioning systems and accuracies

Finally user can also be identified at a service point, utilizing e.g. WLAN(Wireless Local Area Network), Bluetooth or infrared technologies. Thesekinds of proximity positioning systems require a dense network of accesspoints[13]. The density of the network depends both on the required loca-tion accuracy and on the range of the access points. The accuracy can be


down to 2 meters, even if practical test using an iPhone device16 showedan average accuracy of 30 meters17. The user needs mobile devices able toconnect to WLAN and Bluetooth services, but nowadays these are becom-ing very common in current mobile smartphones. Because of the requiredinfrastructure, such localization methods can only be used in a predefinedarea, e.g. a shopping centre, an exhibition area or an office building. Thelocation of the user is available only when the user is in the service area.

2.2.1 The GeoTags

Geotagging refers to the process of assigning geospatial context information,ranging from specific point locations to arbitrarily shaped regions to objectsand online resources. The concept is similar to the action of tagging on-line resources explained in the Chapter 2.1, but in this case the objects andresources are being annotated with geographical metadata instead of usingfree-form textual keywords.Different sources of geospatial context information for annotating Web re-sources often co-occur in real world applications[20]:

• Annotation provided by the user, manually or through location-awaredevices such as car navigation systems, RFID-tagged products andGPS-enabled cellular phones. These devices geotag information auto-matically when it is being created.

• Determining the location of the user analyzing his connection pointto the Internet – e.g. by querying the Whois18 database for domainregistrations or using the W3C Geolocation API19 for an higher accu-racy. In case of mobile devices with one of the methods mentioned inthe above paragraph.

• Automated annotation of existing documents. The processes of recog-nizing geographic context and assigning spatial coordinates are com-monly referred to as geoparsing and geocoding, respectively.

The geospatial metadata (geotags) usually consist of latitude and longitudecoordinates, even though they can also include annotation like altitude, bear-ing, distance, accuracy data, and place names.

16http://www.apple.com/iphone/specs.html17http://www.wired.com/gadgets/wireless/magazine/17-02/lp_guineapig18http://www.dnsstuff.com/19http://dev.w3.org/geo/api/specsource.html


Geotagging can help users find a wide variety of location-specific informa-tion. For instance, one can find images taken near a given location by en-tering latitude and longitude coordinates into an appropriate image searchengine.Geotagging-enabled information services can also potentially be used to findlocation-based news, websites, or other resources. The related term geocod-ing refers to the process of taking non-coordinate based geographical iden-tifiers, such as a textual street address, and finding associated geographiccoordinates (or vice versa for reverse geocoding). Such techniques can beused together with geotagging to provide alternative search techniques. Thegeocoding activity usually analyzes unambiguous structured location refer-ences, such as postal addresses and formatted numerical coordinates, whileGeoparsing handles ambiguous references in unstructured discourse, such as“Venice” which represents the name of several places, including towns inboth Italy and USA.

2.2.2 Geotagging Photos

There are several circumstances in which the location where a picture wastaken is important: tourists shoot photos of family while traveling on va-cation, botanists record images of plant species, and real-estate firms postshots of houses and neighborhoods[32]. These represent only few examplesin which the geographic location where the photographs were taken providescritical context. Other factors could be represented by the social sharing ac-tivities, in order to let the highest number of users as possible to reach ourpictures by geotagging them20. Geotagging can for example tell users thelocation where a given picture was taken, and conversely some media plat-forms allow to show pictures relevant to a given location.Users have the opportunity to spatially organise and browse their personalmedia, and photo sharing services (see Flickr in next section) are leadingthe growing enthusiasm for personal location-awareness[31]. Geo-referencedphotos can be organised in a browsable taxonomy of major locations or pin-pointed on a map to identify very small regions. Some of the most popularexamples are Flickr Places and Google Panoramio21.

The base resource for geotagging digital objects is represented by the20http://www.msnbc.msn.com/id/22732770/ns/technology_and_science-internet21http://www.panoramio.com/


position, that in almost every case, is derived from the GPS, and based onthe latitude/longitude coordinate system that presents each location on theearth from 180° west through 180° east along the Equator and 90° norththrough 90° south along the prime meridian.There are two main options for geotagging photos: capturing positioninginformation (usually GPS) at the time the photo is taken or “attaching” thephotograph to a map after the picture is taken.

In order to capture GPS data at the time the photograph is captured, theuser must have a camera with built in GPS or a standalone GPS along witha digital camera. Because of the requirement for wireless service providersin United States to supply more precise location information for 911 calls bySeptember 11, 201222, more and more cell phones have built-in GPS chipsalso sold all around the world.Some cell phones like the iPhone and different devices using the Android23

Operative System already utilize a GPS chip along with built-in cameras toallow users to automatically geotag photos. Others may have the GPS chipand camera but do not have internal software needed to embed the GPSinformation within the picture.A few digital cameras also have a built-in GPS that allow for automaticgeotagging such as Nikon, Sony and Ricoh. Almost any digital camera canbe coupled with a stand alone GPS and post processed with photo mappingsoftware (such as GPS-Photo Link24, Alta425, or EveryTrail26) to write thelocation information to the image’s Exif header.

An alternative way to know the location of the pictures is representedby the use of a camera with an SD memory card or SDHC card27 withwireless connection enabled and geotagging capabilities. The most commonSD memory card with these characteristics is the Eye-Fi Geo28 providinga unique method in extract place location.The Eye-fi card geotags picturesthrough Wi-Fi Positioning System (WPS)[19] technology. Using the built-in Wi-Fi module, the Eye-Fi Card senses surrounding Wi-Fi networks whilethe user is taking pictures. The location is not locally recorded in conven-tional Exif coordinate form, but the geotags are inserted into Exif whenphotos are uploaded using the Eye-Fi Service.

Geographic coordinates can also be added to a photograph after the pho-22http://www.fcc.gov/cgb/consumerfacts/wireless911srvc.html23http://www.android.com/24http://www.geospatialexperts.com/productfeatures.php25http://www.alta4.com/eng/geoimaging/camera/index.php26http://www.everytrail.com/garmin_import.php27http://www.sdcard.org/28http://uk.eye.fi/products/geox2


tograph is taken by “attaching” the photograph to a map[29] using onlineservices such as Flickr and Panoramio. These tools can then write the lat-itude and longitude into the photos Exif header after selecting the locationon an online map.

Tag vs. GeotagIn the online photo sharing communities the user text-based annotation(tags) and the location metadata (geotags) often co-exist. In these contextsit is not rare to refer to a geotag as the textual tag carrying also the georef-erence information associated to the photo that it describes.In this report the term “geotag” refers to the geographic annotations (e.g.where a photo was taken) while “tag” is always intended as the textual an-notation related to a photo (e.g. “cat”).

EXIF MetadataGeotag information is typically embedded in the metadata (stored in EXIFformat). These data are not visible in the picture itself but are read andwritten by almost any digital imaging programs and most digital camerasand modern scanners.EXIF stands for “Exchangeable image file format”, and represent a specifi-cation29 for the image file format used by digital cameras (including smart-phones) and scanners. The specification uses the existing JPEG, TIFF Revi-sion 6.0, and RIFF WAV file formats, with the addition of specific metadatatags. It is not supported in JPEG 2000, PNG, or GIF.The specification Version 2.1 is dated June 12, 1998 and the latest version2.3 dated April 201030, was jointly formulated by JEITA31 and CIPA32.Though the specification is not currently maintained by any industry orstandards organization, its use by camera manufacturers is nearly universal.The metadata tags defined in the EXIF standard cover a broad spectrum:

• Date and time information. Digital cameras will record the currentdate and time and save this in the metadata.

• Camera settings. This includes static information such as the cameramodel and make, and information that varies with each image such as

29http://www.exif.org/specifications.html30http://www.cipa.jp/english/hyoujunka/kikaku/pdf/DC-008-2010_E.pdf31http://www.jeita.or.jp/english/32http://www.cipa.jp/english/index.html


orientation (rotation), aperture, shutter speed, focal length, meteringmode, and ISO speed information.

• A thumbnail for previewing the picture on the camera’s LCD screen,in file managers, or in photo manipulation software.

• Descriptions and copyright information.

• Geotags

Latitude and longitude are stored in units of degrees with decimals, in thisformat, a positively signed coordinate indicates Northern or Eastern hemi-sphere, while negative sign indicates Southern or Western hemisphere.An example readout for a photo might look like:

GPS Latitude : 57 deg 38’ 56.83” NGPS Longitude : 10 deg 24’ 26.79” EGPS Position : 57 deg 38’ 56.83” N, 10 deg 24’ 26.79” E


2.2.3 Yahoo! GeoPlanet & WoeId

Dealing with geo-referenced data and metadata we always face the ambi-guity. The ambiguity can arise from at least three different critical factorsthat have always affected the geographic representation systems long timebefore the birth of the GeoWeb: location names, coordinates precisions andboundaries.The first factor is due to the fact that different people call the same placewith several different names depending on the user’s language, alphabet orcultural background. Every location on the Earth can have hundreds ofdifferent names referring the same geographical object.For each location there can be:

• different names in English

• different names in other languages (including the local one)

• well-known (but unofficial) variants for the place (e.g. “New YorkCity” for New York)

• colloquial names for the place (e.g. “Big Apple” for New York)

• version of the names stripped of accent characters

• abbreviations or code for the place (e.g. “NYC” for New York)The second factor deals with the accuracy of the coordinates identifying aplace. A geographical place can be targeted using several distinct sourcesand each one provides its own version and representation of the coordinatesidentifying that place, and, with an high probability, they will never over-lap. The accuracy is the main reason of this problem; we can, for example,deal with accuracies of centimeters, meters or kilometers supplied by threedifferent sources and everyone is identifying the same geographical object.Someone could say that the solution to this problem is the adoption of thesystem supplying the coordinates with the highest accuracy. This lead tothe third source of ambiguity, that is the boundary limits.

Even adopting the system supplying the best accuracy we have to facethat the three system exposed in the example can identify the same geo-graphic place boundaries in different ways. Since the geographic places mayhave very difference real-world boundary shapes it is a challenge to identify:

• the center of the object

• the shape of the object

• the accuracy representing the shape


The Geoplanet ServiceYahoo! provide an online service and public API that attempts to solvesome of the mentioned problems: the Yahoo! GeoPlanet33 service (Geo-planet in short). GeoPlanet is designed to bridge the gap between the Realand Virtual worlds by providing an open, comprehensive, and intelligent in-frastructure for geo-referencing data on Earth’s surface. In practical terms,GeoPlanet is a resource for managing all geo-permanent named places onearth. It provides a vocabulary and grammar to describe the world’s geog-raphy in an unequivocal, permanent, and language-neutral manner, and isdesigned to facilitate spatial interoperability and geographic discovery.

Where On Earth IdentifierGeoPlanet provides information for about six million named places globally.Spatial entities provided by GeoPlanet are referenced by a 32-bit identi-fier: the “Where On Earth ID” (WOEID). WOEIDs are unique and non-repetitive, and are assigned to all entities within the system. A WOEID,once assigned, is never changed or recycled. If a WOEID is deprecated it ismapped to its successor or parent WOEID, so that requests to the serviceusing a deprecated WOEID are served transparently.

The HierarchyThe service uses a hierarchical model for places that provides both verti-cal consistency and horizontal consistency of place geography. The modelensures that places in each layer in the hierarchy overlay the correct andcorresponding places in other layers, and that geographical relationships arepreserved. The hierarchy allows to query the geographic context of everynamed place represented by a WOEID. Every place belongs to a number ofcontaining, superior (larger) geographic entities, and in turn may contain anumber of inferior (smaller) geographic entities. The smallest fully contain-ing official geographical entity for a place is called its parent. The list ofcontaining official geographic entities for a place is called its ancestors. Thefully contained geographic entities for a place are called its children. Thehierarchy recognizes a distinction between “official” administrative places,such as country, state, county, and city, and “informal” places, such as col-loquial places and historical administrative places. These informal placesare included in a separate collection called belongtos.

33http://developer.yahoo.com/geo/geoplanet/


Figure 2.3: Geoplanet Hierarchy and Relationships

RelationshipPlaces have relationships with other places; Yahoo! GeoPlanet allows usersto identify places that have specific relationships to others, such as the par-ent, children, and neighbors. For example, a list of states (or first-leveladministrative areas) in a particular country can be obtained by requestingthe children of that country; in a similar manner, the surrounding postalcodes of a particular postal code can be obtained via a call for its neighbors.The following relationships are provided by GeoPlanet:

• Parent

• Children

• Neighbors

• Siblings

• Belongtos

• Ancestors


Place TypePlaces are categorized to help identify the geographic entity. These PlaceTypes have unique codes that may be used to filter results for some re-sources. They also have localized names, so they can be displayed alongwith the localized place name. The following list describe a little subset ofthe supported place types.

• Continent (code:29)

• Country (code:12)

• Town (code:7)

Positional ConsistencyPlaces in GeoPlanet are roughly represented in Longitude/Latitude coordi-nates using the WGS8434 data. All places are represented within a singlepositional context to ensure that content is organized in a consistent wayglobally. GeoPlanet also recognizes that a place has a center and an areaof influence and represents these respectively by its centroid and its bound-ing box. Thus every place within each theme has a geometric description.Different areas within different themes overlap to enable the most granularlocation for an address to be found.The coordinates provided are illustrative, not normative; the service doesnot aim to be the authority on the exact bounds of any particular place.The main feature is instead to provide a common naming convention, andto ensure that places are correctly represented in relation to each other in aglobal, consistent framework. In practice this means that the service doesnot claim that a particular neighborhood stops at one block and starts atthe next, only that the concept of that neighborhood be identified consis-tently. The primary concerns are geography, and the semantics of place.

The role of Geoplanet inside my researchThe Geoplanet Service and the relative API (Section 4.2.1) have been usedduring the project and exploited mainly to clear identify different placeson earth (the dataset used is composed of 233 cities). One important fea-ture used was represented by the WOEID; dealing with an unique integeridentifiers, once identified each city I could refer to them without caringabout names, ambiguity, geographic coordinates and bounding boxes. TheWOEID was a really important aspect of the project that simplified severaldevelopment stages (see Chapter 4).

34http://earth-info.nga.mil/GandG/publications/tr8350.2/tr8350_2.html


2.2.4 Flickr

Flickr35 is a popular image and video hosting website and online communitycreated by Ludicorp and later acquired by Yahoo!.In September 2010, it reported that it was hosting more than 5 billion im-ages36. Flickr let photo submitters to organize images using tags, whichenable searchers to find images related to particular topics, such as placenames or subject matter. Flickr was also an early website to implementtag clouds, which provide access to images tagged with the most popularkeywords.Because of its support for tags, Flickr has been cited as a prime example ofeffective use of folksonomy, although Thomas Vander Wal suggested Flickris not the best example37 defining it a narrow folksonomy (Chapter 2.1).

Since 200638, Flickr lets also users geotag their uploaded pictures drag-ging them over a map[29]39 or importing photos that have been alreadygeotagged using other tools or services including the automatic mobile geo-tagging methods explained in the Section 2.2.1.

Figure 2.4: Flickr Map

35http://www.flickr.com/36http://blog.flickr.net/en/2010/09/19/5000000000/37http://www.vanderwal.net/random/entrysel.php?blog=178138http://blog.flickr.net/en/2006/08/28/great-shot-whered-you-take-that/39http://www.flickr.com/map


The import system is able to extract the geographic metadata from theEXIF information with a strong support for the Where On Earth identifierprovided by Yahoo! Geoplanet (Section 2.2.2).

For mobile users, Flickr has an official app for iPhone40, BlackBerry41

as well as several 3rd party apps. All these mobile apps let the users, oftengeolocated, to interact with the system aware of the users’ location (depend-ing on the used devices) and automatically upload geotagged photos fromthe devices.

Finally, Flickr offers a fairly comprehensive web-service API that enablesprogrammers to create applications that can perform almost any function auser on the Flickr site can do (Section 4.1.1).

Online photo sharing systems, which Flickr represents the most popular,are strongly contributing in the growing enthusiasm for personal location-awareness[29]. It is worth to mention that for the launch of this new service,on the August 28, 2008 the Flickr developers made projections guessing howmany photos Flickr members would geotag; they though that they couldhit a million in the first month or maybe in the best scenario in two weeks.Instead, 24 hours after the launch of the geotag service, there were 1,234,384geotagged photos42. Lately, on January 8th, 2011 it was announced thatthere were 190 Million of available geotagged photos with a constant increas-ing trend43.

The role of Flickr inside my researchFlickr aggregate a huge collection of images, most of them annotated with awide variety of textual tags but also with other forms of information includ-ing the “owner” of a tag, geolocation, time and photographer, very preciousfor the aim of this research in order to extract patterns and generate “knowl-edge”.

40http://itunes.apple.com/us/app/flickr/id328407587?mt=841http://us.blackberry.com/smartphones/features/social/flickr.jsp42http://blog.flickr.net/en/2006/08/29/geotagging-one-day-later/43http://code.flickr.com/blog/2011/01/08/flickr-shapefiles-public-dataset-2-0/

2.3. Weighting and Scoring Methods 25

2.3 Weighting and Scoring Methods

As earlier anticipated the research I carried out can be resumed in the fol-lowing steps:

1. Extract a representative description of each city

2. Find similarities between these representations and storing the rankedlist of place similarities

3. Given a city as a query term, find the most similar analyzing theranked list previously calculated

In particular for the step 1 the TF-IDF approach has been used (themotivations will be clear reading the assumptions in the Chapter 3.2). Forthe step 2 and step 3 the Vector Space Model has been adopted in order tocalculate and retrieve the similarities.

2.3.1 TF-IDF

The Term Frequency-Inverse Document Frequency is a weight often usedin information retrieval and text mining. It was first introduced by GerardSalton44 in 1975[27] in conjunction with the Vector Space Model (See NextSection).The weight is a statistical measure used to evaluate how important a wordis to a document in a collection or corpus. The importance increases pro-portionally to the number of times a word appears in the document but isoffset by the frequency of the word in the corpus. Variations of the tf–idfweighting scheme are often used by search engines as a central tool in scor-ing and ranking a document’s relevance given a user query.

Definitions:The term count (or frequency ) for a term ti in a document dj is the num-ber of times a given term appears in that document. This count is usuallynormalized to prevent a bias towards longer documents (which may have ahigher term count regardless of the actual importance of that term in thedocument) to give a measure of the importance of the term ti within theparticular document dj.

44http://en.wikipedia.org/wiki/Gerard_Salton


Thus the term frequency (normalized) is defined as follows:

ti,j =ni,j

Σknk,j(2.1)

where ni,j is the number of occurrences of the considered term (ti) in doc-ument dj , and the denominator is the sum of number of occurrences of allterms in document dj , that is, the size of the document |dj | .

The inverse document frequency is a measure of the general impor-tance of the term along the corpus, obtained by dividing the total numberof documents by the number of documents containing the term, and thentaking the logarithm of that quotient:

idfi = log2|D|

|d:ti∈d|

where:

• |D| represents the total number of documents in the corpus

• |d : ti ∈ d| is number of documents where the term ti appears (that isni,j ̸= 0 )

The reason why the the logarithm of the ratio is needed for calculating theidfi is well explained by Dr. E. Garcia in one of the most influential blogsabout Information Retrieval topics45.

The weight that determines the importance of a term ti for a documentdj is computed as:

Wi,j = tfi,j · idfiA high weight in tf–idf is reached by a high term frequency (in the givendocument) and a low document frequency of the term in the whole collectionof documents; the weights hence tend to filter out common terms.Various (mathematical) forms of the tf-idf term weight can be derived de-pending on the probabilistic distributions of the terms and documents underanalysis. I personally used different variations of the idf terms in order tocompute different weights based on different assumptions (Chapter 3.6.3).

The tf-idf weighting scheme is often used in the vector space model to-gether with cosine similarity to determine the similarity between two docu-ments.

45http://irthoughts.wordpress.com/2009/04/15/why-idf-is-expressed-using-logs/


2.3.2 Vector Space Model

Vector space model (VSM) is an algebraic model for representing text doc-uments (and any objects, in general) as vectors in a multidimensional spaceof index terms. It is used in information filtering, information retrieval, in-dexing and relevancy rankings. A first version was first introduced in 1960in the Gerard Salton’s SMART46 Information Retrieval System.Each dimension corresponds to a separate term and it is based on the as-sumption of independency between terms.In the Vector Space Model a query is treated just as another document andboth are represented as vectors in the term space.Each document (or query) is represented by a vector in a M-dimensionalspace, where M is the number of index terms:

dj = (w1,j , w2,j , ..., wM,j)

q = (w1,q, w2,q, ..., wM,q)

Each term is identified by a unit vector ti = (0, 0, ..., 1, ..., 0) pointing inthe direction of the i-th axis (orthogonality assumption). The set of vectorsti, i = 1, ...,M forms a canonical basis for the Euclidean space RM .Any document vector dj can be represented by its canonical basis expansion:

dj =

M∑i=1

wi,j · ti (2.2)

If a term occurs in the document, its value in the vector is non-zero.Several different ways of computing these values, also known as (term)weights, have been developed. One of the best known schemes is tf-idf(See previous section) weighting. The definition of term depends on theapplication. Typically terms are single words, keywords, or longer phrases.If the words are chosen to be the terms, the dimensionality of the vector isthe number of words in the vocabulary (the number of distinct words oc-curring in the corpus). In my research the distinct tags x in the corpus areadopted as the basis of the VSM, representing each city (that can be seen asa document) as vectors in this space (See the Chapter 3.7.1) computing alsothe weights using the tf-idf approach (See 3.6.3). Vector operations can beused to compare documents with queries, in particular it is often used theassumption that documents that are close to each other in the vector spaceare similar to each other. Using this assumption in a keyword search, therelevance rankings of documents to a given query can be calculated.

46http://www.tcnj.edu/~mmmartin/EThul/SMART/smart-pres.pdf


Cosine SimilarityThe Vector Space Model computes the similarity score SC(q, dj) between thequery and each document, and produces a ranked list of documents. Thereare various measures that can be used to assess the similarity between doc-uments (Euclidean distance, Inner product, Jaccard and Dice similarity,..)but the most used is the cosine similarity[28].Using the assumption that documents that are close to each other in thevector space are similar to each other, it measures the similarity (not thedistance) between two vectors vectors by measuring the cosine of the anglebetween them.The result of the Cosine function is equal to 1 when the angle is 0, and it isless than 1 when the angle is of any other value.Calculating the cosine of the angle between two vectors thus determineswhether two vectors are pointing in roughly the same direction. The cosinemeasure normalizes the results by considering the length of the documentvector and for normalized vectors, the cosine similarity is equal to the innerproduct:

SC(q, dj) = cos(α) =qT dj|qdj |

by comparing the deviation of angles between each document vector andthe original query vector where the query is represented as same kind ofvector as the documents.

Figure 2.5: A three-dimension example of the Vector Space Model


2.3.3 VSM Related Issues

The Vector Space Model has been first introduced in the 60s but it stillwidely used nowadays(with all the improvements) in several InformationRetrieval applications. Nevertheless there are some issues to take into con-sideration in adopting this model to compute a ranked list of documentsimilarities to a given query (the attempt of this research):

1. Search keywords must precisely match document terms because wordsubstrings might result in a false positive match (i.e. “car” vs “car-toon”).

2. Semantic sensitivity; documents with similar context but differentterm vocabulary will not be associated, resulting in a false negativematch.

3. The order in which the terms appear in the document is lost in thevector space representation.

4. Assumptions of terms independence rarely reflects real-world docu-ments.

5. Impossibility of formulating “structured” queries, that means is notpossible to use operator such as OR, AND, NOT, etc..

6. Terms represent axes that means we can reach an high number ofdimensions.

7. Each documents is represented by the weights of its terms w.t.r. of thedictionary (or corpus) used. Each document contains only a fractionof the terms in the corpus resulting in a sparse Term-Document matrix(i.e., inefficient for the calculation).

In order to prevent the point 7 a different storage schemes are usually used.For example an Inverted Index storage scheme is usually adopted to preventthe term-document matrix being sparse (i.e. in order not to store the po-sitions with “0”). Also different text preprocessing methods are performedbefore computing the term weights to reduce the dimensionality (point 6)and the semantic sensitivity(point 2) in order to reduce inflectional formsand derivationally related forms of a word to a common base form. Usuallyin the VSM this is done using:

• Stemming: heuristic process that chops off the ends of words[18, 23].

• Lemmatization: accurate process with the use of a dictionary andmorphological analysis of words.

2.4. Related Works 30

2.4 Related Works

Several studies have been conducted on extracting knowledge from Flickrphotos collections’ georeferenced metadata. Following, the most relevantand representative works on this topic related to my research will be brieflydescribed to better understand my approach.

2.4.1 Wormholes

In “Finding Wormholes with Flickr Geotags”[3] Maarten Clements47 et al.propose a kernel convolution method to predict similar locations (worm-holes) based on human travel behaviour.A wormholes is defined as a similar, but not necessarily spatially close loca-tions on the planet. Their hypotheses can be resumed as following:

• users have a specific travel preference and therefore visit locations thatare to some extend similar

• making a photo at a visited location is an indication that the user likesthat location

Based on these hypotheses, the aggregated travel data of many users shouldbe able to reveal which locations are most similar to a given query location.In photo sharing websites like Flickr, users can upload and indicate the geo-graphical location of their pictures and annotate them with text-based tags.The method combine these two information in a prediction for similar loca-tions.

DatasetUsing the public API of Flickr the research group collected the top-100 mostpopular localities (cities, parks, etc.) for each day in 2008. The aggregateddata contains 8,643 places. To retrieve the geotagged data, they repeatedlyfollowed the procedure:

• Select a location l from the full distribution, with the probability rel-ative to the global popularity in 2008

• Get a photo il from this location

• Get all the photos from the user who made il

Following this strategy, they collected the tags and their relative coordinatesof 36,264 users. Together these users have uploaded 52,425,279 photos ofwhich almost 23 millions have been geotagged (see Figure 2.6).

47http://homepage.tudelft.nl/5q88p/


Figure 2.6: Geotagged World photos distribution collected for the “Wormholes” research

Wormhole DetectionFrom a given target location L the algorithm wants to find the most similarlocations around the world. For each user u, a weight WL,u is computedbased on the distance of the nearest geotagged photo of the user to thetarget location, weighted by a normal distribution:

where standard deviation σ of the normal distribution determines how manyusers are considered relevant for the target location and can therefore beused as a scaling parameter and d(L,Gu,i) computes the euclidean distancebetween the ith geotag of a user Gu,i and L.


The weighting function slowly decays when the nearest geotag48 is foundfurther from the target location L:

The wormholes are found by aggregating the geotags of all users with Wu

as weight per user. The aggregated user information is convolved with agaussian kernel to create a smooth prediction profile. The difference betweenthe resulting profile and a distribution based on all users gives a score thatindicates the relevance of each position on earth with respect to the targetlocation L.Method EvaluationMaarten Clements et al. shown that geotags can effectively be used topredict similar locations with high precision. Predicting the wormholes fromMount Everest clearly shows the similar locations: ’The Rocky Mountains’,’Mnt. Kilimanjaro’, ’The Scottish Highlands’ and some other mountainranges in Indonesia, Japan and New Zealand. Many of the urban areas inthe USA and Europe are predicted to have a negative relation with MountEverest. In the Figure 2.7 is shown the wormholes prediction for mount

Figure 2.7: Wormholes detection from Mount Everest with σ= 50 km.

Everest, positive predictions are blue, negative red. Further informationcan be found by visiting the project web page49.

48the term is used meaning the textual tag carrying also the georeference informationassociated to the photo that it describes

49http://homepage.tudelft.nl/5q88p/wormholes/


2.4.2 World Explorer

In “World Explorer: Visualizing Aggregate Data from Unstructured Textin Geo-Referenced Collections”[1] researchers from Yahoo! Research Berke-ley50 show how to analyze the tags associated with the geo-referenced Flickrimages to generate aggregate knowledge in the form of “representative tags”for arbitrary areas in the world. They used these tags to create a visualiza-tion tool, “World Explorer”51 with the attempt to help revealing the contentof the analyzed data, using a map interface to display the derived tags andthe original photo items.The data analysis of the system is based on multi-level clustering and TF-IDF-based (term frequency, inverse document frequency) scoring of tags.The visualization exposes, for each map region and zoom level, the high-scoring tags for the generated clusters; these tags are shown as text over themap area where each cluster occurs.There are challenges in analyzing and visualizing such unstructured user-contributed data. Issues of noise (e.g. photos with tags that are not rele-vant to the location) and errors (e.g., photos that are geotagged incorrectly)abound in the Flickr data. The algorithm tries to handle these in a gracefulmanner.There are also considerations of scale, especially as the amount of data in-creases. To summarize, the contributions of the work are:

• An approach for deriving meaningful data from unstructured text as-sociated with geo-referenced collections.

• A sample application that derives such information from Flickr geo-tagged images.

• A visualization technique for large-scale geo-referenced photo collec-tions, that allows automatically-derived and effective world explo-ration via photos and maps.

Only the first two points will be explained because considered relevant formy work.

50http://research.yahoo.com/Yahoo_Research_Berkeley51http://tagmaps.research.yahoo.com/worldexplorer.php


Figure 2.8: The World Explorer Map for a large scale of details

Figure 2.9: The World Explorer Map for a narrow scale of details for the City of Romein Italy


DatasetThe dataset used in the research consisted of 6 million public geo-taggedphotographs on Flickr (data collected in October 2006).Almost 90% of the 6 million photos were associated with user-entered tags.While the heaviest concentration of geotagged photos was found in theUnited States and Western Europe, the dataset had very wide coverage, andincluded photos from almost every country in the world. The researchersexcluded from their analysis photographs which had an accuracy lower than10 (approximately city level). Other methods were used in order to reducethe noise in the data: weighting of tags described below.

Data Model and ObjectivesThe dataset consists of three basic elements: photos, tags and users. Usingsuch data the algorithms find tags that are most “representative” for eachgiven geographical area G. It is important to note that these representa-tive tags are often not the most commonly used tags within the area underconsideration. Instead, the aim is to extract tags that uniquely define sub-areas within the area in question. For example, if the user is examining aportion of the city of Rome then there is very poor advantages by showingthe user the “Roma” or “Rome” tags, even if these tags are the most fre-quent. Instead, it is useful to show tags such as “Pantheon”, “San Pietro”,“Villa Borghese” which uniquely represent specific locations, landmarks andattractions within the city. The first step in determining the “representa-tiveness” of a tag is to have an intuition of what the term implies. Whilethere are no formal models to define how representative a tag is for an area,Mor Naaman et al. followed some simple heuristics that guided throw thedesign of the algorithms. The heuristics attempt to capture the human at-tention and behavior within the photos and tag dataset, and include thenotions that:

• The number of photographs taken in a location are an indication ofthe relative importance of that location.

• The importance of a location increases with the number of individualphotographers that have taken photos there.

• Users are likely to use a common set of tags to identify the objects/events/locationsthat occur in photographs of a specific location.

• Tags that occur in a concentrated area (and do not occur often outsidethat area) are more representative than tags that occur diffusely overa large region.


• The more users that used a tag in an area, the more representativethe tag is for that area.

Based on the data, the requirements from the analysis are therefore:

• Identify important regions for every map region and zoom level.

• Select representative tags for the identified regions.

Computing Tags for a Geographic AreaThe analysis starts by assuming that the system considers a single givengeographic area G, and the photos that were taken in this area, PG. Thesystem attempts to extract the representative tags for an area G. Thiscomputation is done in two main steps:

• cluster the set of photos PG using the photos’ geographic locations.

• assign scores to the tags in each cluster.

For the first step a K-means approach was used in order to cluster photoswithin an area based on the photos’ latitude and longitude G.Once the clusters have been determined, the system scores the cluster’s tagsto see if it is possible to extract representative tags for each cluster C.Each cluster represent a set of tags. The system ranks each tag x in theset so that the top tags are, according to the defined heuristics, the mostrepresentative tags.The main factor used for scoring is a TF-IDF approach that assigns a higherscore to tags that have a larger frequency within a cluster compared to therest of the area under consideration (based on the assumption that the moreunique a tag is for a specific cluster, the more representative the tag is forthat cluster).


TF-IDF WeightingThe TF-IDF is computed with slight deviation from its regular use in In-formation Retrieval.

The term frequency tf(x) for a given tag x within a cluster C is thecount of the number of times x was used within the cluster.

The inverse document frequency for a tag x usually is represented by theratio of the number of documents (i.e Clusters) that contain photos withthis tag in the entire area G under consideration. In this particular context,they modified the measure to consider the overall ratio of the tag x amongstall photos in the region G under consideration: idf(x) = |PG|/|PG,x|.They modified the standard idf usage due to the small number of clustersthat obtained for each area. If they wanted to check the presence of a tagwithin a cluster with the standard idf then they could face large changesin the TF-IDF weights if even a single photograph in the cluster containedthe tag. Also they only consider a limited set of photos (PG) for the IDFcomputation, instead of using the statistics of the entire dataset. This re-striction to the current area, G, allows to identify local trends for individualtags, regardless of their global patterns.

As mentioned in the above section, there are good probability to facethe noise in the Flickr data. One scenario is due to a single photographerwho takes a large number of photographs using the same tag. To guardagainst this event, the researchers included a user element in the scoringthat also reflects the heuristic in the hypotheses that a tag is more valuableif a number of different photographers use it (as a further guard, they assigna score of 0 to any tags that was used by less than 3 photographers in agiven cluster). In particular, the factor is the percentage of photographersin the cluster C that use the tag x: Uf(x) = UC,x/UC .

Finally the score for tag x in cluster C is computed byScore(C, x) = tf · idf · uf . For each cluster only the tags that score abovea certain threshold are retained and considered representative.

The other parts of the algorithm (indexing, storage and retrieval) are notrelevant for my work and they will not be explained.For further information, the original paper presents deeper details and explanations[1]

Chapter 3

My Approach

In this chapter the theoretical approach of this work is explained.The chapter begins with the hypotheses formalization through the assump-tions on which the entire the work is based. Especially the concept of placesimilarity is clarified as intended in the scope of this research.Later the two different datasets used are exposed explaining the reasons whythey are needed and why two different analyses were performed. Followingthe term definitions are listed to let the reader having a clear overview ofthe symbols used in the scope of this work.The chapter continues discussing the main core of all the research, that isthe TF-IDF weighting methods and the Vector Space Model representation.Finally the approaches to score and retrieve similar locations are presented.

38

3.1. Approach Introduction 39

3.1 Approach Introduction

The objective of the research is to find place similarity analyzing the ag-gregated georeferenced tags from a subset of Flickr photos. The objectivesare similar to those exposed in the description of the “Wormholes” research(2.4.1), but using a TF-IDF approach similar to the one described in the“World Explorer” research (2.4.2) in order to extract representative tags de-scribing each place. Once obtained a representative set of tags describingeach place I used an Information Retrieval method (Vector Space Model)to compute a ranked list of places according to their “similarity”. FinallyI could retrieve a list of places similar to a location given as a query termanalyzing the pre-computed ranked list.

I can summarize my research in the following steps:

1. Extract a representative description for each city

2. Find similarities between these representations and storing the rankedlist of place similarities

3. Given a city as a query term, find the most similar one analyzing theranked list previously calculated

I analyzed the textual tags distribution of Flickr geotagged photos from theunion of the top 150 tourism city destinations for the years 2007 and 20081

for a total count of 233 cities.

1according to “Euromonitor International” :http://www.euromonitor.com/Top_150_City_Destinations_London_Leads_the_Way

3.2. Assumptions 40

3.2 Assumptions

I based my research on assumptions derived from both the two main relatedworks mentioned in chapters 2.4.1 and in 2.4.2.For both Datasets and their two related analysis methods (see Chapter 3.3)the following assumptions hold:

1. Users have a specific travel preference and therefore visit locations thatare to some extent similar.

2. Making a photo at a visited location is an indication that the user likesthat location.

3. Users are likely to use a common set of tags to identify theobjects/events/locations that occur in photographs of a specific loca-tion.

4. The extracted representative set of tags for a place provides a repre-sentative description of that place.

5. A place is defined as a “similar” to another (not necessarily spatiallyclose locations on the planet) if their representative descriptions aresimilar.

6. The more users used a tag in an area, the more representative the tagis for that area (tf term).

7. Tags that occur in a place (and do not occur often outside that place)are more representative than tags that occur diffusely over all theplaces (idf term).

8. The tags are assumed to be mutually independent (VSM condition)

3.3. Datasets 41

3.3 Datasets

I used the public Flickr API to retrieve the data I worked on. Flickr offersan API that returns the top 100 most frequent tags for a given place (see4.2.2). This offered an easy way to retrieve the data. Unfortunately this APIdoes not give any extra information about the users who made the photosor tagged them nor about the photos and their exact coordinates. Also, inmy opinion, a collection of only 100 tags can not properly describe a widearea such a city because of the high probability of obtaining trivial results.For this reason I decided to follow two different experimental analyses, oneusing this dataset (with all its limitations), and another one representinga subset of the actual geotagged photo distribution obtained by a randomsampling of all the geotagged photos for each place based on the hypothesisthat by a random sampling of the original distribution we can obtain an-other one resembling some of its characteristics.

Dataset “Top 100”For the reasons mentioned above this dataset, that I will call “Top 100”,is represented by a list of the most common tags (100) for each of the 233cities analyzed and their relative frequencies (total count 23.300 tags).

Dataset “Random”I obtained the “Random” dataset performing a random sampling in order toretrieve a collection of photos (and related tags) trying to cross the apparenttriviality obtained by the “Top 100” dataset.I performed a random sampling (over the time parameter) mainly for tworeasons.

First, I did not have the time or the resources to retrieve the totality ofall the geotagged photos for each place. Flickr does not support aggregatequery functions like “count” as in sql-like languages, so there is no way toknow in advance the amount of geotagged photos for each place. Anyway,in order to give an idea of the magnitude, we can observe that for the cityof London the only tag “london” appears with a frequency of 1,2 millions(observed from the “Top 100” dataset), that means that there are, with ahuge underestimation, at least 1,2 millions geotagged photos only for thecity of London. Multiplying these rough estimations for 233 different citiesclearly gives the idea of the impossibility of retrieving such amount of datain the available time I had to carry out this research.

3.3. Datasets 42

The second reason to perform a random sampling comes from experi-mental tests of retrieving geotagged photos for a prefixed period of timealong all the places. From these tests I recognized that data biases existeddue to particular events for that period in one or more of the 233 places (i.e.a parade, elections, a rock concert) so that this dataset would not representthe characteristic of the actual geotagged photos distributions.

Using the “Random” dataset I could obtain much more useful informationnot available with the “Top 100” dataset such as the user who tagged a par-ticular photo. Having this information was essential in order to introduce afurther assumption:

• The importance of a tag for a place increases with the number ofindividual photographers that use it in that place [1] 2 (Uf term).

In other terms, a tag is more valuable if a number of different photographers(who is tagging) use it.

NOTE: the details and the conditions of the “Random” sampling will beexplained in Chapter 4.2.2.

2Modified from the assumptions of “Wormhole” research: Chapter2.4.2

3.4. Definitions 43

3.4 Definitions

• A tag is represented by ti

• The union of all the retrieved tags ti for all the cities representsa corpus C

• The corpus for the “Top 100” distribution is defined as C100

• The corpus for the “Random” distribution is defined as CRnd

• The set of the 233 places (cities) under analysis is defined as G

• Each place (a city) in G is represented by P

• A retrieved photo for the “Random” dataset is defined as f

• The set of all the retrieved photos for the “Random” dataset is definedas F

• A user who tagged a photos in the “Random” dataset is defined as u

3.5. Flickr: from Narrow to Broad Folksonomy 44

3.5 Flickr: from Narrow to Broad Folksonomy

A particular aspect of this research captured my attention. As mentionedin 2.1, Flickr represents a narrow folksonomy, so we can hardly extracttrends and usage patterns. This is true if we consider the distribution oftag along the photos, but in this case we are considering the distribution oftags belonging to photos that are aggregated by the place where they weretaken. This means that tagging a geotagged photo with text-based termsthe user is not only describing the photo but also the place where the photowas taken (Assumption 3). In other words if we consider as hypothesis, theplace (city) as the object being tagged by aggregating the tags of the photostaken in that place we can infer that under this consideration Flickr is abroad folksonomy because more than one user can re-use the same terms fordescribing the object (the city). This can be empirically shown by plottinga tag distribution for one of the places under study, for example, New York:

Figure 3.1: Tag distribution for the city of New York using the “Top 100” dataset

The distribution clearly presents the power law curve (see Chapter 2.1)typical of the broad folksonomies[7].


We can now consider the same distribution obtained from the “Random”dataset:

Figure 3.2: Tag distribution for the city of New York using the “Random” datasettruncated to the 150th tag

We obtained again a power law curve with the only differences of a scalefactor3 and a wider tags vocabulary. This also shows that the “Random”sampling distribution has to be considered as valid as the “Top 100” distri-bution.

3The scale factor is no relevant because of the scale-invariant property of the powerlaw curve, Chapter 2.1.1


We can observe now the distribution in a log-log scale:

Figure 3.3: Tag distribution for the city of New York using the “Random” datasettruncated to the 150th tag using a log-log scale

In Figure 3.3 we can identify a linear-similar trend that proves (with all thedue approximations) that Figure 3.2 identify a power law curve.

NOTE: The hypothesis that the “Random” distribution presents some ofthe characteristics of the real one will not be proven because this is not theaim of this research. Such “Random” distribution has been used only forlack of resources to obtain the real one, and it will be considered valid forthis scope.

3.6. Extract a representative description 47

3.6 Extract a representative description

The Assumption 5 states that: “A place is defined as a “similar” to another(not necessarily spatially close locations on the planet) if their representativedescriptions are similar”.So the first step in retrieving similar places is to extract their representativedescription and then compare them in a further step.The union of all the retrieved tags for all the cities represents a corpus C,and each city is described using all the terms (tags) in the corpus C. Therepresentative description of a city means give a weight to each tag in thecorpus identifying how “important” it is in describing each place.The “importance” score has been computed using a TF-IDF weighting.

3.6.1 Vector Space Model Representation

In the Vector Space Model (See Chapter 2.3.2) text documents are repre-sented as vectors in a multidimensional space of index terms. Each dimen-sion corresponds to a separate term and it is based on the assumption ofthe independency between terms (Assumption 8).The objective of this research is to find similarities between geographicalplaces, but if we consider Assumption 5 (Chapter 3.2) we can reduce theproblem to find similarities between their descriptions.As a further step we can recall that each place (city) is described using allthe terms (textual-based tags) in the corpus C taken with different weightdepending on the “importance” of a tag in describing each place (see TF-IDF in next chapter). Under these considerations the description of a placeis nothing more than a text document and since we want to find similaritiesbetween the descriptions of the places the overall objective is reduced to findsimilarities between textual documents. Hence, normal textual based Infor-mation Retrieval models can be applied, so I decided to adopt the VectorSpace Model to represent each place in G.


3.6.2 Problem Decomposition

Since the objective of the research may not be clear with the adoption ofthe VSM I will explain a step-by-step problem decomposition:

• each place P is described by its representative description (Assump-tions 4, Chapter 3.2);

• a place description is, in turn, represented in the VSM by a vector Vin an M -dimensional space;

• the basis of the VSM are the M distinct tags occurring in the corpusC;

• the components of the generic vector V are the weights obtained bythe TF-IDF approach (next chapter);

• the places similarity problem is reduced to the similarity of their de-scriptions (Assumptions 5, Chapter 3.2);

• the descriptions similarity problem is reduced to the vectors similarityproblem;

• the places similarity is calculated computing the vectors similarity.

From now on we can refer to the generic place P as its description (see nextsection).The Vector Space Model is based on the assumption that documents (places)that are close to each other in the vector space are similar to each other.Since the query and the documents are treated in the same way and repre-sented in the same space, in order to compute a similarity score between aquery and a document a distance or a similarity function is applied betweenthe vectors representing the documents and the query.In my case I want to compute the similarity scores between a place P in theset G (considered as the query) and all the other places in G (considered asthe documents). As a consequence we can eventually find the place similar-ity for each place in G with respect to all the others adopting a similaritymeasure in order to estimate the similarity (not distance) between the vec-tors.The cosine similarity has been adopted for computing a ranked list of simi-larities for each place P with respect to all the other places in G.


3.6.3 TF-IDF Weights

Each place P in G is described by all the tags t in the corpus C taken withdifferent weights. The TF-IDF approach was used to score each tag ti in C

according to the “importance” that the tag ti has in describing the place P

(based on Assumptions 6, Assumptions 7 Chapter 3.2).The TF-IDF assigns a higher score to tags that have a larger frequencywithin a city compared to the rest of the places (term frequency factor).The term frequency factor will be the same for both dataset and analyseswhile the Inverse Document Frequency will be calculated with some heuristicdeviations from its standard usage for the “Random” dataset (and relatedanalysis) in order to find the best weighting scheme to represents places inthe VSM.

TFThe term frequency tfi,j for a given tag ti within a place Pj is the count ofthe number of times ti was used within the place Pj .Depending on the dataset under analysis ti was a tag respectively in C100

and CRnd corpuses.

IDF for the “Top 100” DatasetThe inverse document frequency for a tag ti in the corpus C100 is representedby the ratio of the number of documents (i.e Cities) in the set with respect tothe number of documents that contain this tag. This represents the heuristicthat “tags that occur in a place (and do not occur often outside that place)are more representative than tags that occur diffusely over all the places”.In other terms, Standard idfi:

idfi = log2|G||Pi|

where |G| is the number of places used in the analysis (233) and |Pi| is thenumber of places containing the term ti.


IDF for the “Random” DatasetThe “Random” dataset carries extra information (Photos and Users) thatwere used in computing two inverse document frequency deviations from thestandard information retrieval usage in order to heuristically find the bestweighting scheme to represents places in the VSM.In the “Random” analysis the generic tags ti are in CRnd corpus and placesare described using the terms in CRnd.1) Firstly the standard idfi was calculated:

idf -rnd1i = log2|G||Pi|

where |G| is the number of places used in the analysis (233) and |Pi| is thenumber of places containing the term ti (ti in CRnd).

2) For the first idf deviation I adopted one of the heuristic introduced in the“World Explorer” research (Chapter 2.4.2) exploiting the extra informationcoming from the photos factor. In this case I estimate the overall importanceof a tag ti along the photos collection F and not along the places set G:

idf -rnd2i = |F ||fi|

where |F | is the total amount of photos in the “Random Dataset” and |fi|is the number of photos containing the tag ti. This measure has been intro-duced with the attempt of avoiding large changes in the TF-IDF weights ifeven a single photo in all the dataset contained the tag ti.

3) For the second idf modified formula I followed the extra assumptionthat “The importance of a tag for a place increases with the number ofindividual photographers that use it in that place” for this dataset (Chapter3.3). We refers to photographers as the users who tagged the pictures,hence who contributed in the “descriptions” of the places. To compute the“importance” of a tag in describing each place considering the assumptionmentioned above I introduce a further measure, that is properly the userfactor:

Ufi,j =|ui,j ||uj |

where |ui,j | is the number of users tagging with the tag ti in the place Pj ,while |uj | is the number of distinct users who tagged at least a photo in Pj .


In this case the inverse document frequency will be:

idf -rnd3i,j = |fj ||fi,j |

where |fj | is the number of photos for the place Pj and |fi,j | is the numberof photos for the place Pj containing the tag ti. As in the the idf -rnd2this measure has been introduced to prevent that small groups of photoscontaining ti (along the place) could widely change the tf-idf weights.The idf -rnd3 is not actually an inverse document frequency because it takesunder consideration not only the “importance” of the tag ti along all theplaces in G but it introduces also a place factor. So it gives the (relative)importance of a tag ti for the place Pj . It looks similar to the term frequencyfactor (meaning that it depends on both document and term) but with awide different meaning.

3.6.4 Weighting Systems

Finally, I obtained all the necessary measures and data to calculate theweights in the idf-idf approach, reminding that these represent the compo-nents of the vectors (the places) in the vector spaces. That means that theseweights have to be calculated for all the unique terms occurring in a place(with respect to the corpus used).For each place I computed four different weights measures organized as fol-low:

System A) For the “Top 100” dataset only one weight was calculated fol-lowing the stadard tf-idf (Chapter 2.3.1):

W -100i,j = tfi,j · idfi, ti ∈ C100

For the “Random” dataset three different weights were calculated dependingon the idf definitions explained above:System B) Weight for the “Random” dataset based on the standard tf-idf:

W -rnd1i,j = tfi,j · idf -rnd1i, ti ∈ CRnd

System C) Weight for the “Random” dataset based on photo factor (seepoint 2, previous page) :

W -rnd2i,j = tfi,j · idf -rnd2i, ti ∈ CRnd

System D) Weight for the “Random” dataset based on user factor (seepoint 3, previous page) :

W -rnd3i,j = tfi,j · idf -rnd3i · Ufi,j , ti ∈ CRnd


3.6.5 VSM Limitations

The Vector Space Model presents some usage limitations as already men-tioned in Chapter 2.3.3 affecting the performance and the expectations ofthis research. The limitations that most affected the system in adopting aVSM representation will be analyzed also for understand the overall systemevaluation explained in Chapter 5.3. Some countermeasures were adoptedattempting to cross some of the disadvantages typical of the VSM.

BlacklistWith a first analysis of the two distributions (“Random”,“Top 100”) I no-ticed a certain amount of repeated tags occurring all along G (the set ofthe 233 places). It mainly concerned the most common tags used all alongthe Flickr collection, for example very frequently terms like “day”, “dog”,“friends” or photography related terms e.g. “canon”, “nikon”, “black&white”.From the Assumptions 74 we know that these tags would not give any extrainformation for the aim of extracting a representative description for a placeP . The TF-IDF weighting could easily spot that these tags were not consid-ered important for any of the place (idf term) so I could leave them and letthe system do the work. But since every single (non-repeated) term occur-ring in the corpuses defines a dimension in the VSM representation, withoutpruning these tags the corpuses C100 and CRnd presented a dimensionalityproblem. Even if these terms are not relevant they were still included in thecorpuses and then used for the weights and similarity computations intro-ducing only noise, system complexity and hence performance issues.For all these reasons I decided to prune them.I heuristically created a blacklist (see Appendix B) based on the top mostused tags in Flickr5 and personal choices (see also Section 4.3.2.1).I also pruned all the tags occurring in only one city all along G becausein the VSM they represent terms that will never match in computing thesimilarity scores.

4Chapter 3.25http://www.flickr.com/photos/tags/


Inverted IndexConsidering how the weights are calculated using the general TF-IDF weight-ing (previous section) we can easily observe that if a term ti in the corpusdoes not occur in a document dj its term frequency tfi,j will be zero, henceits weight wi,j will be zero. This leads to a wide occurrences of zeros insidethe representation matrix “term-document”. In other terms the matrix issparse.Considering for example the “Random” dataset I obtained a corpus of 55.000unique terms determining as many dimensions. This means that each of the233 cities analyzed is represented by 55.000 weights, most of them being ze-ros. This leads to two main problems: computational inefficiency and non-optimal resources allocation. In order to prevent these problems I adoptedan inverted index storing scheme.In such a way for each place Pj in G only the weights of the terms havinga non-zero term frequency tfi,j are computed; that means, the weights arecomputed only for the terms (of the respective corpus) occurring at leastonce in the the place description. All the other values will be zero.The inverted index allows computational and storage efficiencies.

Stemming IssueUsually, in order to obtain a further reduction of the dimensionality andthe semantic sensitivity (reduction, not solution) different text preprocess-ing methods are performed before computing the term weights(See Chapter2.3.3). This is generally done using text preprocessing methods to reduceinflectional forms and derivationally related forms of a word to a commonbase form. The most common way to accomplish this aim is represented bystemming[18, 23].The standard stemming algorithm for the English language is the Porter’sStemmer[23]. The stemming algorithms are language dependents and Flickrrepresents a folksonomy(Chapter 2.1), that means that users are allowed touse text-free and unconstrained tags in different languages and alphabets.Even if the most common language used in Flickr tags is the English[10] itis not possible to use a stemmer algorithm in such multi-language context.However different tests have been carried out in order to detect the Englishlanguage tags in the corpus attempting to apply the Porter’s stemmer atleast to the English terms. The details will be explained in Chapter 4.3.2.2.

3.7. Scoring and Retrieving Similar Places 54

3.7 Scoring and Retrieving Similar Places

3.7.1 Scoring

Having two different datasets and corpuses CRnd and C100, their respectivedistinct tags represents two different bases for two different VSM representa-tion. That means that the the places in G will be represented in two distinctVSM representations, one using the “Top 100” dataset (and analysis) con-sidering the generic tags in the C100 and the other VSM representation usesthe “Random” dataset (and analysis) considering the generic tags in theCRnd corpus. These two VSM models represent two different and separateanalyses that we can call “Top 100 VSM” and “Random VSM”. The VSMrepresentations are actually four because in the “Random” analysis I intro-duced three interpretations of the idf ; anyway these three representationsshare the same analysis method and dataset.In Chapter 3.6.1 the problem of similarity between places has been reducedto finding vectors close to each others in the VSM representations. We needto adopt a distance measure in order to calculate such distances.I decided to adopt the cosine similarity as distance measure between vec-tors. This is not properly a distance measure (see Chapter 2.3.2) but a“similarity” measure. It basically measures the Cosine function of the anglebetween vectors. Calculating the cosine of the angle between two vectorsthus determines whether two vectors are pointing in roughly the same direc-tion, so given two vectors, their similarity is determined by their directions.

The cosine similarity is the inner product between two vectors normalizedby their euclidean distance. The euclidean inner product is defined as:

a · b = |a||b|cos(α)

deriving:

similarity(a, b) = cos(α) = a·b|a||b|

The cosine measure normalizes the results by considering the length of the(places) vectors. The angle between two vectors cannot be greater than 90°because the tf-idf weights cannot be negative hence the cosine similarity oftwo places will range from 0 to 1.The result of the cosine similarity is equal to 1 when the angle is 0 (exactmatch, i.e. e vector with itself), it is less than 1 when the angle is of anyother value (partial match) and it is 0 when the angle is 90°(not matchingterms). This is a convenient way of ranking places by measuring how close

3.7. Scoring and Retrieving Similar Places 55

their vectors are to a query vector.

The similarity scoring between places has been performed calculatingthe cosine similarity of each vector (representing a place) with respect to allthe other vectors (the other places).The computations have been performed separately for the “Top 100” and“Random” Vector Space Model representations.

NOTE: the similarities are intended between all the places in G (233 cities)with respect to all the others within this set.

3.7.2 Retrieving Similar Places

All the computations of the cosine similarities have been performed in ad-vance and stored in a database.In particular I calculated four different measures of similarities betweenplaces. The four different measure come from the different VSM represen-tations obtained by the different weights measures (derived from the fourtf-idf weighting schemes, see Chapter 3.6.3).That means one measure for the “Top 100” dataset, and three different mea-sures for the “Random” dataset, derived from the three idf interpretations:

• System A: from “Top 100” Dataset using a standard weight W -100i,j

• System B: from “Random” Dataset using a standard weight W -rnd1i,j

• System C : from “Random” Dataset using a modified weight W -rnd2i,j(more importance to photos)

• System D: from “Random” Dataset using a modified weight W -rnd3i,j(more importance to users)

The retrieving has been implemented as a simple query to a database.Since all the similarity scoring between a city and all the others have beencalculated in advance (pre-computed), the retrieval of all the places similarto a place P is simply represented by a query selecting all the scored placesin G with respect to P , ordering the results from the most similar (cosinesimilarity close to 1) to the last similar (cosine similarity close to 0).

Chapter 4

Implementation

In this chapter, various tools, API and languages used throughout the soft-ware development process of the “Place Similarity” project are presented.In addition, the results of the usages of mentioned tools and some code ex-amples are presented. Later on an overall view of the infrastructure used tobuilt the system and the web application are explained.

4.1 Development Platform

The implemented tool made a large use of online API and I started its devel-opment, from the first experimentations and tests, as an online application.The later steps required heavy computation phases but I decided to con-tinue the development using languages and platforms typically used for theweb development for the sake of compatibility between the components andbecause, simply, the used tools allowed me to carry out the work withoutany extra components and special efforts.The “Web2rism” project has been implemented using a LAMP platform,an acronym standing for Linux, Apache, Php and MySQL. I decided to un-dertake the development adopting this platform both on my local machineand as a remote system on the “Web2rism” server hosted at the “Universitàdella Svizzera Italiana in Lugano” (Switzerland).

56

4.1. Development Platform 57

4.1.1 PHP

As the main programming language, I have used the PHP (version 5.2.4) anopen-source general purpose scripting language.PHP was originally designed for web development to produce dynamic webpages, but it has become a is a general-purpose scripting language. As ageneral-purpose programming language, PHP code is processed by an in-terpreter application in command-line mode performing desired operatingsystem operations and producing program output on its standard outputchannel. The interpreter is integrated in the LAMP platform. In this way Icould create different purpose scripts (Web interface, data gathering, dataanalysis and computation) adopting only a single scripting language.

4.1.2 MySQL

PHP offers a good integration with the Apache web server and several open-source Database Management Systems.MySQL is a popular (maybe the most) open-source relational DBMS largelyused in the Web Development for its perfect integration with PHP and theApache Server in LAMP systems. In has been used for the “Place Similarity”development and used in several tasks. The totality of the tool data analysesand computations have been performed offline, with the meaning of pre-computation. MySQL offered a solid structure to store all the data gatheredonline and subsequently for storing the results of the analyses and scoringcomputations.

4.1.3 Python

PHP offers a good integrations with hundreds of modules and libraries butalthough it is considered a general purpose scripting language it is stillstrictly connected with the web development community. As a consequencePHP suffers the integration with complex mathematic and scientific mod-ules. Without any doubt this is not the case of Python. Python is an in-terpreted, general-purpose high-level programming language offering a widevariety of third-party extensions.Python represent a good choice for the scientific community for its supportoffered through extensions like scipy1 and numpy2 as well as thousands of

1http://www.scipy.org/2http://numpy.scipy.org/

4.1. Development Platform 58

other scientific-oriented extensions.

GensimDuring the development phase I used python for a couple of tests in whichthis language seemed to be more useful than PHP. In particular, an exten-sion called gensim[26]3 represents a framework for Vector Space Modelingoffering automations and functions for all the typical VSM related tasks aswell as semantic support (Latent Semantic Analysis, Latent Dirichlet Allo-cation, Random Projections algorithm).Unfortunately, the compatibility with the “Web2rism” project was funda-mental, so I decided to work using the LAMP platform for compatibilitysupport and integrations, renouncing to use this approach. I then startedto develop my own version of the VSM models using only PHP.

Another task where Python seemed to be a valid alternative to PHP wasthe language detection of the retrieved tags (See Chapter 4.3.2.2).

PerlPerl is a high-level, general-purpose, interpreted, scripting language. Thelanguage provides powerful text processing facilities and it was used duringthe development exactly for this feature.For example, it was used for extracting tags from the Flickr webpage4 ex-posing the most popular Flickr tags. This has been done because Flickr doesnot provide a functionality to know this information through the normal useof the public API.

3http://nlp.fi.muni.cz/projekty/gensim/4http://www.flickr.com/photos/tags/

4.2. API & Online Services 59

4.2 API & Online Services

Being this research based on analyses of user generated content it made alarge use of online API and services. It mainly used web services providedby Yahoo. All the online API were accessed using PHP through the ClientURL Library5 (cURL).

4.2.1 Yahoo Geoplanet

The Yahoo GeoPlanet API6 (see Chapter 2.2.3) was mainly used to identifythe different places on earth avoiding any form of ambiguity. The imple-mented tool let the users find a specific place using a form, returning a listof different places having a certain matching degree with the searching string(Disambiguation, see Chapter 4.4). The users can choose the specific placehe is looking for reading some extra information provided with each result7

(Country, Region, ectr.). If the chosen place is within the set of the analyzedcities (the set is composed of 233 cities) then, the tool presents the user apage with four lists of cities similar to the chosen one. The ambiguity hasbeen avoided indexing each place in database with its WOEID.The WOEID is a unique 32-bit integer identifier assigned to millions of ge-ographical entities.Once identified each city I could refer to it without caring about names,ambiguity, geographic coordinates and bounding boxes.The WOEID was a really important aspect of the project because it is wellsupported by the Flickr API (next section) allowing to retrieve photos byplace just specifying their WOEID.The main APIs used:

• /places: returns a collection of places that match a specified placename; used for the disambiguation;

• /place/woeid: returns a resource containing the long representation ofa place; used for taking information about each place.

5http://php.net/manual/en/book.curl.php6http://developer.yahoo.com/geo/geoplanet/guide/7See Figure 4.5


4.2.2 Flickr

Flickr represented the main datasource on which all this research is based.

“Top 100” DatasetFor the first task of this work the API flickr.places.tagsForPlace representedan easy way to obtain results for the initial experiments. This API returnsa list of the top 100 unique tags for a Where on Earth ID. Since the WOEIDwas already known for all the 233 thanks to the GeoPlanet service (previoussection) with this API it was possible to build the “Top 100” Dataset (see3.3). Unfortunately this method does not give any information about thesingle photos containing the tags, nor about the authors. This representeda big limitations, and it was the reason to perform a successive kind of datagathering from Flickr (“Random”).

“Random” DatasetDue to the limitations of the previous API I performed a random samplingof photo metadata using the following API:

• flickr.photos.search used to retrieve lists of geotagged photos giventheir geographical location (through WOEID) for a specified period oftime (explained later);

• flickr.photos.getInfo used to get tags and owner from the above men-tioned retrieved photo lists.

The reason of a random approach are explained in Chapter 3.3. I personallydecided the criteria for the randomization, in particular I chose to retrieveinformation about 10 photos for each one of 300 different days. Each of the300 days were chosen randomly (by a customized function) introducing alsoa random hour factor. That means that for each one of these days a randomhour (between 00:00 and 23:00) was chosen in order to prevent bias in thetag collections introduced for example in photos taken during the night time(e.g. “night”, “nightphotography”, “long exposure”, “lights”).I coded a PHP script having as a first requirement the robustness. This wasan important requirement due to three different issues. The main reason tohave a robust script depended on the fact that three instances of this scriptwere executed on different machines for more than a week making unfeasibleand problematic the constant monitoring of their execution statuses.Also during the execution period, the machines running the script oftenfaced temporary network connection problems. Finally, the API service re-turned different sort of errors several times. Given all these eventuality I


put some effort in creating a script with the capacity of execute with noparticular problem even if these events occurred.

After this period I collected and aggregated the data gathered from thethree different machines collecting a total count of over 3 Millions tags.

4.2.3 Yahoo Query Language

The Flickr gathering method exposed above was not efficient because in thatway it was necessary to first retrieve all the photo list using flickr.photos.searchaccording to the specified parameters and then retrieve the actual informa-tion about them (tags) using the second API flickr.photos.getInfo.Instead I decided to perform a more efficient task using the Yahoo QueryLanguage, or YQL8.Yahoo Query Language is an expressive SQL-like language that lets devel-opers to query, filter, and join data across Web services.The previous task became, using this API as follow:

SELECT tags, id, owner.nsid from flickr.photos.info wherephoto_id in (SELECT id from flickr.photos.search($NUM_PHOTO_TO_TAKE)where has_geo=’true’ and woe_id=’$woeid’ and min_taken_date=’$random1’and max_taken_date=’$random2’ and safe_search=’true’);

YQL example using flickr.photos.search and flickr.photos.infoAPI in the same statement

It is worth to notice that YQL may slightly change the name of theunderlying used API even if they are considered the same (e.g. the Flickrflickr.photos.search) becomes flickr.photos.info in YQL). In this way I avoidedto use two API calls, being also the above code more human-readable (verysimilar to SQL). YQL provides default mappings between several onlineservices (e.g. Flickr, Geoplanet, YMaps and most of the Yahoo services)through YQL tables. It also gives the developers the possibility to developtheir own Data Tables (i.e. a mapping between a generic online service andYQL). Since the API flickr.places.tagsForPlace used for creating the “Top100” dataset was not available as default in YQL service I decided to createmy own Open Data Table910 as shown below. It is actually an XML file thathas to be online accessible in order to use it.

8http://developer.yahoo.com/yql/9http://developer.yahoo.com/yql/guide/yql-opentables-chapter.html

10http://www.datatables.org/


<?xml version=’1.0’ encoding=’UTF-8’?><table xmlns=’http://query.yahooapis.com/v1/schema/table.xsd’><!–Return the 100 most used tags for a place –>

<meta><author>Leonardo Gentile</author><documentationURL>http://www.flickr.com/services/api/flickr.places.tagsForPlace.html</documentationURL><sampleQuery>select * from table where woe_id=’44418’</sampleQuery></meta><bindings><select itemPath=’rsp.tags.tag’ produces=’XML’><urls><url env=’all’>http://api.flickr.com/services/rest/?method=flickr.places.tagsForPlace</url></urls><paging model=’page’><start id=’page’ default=’1’/><pagesize id=’per_page’ max=’500’/><total default=’100’/></paging><inputs><key id=’api_key’ type=’xs : string’ paramType=’query’ required=’true’default=’a1eadc88c051de04ccc9bc71f915cc24’/><key id=’woe_id’ type=’xs:string’ paramType=’query’/><key id=’place_id’ type=’xs:string’ paramType=’query’/><key id=’min_upload_date’ type=’xs:string’ paramType=’query’/><key id=’max_upload_date’ type=’xs:string’ paramType=’query’/><key id=’min_taken_date’ type=’xs:string’ paramType=’query’/><key id=’max_taken_date’ type=’xs:string’ paramType=’query’/></inputs></select></bindings></table>

Open Data Table for flickr.places.tagsForPlace API

4.3. System Architecture 63

4.3 System Architecture

The system architecture was not particularly complex because the imple-mented tool represented a prototype. In particular, single php scripts weremanually executed through command line interface at each step of the datagathering and analysis, storing each operation result within the database.In Figure 4.1 the data flow of the operations is shown.

Figure 4.1: Data Flow Representation


4.3.1 Data Storage

The used DBMS was MySQL, in line with the LAMP platform as alreadyillustrated in Chapter 4.1.2. As in many real-life search engines all the datawere computed offline meaning that they were pre-computed presenting theuser only the result of previous computations (ranked lists of place similar-ity).

Figure 4.2: Database Representation for the “Random” Dataset

Figure 4.2 represents the database scheme for “Random” dataset (“Top 100”dataset representation is similar but with less variables since it carries lessinformation). The rectangular shapes are the tables and the darken roundedshapes represent the key fields for each table while the white rounded shapesare all the other fields. The blue lines represents the relations between tables(foreign keys).The table Place contains the list of the 233 cities under analysis, while thetable Tags_cleaned contains all the gathered tags (and other photo meta-data) after the suppression of the stop-words (see next section). The tableDictionary includes all the unique tags obtained from Tags_cleaned. FinallyWeights is the table where all the computed weights are stored.


4.3.2.2 Language Detection with PHP & Python

As mentioned in the Chapter 3.6.5 it was not possible to perform the stem-ming operations on a multi language context.The porter’s stemming algorithm[23] is maybe the most used for the englishlanguage, but several other stemmer algorithms exist for other languages.I then decided to try to guess the language of each tag in order to lately per-form different stemming operations on them using the different algorithms.

PHP ApproachThe language detection has been first attempted using PHP and a n-gramapproach. An n-gram is just a n letter long sequence extracted from a doc-ument, so for example the word “constable” in trigrams (3-letter sequences)would break down like this: “con”, “ons”, “nst”, “sta”, “tab”, “abl”, “ble”.There are a lot of ways of extracting these, but I adopted an algorithm thattokenizes the document into 3-grams, for any string passed in11.I tested the algorithm in conjunction with a vector space style cosine simi-larity using the default Mac Os X pre installed dictionaries for the languagedetections.The algorithm stores the term (trigrams tokens) frequencies against a lan-guage, and detect, which one decompose a document in the same way, andfor each trigram present compares its frequency with the test languages.

Because in the algorithm the vectors are divided by their lengths this is anormalized dot product between the two sets of weights, which gives a scorebetween 0 and 1 (cosine similarity, see Chapter 2.3.2).

It turns out that these kind of approaches detect the language of a documentcollecting terms usage statistics and finding matches (actually similaritieswith the VSM approach) resulting in a raking list of languages ordered byscores.Since the documents under analysis were composed of tags in a multitudeof different languages (UTF-8 encoding) such kind of algorithms fail thelanguage detection in similar contexts.

11http://phpir.com/language-detection-with-n-grams


Python ApproachAfter the first failure attempt using the tri-gram approach I decided to usePython and a spell-check approach.PyEnchant12 is a spellchecking library for Python, based on the Enchant13

library. Using this method it is possible to check if a particular word (tag)belongs to a dictionary. Being the tags unconstrained and free-form theycould be in any of the language of this planet, but they can also be propernouns or anything else not included in any dictionary. Due to the impossi-bility to obtains all the necessary dictionaries I decided to detect only theenglish terms for later applying the stemmer algorithm to them.

It turns out that this approach fails because words in a particular language(e.g. Italian) could be included in the English dictionary but they mean dif-ferent things in their respective languages, e.g. “male” means evil in Italianand a person or animal of the male gender in English.

As result of these tests the language detection was not performed on thecorpuses due to the problems exposed above. That means that it was notpossible to perform any stemming or text pre-processing algorithm.

12http://www.rfk.id.au/software/pyenchant/index.html13http://www.abisource.com/projects/enchant/


4.3.2.3 Similarity Computation

Once obtained the metadata, cleaned them, and computed their weights foreach one of the different weighting systems (see Chapter 3.6.3) the last stepwas calculating the similarity scores between each place with respect to allthe others. This final step was also pre-computed and the results stored indatabase.

Figure 4.3: Table Representation for the generic “score” table

Figure 4.3 illustrates one of the four scoring tables that have been createdfor the four scoring Systems (see Chapter 3.7.1). Since we are comparingeach place with respect to all the others in the set, the similarity scores,should be represented by a matrix in which rows and columns are identifiedby the places in the set. Each element of the matrix represents a similarityscore between a place Pi identified by the row i and a place Pj identifiedby the column j. The table, instead was built as a single correspondencebetween couples of places.This is because I used a prevention measure in order to obtain the maximumcomputation efficiency, that is, since the mutual similarity between a placePi and Pj is the same similarity between Pj and Pi I actually created thephp script to compute only the upper part of this triangular matrix.Using a MacBook Pro with an Intel Core 2 Duo Processor (2,53 GHz) and 4Gigabyte of RAM, the computation of the scoring for the “Top 100” datasetrequires only few minutes, while for each one of the weighting systemsderived from the “Random” dataset required on average approximately 3hours.

4.4. User Interface Design & Data Presentation 69

4.4 User Interface Design & Data Presentation

The tool has been designed as a web application since the beginning. Even ifit represented a prototype I put good efforts in designing a system as muchusable as possible. The interface design was based on an existing open sourceproject, Geoplanet Explorer14 created by the developer evangelist “ChristianHeilmann”15

As first step the users (or the testers) introduces the name of a geographiclocation as shown in Figure 4.4

Figure 4.4: Search Field

The system will contact with Geoplanet throw API in order to disambiguatethe inserted place as showed below.

Figure 4.5: Disambiguation

14http://isithackday.com/geoplanet-explorer/15http://www.wait-till-i.com/

4.4. User Interface Design & Data Presentation 70

In Figure 4.5 basic information about the place are presented in order to letthe user identify the place requested.Once the user clicks on the link of the place that he really intends, a pagewith a map of the place and the most recent photos from Flickr will bepresented.If the city is in the similarity database, the four similarity systems will showthe most five similar city to that place (e.g. “Rome”) according to theirscoring schemes (Figure 4.6).

Figure 4.6: Similarities for the city of “Rome”

In this phase the similar cities are retrieved simply querying the “Score”table (Figure 4.3) ordering the results by score (limiting the results to thefirst 5 query match). The similar cities are presented using a tag cloud toolin order to communicate to the user the “degree” of similarity.

Chapter 5

Tests and Evaluations

In this chapter, the concept of similarity as perceived by the users is dis-cussed. Understanding this perception is necessary to interpret the resultsobtained from the survey delivered to the users in order to evaluate the tool.The survey results and the tool evaluation are presented.

5.1 About Similarity

Once again, the purpose of this research and the implementation of the re-lated tool is to find similarities between places.Before getting into the evaluation it is a good idea to clarify the meaning of“place similarity” intended in the scope of this work.The Assumption 5(Chapter 3.2) states that “A place is defined as a similarto another (not necessarily spatially close locations on the planet) if theirrepresentative descriptions are similar”.The main interpretation of place similarity is transferred in the concept ofsimilarity between their descriptions.Each place description is derived aggregating Flickr metadata which rep-resent a narrow folksonomy. Once these tags have been aggregated withrespect to a place we are creating a broad folksonomy (see Chapter 3.5) andwe have to interpret it for what it actually represents.

71

5.1. About Similarity 72

Figure 5.1: Tag distribution for the city of New York using the “Random” datasettruncated to the 150th tag

Figure 5.1 represents the tag distribution for the city of New York, thatcan be interpreted how users describe the city of New York using their ownvocabulary, language and cultural background. As mentioned in Chapter2.1.1, in a broad folksonomy, after the tag distribution stabilizes over timeit produces the famous power law[7]. This means that from this distributionwe can gather trends on how a wide range of people are calling or “describ-ing” the city (high term frequency).We should not forget the right end of the curve, that is the long tail. This iswhere there is a small minority of people who describe the city by a term al-lowing others with a similar vocabulary mindset (or maybe same language)to agree on the city description, even if they do not use the terms used bythe masses over at the left end of the curve.

Under these circumstances we can analyze some critical factors before in-terpreting the survey results.

5.1. About Similarity 73

5.1.1 Critical Factors

A) The original distributions in the un-modified corpuses (like the oneshowed in Figure 5.1) have been widely modified extracting a descriptionbased not only on the term frequency but also on the inverse document fre-quency and other factors(tf-idf Chapter 3.6.3).Since this is only one of the possible several descriptions of the place, theuser may not agree on the criteria chosen for creating this representativedescription.

Figure 5.2: Tag Weights distribution for the city of New York using the “Random”dataset truncated to the 150th tag using the W-rnd1 e W-rnd3 weights

In Figure 5.2 we can observe the distribution of the weights for the same150 tags of Figure 5.1 taken in the same order. This represents a part(truncated to 150 tags) of the New York description using the “Random”dataset and two different weighting schemes (W-rnd1 in red e W-rnd3 inblue, see Chapter 3.6.3).We can observe that some tags does not exists anymore (due to the blacklist,Chapter 3.6.5) and the “importance of the tags” are widely changed. Thisrepresents only two of the possible place descriptions, that the user may ornot find suitable.

5.2. Test 74

B) Some of the factors contributing in creating a description of a city de-pend strongly on the set of places under analysis (e.g. changing the set ofplaces G will change the idf ) and the users are not aware of this.For example a user may consider Berlin the most similar city compared toOslo based on their alternative music movements, but Berlin may not beinside the analyzed set G. In this case the user may be disappointed findingDusseldorf in what he perceives being the position of Berlin.

C) Assumption 5 defines similar places those who shares a similar descrip-tion but that are not necessarily geographically close. The assumption doesnot exclude the usual geographic or physical concept of similarity, but itclearly claims that this condition is not necessary.That means that they can be geographically close. The interpretation ofthis condition can be explained with an example.If we consider the city of “Rome”, the scoring system D gives the first three(ranked) result: “Siena”, “Florence” and “Venice”. These four city are actu-ally close (all in Italy) but also their descriptions are similar (e.g. a consid-erable number of matching terms in the italian language), maybe becausethe citizens of those places share a common cultural background hence acommon place description. So this is a valid result, not a false-positive.

5.2 Test

Because of all the “human” factors explained in the previous section we cannot apply objective evaluation measure such as precision or recall becauseit does not exist a unique way to suggest us which city is actually the mostsimilar to another one and evaluate the results of the research on these cri-teria.The results depend only on the weighting schemes adopted (see Chapter3.7.2) and on the initial retrieved data, that is the Flickr folksonomy, in-tended as the “original” description of a place created by the user tags.

We asked the users to evaluate the four scoring systems, and since they onlydiffer in the used weighting schemes (System A also differs for the underlyingdataset) that means they differ in how they create the description of a place.As a consequence we are actually asking the users which one of the four placedescriptions is the best according to their mindsets.

5.2. Test 75

These descriptions are obtained from a folksonomy and the four methodsto create the four descriptions may enhance or cut off the “importance” ofthe different tags in contributing in the each of the place description (Figure5.2). Since the tool extracts knowledge and patterns from user generatedcontent, in my opinion, a good way to estimate “how good” a similaritysystem is, compared to the others, is to analyze the users’ interpretation.

5.2.1 Survey

In order to evaluate this research I created an online survey asking the userswhich one of the four similarity systems is the best one according to them.Before starting the survey clear instructions explain that the similaritiesbetween places in such a tool are based on place descriptions, not geographicor physical similarities.

Figure 5.3: Survey Introduction and Instruction

For the survey a small group of cities have been selected with an highpercentage of European tourism destinations. This has been done becausemost of the users in the focus group may have no cultural background tojudge cities elsewhere. Nonetheless a small group of cities from Asia, SouthAmerica and USA have been inserted in the test.The focus group was composed of 113 people, in particular, master studentsfrom “Università della Svizzera Italiana” and “Politecnico di Milano”, anda percentage from a Facebook group about European Tourism. Each userwas asked to give a relative judgment on which similarity system performsthe best similarity scores with respect to the others.Each user judged one city at time for a total of five cities. For each city, theuser could observe the five most similar cities according to the four scoring

5.2. Test 76

systems. The web page presented four lists, each one composed by five citiesin a ranked list. Figure 5.4 shows the survey for Marseille.

Figure 5.4: Screenshot of the survey for the city of Marseille

At the top of the page some basic information about the city were pro-vided (Country, Administrative Regions, ectr.) as well as a map, in order togive the users a basic background to let them identify the city under test.The users could understand the order of similarity between the city undertest and each one of the five cities (for each list) observing the size of thefont. The bigger the font, the more the city were similar to the one under

5.3. Evaluation 77

test (e.g. Marseille) with respect to each one of four scoring the system.

Since the place similarity is based on similarity of description, in order togive an extra clue to the users, they could click on each of the city in the listsand check the tags in common with Marseille. This is clearly a simplificationbecause the similarity tool is not based on a simple term match (like in thebinary information retrieval model). Nonetheless it was considered a goodway to ask the users if the tags in common between two cities were or nottrivial. The important point of this particular aspect of the survey was thatI did not suggest in any way what to consider trivial and what to consider anappropriate match. It is the user with his mindset to judge it. As a recap:

• we are extracting knowledge from a folksonomy;

• a folksonomy is “created” by users;

• we are asking the users which one of the knowledge extraction (andcomparison) methods is the best.

This evaluation appeared as the most coherent one with the choices I pre-viously made along all this research.

5.3 Evaluation

The survey was filled by 113 users and monitored for about a week producing516 answers. Since the beginning, a trend was visible, as showed in Figure5.5.

Figure 5.5: Users’ survey answers for the first day of evaluation.

The letters, from A to D represent the respective scoring systems (e.g. Sys-tem A, System D, see Chapter 3.6.4). The letter N denotes that the usersdid not know how to answer.

5.3. Evaluation 78

After a week the proportion between the answers remained more or lesssimilar, as showed in Figure 5.6.

Figure 5.6: Users’ survey answers for the last day of evaluation

The final result gives several useful information. The most noticeable isthat the users did not know what (or how) to answer for a high number oftimes (84), that can be interpreted in three ways.One interpretation is that the users did not find, according to their mindset,any of the city representations appropriate (see Chapter 5.1.1).Another way to interpret the high number of un-answered questions is thatthe proposed survey may have been perceived too complex to understandfor a part of the users. I tried to avoid this problem selecting a focus groupwith a high education level (master students) but it seems that more effortscould have been done to propose such evaluation test in an easier way.Finally there were good possibilities that, simply the users did not know theplace under consideration.On the other side we can discern a trend in which the System D remains thefavorite one, followed by the System A. The System B and System C areconsidered less valuable and they swapped importance position with eachother several times during the observation week.

The best system according with the users is System D (139 prefer-ences). System D is based on the weighting scheme in which great impor-tance was given to the user factor (Chapter 3.6.4). In this similarity systemthe weighting scheme is based on the extra Assumption that was introducedfor the “Random” dataset: “The importance of a tag for a place increaseswith the number of individual photographers that use it in that place” (seeChapter 3.3). This assumption was modified from an assumption borrowedfrom the related work “World Explorer”[1]. This result confirm that notonly this assumption is valid for the scope and objective of this research butalso, it represents the best one.

5.3. Evaluation 79

System A represents the second most popular answer (118 preferences)and it shows that we can extract valuable knowledge gathering informationfrom the “Top 100” dataset. In other words using only the most 100 populartags for a city we can build a place similarity system based on them. Ofcourse such system would not be as comprehensive as the one representedby System D, and this can be explained with an example.If we analyze the city of “Seville” (Spain) using the System A and System Dwe obtain in both cases that “Cordoba” (Spain) is in the set of the five mostsimilar cities. So in this circumstance we have similar results using differentsystems, but if we check the shared tags between these two cities accordingto the two systems under evaluation we can see a huge difference. The tagsin common between “Seville” and “Cordoba” according to the System A areno more than fifty as shown in Figure 5.7.

Figure 5.7: Shared tags between “Seville” and “Cordoba” according to the System A

From the other side, according to the System D the same two cities sharethousands of tags as illustrated in Figure 5.8.

Figure 5.8: Shared tags between “Seville” and “Cordoba” according to the System D(truncated)

5.3. Evaluation 80

As we know the VSM does not consider a simple term match (as in this ex-ample) but it takes under consideration other factors. However, accordingto the cosine similarity (Chapter 3.7.1) two cities that have similarity scoredifferent from zero share at least one tag. So the term matching is used inthis scope as illustrative of the examples.Based on the last example it is clear that the similarity according with Sys-tem A may not be efficient because its analysis is based on a small subset oftags. Of course this subset represents “important” tags for most of the users(top 100 most common tags), but since the VSM will widely change theirweights, this does not exclude that with the adoption of the Vector SpaceModel these tags may not be “important” anymore for the representativedescription of a city (see Chapter 5.1 and Figure 5.2).

Under these considerations I can claim that system A is not consideredto be an efficient way to find place similarities, because based on a smallcorpus C100 of tags and in certain conditions the system may fail in findingsimilar cities due to small dimension of the corpus. This can be clarifiedwith another example.This time I will consider the city of “Rome” (Italy). According to SystemA its most similar city is “Tarragona” (Spain) as showed in Figure 5.9.Analyzing the tags in common between these two cities we can observe avery small set of (trivial) tags (Figure 5.10)

Figure 5.9: Five cities similar to “Rome” according to System A and System D (inranked order)

While, under the same condition System D scored “Siena”, “Florence”,“Venice”, “Verona” and “Naples” (Figure 5.9) the five most similar citiesto “Rome” (in ranked order). System D shares thousands of tags with eachone of these cities (i.e. not trivial analysis) and as a personal opinion (beingan italian citizen I have the cultural background to judge this particularexample) this ranked list is the correct one with respect to the one createdwith System A.

5.3. Evaluation 81

Figure 5.10: Shared tags between “Rome” and “Tarragona” according to System A

System B and System C did not have a great success amongst thesurvey users, but also they were not considered completely wrong (respec-tively 81 and 94 preferences).System B makes use of a standard inverse document frequency for the “Ran-dom” Dataset, while System C exploited the extra information coming fromphotos factor for the “Random” dataset, calculating a non-standard idf (seeChapter 3.6.3).In both cases these weighting systems were not judged as good as System Dand System A.

Figure 5.11: Five cities similar to “Rome” according to System B and System C (inranked order)

Chapter 6

Conclusion

6.1 Current Status of the Work

The research started with the aim of finding similar geographical locationsusing the geo-referenced metadata extracted from Flickr photos. By usinggeotagged photos and their metadata for a limited set of 233 worldwidetourism destinations (cities) meant to narrow the scope of the research withthe objective of finding similar destinations in order to realize a tourismsuggesting system.Before starting the experimentation I formalized the hypothesis introducinga series of assumptions on which I based the whole research.Following the assumptions I described each geographical location aggregat-ing the tags from Flickr photos taken in that location. Once obtained thesetextual descriptions, I represented each location using the Vector SpaceModel, introducing also four TF-IDF weighting modifications in order tofind the best way to describe each geographical place. The Vector SpaceModel was also used to calculate the similarities between the descriptions.Finally a survey has been presented to a focus group of users in order tojudge which one of the four TF-IDF modification schemes perform the bestplace similarity according to them.From a total number of 516 answers (survey proposed to 114 users) atrend emerged, evaluating the System D the most suitable similarity sys-tem (139/516). From this result we can deduce that the particular choice ofconsidering a user factor in the TF-IDF scheme was successful.

82

6.2. Application Fields 83

6.2 Application Fields

This research and the related implemented tool represent an innovativemethod to find place similarities, especially in the field of tourism desti-nation suggestion.Several researches have been proposed about knowledge extraction from geo-referenced user generated content and metadata. The innovative aspect ofthis research is that it goes a step further, meaning that after obtaining howthe users call or describe a place, it attempts to find a similar one.This may represent a very useful way to suggest similar tourism destination,being different from the actual suggestion systems. In particular it is notbased on explicit user suggestions. When someone tags a photo taken ina generic place, he is contributing in describing that place creating an im-plicit link with other geographical locations described in a similar way. Thisis a “genuine” way of obtaining a place description (and then suggestion)avoiding the “human biases” depending on the people feelings always presentwhen writing a review (e.g. tourists that enjoyed the beautiful weather ina particular week or maybe they experienced a very uncomfortable accom-modation). Although the raw descriptions extracted with this system maynot be useful or readable (thousands of keywords) it turns out to be veryhelpful in suggesting similar places.

6.3 Future Work

The system was positively evaluated by the users showing its potential us-age even in this early development phase. Nonetheless it experiences someissues typical of the most common information retrieval models. In details,the system is based on the Vector Space Model, hence it inherits its typicalissues that may be solved as explained in the following paragraph.

DimensionalityThe main concern was represented by a dimensionality problem due to thehigh number of tags to process for each city. I narrowed this problem in-troducing a blacklist and some heuristics in order to suppress those termsthat may not give any extra information but only add noise and computa-tional inefficiency. Nevertheless, after the stop-words suppression the corpusCRnd (from the “Random” dataset) presented 55.000 unique terms, meaning55.000 dimensions in the VSM. In other words, the systems required severalhours of computation in order to obtain all the similarity distances betweeneach place with respect to all the others.

6.3. Future Work 84

If the number of analyzed places will increase (say 1000 cities) the requiredcomputation time could drastically increase. The solution to this problemis to reduce the dimensions of the VSM.

One way is to enhance the blackList terms detection. For example in-stead of adding manually exact terms in the blacklist we may filter tagsnot in the stop-list but with similar meaning using the distributional hy-pothesis[8][6] which states that words found in similar contexts tend to besemantically similar: e.g New York, NY, NYC, New York City.

Another way to reduce the dimension of the vocabulary can be repre-sented by the adoption of another Information Retrieval Model, for examplethe Topic-based vector space model or Latent semantic analysis less affectedby this issue.

Multi-Language ContextA crucial point that was not possible to solve in this research was the lan-guage detection of the various tags amongst the corpuses, hence it was notpossible to apply any form of text pre-processing. Of course the fact thatthe system works (with no particular problem) in a multi-language contextis an advantage so I would not suppress in any way the non-english words.The language detection of each tag in a multi-language context is a realchallenge (see Chapter 4.3.2.2) and in my opinion it may have a possiblesolution (e.g. analyzing co-occurring tags in a particular language ) but itwould require a research per-se.

Bibliography

[1] Shane Ahern, Mor Naaman, Rahul Nair, and Jeannie Hui-I Yang. Worldexplorer. Proceedings of the 2007 conference on Digital libraries - JCDL’07, page 1, 2007.

[2] Ciro Cattuto, Dominik Benz, and Andreas Hotho. Semantic analysis oftag similarity measures in collaborative tagging systems. arXiv, pages3–8.

[3] Maarten Clements, Pavel Serdyukov, and A de Vries. Finding Worm-holes with Flickr Geotags. Advances in Information, pages 658–661,2010.

[4] David J. Crandall, Lars Backstrom, Daniel Huttenlocher, and JonKleinberg. Mapping the world’s photos. Proceedings of the 18th in-ternational conference on World wide web - WWW ’09, page 761, 2009.

[5] Alan Dix, Stefano Levialdi, and Alessio Malizia. Semantic halo for col-laboration tagging systems. In the Social Navigation and Community-Based Adaptation Technologies Workshop-June 20th, 2006.

[6] J. R. Firth. A synopsis of linguistic theory 1930-55. 1952-59:1–32, 1957.

[7] Harry Halpin, Valentin Robu, and Hana Shepherd. The complex dy-namics of collaborative tagging. In Proceedings of the 16th internationalconference on World Wide Web, WWW ’07, pages 211–220, New York,NY, USA, 2007. ACM.

[8] Z. S. Harris. Mathematical Structures of Language. Wiley, New York,NY, USA, 1968.

[9] Charles (U.S. Army Construction Engineering Research Laboratory)Herring. An Architecture for Cyberspace : Spatialization of the Inter-net.

85

BIBLIOGRAPHY 86

[10] Livia Hollenstein and Ross Purves. Exploring place through user-generated content: Using Flickr to describe city cores. Journal of SpatialInformation Science, 1(1):21–48, July 2010.

[11] Buhalis D. Inversini A., Cantoni L. Destinations’ information com-petition and web reputation. itt – Journal of information technologytourism, pages 221–234, 2009.

[12] Dedekind C. Cantoni L. Inversini A., Marchiori E. Applying a concep-tual framework to analyze online reputation of tourism destinations.Proceedings of the International Conference in Lugano, Switzerland,February 10-12, 2010, pages 321–332, 2010.

[13] Eija Kaasinen. User needs for location-aware mobile services. Personaland Ubiquitous Computing, 7(1):70–79, May 2003.

[14] Evangelos Kalogerakis, Olga Vesselova, James Hays, Alexei a Efros,and Aaron Hertzmann. Image sequence geolocation with human travelpriors, September 2009.

[15] Lyndon Kennedy, Mor Naaman, Shane Ahern, Rahul Nair, and TyeRattenbury. How flickr helps us make sense of the world. Proceedingsof the 15th international conference on Multimedia - MULTIMEDIA’07, page 631, 2007.

[16] R. Lambiotte and M. Ausloos. Collaborative tagging as a tripartitenetwork. ArXiv Computer Science e-prints, December 2005.

[17] RR Larson. Geographic information retrieval and spatial browsing. Geo-graphic information systems and libraries: Patrons, Maps, and SpatialInformation. Number Dl. ACM Press, New York, New York, USA,1996.

[18] J.B. Lovins and MASSACHUSETTS INST OF TECH CAMBRIDGEELECTRONIC SYSTEMS LAB. Development of a stemming algo-rithm. 11(June):22–31, 1968.

[19] Sergio Martín, Elio S. Cristobal, Rosario Gil, Gabriel Díaz, Manuel Cas-tro, and Juan Peire. A Context-Aware Application Based on UbiquitousLocation. 2008 The Second International Conference on Mobile Ubiq-uitous Computing, Systems, Services and Technologies, pages 83–88,September 2008.

BIBLIOGRAPHY 87

[20] Kevin S. McCurley. Geospatial mapping and navigation of the web.Proceedings of the tenth international conference on World Wide Web- WWW ’01, pages 221–229, 2001.

[21] Mor Naaman, Susumu Harada, Q.Y. Wang, H. Garcia-Molina, and An-dreas Paepcke. Context Data in Geo-Referenced Digital Photo Collec-tions. In Proceedings of the 12th annual ACM international conferenceon Multimedia, pages 196–203. ACM, 2004.

[22] Isabella. Peters and Paul. Becker. Folksonomies : indexing and retrievalin Web 2.0. De Gruyter/Saur, Berlin, 2009.

[23] M. F. Porter. An algorithm for suffix stripping, pages 313–316. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 1997.

[24] Jonathan Raper, Georg Gartner, Hassan Karimi, Chris Rizos, In-formation Science, and Northampton Square. Applications of loca-tion–based services: a selected review. Journal of Location Based Ser-vices, (791766136), 2008.

[25] Tye Rattenbury, Nathaniel Good, and Mor Naaman. Towards auto-matic extraction of event and place semantics from flickr tags. Pro-ceedings of the 30th annual international ACM SIGIR conference onResearch and development in information retrieval - SIGIR ’07, page103, 2007.

[26] Radim Řehůřek and Petr Sojka. Software Framework for Topic Mod-elling with Large Corpora. In Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta,May 2010. ELRA. http://is.muni.cz/publication/884893/en.

[27] G. Salton, A. Wong, and C. S. Yang. A vector space model for auto-matic indexing. Commun. ACM, 18:613–620, November 1975.

[28] Gerard Salton. Automatic text processing: the transformation, analysis,and retrieval of information by computer. Addison-Wesley LongmanPublishing Co., Inc., Boston, MA, USA, 1989.

[29] Pavel Serdyukov, Vanessa Murdock, and Roelof van Zwol. Placingflickr photos on a map. Proceedings of the 32nd international ACMSIGIR conference on Research and development in information retrieval- SIGIR ’09, (May):484, 2009.

[30] T.R. Smith. A digital library for geographically referenced materials.Computer, 29(5):54–60, 1996.

BIBLIOGRAPHY 88

[31] Carlo Torniai, Steve Battle, and Steve Cayzer. Sharing, discovering andbrowsing geotagged pictures on the web. Citeseer, 2007.

[32] Kentaro Toyama, Ron Logan, and Asta Roseway. Geographic locationtags on digital images. Proceedings of the eleventh ACM internationalconference on Multimedia - MULTIMEDIA ’03, page 156, 2003.

Appendix A

Terms and Abbreviations

tf: term frequencyidf:inverse document frequencyTF-IDF: term frequency-inverse document frequencyVSM: Vector Space ModelYQL: Yahoo Query LanguageLAMP: Linux, Apache, MySQL and PHP

Appendix B

Blacklist

Figure B.1: Stop-words or Blacklist composed by the most common Flickr tags

using flickr geotags to find similar tourism … · homemade server letting me gather 1 millions of...

Documents