towards social media as a data source for opportunistic ... · social media already exist, such as...

Towards Social Media as a Data Source for Opportunistic SensorNetworking

James Meneghello1 Kevin Lee1 Nik Thompson1

1 School of Engineering and Information TechnologyMurdoch University,

Perth, Western Australia 6150,Email: [email protected], [email protected], [email protected]

Abstract

The quality and diversity of available data sourceshas a large impact on the potential for sensornetworks to support rich applications. The high costand narrow focus of new sensor networkdeployments has led to a search for diverse, globaldata sources to support more varied sensor networkapplications. Social networks are culturally andgeographically diverse, and consist of large amountsof rich data from users. This provides a uniqueopportunity for existing social networks to beleveraged as data sources. Using social media as adata source poses significant challenges. Theseinclude the large volume of available data, theassociated difficulty in isolating relevant datasources and the lack of a universal data format forsocial networks. Integrating social and other datasources for use in sensor networking applicationsrequires a cohesive framework, including datasourcing, collection, cleaning, integration,aggregation and querying techniques. While similarframeworks exist, they require long-term collectionof all social media data for aggregation, requiringlarge infrastructure outlays. This paper presents anovel framework which is able to source social data,integrate it into a common format and performquerying operations without the high level ofresource requirements of existing solutions.Framework components are fully extensible, allowingfor the addition of new data sources as well as theextension of query functionality to support sensornetworking applications. This framework provides aconsistent, reliable querying interface to existingsocial media assets for use in sensor networkingapplications and experiments - without the cost orcomplexity of establishing new sensor networkdeployments.

Keywords: data sourcing, data mining, dataintegration, social media, social data, integrationframework, sensor networking, resource preservation

1 Introduction

Sensor networks are a grid of spatially distributedmultifunction sensor nodes designed to cooperativelymonitor environmental conditions and pass collecteddata in a collaborative fashion for further analysis.

Copyright c©2014, Australian Computer Society, Inc. Thispaper appeared at the Australasian Data Mining Conference(AusDM 2014), Brisbane, 27-28 November 2014. Conferencesin Research and Practice in Information Technology (CRPIT),Vol. 158, Richi Nayak, Xue Li, Lin Liu, Kok-Leong Ong, Yan-chang Zhao, Paul Kennedy, Ed. Reproduction for academic,not-for-profit purposes permitted provided this text is included.

These networks have provided numerous benefits tosociety since their development, and have previouslybeen deployed in areas to monitor volcano activity,floodplains (Hughes et al. 2008), bushfires andearthquake risk zones (Akyildiz et al. 2002).Human-oriented sensor networks (such as BodyArea Networks (Chen et al. 2011)) have also beendeveloped with the aim of providing benefit toindividuals through better monitoring of healthconditions, as well as epidemic detection for virusessuch as influenza (Okada et al. 2009). These systemsrely on analysis of data collected by the sensors inorder to provide actionable advice, and the majorityof these systems have application-specificimplementations. There is some repurposing ofcollected data, but sensor networks for generic useare generally not deployed as acquiring fundingwithout a well-defined scope is difficult.

Sensor networking applications are constrained byavailable data sources. Supporting new applicationsoften requires deployments of sensor hardware thatare able to collect new types of data. This data canthen be used in new analytical applications or usedto extend and enhance existing applications. Usingexisting data sources to enhance sensor networkingis an attractive prospect, as it does not require theexpense of developing, implementing and maintaininga new sensor deployment.

To support the extension of sensor networks inthis manner, existing networks capable of providingdata must be located. A candidate network shouldbe semantically and structurally similar to sensornetworks to allow a smooth application of existingsensor networking techniques to the new datasources, and also support integration with existingsensor networks. It is likely that multiple datasources properly integrated into a cohesive datasetof a standardised format could meet theseconditions, as long as the data was procured in away similar to sensor networks: a node collectsevents from the environment and communicatesthem to a central host.

Social media is an ideal candidate for use as adata source in sensor networking. It is comprised ofa number of isolated networks covering much of theglobal population, providing penetrative reach intodiverse communities. It produces very large amountsof variable-quality data with accompanyingmetadata. When properly filtered and cleaned, itcan be a source of good-quality, relevant data forintegration into sensor networking applications.Social media also follows similar design patterns tosensor networking, being primarily event-based, andcan conceptually be treated much the same way as atraditional sensor network. Social users (nodes)write comments (events) that are transmitted to

followers (connections). By identifying points ofinterconnection between social networks and sensornetworks and providing appropriate metadata, wecan inject social user data into sensor networks toemulate additional sensor nodes.

The near-ubiquitous acceptance of social mediawithin modern society has led to strong userbasegrowth across all platforms, with 56% of Americansacross all age groups now using at least one socialnetwork and 96% of those aged between 18-35(FormulaPR 2011, Edison 2012). Globally, 26% ofthe population use social media, including 57% ofAustralians and 46% of Chinese (Singapore 2014).Social media users use tend to use differentplatforms depending on their nationality - Cyworldis estimated to host 50% of the social networkingprofiles of South Korean users, while Mixi is morepopular in Japan and 20% of Orkut’s population isIndian (Vasalou et al. 2010). Integrating these socialnetworks would provide a truly global,culturally-diverse, rapidly-updated data source forsensor networking.

There is great demand for this social data withinthe academic and industrial communities. Socialstudies rely heavily on qualitative results fromuser-generated content such as surveys, whileadvertising entities harvest user metadata to moreappropriately target advertising. Currently, data forthese activities is collected directly from users,either by asking for their participation in studies orby using web-based tracking technology to monitortheir use of the internet. These two forms of datacollection have also been implemented in socialsensor networks, as participatory (Burke et al. 2006)and opportunistic (Lane et al. 2008) sensing.Problematically, both of these implementationsrequire ongoing software deployments and usage ofresources to monitor the environment - potentiallydissuading users from continuing participation.

Using social media as a data source for sensornetworking presents a number of challenges, fromrealistic and technical perspectives. Allowing themass collection of data from users in aneasily-queryable format could have unintendedconsequences, such as the mass-reporting of userlocations (Borsboom et al. 2010). Analysis is mademore resource-intensive by the high throughput ofdata posted to social media - Facebook alone collectsand warehouses almost half a petabyte of data perday (Ching et al. 2012). Platforms to aggregatesocial media already exist, such as Datasift(DataSift 2010) and Gnip (Gnip 2008), and providethis service by performing widespread collection ofall social media data, requiring substantial storageand processing infrastructure. These platforms arenot developed with sensor networking integration inmind, and the event-based behaviour of social nodesis lost during the integration process.

In this paper, we propose a framework whichperforms data mining, filtering, integration andquerying of generic social networks. This frameworkis designed to integrate with sensor networkingsystems and analytical tools to allow researchers andindustry to leverage user-driven content for thepurposes of event-detection, metadata analysis andgeneral user analysis. This is achieved throughconflating the concepts of social networks and sensornetworks and provides a robust data integrationchain with a data querying engine designed tostreamline access to desired data. In addition, toolsfor mining data from social media are detailed andintegrated into the framework to provide a morecomplete picture of each user and their content than

merely parsing a single Application ProgrammingInterface (API). The framework provides a simplifiedquery interface to a cohesive data-set made fromgeneric social media networks, providing researchersand industry with access to a large, global, genericset of user-generated metadata and content.

Section 2 presents a review of literature andprevious work completed in the areas of social datamining and opportunistic sensing. Section 3 analysesthe use of social media as a data source for sensornetworking, as well as the challenges involved indoing so. Section 4 describes a framework designedto alleviate some of these challenges by optimisingthe data collection process. Section 5 evaluates thedata sourcing algorithm ability to discover potentialdata sources relevant to a given topic. Finally,Section 6 presents some conclusions.

2 Background

The process of collecting data from disparate socialnetworks and integrating it for common use utilisestechniques from a broad spectrum of computerscience research. Supplying user data as a datasource for sensor networking is commonly referred toas participatory or opportunistic sensing, whereparticipatory involves the direct contribution of datafrom users where opportunistic uses passiveobservation of users. This section examines the useof both of these methods of data collection fromsocial networks, as well as examining previousapplications of social data in analytical studies.Finally, data integration techniques are examined foruse in linking collected data across social networks.

2.1 Participatory and Opportunistic Sensing

Sensor networks are typically a spatially-distributedmesh of potentially thousands of sensors, withsensor nodes deployed in a configuration befittingthe desired application. Each node communicatescollaboratively in order to move collected data backto the sink node - using other nodes as a relay toincrease communications range. The sink node thentypically has a connection back to more powerfulresources for processing the information, though thismay also be collected manually. Low-power wirelesssensor networks are often used to collect data fromthe natural environment, the human body,mechanical equipment and many other sources. Thisdata is, in turn, analysed and logged or used toactionable effect - for example, to trigger a warningin the event of an earthquake.

Participatory sensing uses existing mobile devicesand users as nodes within a sensor network (Burkeet al. 2006) by encouraging users to gather, analyseand share local knowledge in what is commonlyreferred to as ”crowdsourcing” (Kanhere 2011). Bytreating users as sensor nodes, participatory sensingtakes advantage of resources that already exist toextend the reach of sensor networks for a number ofpurposes, including urban planning and policydevelopment. Smartphones also come with a numberof sensors that can be made available forparticipatory sensor networks, including theimportant contextual metadata sensors of locationand time.

Participatory sensing requires direct and activeparticipation of users, which comes with a number ofchallenges. One of the key issues with using peopleas nodes in sensor networks is that they are oftenless reliable than hardware sensors (Hughes et al.

2014). Users are able to choose whether to providedata on a requested topic, and may choose not to doso. Participatory sensing is named as such for areason - without active participation, the system isunable to produce useful outcomes.

Further research in using participatory sensingfor urban sensor networks has resulted in thedevelopment of several auxiliary approaches.Opportunistic sensing reduces the burden on theuser by lessening direct participation - insteadrelying primarily on the devices the user carriesaround (Lane et al. 2008). Applications onsmartphones can take sensor measurements withoutbothering the user and pass it along to sink nodesover mobile networks. While quality data is stilldependent on the user’s presence in an area ofinterest (Min et al. 2013), the user is no longerrequired to directly answer queries. Data producedby hardware sensors through opportunistic sensingis objective and usually of good quality, as it is notprovided directly by the user and is in a standardformat (Campbell et al. 2006). Under this approach,collected data is only provided by sensors directlyattached to the device, and users cannot be queriedfor additional information. This use of smartphonesin opportunistic sensing is commonly referred to asMobile CrowdSensing (Ganti et al. 2011).

In order to leverage opportunities provided byboth participatory and opportunistic sensing, somearchitectures combine both techniques (Guo et al.2014). This allows users to provide data asrequested and fill contextual data usingopportunistic sensing. Concerns with data qualityarising from direct user involvement remain.

A major problem with many implementations ofboth participatory and opportunistic sensor systemsis the requirement of manual application installationby the user. To retrieve data from a smartphone’ssensors, an application must be installed that allowsthe smartphone to operate as a sensor node. Theseapplications are usually neither large nor difficult toinstall, but even the smallest of hurdles can hamperefforts to leverage users as sensors. These servicesalso consume energy on devices that have limitedresources, which may drive some users away. Use ofthese services usually relies on providing the userswith some kind of benefit, which is an ongoing cost.

To alleviate these issues, sensor middleware suiteshave been developed (Hachem et al. 2013, Hugheset al. 2009) for smartphones that only require asingle installation, even if there are multipledifferent deployments using the device. Theseframeworks vastly simplify sensor installation andreconfiguration, further reducing burden on the user.

The framework proposed in this paper aims touse indirect participatory and opportunistic sensingat a higher level by analysing social media. Whilefacing similar data quality challenges to bothtechniques, we work around user participation byonly monitoring existing social media usage. Noadditional applications need to be installed and noextra resources are consumed by personal devices.

2.2 Social Media

Effectively deriving useful data points from socialmedia is a topic of much discussion, and presents anumber of challenges (Maynard et al. 2012). Postsby users on social media are usually free-form - thedata comes in no particular standard format, asthey often represent part of a stream ofconsciousness from a user. Additionally, posts arenot limited to text and users often share pictures,

videos, diagrams or graphs which present uniquechallenges to data analysis.

The proliferation of social media has drivencontent and context creation, but has also made itmore difficult to locate appropriate data sources(Wandhofer et al. 2012). Where topics were oncediscussed in semi-centralised locations such asUsenet, the integration of commenting systems intomany different news and blog sites has dispersedinformation to this point where it is unrealistic toattempt collection of all relevant discussion relatingto a topic. Instead, discussion centers can bediscovered by tracking the spread of topics acrossthe internet and monitoring the most activecommunities.

Many existing studies on using social media forevent detection tend to focus on a single socialnetwork (Li & Cardie 2013, Cameron et al. 2012,Sakaki et al. 2010, Robinson et al. 2013), limitingcollection specifically to the demographicsrepresented on the chosen platform. Culturaldemographics can vary significantly per platform -Cyworld is estimated to host 50% of the socialnetworking profiles of South Korean users, whileMixi is more popular in Japan and 20% of Orkut’spopulation is Indian (Vasalou et al. 2010). In orderto develop a truly global and cross-cultural socialmedia monitoring system, data collectionmechanisms should be extensible to any form ofonline social media, rather than a specific fewplatforms.

In order to realise the concept of global genericsocial sensor networking, a diverse set of datacollection and integration techniques need to beutilised. Different social networking platformsoperate on different data structures, using differentstorage engines and producing entirely differentoutput. Even considering this, social media data isconceptually similar across platforms, consisting of anumber of common constructs (users, friends,connections, messages, events). Because of thisconceptual similarity, it is possible to integrate thisdata into a single queryable dataset through the useof data collection and integration techniques.

2.3 Generic Collection and Integration

Generically utilising data from different web servicesrequires integration between social media platforms.While programmatic access to web services hasvastly improved since the introduction of Web 2.0paradigms, interoperability and data integrationbetween social networks remains an issue. There islimited interoperability of services provided for thepurposes of open authentication, but content sharingis usually limited. There have been some attemptsto apply semantic web principles to the problem,including Semantically-Interlinked OnlineCommunities (SIOC) (Breslin et al. 2009) whichdescribes social networks using the ResourceDescription Framework (RDF) to improveinteroperability. While RDF has yet to reachwidespread adoption and is unavailable for use withmany social media systems, the data structures andprinciples in use provide a good platform uponwhich to base further work.

There are a number of challenges surrounding thematching of social media profiles between networks.The FOAF (Friend-of-a-Friend) Project (Brickley &Miller 2000) attempts to extend Semantic Webefforts to social media by providing a base for userprofile matching across networks. It does this bycombining names and user metadata (such as

location, email addresses or education details) toprovide a more substantial set of data to improveaccuracy of matches. As with the SIOC project, fewsocial networks actively support such efforts andmost do not provide FOAF output for users.

The majority of studies conducted using datacollected from social media follow a reasonablysimilar process: conduct (manual or automated)searches of a social network, save or export returneddata in a simple format, and perform analysis(online or offline) to determine answers to aparticular query. While this process is reasonablygeneric, there have been no attempts to develop aquerying engine for social media analysis. Queryengines have been developed for a range of otherpurposes (Khoury et al. 2010, Madden et al. 2005)with the intention of providing a standard queryinginterface and abstraction layer on top of complexdatasets. The development of a query engine thatcan genericise collection and analysis over multiplesocial media platforms without requiring largeinfrastructure outlays would be a useful addition tosocial media research.

2.4 Integration of Social Data

Data integration can be a complex process,depending on the complexity, relevancy and size ofthe converging datasets. Numerous techniques existto handle the process of querying integrateddatasets, designed with different operatingrequirements. Some techniques require the offlinetransformation and storage of data for laterquerying, while others can handle-queries inreal-time by inferring the goal of the query andtransforming appropriate data as required.

Extract-Transform-Load (ETL) systems arecommonly used in data warehousing, where data isinitially cleaned and transformed for storage andlater use (Vassiliadis 2009). Due to the extensiveprocessing that data undergoes in order to be in aclean state with unified schema, this process isgenerally not real-time, and can suffer from datafreshness issues. This approach is therefore onlyuseful for use in delayed queries, and its use withsocial media would exclude any real-time sensingand reactive systems.

There are also dataflow processing systems suchas Google Cloud DataFlow (Perry 2014) that wouldbe appropriate for the task of integrating large socialdatasets. In conjunction with the use of BigQuery(Sato 2012), we can provide a platform providing aqueryable interface to real-time integration of socialstreams that can be use in decision support systems,providing actionable insights.

Some attempts have previously been made tosupport interoperability between social networks,particularly the Semantically-Interlinked OnlineCommunities (SIOC) project (Breslin et al. 2009).The SIOC project uses the Resource DescriptionFramework specification to integrate some elementsof social networks, particularly the association ofuser profiles over different networks. SIOC exportershave been developed for a limited number of socialnetworks.

Commercial services such as Datasift (DataSift2010) and Gnip (Gnip 2008) perform integrationand long-term storage of social data. The integrateddata is then presented to applications through aquery API, which is often used by companies tomonitor their online social profiles. This allows forrapid response to user complaints directed at socialfollowers (rather than the company itself, through a

complaints procedure). The integration process ofthese systems does not consider the event-basedbehaviour of social nodes, and are generallyunsuitable for use in extending sensor networking.Ongoing access costs can also be a constrainingfactor for applications intending to use social data,and independently deploying such a system requiresan infeasiable level of processing and storageinfrastructure for most projects.

3 Social Media as a Data Source

Social media is comprised of a number of isolatednetworks covering much of the global population.They are primarily used for facilitatingcommunications between people over the internet,social networks have also been used to gauge userresponse to advertising (Taylor et al. 2011) and alsoas transmission mediums for sensor networks.Potential sensor networking applications rely onavailable data sources, and the penetrative reach ofsocial media into global communities and vastamount of data posted provides an ideal data sourcefor this purpose. Properly prepared, social data issuited for further analysis, particularly for detectingcross-cultural events or insights. Using social mediaas a data source for sensor networking is not atrivial task. There are significant challenges facingany platform aiming to facilitate this dataflow.

To use social media in this way requires theintegration of disparate social networking platforms.There have been attempts to ease the bidirectionalmigration of data on social networking platforms, asnoted in Section 2.3, but these have generally hadlittle industry support. Most networks have aninterest in ”locking-in” customers, to dissuade themfrom migrating between networks and losingassociated advertising revenue. Easing dataintegration processes would also accelerate migrationbetween competing networks, so efforts tostandardise export formats are unlikely to everachieve industry co-operation. Therefore, newtechniques supporting inter-network social dataintegration need to be developed in order tofacilitate the integration of social media into sensornetworking.

To enable the use of social media as a data sourcefor sensor networking, a framework supporting thisgoal must be developed. This framework is madefrom a number of processes designed to takeheterogeneous data sources, integrate them into acommon schema, provide generic queryingfunctionality and present appropriate output forfurther analysis or integration into sensor networks.

This framework must address a number of keychallenges:

1. Collecting data from all available sources foruse in sensor networks can be infeasible due tothe sheer amount of data being produced. Toquery social media data in an efficient mannerwhile retaining the ability to query as muchrelevant data as possible, some way of reducingthe incoming flow of data to exclude irrelevantsources is required.

2. To use social media as a sensor, sourced dataneeds to be retrieved. Retrieving social mediadata can be straightforward if an API isprovided, or require manual scraping if keyinformation is not provided by the API.

3. Social media data is noisy because it is almostentirely user-generated and doesn’t adhere to astandard structure or format. This datarequires extensive cleaning to remove spam andbring low-quality content up to a usablestandard (Agichtein et al. 2008). Withoutperforming this cleaning, using the data inautomated applications becomes significantlymore difficult.

4. In order to present the collected data in aformat that can be integrated with sensornetworking applications, there must be a waycomparing diverse datasets. Without a methodof mediating schematic differences between datasources, every application would require manualmapping of data structures.

5. The execution of queries over datasets as largeas those that social media provide can bechallenging, due to resource constraints andmissing values. Sensor networks also deal withdata in many different ways and can requireextensive pre-processing and aggregation to beperformed prior to use.

6. Some sensor networking applications canintegrate data in simple formats such as JSON,but others require more complex techniques(such as event injection) to emulate social nodesas sensor nodes.

In order to properly define steps in the socialintegration process, each of these challenges isrepresented by a key area of functionality,respectively: Sourcing, Collection, Cleaning,Integration, Querying and Presentation. Once asolution for each of these challenges has beenidentified, they can be joined into a unified processsupporting the integration of social media andsensor networking. These challenges are describedmore fully in the sections below.

3.1 Sourcing

One of the most simple optimisation steps inlarge-scale data processing applications is to filterincoming data, resulting in reduced processing foreach successive step in the integration process. Associal media has an extremely broad scope, relevantdata sources must be located in order to efficientlyleverage social data in sensor networkingapplications. Without this initial filtering process,much of the social data processed may becompletely irrelevant to the application. As thisdata must undergo cleaning, integration, storage andquerying, early filtering can result in significantresource savings. Hence, the process of findingquality data sources is very important.

Sourcing involves locating social data streamsthat provide data relevant to the intendedapplication. On the internet, many of these sourcescan provide access to historical or real-time datastreams. Both of these types of data can be useful toextend or enhance sensor networking applications.Real-time data can actively replace or enhancesensor nodes with additional data sources, whilehistorical data can provide longitudinal context.Data sources can also be normal rich-text web pagesthat can require substantial parsing and cleaningbefore use.

Most of these data sources also provide access tofurther sources. Social media posts often containhyperlinks to static web content, and also contain

metatags (such as other usernames and hashtags).Static content is often interlinked, with mostwebsites providing hyperlinks to other relevantmaterial on associated sites.

Figure 1 presents the process by which sourcingoccurs. The query specifies relevant keywords, whichcan be expanded upon by use of predefineddatabases and appended to by examining oft-usedkeywords on strongly-relevant search results. Thesekeywords are used to query known search APIs,returning a list of locations that potentially containresults. The exact method used for data access andsearching can vary for each system, as each platformprovides differing methods of accessing and filteringdata. Some examples of different methods are:

Facebook Using the API, search for public postswith related search terms, popular news feedsand other items of interest. Additionally collectcomment authors.

Twitter Using the API, search for public tweetscontaining related keywords. Additionallycollect information about replies to tweets, andalso examine hashtags commonly appearingwithin the initial set.

Blogs Using Google’s Search API, search for publicblog posts containing relevant keywords,including author information and comments.Additionally collect relevant results fromcommenters’ own blogs.

Forums Using manual page scraping andauthentication for forum software such asVBulletin and phpBB, search for forumscontaining threads relevant to our query.

Figure 1: Finding relevant sources with the sourcingprocess

Upon completion of this initial phase, a list ofrelevant places to look for information relating tothis single query has been collected. Not all datapresent within these sources is useful, and some maybe completely irrelevant. By selecting a fairly widesub-set of available data, we have already limitedthe initial collection to a feasible scope, allowing formore accurate filtering later. This optimisation canpotentially exclude obscure results, but this is anecessary trade-off.

3.2 Collection

To use data from identified sources in applications,data must be collected. There are three mainapproaches to data collection: the use of ApplicationProgramming Interfaces (APIs), access to formattedreal-time streams, and ad-hoc data collection.

Using APIs to collect data involves sending aspecifically-constructed request to a provided server.This request contains a number of parameters usedto focus the scope and range of data returned by theAPI. The requested data is then returned to therequester in a well-defined format. Some APIsprovide a limited view of data available to thesource, requiring the requester to follow-up with anad-hoc request for more data. These initial APIrequests often lead to further queries for relateddata (such as collecting user profile data for usersthat have posted in a comment thread).

Data can be collected from real-time streams byrequesting stream access from a source server. Thesource server then pushes a constant stream ofreal-time data towards the requester, in what isoften called a ”firehose” stream. This data ispresented in a well-defined format that can bemapped to an appropriate data schema. Manyfirehose streams consist of all available real-timedata being produced by the source. The amount ofdata provided by this real-time stream can beproblematic for systems operating with restrictedbandwidth, processing or storage resources, and caneasily overwhelm low-resource systems.

Ad-hoc parsing or scraping of data can be usedfor data sources that have no defined format and donot provide an API or formatted data stream. Thisusually requires the manual development of parsingalgorithms tailored to specific sources. Automatedparsing algorithms can also be used, to varyinglevels of success. Ad-hoc scraping can also be usedto enhance data provided by APIs, where data ismissing or deliberately restricted from API access.

Figure 2 illustrates an example structure forhandling data collection across multiple source typesand authentication methods. For many sources,collection can occur through accessing anetwork-provided API, such as Facebook orTwitter’s APIs. Many APIs operate on similarauthentication standards (such as OAuth (Hardt2012), depicted), requiring minimal work to writewrappers for a generic scraping engine even across awide variety of different sources. Other sourcesrequire collection through page-scraping, a moreresource-intensive process that involves writingparsers for source pages and handling pageauthentication in a customised manner. While APIaccess is generally less resource-intensive andrequires less development work than scraping,scraping can potentially provide more data.

Data collection is limited by the interfacesprovided by the API, e.g. the Facebook API doesnot provide location information for users, even ifthe data is set to be publically viewable. Asecond-pass scraper can access the profile pagedirectly to collect this information, providing morecomplete meta-data but is also moreresource-intensive. Second-pass collection can alsoinvolve further queries to the API for related data(such as collecting user profile data for users thathave posted in a comment thread).

The collection process can be very easilydistributed across resources, as each first-passoperation is isolated. The initial source list can bepackaged by a workflow manager and work assigned

Figure 2: Collection module flow

in an optimised manner to ensure that requests arespread evenly over workers. When workers have runout of allocated work, second-pass collection canoccur in a manner that ensures atomicity.

3.3 Cleaning

Data collected from social media and other internetsources is noisy, and can require extensive cleaningand processing. Most social data is user generatedand adheres to no particular structure, primarilybeing unstructured conversational language.Differences in data formats between sources can alsobe problematic, such as the use of different datestandards between cultures and systems.Conceptual differences between individual dataelements across different networks can also requiremanipulation of data into a unified standard.

There can also be structural problems with socialdata depending on its age and nature. Legacy datahas often undergone repeated format and schemashifts, so data from these sources can requireextensive cleaning. Modern sources adhere tostricter standards and usually require less cleaning.

The method of collection can determine theextent of data cleaning necessary. API-providedresults are usually retrieved from the database andoutput without presentation, and therefore adhereto internal database quality standards. Datareturned from ad-hoc scrapers can require extensivecleaning, including removal of undesirable elementssuch as HTML tags and extraneous characters.

It is at this point that cleaning data for privacyreasons may be handled. Anonymisation of userscan be performed during the cleaning process toensure user privacy for applications that do notrequire identifying information to be stored.

3.4 Integration

Data from different social sources are collected usingtheir original schemas. Attempting to query dataacross these many schemas can be difficult withoutproper data integration, as it requires the queryengine to specifically deal with many different dataformats and sources. Data integration provides theability to query data over a mediated schema, inwhich all sources can be treated in a genericmanner. To use this data set in sensor networkingapplications, collected data must be integrated.

There are a number of available data integrationtechniques supporting this goal, including thosediscussed in Section 2.4. The integration process canbe operated by predefined mappings from collecteddata schema to the global schema, such as the RDF

specifications used in the SIOC project (Breslinet al. 2009). It is possible to use automated mappingalgorithms (Doan et al. 2001) to develop page andAPI wrappers that require minimal modification bya developer, reducing development time.

These mappings can be applied to data in anumber of different ways, depending on the databaseengine used. Data mappings to relational designscan store the integrated data in relational databasemanagement systems, while there are also optionsfor NoSQL, key:value and tuple stores. Eachprovides the ability to perform a different form ofintegration, so any integration design process shouldalso take the choice of query engine into account.

3.5 Querying

To provide only relevant and desirable data tosensor networking applications, there must be amethod of restricting, aggregating and analysingintegrated data. Analysis performed may requireresults aggregated by categorical variables, and thiscan be handled using querying. Querying is animportant part of data analysis, and is readilyavailable in all database systems.

As the integrated data is loaded into a relationaldatabase management system with a unified schema,query processing and optimisation is greatlysimplified. Additional operators and aggregatefunctions can be added to the query syntax to allowfor the sourcing and collection mechanisms detailedin 3.1 and 3.2 respectively, including the ability tofilter by Site and restrict potential data sources bykeyword. Queries can be provided in both SQL-likesyntax or to a RESTful API, using JSON or XML.

The utility of integrating social data into sensornetworking applications ultimately hinges on theability to adequately query the data. Other systemsdesigned to integrate diverse data sources intoapplications place significant emphasis on theirquerying engines. Software projects such as Google’sBigQuery and Facebook’s Presto were designedspecifically to query large-scale sets of data inreal-time. Querying social data therefore needs to beefficient in order to provide real-time results tosensor networking applications.

3.6 Presentation

Effective data presentation is required in order touse integrated data in sensor networkingapplications. Applications often handle differentinput data formats, such as XML, JSON or eventpackets. Presenting the data in this format forintegration should be handled in such a way as to becompatible with any application.

Presentation can be delegated to clientapplications, with the querying engine onlyproviding result data in a variety of text-basedformats such as JSON and XML. This data can thenbe provided for import into other systems or directlyanalysed for use in graphical applications. For usewith automated systems, queries can be designed toprovide simplified output to directly trigger actionsor provide detailed output for further use inDecision Support Systems.

Direct integration is more involved but allows forsocial nodes to be used directly in sensor networkingapplications. This works by developing outputformatters that exist within those networks andrebroadcast data in the format required, often eventpackets. This allows social data nodes (users) todirectly act as nodes within the sensor network.

4 A Framework for Supporting Social Mediaas a Data Source

In order to support the integration of generic socialmedia data into sensor networking applications, thefunctional components described in Section 3 areimplemented in a streamlined data processingframework. This framework encapsulates allnecessary operations from initial data request tosourcing, collection, cleaning, integration, queryingand presentation of returned results. For ease ofpresentation, the framework architecture ispresented in multiple views.

Figure 3: The process of finding new data sources

Figure 3 presents the workflow for sourcing andcollection. In this stage, the framework takes theinitial query and expands upon the specifiedkeywords in an iterative process. Potential sourcesare taken from historical data and queried forrelevance, and commonly-recurring users andkeywords are crawled to discover additional relevantsources. A wide range of sources are discovered,analysed and discarded during this process to ensureadequate data coverage of relevant sources.

Sourcing is a two-step process. Using an initialseed set of keywords, the Sourcers locate sources anduse content from these sources to expand andreinforce the keyword lists, as well as isolatingrelevant metatags (such as URLs or usernames). Inthe second step, the metatags are put back throughthe Sourcers to locate additional sources and furtherreinforce keyword and source lists. Each additionaliteration of this process serves to further strengthenthe keyword and source list, effectively usingmachine learning to focus the search space.

Sourcing can be performed in two modes. Thefirst is a narrow search that attempts to find themost relevant sources while being hesitant to expandthe search space. The second is a much broadersearch that uses all available content to train thekeyword set and broaden the search space, findingpotentially-relevant sources over a much larger area.Narrow searches are designed to find the mostrelevant sources quickly, while the broad search canfind distantly relevant sources but additionalresource requirements.

For organisational reasons, the sourcingcomponents are integrated with the collectioncomponents. Functionality is shared between thesetwo processes, with some content collection requiredto assess relevance of sources (and completecollection required during the collection phase).

The sourcing and collection software processescan be distributed across multiple resources, onlyrequiring synchronisation to finalise the completedsource list and return the completed collectionresults. Individual site wrappers may also bedistributed, which can be particularly desirable inthe instance of micromanaging API throttle limitsthat tend to differ between platforms. Distributionof these processes provides significant performanceboosts in the sourcing, collection and filteringphases.

Relevant sources are cached for two purposes.The first is to expedite the sourcing process forrepeated similar queries and reduce overhead andexternal API usage, which associates sources withparticular keyword sets. The second is to identifyrich-text websites that contain a large amount ofuseful data or pages. By keeping track of which sitesare providing useful data sets, candidates for furtherwrapper development can be identified.

Figure 4: Example data flow within framework

The framework wraps around a RelationalDataBase Management System (RDBMS), as shownin Figure 4. This wrapping occurs in multipledirections by allowing modified incoming queries,automating data collection, extending availableaggregate and filtering functions and also providingfor formatted output. For simplification, Figure 4does not show processes relating to sourcing andfirst-pass filtering, which are instead shown inFigure 3.

After determining a source list, the frameworkscrapes information from sources using collectors.

The collectors are either source-specific wrappers orgeneric web scrapers that can collect varyingamounts of information from sources. Source-specificwrappers (e.g. Facebook or Twitter) return data ina standard format that can be easily integrated,while the generic scraper returns data that may below-quality or require cleaning, as discussed inSection 3.3. This cleaning process can requirefurther related data to be requested from sources.

The integration component takes cleaned dataand integrates it over the mediated schema, asshown in Figure 5. Once complete, the data is fit tobe inserted into an RDBMS using a pre-definedschema. This schema was developed using both theSemantically-Interlinked Online Communities(Breslin et al. 2009) and Friend-of-a-Friend (Brickley& Miller 2000) projects as inspiration, allowing forsimple output to common semantic web formatswhile still retaining a high standard of performanceover large datasets. The schema has been modifiedto work within an RDBMS, which retaining as muchflexibility as possible.

Figure 5: Internal mediated data model

Figure 5 presents the internal data model used tostore integrated web data. Designed for use withrelational database management systems, the modelsupports enough expression to adequately covermost social network and social page examples. Notall aspects of the model are utilised by everynetwork, and some networks use certain modelsdifferently: while Twitter and Facebook can use theTag model to represent hashtags and othermetadata, blogs can use the same model torepresent word tags as part of the word clouds usedin most blogs.

Some model entities in Figure 5 see significantreuse due to their generic nature. Blog posts,comments, tweets, statuses and forum posts are allconceptually similar, generally consisting of alimited attribute set: title, date (posted/created),author (and other user details), content and a set ofreplies, along with other assorted metadata. Asimple Post entity (related to Accounts to provideauthor and commenter data) can satisfy all of theserequirements, including any comments or replies byusing an adjacency relationship.

Queries submitted to the framework aretranslated before being passed through to theRDBMS, as there are extensions made in order tosupport sourcing and presentation and the originalqueries are not SQL. Custom second-pass filters andaggregators can be called directly, assuming theRDBMS supports user-created functions. Using thefilters in this way also allows the system to cacheand optimise queries natively where possible,

without requiring extensions to the database engine.Once the query has been completed, the data can

be formatted by presenters. These presenters can bein the form of a RESTful Web API that outputsJSON, XML or graphed images, or can alternativelybe integrated into sensor networking systemsthrough use of custom event generators. Normally,presentation is handled at an application-level and isinappropriate for inclusion, but the platformeffectively wraps the RDBMS and provides thisfunctionality to enable simple access to the data.Accessing RDBMSs without an interface layer raisesthe barrier to entry for users, requiring directconsole access and usually provides data inawkwardly-formatted text.

5 Evaluation

One of the prominent and novel advantages of theproposed framework over existing approaches is theuse of on-demand social data sourcing. Effectivedata analytics relies on the presence of quality datasources, so this is an important goal. The evaluationprovided therefore emphasises data sourcing, andexamines the relevance and quality of data sourcesdiscovered.

5.1 Experimental Setup

A proof of concept experiment for the sourcingalgorithm was implemented in Python 3.4, takingadvantage of the Natural Language ToolKit (Birdet al. 2009) library for text mining. The sourcingalgorithm was executed on an Amazon EC2m3.medium instance, and individual experimentswere conducted in a single run to ensure that theavailable social data did not dramatically changebetween runs.

Sourcers were implemented for Twitter andGoogle Custom Search, as well as a generic webscraper used to further build keyword lists fromsources. Both services are subject to API throttlinglimits, and these limits are taken into accountduring the sourcing process.

The process for each experiment consists of aseries of iterations. A single iteration consists of thefollowing steps (subject to configuration values):

1. Search Sourcers using initial seed keyword

2. Generate target source list

3. Retrieve content from list of sources

4. Mine content for commonly-occurring keywordsand metatags (URLs, Usernames, Emails)

5. Further add to source and keyword lists

6. If [broad search]: Search Sourcers using metataglists (ie. Twitter users) and add to sources

7. Iterate using expanded keyword set

There are a number of possible configurationoptions for each experiment, which are explainedbelow:

broad search Whether the sourcing algorithmshould also take sources from collected content

max search results The number of results toreturn from a Sourcer search (important toavoid API throttling)

The broad search algorithm also collects specialmetatags from content, such as usernames andhashtags on Twitter, and email addresses or URLsin static content. These metatags are then used tobroaden the search scope, discovering additionalsearch vectors and providing a significantly highernumber of data sources while being subject to muchhigher noise levels. These searches are examined inthe next section.

5.2 Relevance

The relevance of discovered data sources is animportant metric in evaluating the usefulness of thissourcing algorithm. In order to evaluate this, thealgorithm was given a broad seed keyword(“Australian politics”) and let to run over multipleiterations, with an upper limit on the number ofsearch results returned from each API of 250 forTwitter and 50 for Google. The source list wasexported to a comma-separated value file and eachsource was manually evaluated for relevance to thetopic. The algorithm was then executed twice: onceas a normal search, and once with the additionalbroad searching options enabled.

We define sources as relevant based on a numberof factors: whether the source provides data relatedto the queried topic, the ability of the source toprovide ongoing data and other sources, and thequality of data provided and its ability to be used inqueries. If a source satisfies these criteria, it isconsidered relevant. All other sources, includingthose recorded as a result of algorithmic mishaps(such as misparsed hyperlinks and advertisingservers), are deemed as irrelevant. Relevance is atherefore a boolean result.

As to be expected, the most relevant sources arequickly and easily found by a normal search with alow percentage of irrelevant results in the firstsourcing iteration, as seen in Figure 7a. The broadsearch finds a higher number of relevant sources, butwith a significantly higher number of irrelevantsources, shown in Figure 7b. In both instances,these sources are expanded and new keywords arederived, and additional relevant sources continue tobe found at a lessening rate.

After the first two iterations, new sourcescontinue to be found at a relatively linear rate. Therelevance of discovered sources decline steadilyrelative to the total after the second iteration, butstill maintain an acceptable rate of discovery. Figure6b shows a steep increase in irrelevant sourcesduring the fourth iteration, as the search spaceexpands out beyond any semblance of relevance.The broad search maintains a much more evenpercentage of relevant results over the search space.

5.3 Keywords

The search keywords (initially seeded as “Australianpolitics”) are expanded based on discovered contentand the most relevant keywords float to the top ofthe list. The first iteration of both searchesimmediately expand the keyword set to containmostly relevant keywords, as seen in Table 8a.Successive iterations narrow the search space downto a specific set of topics that quite accurately frameAustralian politics. Interestingly, the broad search(which relies more heavily on page content ratherthan Twitter) contains a higher number of historicalkeywords, seen in Table 8b. A number of thesekeywords from a broad search relate to thegovernment as of 2013, whereas those from the

(a) Normal search

(b) Broad search

Figure 6: Relevance of data sources

normal search in Table 8a relate more to the currentstate of Australian politics, circa 2014.

The difference in keyword sets between the searchtypes also affects the relevancy of discovered sourcesafter successive iterations. The iterative improvementof the search space during broad searchers potentiallyexplains why even the third and fourth iterations ofsearching still contain a relatively high percentage ofrelevant results, as seen in Figure 6b. The normalsearch relies on a much smaller set of newer contentfrom which to derive keywords, resulting in a lowerdiscovery rate of historical data sources.

5.4 Signal-to-Noise

The signal-to-noise ratio of each search type wasalso examined, relative to the total number ofsources discovered. A normal search discoversrelevant data sources at a ratio of over 7:1 duringthe first iteration (Figure 9a), but immediatelydrops to a very low success rate in successiveiterations. By comparison, the broader search startswith a much lower success rate of approximately2.5:1 (Figure 9b), but maintains a steadier ratio wellinto later iterations, providing a more steady flow ofnew relevant sources. As explained in Section 5.3,this is likely due to the broader search touching on alarger source of historical data sources due todifferences in keyword selection.

5.5 Analysis

The results of the sourcing algorithm indicate thatthe search methods (normal and broad) operatealong different parameters. Normal searches rely

(a) Normal search

(b) Broad search

Figure 7: Cumulative relevance of data sources

more heavily on new data provided by sources suchas Twitter, while the broad searches derive keywordsprimarily from older data sources such as newsfeedsand articles. As a result, the training of keywordsets tend toward two different trends: modern andhistorical, but could also indicate a disconnect indiscussion between traditional media and socialmedia. Both of these search types are useful,depending on the desired application. Overall, bothsearch types provided a significant number ofrelevant data sources and ultimately achieved theirgoal - to optimise the collection of social media data.

There are a number of improvements that couldbe made to the sourcing algorithm. One would be toincrease the use of training to include potentialsources, preferring those new sources that hadmultiple existing links to discovered sources. Asecond involves a combination of both approaches,using historical keywords to search social mediasources and modern keywords to search historicaldata sources.

6 Conclusion

This paper presents a framework that supports thesourcing, collection, cleaning, integration andpresentation of social data for experiments andapplications. The output provided by the frameworkcan be used to drive generic sensor networking andother social analytic applications without requiringthe significant hardware infrastructure investmentused to store social data for later querying. Thisallows for the use of social media as a data sourcefor generic sensor networking applications, without

Iterations

1 2 3 4

politics abbott abbott abbottaustralia australia government tonygovernment government australia australiaaustralian news tony governmentuniversity australian news newsparty minister minister ministerminister party party partyabbott politics people peoplenews tony pm politicsmedia people politics australian

(a) Normal search

Iterations

1 2 3 4

politics australia australia abbottaustralia government abbott australiaaustralian politics government ruddgovernment party rudd laborparty australian party governmentnews abbott minister partymedia minister labor ministerworld labor australian newsminister rudd politics australianabbott pm news election

(b) Broad search

Figure 8: Top 10 keywords used for searches

requiring new hardware deployments. Theintegration of disparate social and static mediaplatforms also supports cross-cultural andcross-platform experimentation and analysis withoutrequiring extensive user knowledge or experience.

This work also evaluates the use of a novel newdata sourcing algorithm, designed to optimise thetask of collecting relevant data for use in theframework. Two different approaches to sourcing areevaluated, with relevant data sources identified. Anarrow search approach is found to discover relevantsocial media sources, while a broader search is moreadept at discovering historical data sources. As theframework is designed to operate over both new andhistorical data, both approaches are useful.

As future work, improvements to the sourcingalgorithm are recommended to extend itseffectiveness to new and historical data sourcessimultaneously, providing the most relevant datasources possible for use in data analysis andapplications.

References

Agichtein, E., Castillo, C., Donato, D., Gionis,A. & Mishne, G. (2008), Finding high-qualitycontent in social media, in ‘Proceedings of the 2008International Conference on Web Search and DataMining’, ACM, pp. 183–194.

Akyildiz, I. F., Su, W., Sankarasubramaniam, Y. &Cayirci, E. (2002), ‘Wireless sensor networks: asurvey’, Computer networks 38(4), 393–422.

Bird, S., Klein, E. & Loper, E. (2009), Naturallanguage processing with Python, O’Reilly Media,Inc.

(a) Normal search

(b) Broad search

Figure 9: Signal-to-Noise Ratio of data sources

Borsboom, B., Amstel, B. v. & Groeneveld, F. (2010),‘Please rob me’.URL: http: // pleaserobme. com/

Breslin, J., Bojars, U., Passant, A., Fernandez, S.& Decker, S. (2009), ‘Sioc: Content exchange andsemantic interoperability between social networks’.

Brickley, D. & Miller, L. (2000), ‘The friend of a friend(FOAF) project’.URL: http: // www. foaf-project. org/

Burke, J. A., Estrin, D., Hansen, M., Parker,A., Ramanathan, N., Reddy, S. & Srivastava,M. B. (2006), ‘Participatory sensing’, Center forEmbedded Network Sensing .

Cameron, M. A., Power, R., Robinson, B. & Yin,J. (2012), Emergency situation awareness fromtwitter for crisis management, in ‘Proceedings ofthe 21st international conference companion onWorld Wide Web’, ACM, pp. 695–698.

Campbell, A. T., Eisenman, S. B., Lane, N. D.,Miluzzo, E. & Peterson, R. A. (2006), People-centric urban sensing, in ‘Proceedings of the2Nd Annual International Workshop on WirelessInternet’, WICON ’06, ACM, New York, NY, USA.

Chen, M., Gonzalez, S., Vasilakos, A., Cao, H.& Leung, V. C. (2011), ‘Body area networks:A survey’, Mobile Networks and Applications16(2), 171–193.

Ching, A., Murthy, R., Molkov, D., Vadali, R. &Yang, P. (2012), ‘Under the hood: SchedulingMapReduce jobs more efficiently with corona’.

URL: https: // www. facebook. com/notes/ facebook-engineering/ under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/ 10151142560538920

DataSift (2010), ‘DataSift’.URL: http: // datasift. com/

Doan, A., Domingos, P. & Halevy, A. Y. (2001),Reconciling schemas of disparate data sources:A machine-learning approach, in ‘ACM SigmodRecord’, Vol. 30, ACM, pp. 509–520.

Edison (2012), ‘The social habit’.URL: http: // socialhabit. com/ secure/ wp-content/ uploads/ 2012/ 07/ the-soCial-Habit-2012-by-edIson-research. pdf

FormulaPR (2011), ‘Social networking in summary’.URL: \url{ http: // www. formulapr. com/fuse/ june2011/ bitech. pdf}

Ganti, R. K., Ye, F. & Lei, H. (2011), ‘Mobilecrowdsensing: Current state and future challenges’,Communications Magazine, IEEE 49(11), 32–39.

Gnip (2008), ‘Gnip’.URL: http: // gnip. com/

Guo, B., Yu, Z., Zhang, D. & Zhou, X. (2014), ‘Fromparticipatory sensing to mobile crowd sensing’,arXiv preprint arXiv:1401.3090 .

Hachem, S., Pathak, A. & Issarny, V. (2013),‘Service-oriented middleware for large-scale mobileparticipatory sensing’, Pervasive and MobileComputing .

Hardt, D. (2012), ‘The OAuth 2.0 authorizationframework’.

Hughes, D., Crowley, C., Daniels, W., Bachiller, R.& Joosen, W. (2014), User-rank: generic queryoptimization for participatory social applications,in ‘System Sciences (HICSS), 2014 47th HawaiiInternational Conference on’, IEEE, pp. 1874–1883.

Hughes, D., Greenwood, P., Blair, G., Coulson,G., Grace, P., Pappenberger, F., Smith, P.& Beven, K. (2008), ‘An experiment withreflective middleware to support grid-based floodmonitoring’, Concurrency and Computation:Practice and Experience 20(11), 1303–1316.

Hughes, D., Thoelen, K., Horre, W., Matthys, N.,Cid, J. D., Michiels, S., Huygens, C. & Joosen,W. (2009), LooCI: a loosely-coupled componentinfrastructure for networked embedded systems, in‘Proceedings of the 7th International Conference onAdvances in Mobile Computing and Multimedia’,ACM, pp. 195–203.

Kanhere, S. S. (2011), Participatory sensing:Crowdsourcing data from mobile smartphonesin urban spaces, in ‘Mobile Data Management(MDM), 2011 12th IEEE International Conferenceon’, Vol. 2, IEEE, pp. 3–6.

Khoury, R., Dawborn, T., Gafurov, B., Pink, G.,Tse, E., Tse, Q., Almi’Ani, K., Gaber, M., Rohm,U. & Scholz, B. (2010), Corona: energy-efficientmulti-query processing in wireless sensor networks,in ‘Database Systems for Advanced Applications’,Springer, pp. 416–419.

Lane, N. D., Eisenman, S. B., Musolesi, M.,Miluzzo, E. & Campbell, A. T. (2008), Urbansensing systems: opportunistic or participatory?,in ‘Proceedings of the 9th workshop on Mobilecomputing systems and applications’, ACM,pp. 11–16.

Li, J. & Cardie, C. (2013), ‘Early stageinfluenza detection from twitter’, arXiv preprintarXiv:1309.7340 .

Madden, S. R., Franklin, M. J., Hellerstein, J. M.& Hong, W. (2005), ‘TinyDB: an acquisitionalquery processing system for sensor networks’,ACM Transactions on database systems (TODS)30(1), 122–173.

Maynard, D., Bontcheva, K. & Rout, D. (2012),‘Challenges in developing opinion mining tools forsocial media’, Proceedings of NLP .

Min, H., Scheuermann, P. & Heo, J. (2013), ‘Ahybrid approach for improving the data qualityof mobile phone sensing’, International Journal ofDistributed Sensor Networks 2013.

Okada, H., Itoh, T., Suzuki, K. & Tsukamoto, K.(2009), Wireless sensor system for detection ofavian influenza outbreak farms at an early stage,in ‘Sensors, 2009 IEEE’, IEEE, pp. 1374–1377.

Perry, F. (2014), ‘Sneak peek: Google cloud dataflow,a cloud-native data processing service’.URL: http: // googlecloudplatform.blogspot. com/ 2014/ 06/ sneak-peek-google-cloud-dataflow-a-cloud-native-data-processing-service. html

Robinson, B., Power, R. & Cameron, M. (2013),A sensitive twitter earthquake detector, in‘Proceedings of the 22nd international conferenceon World Wide Web companion’, pp. 999–1002.

Sakaki, T., Okazaki, M. & Matsuo, Y. (2010),Earthquake shakes twitter users: real-time eventdetection by social sensors, in ‘Proceedings of the19th international conference on World wide web’,pp. 851–860.

Sato, K. (2012), ‘An inside look at google BigQuery,white paper’, Google Inc .

Singapore, W. A. S. (2014), ‘Social, digital & mobilein APAC’.URL: http: // www. slideshare. net/wearesocialsg/ social-digital-mobile-in-apac

Taylor, D., Lewin, J. & Strutton, D. (2011), ‘Friends,fans, and followers: Do ads work on socialnetworks?’, Business Faculty Publications .

Vasalou, A., Joinson, A. N. & Courvoisier, D.(2010), ‘Cultural differences, experience with socialnetworks and the nature of “true commitment”in facebook’, International Journal of Human-Computer Studies 68(10), 719–728.

Vassiliadis, P. (2009), ‘A survey ofextract–transform–load technology’, InternationalJournal of Data Warehousing and Mining(IJDWM) 5(3), 1–27.

Wandhofer, T., Taylor, S., Walland, P., Geana,R., Weichselbaum, R., Fernandez, M. & Sizov,S. (2012), ‘Determining citizens’ opinions aboutstories in the news media: analysing google,facebook and twitter’, eJournal of eDemocracy &Open Government (JeDEM) 4(2), 198–221.

towards social media as a data source for opportunistic ... · social media already exist, such as...

Documents