the ethics of large-scale web data analysis (webmetrics) mike thelwall, statistical cybermetrics...

21
The Ethics of Large- Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland, Australian Demographic and Social Research Institute, Australian National University Virtual Knowledge Studio (VKS) Information Studies

Upload: amanda-boone

Post on 15-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

The Ethics of Large-Scale Web Data Analysis (Webmetrics)

Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK

Rob Ackland, Australian Demographic and Social Research Institute, Australian National University

Virtual Knowledge Studio (VKS) Information Studies

Page 2: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Contents

What is webmetrics?Context: Online access to personal informationResearchers’ use of personal informationConfidentiality and anonymityResource issues

What ethical considerations apply to collecting and analysing web data on a large scale from unaware web “publishers” ?

Page 3: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

1. What is webmetrics?

Large-scale analysis if web-based dataCollecting and quantitatively analysing online informationObjective is not to find information about individuals but identify trendsData gathered with VOSON, SocSciBot, Issue Crawler, LexiURL,…

Page 4: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Example

VOSON Hyperlink

network ofpolitical partiesfrom 6 countries(Ackland andGibson, 2006).Node size prop.to outdegree.76 nodes.

Page 5: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Normalised linking, smallest countries removed

Geopoliticalconnected

SwedenFinland

Norway

UK

Germany

Austria Switzerland

Poland

Italy

Belgium

Spain

France

NL

Example:Links betweenEU universities

AltaVista link searches

Page 6: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Link associations between social network sites

Page 7: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Example: Blog searching

Page 8: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

2. Context: Online access to personal information

Blogs, social network sites, personal web sites contain information that is: Private and protected (invisible to

researchers) Intentionally public Publicly private1 (intended for friends

but allowed to be public) Unintentionally public (public but

believed by owner to be private)

1. Lang (2007)

Page 9: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Accessing “public” information

Commercial search enginesWeb crawlersInternet Archive (includes deleted info)

Page 10: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Who is using Dataveillance?

Dataveillance1: Downloading or otherwise gathering data on internet users in order to influence their behaviourGoogle – can use email, searching, blogging, social network activities to target advertising (& may report to US government)Amazon – can use past activities to target adverts or improve web site

1. Zimmer (2008)

Page 11: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

3. Researchers’ use of personal information

Key issue: for large scale research, data from/about the unaware is used without their approval, and possibly for purposes that they might disagree withWhich ethical safeguards should be taken for this kind of research?

Page 12: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Issue 1: People vs. Documents

Traditionally, documents can be researched without approval, but people can’t Even harsh criticism is fair practice

(e.g., book review/analysis)

Since web pages are documents, researching them without permission is normally OK

Page 13: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Issue 2: Invasion of privacy? Natural vs. normative

A situation is naturally private1 if a reasonable person would expect privacyA situation is normatively private1 if a reasonable person would expect others to protect their privacyNon-secure web pages/data are typically naturally private Accessing is not normally invading privacy,

even if undesired by page owners and with negative consequences

1. Moor (2004)

Page 14: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

4. Confidentiality and anonymity

When should anonymity be granted to research “subjects” (page owners)? When a possibly undesired label attached

(e.g., hate group, terrorist) When undesired groups might benefit? (e.g.,

league table of hate groups) When publicly private individuals singled

out (e.g., detailed analysis of “average” blogger)

Should data be anonymised – as for Census data used for research?

Page 15: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

5. Resource issues

Accessing a web page uses the owner’s server time/bandwidthCrawling a web site can use a lot of the owner’s server time/bandwidth May incur charges or loss of service

quality

Page 16: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Robots.txt protocol

This file lists pages/folders in a web site may not be crawledIt does not restrict crawling speedIt should be obeyed in researchMost individual users are probably unaware of this and so don’t use its protection

Page 17: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Crawling speed

Web crawlers should not run too fast that they cause service issuesFull speed is probably OK on a UK university web site but not on a Burkina Faso library web siteUse judgement to decide how quickly to crawl – length of pauses in crawling

Page 18: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

How many pages to crawl?

Crawling too many pages puts unnecessary strain on the server crawledUse judgement to decide the minimum number of pages/crawl depth that is enoughUse search engine queries as a substitute, if possible

Page 19: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Automatic search engine searches

Research can piggyback off the crawling of commercial search enginesNo resource implications for site ownersUses search engine “Applications Programming Interfaces”Search engines specify the maximum number of searches per dayResults limited to the imperfect web crawling/coverage of search engine crawlers

Page 20: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Summary

Researchers need to be aware of potential issues when doing large scale data analysis researchJudgement is called for in all issuesResearch does not normally need participant permissionBe sensitive to impact of findings and any need for anonymity

Page 21: The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

References

Lange, P. G. (2007). Publicly private and privately public: Social networking on YouTube. Journal of Computer-Mediated Communication, 13(1), Retrieved May 8, 2008 from: http://jcmc.indiana.edu/vol2013/issue2001/lange.htmlZimmer, M. (2008). The gaze of the perfect search engine: Google as an infrastructure of dataveillance. In A. Spink & M. Zimmer (Eds.), Web search: Multidisciplinary perspectives (pp. 77-99). Berlin: Springer.Moor, J. H. (2004). Towards a theory of privacy for the information age. In R. A. Spinello & H. T. Tavani (Eds.), Readings in CyberEthics (2nd ed., pp. 407-417). Sudbury, MA: Jones and Bartlett.