webometrics 1.0from altavista to small worlds and genre drift

40
Webometrics 1.0 from AltaVista to Small Worlds and Genre Drift Lennart Björneborn Royal School of Library and Information Science [email protected] NORSLIS PhD course in informetrics Umeå 18.6.2008

Upload: lennart-bjoerneborn

Post on 02-Nov-2014

5 views

Category:

Education


0 download

DESCRIPTION

NORSLIS PhD course in informetrics, Umeå University, Sweden 18.6.2008

TRANSCRIPT

Page 1: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

Lennart Björneborn

Royal School of Library and Information [email protected]

NORSLIS PhD course in informetrics Umeå 18.6.2008

Page 2: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

outline

webometrics 1.0 birth of webometrics

early webometric research

two webometric studies small-world link analysis

based on graph theory

and social network analysis

genre connectivity analysis

M.C. Escher: House of Stairs, 1951

Page 3: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

WWW = largest network with available connectivity data

Woo

d et

al.

(199

5)

Page 4: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

WWW = collaborative weaving= macro-level aggregations of micro-level interactions= reflect social, cultural formations

Woo

d et

al.

(199

5)

Page 5: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

= keep track of ”the complex web of relationships

between people, programs, machines and ideas” (Tim Berners-Lee, 1997)

Woo

d et

al.

(199

5)

WWW

Page 6: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

birth of webometrics

citation analogy link = implicit recommendation of webpage though also negative references

’Webometrics’ 1997 + ’Web Impact Factor’ 1998 Almind & Ingwersen (1997). Informetric analyses on the World Wide Web: methodological approaches to ‘webometrics’.

Ingwersen (1998). The calculation of Web impact factors.

Google ’Page Rank’ 1998 exploit link structures: who receives many links from someone who also receives many links from someone who also … ?

Page 7: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

birth of webometrics: access to link data*

linkdomain:norslis.net -site:norslis.net

link:www.norslis.net -site:norslis.net

(* cf. breakthrough of bibliometrics: access to citation data)

Page 8: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

linkdomain:norslis.net -site:norslis.net

Page 9: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

A

B

D

E G

F

H

C

basic link terminology

B has an inlink from A : ~ citation B has an outlink to C : ~ reference B has a selflink : ~ self-citation

C and D have co-inlinks from B : ~ co-citation

B and E have co-outlinks to D : ~ bibliographic coupling

co-links

(Björneborn 2004)

Page 10: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

some proposed web metrics

Netometrics (Bossy, 1995)

supplement bibliometrics and scientometrics in observing

“science in action” on the Internet

Webometry (Abraham, 1996)

Internetometrics (Almind & Ingwersen, 1996)

Webometrics (Almind & Ingwersen, 1997)

Cybermetrics (journal started 1997 by Isidro Aguillo)

Web bibliometry (Chakrabarti et al., 2002)

Page 11: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

some related web science

Web Mining (e.g., Etzioni, 1996; Kosala & Blockeel, 2000)

Web Ecology (e.g., Pitkow, 1997; Chi et al., 1998; Huberman, 2001)

Cyber Geography (e.g., Girardin, 1995)

Cyber Cartography (e.g., Dodge, 1999)

Web Graph Analysis (e.g., Kleinberg et al., 1999; Broder et al., 2000)

Web Dynamics (e.g., Levene & Poulovassilis, 2001)

Webology (journal started 2004 by Alireza Noruzi)

Web Science (Berners-Lee et al., 2006)

Page 12: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

webometrics the study of quantitative aspects of

the construction and use of

info. resources, structures and technologies on the Web,

drawing on bibliometric and informetric approaches

informetrics

bibliometrics

scientometrics

webometrics

cybermetrics

(Björneborn 2004)

Page 13: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

webometrics

four main research areas of webometric concern: web page content analysis;

web link structure analysis;

web usage analysis (e.g., log files);

web technology analysis (e.g., search engine performance)

informetrics

bibliometricsscientometrics

webometrics

cybermetrics

(Björneborn 2004)

Page 14: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

web data collection non-standardized, messy data

due to diversified, distributed, dynamic web lack of metadata

primary data own web crawler (beware: robot exclusion) direct access to web servers incl. log files Internet Archive (www.archive.org) manual collection with browser

secondary data search engines (beware: deficiencies)

necessary data cleansing mirror sites, variant names, typo domains + links many file formats, including misspellings

Page 15: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

examples of webometric analysis

powerlaw distributions e.g. pages, outlinks, inlinks, visits per web site (Adamic & Huberman 2001)

correlation between research indicators and inlinks e.g. UK, Taiwan, Australia (several studies by Thelwall et al.) EU projects EICSTES + WISER

co-inlink cluster analysis analogous to cocitation analysis e.g. EU universities (Polanco et al. 2001) e.g. Chinese IT companies (Vaughan & You 2005)

longitudinal studies web page change and permanence (e.g. Koehler 2004)

Page 17: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

Björneborn (2004). Small-world link structures across an academic Web space: A library and information science approach. PhD Thesis. www.db.dk/LB

small-world link analysisbased on graph theory andsocial network analysis

Page 18: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

graph theory- Leonhard Euler (1707-1783), Königsberg

(Wilson & Watkins 1990)

Page 19: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

graph theory graph = mathematical modeling of network

directed graph: e.g. www

nodes (or vertices): A, B, C, D, E

edges (if directed: arcs, links): AC, EB, ...

degree: d(A) = 3

- outdegree: dO (A) = 2; indegree: dI (A) = 1

directed walk: ACB: path length = 2

geodetic distance: shortest path between 2 nodes

centrality global c.: least sum of geodetic distances

betweenness c.: most shortest paths pass node

EE

AA

BB

CC

DD

Gross & Yellen (1999). Graph theory and its applications.

Page 20: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

graph theory applications

graph theory used for mathematical modeling of networks e.g., biology, chemistry, physics, sociology, psychology, technology

also applied in information sciences incl. bibliometrics citation networks (e.g., Garner, 1967; Doreian & Fararo, 1985; Hummon

& Doreian, 1989; Shepherd, Watters & Cai, 1990; Egghe & Rousseau, 1990;

Fang & Rousseau, 2001; Egghe & Rousseau, 2002; 2003a; 2003b)

information systems (e.g., Korfhage, Bhat & Nance, 1972)

hypertextual networks (e.g., Botafogo & Shneiderman, 1991; Smeaton,

1995; Furner, Ellis & Willett, 1996)

Page 21: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

social network analysis

relations between actors in social network

sociometry - 1930s (Moreno) - sociograms

social networks - 1950s - social network analysis

makes use of mathematical graph theory

Wasserman & Faust (1994). Social network analysis : methods and applications. Cambridge University Press.

Otte & Rousseau (2002). Social network analysis: a powerful strategy, also for the information sciences. Journal of Information Science, 28(6): 441-454

Page 22: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

small-world networks

small-world = highly clustered + short paths short distances through shortcuts between clusters in network small-world = short local + short global distances efficient diffusion of signals, contacts, ideas, viruses, etc. in networks

social network analysis in 1960s: ’six degrees of separation’ today: ‘small worlds’ in biological, chemical, technical, social networks brains, epidemics, scientific collaboration, semantic networks etc.

(Watts & Strogatz 1998)

Page 23: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

most links connect similar topics topical clusters

small-world web cross-topic shortcuts

Page 24: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

main research question what types of web links,

web pages and web sites

function as cross-topic connectors

in small-world link structures

across an academic web space?

objective: identify micro-level aspects

of how small-world phenomena emerge

Björneborn (2004). Small-world link structures across an academic Web space: A library and information science approach. PhD Thesis. www.db.dk/LB

small-world link analysis

Page 25: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

UK link data 2001

109 UK universities web crawler, Thelwall

7669 subsites www.hum.port.ac.uk www.atm.ox.ac.uk ... departments, centres,

research groups, etc.

connections between 7669 subsites 207 865 links 105 817 web pages

Page 26: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

1893 SCCStrongest Connected

Component

96 IN-Tendrilsconnected from IN

2660 OUTreachable from SCC

626 INtraversable to

SCC

55 OUT-Tendrils connected to OUT

7 Tubeconnecting IN to OUT

2332 Dis-connected

(Björneborn 2004)

‘corona’ graph model

reachability structures

Page 27: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

10 seed nodes (stratified sampling in SCC component)

hum.port.ac.uk

Faculty of Humanities and Social Sciences, Portsmouth

Atmospheric, Oceanic and Planetary Physics,

Oxford atm.ox.ac.uk

economics.soton. ac.ukEconomics Dept, Southampton

Chemistry Dept, Glasgow chem.gla.ac.uk

psy.man.ac.ukPsychology Dept, Manchester

Mathematics Dept, Glasgow Caledonian maths.gcal.ac.uk

speech.essex.ac.uk

Speech Research Group, Linguistics Dept, Essex

Palaeontology Research Group, Earth Sciences Dept, Bristol palaeo.gly.bris.ac.uk

geog.plym.ac.ukGeography Dept, Plymouth

Ophthalmology Dept,[eye research] Oxford eye.ox.ac.uk

10 path nets with all shortest link paths between five pairs of topically dissimilar subsites

Page 28: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

.ac.uk

.uk

cfd.me.umist.ac.uk

ercoftac.mech.surrey.ac.uk

cajun.cs.nott.ac.uk

ukoln.bath.ac.uk

cs.man.ac.uk

ashmol.ox.ac.uk

collections.ucl.ac.uk

vlmp.museophile.sbu.ac.uk

shortest link path

Page 29: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

path net = ‘mini’ small world

transversal link

path net = all shortest link paths between two given nodes (subsites)network analysis tool = Pajek adjacency matrix

(Björneborn 2006)

Page 30: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

some indicative findings

findings not generalizable: small, stratified sample

however: indicative findings may suggest

computer-science sites = academic cross-topic connectors

personal link creators = web cohesion ‘glue’ – especially link lists

researchers, PhD students, etc. are important providers of site outlinks

and important receivers of site inlinks

over 80% of cross-topic links academic (research, teaching)

Page 31: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

small-world web implications small local threads in the shape of users’ links

affect how the global web is cohesive and may be traversed– like ‘the strength of weak ties’ (Granovetter 1973)

– knowledge diffusion and social cohesion

across social groups

counteract ‘balkanization’ – disconnected / unreachable subpopulations

reachability structures– essential for web crawler harvests

Page 32: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

webometric study: genre connectivity what role do web page genres play for cohesion

and reachability on the Web? [one of the first studies]

what types of web page genres function as link providers and link receivers between university web sites?

Page 33: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

352 links 249 target pages

source pages and target pages in 10 path nets 281 source pages

genre connectivity analysis

Page 34: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

meta genres

Page 35: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

genre pairs

Page 36: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

web of genres

genre network graph extracted with Pajek software © Björneborn

Page 37: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

genre connectivity academic web spaces = rich diversity of interlinked genres

= diversified link motivations personal link creators are important web cohesion builders

personal link lists provide site outlinks personal homepages receive site inlinks

genre connectivity affect web cohesion and reachability by genre drift and topic drift

Page 38: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

genre drift + topic drift

topic clusters with genre diversity + genres with topical diversity changes in page genres and page topics along link paths genre drift within clusters + topic drift between clusters

short link distances (small world)

Page 39: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

questions?

Page 40: Webometrics 1.0from AltaVista to Small Worlds and Genre Drift

read more:

Björneborn (2004). Small-world link structures across an academic web space : A library and information science approach. PhD dissertation. www.db.dk/LB

Björneborn (2006). ‘Mini small worlds’ of shortest link paths crossing domain boundaries in an academic Web space. Scientometrics, 68(3): 395-414.

Björneborn (forthcoming). Genre connectivity and genre drift in a web of genres. In: Mehler et al. Genres on the Web: Corpus Studies and Computational Models.