web archiving meeting 2013 blog archiving (trochidis ilias - tero ltd)

Harvesting and archiving blog content for current and future

generationsIlias Trochidis

Tero LTD

Tero LTD

Research projects

State of the blogosphere Blogs have become fairly established as an online communication and web

publishing tool. Hundreds of millions of blogs are published about every conceivable subject.

Examples 12/9/2013

70+ million sites in the world369 million people viewing more than 11.8 billion pages each month38 million new posts and 62.3 million new comments each month

136.5 million blogs61 billion posts83.7 million daily posts

Trend

http://www.tumblr.com/presshttp://en.wordpress.com/stats/

Lost resources shared on social media

http://arxiv.org/abs/1209.3026

The disappearing web

In the “Blogs of War: Weblogs as News” paper there were documented 29 blogs on the Iraq war; of those 29 blogs: • 13 (45%) on June 2012 no longer exist on

the Internet, • Only 9 blogs (31%) still contained

information on the Iraq war • 12 out of the 20 (60%) blogs that don’t

exist were preserved by the Internet Archive (however there are problems with missing photos, comments not archived etc.)

blogs on major events have already been lost

the average lifetime of a webpage is below 100 days

Blog archiving: objectives and concerns

Aim: harvest, preserve, manage and reuse blogs and their resources

Issues: Frequency of change, structure and semantics of blogs, quantity and range of resources, database driven websites, ownership and DRM, +++

BlogForever: a blog archiving project co-funded by the European Commission (March 2011 – August 2013).

BlogForever Architecture

standard_descrcontent

date

Blog has Entry

is a

PostPage

has

Comment

Content

has

Authorhas

has

Categorised ContentCategorised Content

CommunityCommunity

Web FeedWeb Feed

External WidgetsExternal Widgets

Network and Linked DataNetwork and Linked DataBlog ContextBlog Context

SemanticsSemantics

BlogForever: Conceptual Data Model

Version 0.6

Spam DetectionSpam Detection

embeds

WidgetType

crawlerAouth

Widget

Feed

idformat

last_updated

generatorlast_build_date

related_feedLayout

themecss

images

SnapshotView

dateformat

src

hashas

Expression_ Meta

descriptiondef_keywords

Spam

dateflag

contains

SpamCategory

Keyword SentimentContent_Simila

rity

scoreflag

scoresrc

contains

contains

usernameURIUserProfile

ExternalProfile ProfileType

URI

Association Triple

subjectpredicate

object

Association Type

Multimedia

Text

Link

Tag

srcalt

caption/descrGEO

srcdescription

type

valueformat

tags

copyrightembedding

thumbnail

language

Ranking, Category and SimilarityRanking, Category and Similarity

valuedate

Ranking given

Similarity

Crawling InfoCrawling Info

Crawl captured

Category

similarity_scorealgorithm

AffiliationTypeAffiliation

Eventdate locationname URL

Topic

avatar

creator

service_uri

hasFeed_Type

value

Structured_ Meta

nameproperty

has

Standard and Ontology MappingStandard and Ontology Mapping

OntologyMapping

OntClass

OntProperty

SpamAlgorithm

ImageAudio

VideoDocument

LinkType

is a

BlogEntity

Blog crawler

Real-time monitoring Html data extraction engine Spam and noise filtering Web services extraction

engine

Unstructured information

Web servicesBlog APIs

XML metadata

Blog digital repository

Digital preservation Quality assurance Collections curation Public access APIs Personalised services Information retreival Public web interface /

Browse, search, export

Harvesting

PreservingManaging and reusing

Web servicesWeb interface

BlogForever added value

BlogForever structures the archived blog content. BlogForever is not only about archiving html pages. It is about archiving information entities (posts, comments, authors, metadata, dates, pingbacks, etc) based on the blog archiving data model.

BlogForever is based on an open source state-of-the-art digital library management system developed by CERN (Invenio).

Better management of stored information increasing the utility of the archive (granularity of the collected information, better and fast search etc.).

Added value services e.g. sentiment analysis and analytics.

BlogForever Impact Output: a simple blog archiving solution that any user, user group or institution

could use to preserve their collections of blogs ensuring: authenticity, integrity, completeness, usability, long term accessibility

Parties that will benefit: Bloggers, Universities, Libraries & Information Centres, Museums, Education, Research, Business

Examples: CERN is currently implementing a physics blogs repository, Aristotle University, Greece, has decided to create an academic blog

repository , The Linguistics department of the University of Hannover wants to know

how certain linguistic and textual phenomena / features have evolved within the internet communication diachronically,

L3S research centre, Hannover will collaborate with Tero and AUTH to combine BlogForever and ARCOMEM projects in order to deliver a new innovative web archiving platform.

Blog archiving support and consultancy

Expected release in January 2014 by Tero

Cloud based blog archiving Expected release in March/April 2014 by Tero

On demand analytics Expected release in June 2014 by Tero

Future Work

Collaborate with archives and institutions around the world and spread the need for archiving the web. Share technologies and best practices.

Support the sustainability of web archives (e.g. convincing public funders to support web archives – show what data can do).

We are already archiving the web.

Thank you!

Any [email protected]

http://twitter.com/itroch

Visit: http://www.tero.gr/enhttp://blogforever.eu

web archiving meeting 2013 blog archiving (trochidis ilias - tero ltd)

Technology

blog archiving support

blogs of war

collections of blogs

blog archiving data

semantics of blogs

reuse blogs

blogosphere blogs

disappearing web