web archiving meeting 2013 blog archiving (trochidis ilias - tero ltd)
DESCRIPTION
A software platform for blog archiving (BlogForever project).TRANSCRIPT
Harvesting and archiving blog content for current and future
generationsIlias Trochidis
Tero LTD
Tero LTD
Research projects
State of the blogosphere Blogs have become fairly established as an online communication and web
publishing tool. Hundreds of millions of blogs are published about every conceivable subject.
Examples 12/9/2013
70+ million sites in the world369 million people viewing more than 11.8 billion pages each month38 million new posts and 62.3 million new comments each month
136.5 million blogs61 billion posts83.7 million daily posts
Trend
http://www.tumblr.com/presshttp://en.wordpress.com/stats/
Lost resources shared on social media
http://arxiv.org/abs/1209.3026
The disappearing web
In the “Blogs of War: Weblogs as News” paper there were documented 29 blogs on the Iraq war; of those 29 blogs: • 13 (45%) on June 2012 no longer exist on
the Internet, • Only 9 blogs (31%) still contained
information on the Iraq war • 12 out of the 20 (60%) blogs that don’t
exist were preserved by the Internet Archive (however there are problems with missing photos, comments not archived etc.)
blogs on major events have already been lost
the average lifetime of a webpage is below 100 days
Blog archiving: objectives and concerns
Aim: harvest, preserve, manage and reuse blogs and their resources
Issues: Frequency of change, structure and semantics of blogs, quantity and range of resources, database driven websites, ownership and DRM, +++
BlogForever: a blog archiving project co-funded by the European Commission (March 2011 – August 2013).
BlogForever Architecture
standard_descrcontent
date
Blog has Entry
is a
PostPage
has
Comment
Content
has
Authorhas
has
Categorised ContentCategorised Content
CommunityCommunity
Web FeedWeb Feed
External WidgetsExternal Widgets
Network and Linked DataNetwork and Linked DataBlog ContextBlog Context
SemanticsSemantics
BlogForever: Conceptual Data Model
Version 0.6
Spam DetectionSpam Detection
embeds
WidgetType
crawlerAouth
Widget
Feed
idformat
last_updated
generatorlast_build_date
related_feedLayout
themecss
images
SnapshotView
dateformat
src
hashas
Expression_ Meta
descriptiondef_keywords
Spam
dateflag
contains
SpamCategory
Keyword SentimentContent_Simila
rity
scoreflag
scoresrc
contains
contains
usernameURIUserProfile
ExternalProfile ProfileType
URI
Association Triple
subjectpredicate
object
Association Type
Multimedia
Text
Link
Tag
srcalt
caption/descrGEO
srcdescription
type
valueformat
tags
copyrightembedding
thumbnail
language
Ranking, Category and SimilarityRanking, Category and Similarity
valuedate
Ranking given
Similarity
Crawling InfoCrawling Info
Crawl captured
Category
similarity_scorealgorithm
AffiliationTypeAffiliation
Eventdate locationname URL
Topic
avatar
creator
service_uri
hasFeed_Type
value
Structured_ Meta
nameproperty
has
Standard and Ontology MappingStandard and Ontology Mapping
OntologyMapping
OntClass
OntProperty
SpamAlgorithm
ImageAudio
VideoDocument
LinkType
is a
BlogEntity
Blog crawler
Real-time monitoring Html data extraction engine Spam and noise filtering Web services extraction
engine
Unstructured information
Web servicesBlog APIs
XML metadata
Blog digital repository
Digital preservation Quality assurance Collections curation Public access APIs Personalised services Information retreival Public web interface /
Browse, search, export
Harvesting
PreservingManaging and reusing
Web servicesWeb interface
BlogForever added value
BlogForever structures the archived blog content. BlogForever is not only about archiving html pages. It is about archiving information entities (posts, comments, authors, metadata, dates, pingbacks, etc) based on the blog archiving data model.
BlogForever is based on an open source state-of-the-art digital library management system developed by CERN (Invenio).
Better management of stored information increasing the utility of the archive (granularity of the collected information, better and fast search etc.).
Added value services e.g. sentiment analysis and analytics.
BlogForever Impact Output: a simple blog archiving solution that any user, user group or institution
could use to preserve their collections of blogs ensuring: authenticity, integrity, completeness, usability, long term accessibility
Parties that will benefit: Bloggers, Universities, Libraries & Information Centres, Museums, Education, Research, Business
Examples: CERN is currently implementing a physics blogs repository, Aristotle University, Greece, has decided to create an academic blog
repository , The Linguistics department of the University of Hannover wants to know
how certain linguistic and textual phenomena / features have evolved within the internet communication diachronically,
L3S research centre, Hannover will collaborate with Tero and AUTH to combine BlogForever and ARCOMEM projects in order to deliver a new innovative web archiving platform.
Blog archiving support and consultancy
Expected release in January 2014 by Tero
Cloud based blog archiving Expected release in March/April 2014 by Tero
On demand analytics Expected release in June 2014 by Tero
Future Work
Collaborate with archives and institutions around the world and spread the need for archiving the web. Share technologies and best practices.
Support the sustainability of web archives (e.g. convincing public funders to support web archives – show what data can do).
We are already archiving the web.
Thank you!
http://twitter.com/itroch
Visit: http://www.tero.gr/enhttp://blogforever.eu