using metadata in electronic publishing - tkk · electronic publishing and particularly in online...
TRANSCRIPT
29
Using Metadata in Electronic Publishing
(CoMet internal report)
Juha Puustjärvi
Jukka Yli-Koivisto [email protected]
Espoo 10.5.2001 Software Business and Engineering Institute HELSINKI UNIVERSITY of TECHNOLOGY
30
Content
1 INTRODUCTION.................................................................................................................... 1
2 RELATED WORK .................................................................................................................. 2
3 COMET..................................................................................................................................... 4
3.1 THE PROBLEM ........................................................................................................................ 4 3.2 COMET SOLUTION ISSUES...................................................................................................... 6 3.3 WORKING ENVIRONMENT ...................................................................................................... 7
3.3.1 External news providers ............................................................................................. 8 3.3.2 Content creator........................................................................................................... 8 3.3.3 CoMet System ............................................................................................................. 9 3.3.4 Customer..................................................................................................................... 9 3.3.5 Kernel ....................................................................................................................... 10 3.3.6 CoMet application layer ........................................................................................... 11
3.4 SERVICES PROVIDED ............................................................................................................ 12
4 METADATA STRUCTURES AND MATCHING............................................................. 12
4.1 HIERARCHICAL ONTOLOGY.................................................................................................. 13 4.2 DOCUMENT PROFILES .......................................................................................................... 15 4.3 USER PROFILES .................................................................................................................... 16 4.4 MATCHING .......................................................................................................................... 18
4.4.1 SmartPush matching (asymmetric distance measure).............................................. 20 4.4.2 LCH-matching .......................................................................................................... 21 4.4.3 Weighted LCH-matching .......................................................................................... 23
5 CONCLUSIONS AND FUTURE WORK........................................................................... 24
REFERENCES.......................................................................................................................... 25
1
1 Introduction
Various publishing houses have started to exploit new publishing medias like the Internet and
various terminals such as cellular phones and pocket PC’s. In these environments a publishing
house faces new challenges. More content can be distributed compared to the conventional
newspapers where space is limiting the actual information quantity available. In electronic
publication explanatory background material can be linked to news items. This is called
information augmentation and it provides larger perspectives to readers. Several information
sources must be utilized to provide background material from various heterogeneous document
management systems. This is done by using metadata information witch has been produced to
describe the content of material created by different content creators. Information augmentation
is therefore dependent of the services provided by the system that stores the metadata.
Selective dissemination of information (SDI) is a form of electronic publishing. Newspapers
that are available online are a good example of SDI services. According to WebWombat
[Onl01] there are 65 online newspapers at the moment in Finland. Entertainment and financial
services are also available in various different formats. These services rely on pull or push type
service model. Every service that distributes information by forcing its customers to browse
provided information manually is a pull type service. In this model the user is responsible for
wanting the given information. He or she can then point and click the link to the wanted
information or request it some other way. In push service type the content producer has the
knowledge of its customers and their needs [Çet00]. Because of this knowledge a publishing
house can push the provided content to its customers. The pushed content can be browsed but
the customer cannot directly control it. Content selection is made indirectly by changing the
user profile. The user profile is the key of knowing the customer needs.
Pull type services can use user profiles as well as push type services. User profile is knowledge
about user and his or her needs and interests. The user profile is compared against the
document metadata. If a match is found the content can be shown to the customer. To provide
this kind of services, effective ways of metadata handling is needed. Different media types and
heterogeneous content management systems are very difficult issues as well. Therefore content
management independent metadata handling model must be implemented to provide services
2
for push and pull type services and information augmentation. News content brings some
special features that are needed in content management and distribution of electronic
publications. SDI concern issues like used metadata format for describing news items and
meaningful ontology hierarchy, which describes the metadata structures and their relations
against each other.
Our solution is to build a content management independent metadata handling system that
contains the needed information to be able to build push and pull type services in electronic
publishing. This is achieved by separating metadata from it’s original content. Therefore a
separate database is needed for the metadata. This database is handled by using the facilities
provided by the conventional relational databases. The developed system benefits from
traditional database services such as data abstraction, high-level access through query
languages and controlled multi-user control. The system can also be distributed to
geographically different locations and yet remain efficient. The core system serves multiple
push and pull type applications at the same time through its service interface.
The remainder of the report is organized as follows. In chapter 1, we look at the related work in
electronic publishing and particularly in online newspapers. Then we introduce the basic
problem area and environment for the CoMet System. In chapter 4, we look at the hierarchical
metadata structures that are used in the CoMet System. Also the matching problem is
examined. At the end we present the conclusions and the need for further research.
2 Related work
SmartPush [Sav98, Kur99] is a personated delivery system for economic news items and it has
similarities to our work. It was a research project at the Helsinki University of Technology
from 1997 to 2000 and it is our predecessor project. SmartPush used hierarchical ontology that
is similar to our profile construction. However the used similarity calculation method differs
from our approach. SmartPush uses asymmetric similarity measure [Sav99] that doesn’t
exclude the documents from the result set. It rather ranks the incoming document flow to have
a certain order in a result set. This is clearly a disadvantage according to our aims in
information augmentation and data re-use. SmartPush uses agents in its implementation. Our
3
approach differs also in this section. We will exploit relational database and its calculation
power for matching purposes. SmartPush used a matching agent that was coded in Java. Our
vision is that a relational database can offer a rather efficient way of handling huge document
metadata mass compared to calculation power that Java can offer.
Fishwrap [Che95] is a personalized newspaper that uses news material from several external
news providers. It allows topic selection and layout customization of the personalized news
page. Incoming news feed is matched against the user specified topics. With this pre-
categorizing the system gains in performance compared to online matching of all topics.
Fishwrap has also community vide features. Page one contains several news items. Fishwrap
keeps count on the interest of news items by counting the number of readers that a news item
has. News item’s position on the page one changes according the popularity it has gained.
Fishwrap has also information augmentation features. Fishwrap checks its photo and sound
databases for pictures and sound recordings that mach to the news item.
User behavior had a significant role in user profile adaptation in Krakatoa Chronicle [Bra98]
and its successor project Anatagonomy [Kam97]. The used Java applet enabled very intensive
customer behavior surveillance. The customer behavior is tracked while she or he reads:
activities like scrolling, maximizing, opening articles in new windows or saving them into
scrapbook is assumed to reflect positive interest towards the article content. This kind of
explicit feedback turned out to be worse than implicit feedback that is probably stored as
keyword lists. The storing method for user profiles wasn’t explicitly described in the article but
the used calculation methods and user profile characterizing let’s us assume a keyword list
presentation. Krakatoa Chronicle looks like an ordinary newspaper because it has a multi-
column layout and justified text. News items relevance was shown as a slider widget that the
customer could re-adjust for feedback. Also community related similarity measure was show
for every news item. Krakatoa Chronicle and Anatagonomy used client-server architecture
where a server handed the news items to client witch was responsible for layout generation and
feedback surveillance. This kind of architecture doesn’t suit to the CoMet System because it
limits the used customer terminals.
Telepublishing [Haa94] was implemented in HyperNeWS. This method enables a newspaper
like layout as in Krakatoa Chronicle but the used implementation method restricts all the
4
terminals that do not have HyperNeWS available. A second layout type was also available. It
was designed to support the noted ways of customer’s electronic newspaper usage. A
significant amount of work was targeted to develop ways to support the content creation
process with electronic newspapers. One notable feature in Telepublishing is background
material that is offered to news items. This feature has similarities to our information
augmentation feature in the CoMet System. However Telepublishing seems to offer its
background material in a limited amount compared to our System. Personating features are also
present but the implementation of matching problem is not clarified in the article that describes
the Telepublishing system.
3 CoMet
CoMet stands for content management of a media company based on metadata. The aim of the
project is to build a working prototype of the CoMet System and test it in real content
production environment. The System will provide various services that help content producers
daily activities. The CoMet System can be used as the main intelligence of personalized push
and pull services that will be distributed to different terminals including mobile phones and
desktop computers.
3.1 The problem
Publishing houses can have a large number of different media products in their key business
area such as newspapers, television, Internet services, radio, and publishing activities. Content
production includes a large number of different media types and customer terminals. In this
kind of media rich environment it is essential that content production will be done a bit
differently for every different media type. Content production facilities for newspaper
production has different requirements compared to CD-ROM production. Therefore separate
content production systems are needed for both instances to be able to provide the needed
functionality for both of them. The drawbacks from heterogeneous document management
systems are impaired content re-usability and content dissemination through different media
types.
5
Metadata is a critical component of an effective information management [Mar98]. Therefore
metadata information must be added to the content produced in a publishing house. Produced
content is then stored to various content management systems used in a publishing house.
Metadata contains descriptive information about the document and therefore it can be used as a
guide for end user in their interaction with the document collection. Descriptive metadata
contains also information about various issues beyond the scope of the document’s content
itself like information about the author, media type, production date and version of a document.
A publishing house produces a large amount of content like news items and other documents.
However it may not able to produce all content it needs for its media distribution. News items
can also be bought from an external news producers like Reuters. Document metadata
standards such as RDF [W3C00a] and Duplin Core [Dup99] are needed for efficient data
interchange. In case of special kind of content these standards may not provide the needed
means for information interchange [Grö98, For00]. Dissatisfaction can be caused by
insufficient amount of metadata information available or unsuitable metadata structures. NITF
(News Industry Text Format) [Ipt00, XML01] standard has been introduced to provide the
needed document and metadata structures for news industry. If several different metadata
formats are being used, document matching can cause problems. Because of that metadata
within the CoMet System will be homogeneous.
Content re-use and information augmentation are key issues in electronic publishing [Saa99].
The benefits of content re-use are obvious. If content has already been produced it would be
waist of recourses to produce it again. Content authoring is efficient only if content is produced
only ones. In information augmentation explanatory background material is added to content.
This can been seen in electronic newspapers as links to relevant background material. In this
way the content of an electronic publication can be enriched to meet the customers personalized
needs. Therefore background material must be individually chosen for every logical customer.
If a customer is using an electronic publication like an online newspaper, background material
can be added from other publication systems to meet the individualized needs of a newspaper
reader. This is a difficult task if heterogeneous document management systems are used with
incompatible metadata structures.
6
3.2 CoMet solution issues
Our aim is to design data and metadata structures and their distribution that can be used in
electronic publishing. By these means data re-use and information augmentation are made
possible not only from one single instance of production but from all document management
systems that are involved in content production within a publishing company.
We must admit that a publishing house cannot abandon their current content production
environments and simply start using a new one. Current systems could have gone through
intensive customization to meet the desired production and business needs. Therefore a
separate guide to information must be presented. Metadata must be stored in a relational
database along with the location of the document. The CoMet System provides a transparent
access to all documents within a production company. All media types like text, video and
sound can be distributed by our system. We have to emphasis that the CoMet System doesn’t
replace actual content production tools. It provides the means of finding and distributing
content to customers.
Our focus is to provide services for electronic publishing and especially for the news publishing
industry. Our System will be a relational database centric solution for information handling.
The CoMet System gains from the benefits of relational databases such as data abstraction,
high-level access through query languages and controlled multi-user control. Publishing houses
can have geographically distributed content management systems. The CoMet System can be
distributed as well to meet the needed performance demands and for hiding the actual content
production point.
In SDI systems information is distributed according to user profiles that contain the information
about users interest areas. Therefore a method for matching documents and user profiles must
be introduced [Fol92]. The vector space model is a popular way to calculate the similarity of a
user profile and a document. It has been alleged that recall and precision of the vector space
model is superior compared to the Boolean (relational) model [Çet00]. Recall is the total ratio
of documents received compared to those relevant documents that exist in the source collection.
Precision is the percentage of the returned documents that were desirable for the customer.
7
We suggest that the vector space model is not the only suitable matching method in the
electronic publishing environment. This is due the limited size of source collection involved in
electronic publications. Especially in news distribution a news item is relevant only for a short
period of time. We argue that the Boolean (relational) model can be specified by SQL-queries
with a relatively high precision and recall. To be able to do this user profiles and document
metadata must be presented in such fashion that effective matching can be calculated by the
relational database management system. This technique will provide on efficient way of
calculating user profiles and document profile similarity. To be able to compare the vector
space model to our suggestion both models will be implemented in the CoMet System.
3.3 Working environment
Electronic newspaper is a combination of news items, documents, pictures and other media
objects produced by a publishing house and external news providers. Externally produced
content must be converted and stored to publishers internal document management systems.
The produced content can then be distributed to end-users. Different terminals such as desktop
computers and mobile phones can be used for reading the content that has been made available.
The CoMet System handles the distribution and personating of content. Overview of the CoMet
System architecture can be seen in figure 1. This delivery process is now taken into closer
observation.
Ontology Conversion and Meta Creation
CCooMMeett SSyysstteemm
News
Profile
External News Providers
Customer
Content Creator
Meta and Content Creation
Figure 1: CoMet Architecture
8
3.3.1 External news providers
A publishing house can use external news providers for its news production. In an extreme
example a publishing house doesn’t create content at all. It can act as a news gatherer and
distributor that buy all the distributed content from external news providers. If a publishing
house has it’s own content production, external content providers can still be used. If
information is needed from some special and narrow subject, a specialist is called to help in
content production. Publishing houses buy content often from a few static external news
providers like Reuters. In these conditions a data interchange method can be negotiated. News
items and documents can be delivered using for example the NITF standard.
Metadata is delivered along with the news item that has been produced by an external news
producer. Unfortunately the received metadata information can differ from publishing houses
internal metadata structures and standards. Also the hierarchy that builds metadata structures
i.e. ontology can be different. In that case external news producers ontology must be mapped to
publishers own ontology. Additional metadata creation could be needed as well to fulfill the
publishers metadata needs. Ontology mapping is not a trivial operation and therefore this
subject goes beyond the scope of this paper. We assume that effective ontology mapping
methods exist. It is important to be able to reduce the metadata creation burden. If the ontology
cannot be mapped extra work is needed for re-creation of document metadata. We stated in
chapter 3.1 that content must be produced only once. The same statement goes for metadata
creation as well.
3.3.2 Content creator
Content creator is responsible for creating the articles and news items i.e. content that are
produced within a publishing house. Newspapers are a good example of publishing houses that
have their own content production. When content production is done inside the publishing
house, metadata information can be created according to internal standards. Therefore
additional ontology mappings are not needed. Created content i.e. news items and documents
are then stored to the publishing houses document handling systems.
9
Content creators and editors create metadata. An automated metadata creation tool has been
implemented to help this task. Fully automatic metadata creation is a difficult task because the
interpretation of the content relies on human expertise. However metadata creators can exploit
the basic metadata structures that have been created automatically. These suggestions for
metadata are then audited and generation errors are fixed [Kur99]. Accuracy of the created
metadata is essential for the service. Inaccuracy will reduce precision and recall of document
delivery.
3.3.3 CoMet System
The CoMet System handles distribution and personating duties in a publishing house. CoMet
Kernel acts as an information mediator between information sources and their users. It provides
services for content creators, editors and other customers using SDI services. Content
distribution is based on created metadata. This information is compared to user profiles. If a
document is found to be interesting for a user according to his or her user profile, the document
is show to him or her with relevant background material.
Document metadata usage is the key element in the CoMet System. Therefore it must be aware
of all the documents that are saved to the content management systems within a publishing
house. Particularly metadata information must be placed physically inside the CoMet System.
Documents can still be placed to the various document management systems but a gateway to
them must be implemented. With document management system gateways the CoMet System
offers transparent access to all documents. The CoMet System utilizes these gateways when
documents are delivered to end-users.
3.3.4 Customer
A customer is a person using the CoMet System. A publishing house has internal and external
customers. Internal customers are content creators, editors and other personnel within a
publishing house. External customers are external news providers or people that are using SDI
10
services. The vast majority of customers are ordinary people that are using services provided
by a publishing house.
The CoMet System provides personalizing features to its customers. Every logical customer
has his or her own personal view to the distributed content. One customer can have several user
profiles. A separate user profile is needed to be able to distinguish for example a work related
user profile and a spare time user profile. In the evening user might want to explore totally
different material than during office hours.
Another important issue concerns customers who are using different terminals for connecting to
services provided by a publishing house. The same information cannot be shown in the same
way for cellular phones and desktop computers because of the differences in terminal displays.
An abstract of a document or a short news item is readable from cellular phones display.
Desktop computers larger displays are more suitable for big articles and documents. Therefore
CoMet System has to distribute partly different material to small display terminals. Information
augmentation is also problematic in small display terminals. If the display space is limited the
actual news item and its typography and layout are more significant than the ability to use
information augmentation features.
3.3.5 Kernel
The CoMet Kernel is responsible for all the personalizing and information augmentation
features provided by the CoMet System. The main purpose of the CoMet Kernel is to provide
services to applications within a publishing house that need personalized SDI services for
information distribution. CoMet Intelligence that constructs the core of CoMet Kernel does this
all. Figure 2 shows the details of the CoMet System. CoMet Kernel is the center part of the
CoMet System. It contains CoMet Intelligence, document metadata database (Meta DBMS in
figure 2) and document store.
The document metadata database has an essential role in CoMet Kernel. All the metadata
information is stored there. It contains document metadata and user profiles. Only these
elements are stored in the metadata database. Documents are stored to a separate document
management system i.e. document store. This is due the fact that publishing houses are
11
dependent to their current solutions for document management. This is because of the business
functions that are built heavily on the document systems that are currently being used in
content production. The used document management systems already provide good versioning
functions and authoring facilities. Therefore documents are separated from metadata. Location
of a document is stored in metadata. CoMet Kernel provides a gateway to current document
management systems. Therefore current document creation processes can remain unaltered.
The document gateway can be implemented to file systems (FS), object databases and other
sources.
3.3.6 CoMet application layer
CoMet System divides into two main parts: CoMet Kernel and CoMet Application Layer. The
latter is the part of the system that interacts with the users, like content creators and customers.
Applications have a service interface that they can use. Through this interface applications in
CoMet Application Layer can interact with the CoMet Kernel. Applications have the basic
service portfolio in use and they do not have to worry about SDI issues. The main concern is
the amount of information augmentation used and the wanted layout. CoMet Kernel does not
provide layout issues at all. The layout is done in the application layer. Therefore content is
separated from the presentation to be able to support different terminals [Saa99]. Metadata can
contain also information about the devices that document can be distributed to.
External News Providers
Content Provider
Customer
CoMet Application Layer
Push applications
Pull applications
Content provider’s tools
CCooMMeett KKeerrnneell
CCooMMeett IInntteelllliiggeennccee
MMeettaa DDBBMMSS
DDooccuummeenntt SSttoorree
•Document Metadata •Profiles
Documents in ODBMS or FS
Figure 2: The CoMet System
12
CoMet Kernel can simultaneously serve several CoMet Application Layer applications.
Therefore CoMet System can be used by multiple applications that want to use SDI services
that are made available trough CoMet Application Layer interface. Basically this interface
works as follows. CoMet Kernel provides a line of services. These services can be called to
provide SDI functions. Service request can be for example a document request for a customer
that wants all the news items relevant to his or her profile. CoMet Kernel does the matching
and then hands the documents to requesting application.
3.4 Services provided
The CoMet System has four main services. Firstly, we look at the personalizing features. The
CoMet System has metadata information about customers and documents. The CoMet System
is capable of performing information filtering task and acting as an information mediator
between information sources and their users. Secondly, the CoMet System has content re-use
services. If a news item that has been originally created for a television newsreader is used
again in electronic publication, re-using of content is utilized. The CoMet System can assist
content creators in finding relevant related news items and other documents. Thirdly, the
CoMet System provides information augmentation features. In information augmentation
explanatory background material is added to content. This can been seen as links to documents
that create relevant background material for a news item. Fourthly, story chain management
can be used. This information about articles and their relations against each other is stored in
metadata database.
4 Metadata structures and matching
As we emphasized before, metadata has very important role in the CoMet System. The main
purpose of metadata is to contain a compact representation of a document’s content and a user
profile. Once metadata is created from the actual document content, it should be possible to
process it independently from the original content. This knowledge is used later on to deliver
relevant documents to the CoMet System customers without accessing the original documents.
Created metadata information must be machine-readable. This means that metadata must be
13
processed without human assistance. Metadata creation doesn’t have to be fully automated and
it can contain manual phases. After metadata has been created, it must be processed
automatically [Sav98].
Used metadata structures must be implemented in such fashion that a uniform metadata format
can be used to describe content from different sources. A structured data format, that can
contain different value types, is an efficient way of implementing metadata structures. We use
XML [W3C00b] for presenting the CoMet metadata structures. This kind of structured data is
easy to handle and different accessing methods like DOM (Document Object Model) [W3C01]
has been introduced for handling it. However our approach leaves the chosen metadata standard
open as long as a structured metadata format is used.
Metadata usage has several advantages over using document content for deciding if a document
is distributed to a CoMet System’s customer [Jok00]. The obvious advantage is the size of
metadata. It captures the essential semantics of the source. Therefore created metadata is
potentially smaller than the actual document. Size intensive file types like video and sound
benefits form the use of metadata because it can save a significant amount of storage space and
computation time. Metadata supports all media formats by using only one representation
format. This is a significant advantage because only one matching algorithm must be
implemented to cover all media types and their distribution. The only disadvantage is the time
that must be consumed for metadata creation. This effort increases document creation times and
human resource needs in a publishing house. Another difficult issue is the need for chancing
metadata structures. New media formats may require changes in the used metadata structures.
In this case it is not clear how old and new metadata formats relate to each other.
4.1 Hierarchical ontology
Ontology can be understood in many ways depending on the circumstances it appears. By the
ontology we mean a set of metadata structures consisting of concepts and their relations against
each other. These concepts build a definition of the problem domain. The ontology can contain
different angles i.e. dimensions of the content. These dimensions contain the needed
information for describing the documents within the problem domain. Each dimension
describes one aspect of the problem domain such as story subject, author or geographic location
14
of the story. Every concept in a dimension is orthogonal i.e. it is independent from other
concepts in other dimensions. Dimensions build up from concepts and their relation. They form
a hierarchical model of a dimension.
Ontology and their structures can vary depending on implementation. We will use hierarchical
ontology structures because of the calculation and the implementation advantages that we have
stated in the beginning of this chapter. Besides that hierarchical model is efficient to handle it
provides easily interpreted visual description of dimensions and their concept relations. A
simple example of a hierarchical ontology can been seen in figure 3. In this example a
hierarchical ontology forms a tree. The root level is the top node for this hierarchical model.
Under that is dimension level. From that level different orthogonal dimensions can be accessed.
Here we have three dimensions: Subject, Author and Location. We take Location into closer
observation. In the next level (Dimension level + 1) Location divides into two parts: Finland
and International. This is the leaf level of Location dimension. Finland and International are
concepts in this dimension. As can be seen in figure 3, number of levels in a dimension can
vary. In our example Subject dimension is much deeper than the other two dimensions. The
deepness of the used dimension is not limited. In Subject dimension there are 7 dimension
levels. In this dimension, concepts and their relation can be observed better than in Location
dimension. Human observer can quickly form an impression of this dimension and the concept
relations that bears within Subject dimension.
Politics Financial
Räikkönen
Sauber
Häkkinen
McLaren Ferrari
Team General
F1 Rally
Motor sport
SM-Liiga NHL
Ice hockey
Sport
Subject Author
Finland International
Location
RootRoot Level
Dimension level
Dimension level +1
Dimension level +2
Dimension level +3
Dimension level +4
Dimension level +5
Dimension level +N ...
Dimension level +6
Figure 3: Ontology construction
15
It is important to notice that ontology structures must be created before metadata. Metadata is
in fact the terms that relate to concepts in the dimensions of ontology. Therefore the needed
metadata information is presented in hierarchical ontology construction. Ontology creation is a
difficult task [Jok00]. A person that has a good knowledge of the problem domain and
customer’s needs should create the used ontology. Several iteration times in ontology creation
could be needed before a useful ontology is found. The problem domain will evolve and change
over time. This means that ontology is never in its final state and therefore it must be updated
periodically. This raises a question about the documents where old ontology structures have
been used for containing metadata information. It could mean that old metadata information
must be converted to mach the new model.
4.2 Document profiles
Document profile contains the semantic metadata that has been created to describe the content
of the actual document. Metadata is stored in the hierarchical ontology structures. In figure 4
we can see a news item and its metadata information. This simplified example is based on the
Subject dimension that has been introduced in picture 3. The news item is about formula 1
driver Kimi Räikkönen and his thoughts before his first race in Interlagos circuit. Our ontology
fits into this problem domain and it is therefore capable of describing this news item. The news
item reflects its content to Subject dimensions Sport concept.
Formula One newcomer Kimi Räikkönen is not at all perturbed by the fact that he will be racing on a new track this weekend, but unlike any new F1 track he has encountered so far this season, the Interlagos circuit runs anti-clockwise. "Interlagos is another new circuit for me to learn. It doesn't worry me to know that it is anti-clockwise and considered as one of the toughest of the year in the calendar," said the young Finn who has scored one championship point in two Grands Prix. "The track is difficult because it is very bumpy. It is a lot of work to find a special set up for that, unlike any other track in the F1 calendar. I'm looking forward to driving there, and to challenging again for some points," he concluded. Last year the Sauber team was forced to pull out of the weekend after excessive vibrations caused by the bumpy nature of the circuit caused many rear wing failures for the team. Sauber are confident they will have no such problems this year.
Politics0
Financial0
Ferrari0
Räikkönen0.8
Sauber0.8
Häkkinen0
McLaren0
Team0.8
General0.2
F11
Rally0
Motor sport1
Ice hockey0
Sport1
Subject1
Author Location
Root
Document specific ontology / profile
Figure 4: Document profile
16
The news items relations to subject dimension concepts is expressed as weights in the leaf
nodes of the used hierarchical ontology. Leaf weights are then summarized to its parent node.
Parent nodes weight is the sum of its child node weights. This summarization is continued until
dimension level is reached. Every dimension and the values in it will be normalized. If weights
have been defined to concepts, the dimension level will always have value 1. Every dimension
is independent from other dimensions. If the values are not set, the dimension level node will
have value 0. If the dimension has gained values for leafs, the dimension level node will always
have value 1. This means that every dimension is normalized. In figure 4 concept F1 has a
value 1. This is calculated as a sum from its child node weights 0.8 (Team) and 0.2 (General).
The summarization is then continued and finally weight 1 is put to Subject concept in
dimension level.
Document metadata creation is a part of document creation process. Content creators or editors
are responsible for metadata creation. As pointed out in the previous chapters automatic
metadata creation can help this task. When we take a look at the news item and its ontology in
picture 4, it is clear what the weights in concepts are reflecting. This is because people are very
talented in observing document content [Lab99]. Human expertise is therefore needed in
metadata creation process. In the SmartPush project a content provider’s tool was created to
help the metadata creation task [Kur99]. The content provider’s tool analyzed texts content by
using certain key terms like country names for capturing the needed metadata information.
Content creators and editors then modified these generated suggestions for suitable semantic
metadata information. The tool helped the basic metadata generation task but the accuracy of
the metadata was still in the hands of a human expert. Another problem is the problem domain.
SmartPush project used financial news items and content provider’s tool was optimized for that
problem domain. In other kind of news areas automatic metadata generation results would not
be as good as they were in financial news. Therefore automatic metadata generation must be
optimized separately for every different problem domain area.
4.3 User profiles
A user profile captures the user interests in machine understandable form. A user profile can be
build from a set of keywords that describe the preferred interest areas of a customer. Keywords
are compared against news items. If a news item contains a term or several terms from the
17
keyword list, it is shown to the customer. This method is used in Krakatoa Chronicle [Bra98].
Another popular way to store user profile as a term vector associated with term weights
[Bel96]. We are using the same hierarchical ontology that was presented earlier. In this way we
are able to use the same hierarchical presentation model for documents and user profiles. This
can be seen as an advantage when document matching is taken into closer observation.
In figure 5 we can see a customer and his user profile. It consists of Subject, Author and
Location dimensions although only Subject dimension has meaning in this user profile. Subject
dimension divides into Politics, Financial and Sports. Sport concept has the finest presentation
because that has been refined to dimension level + 4. Concept weights are calculated in the
same way as in document ontology. User profile uses the same ontology that was presented in
figure 3.
A user profile must be created before a customer can use the SDI services that are provided by
CoMet System. Its purpose is to describe the long-term information needs of a customer in a
particular subject area. Customers personal information needs can vary in daily basis but these
changes in customer interest areas cannot be captured. Therefore long-term interest is the only
interest form that can be captured in a sufficient accuracy. In this method document recall and
precision can be guaranteed.
Politics0.2
Financial0.3
Team0.05
General0.15
F10.2
Rally0.1
Motor sport0.3
Ice hockey0.2
Sport0.5
Subject1
Author Location
Root
User specific ontology / profile
Customer
Figure 5: User profile
18
4.4 Matching
The idea of personalized SDI is based on distributing relevant documents to customers
depending on their individual information needs. A user profile is utilized when knowledge is
needed concerning customer interests. Decision on document delivery is made according the
knowledge about document content and its relevance to the user. The matching problem occurs
when a decision on document delivery must be made according to customer’s user profile. This
kind of decision problem exists as well in information retrieval (IR). IR research has raised
three retrieval models that are applicable also to SDI. These models are the Boolean model, the
vector space model and the probabilistic model [Bel92]. The first two models are used broadly
in different SDI implementations.
In the Boolean model a user profile is a sequence of distinct word i.e. terms. This keyword list
contains the terms that characterize the document content in those documents that are defined
as relevant documents according to the customer. A profile matches a document if all the terms
in a user profile appears in the document [Yan94b]. Because of this feature, the Boolean model
is said to be an exact match model. It divides the document store into two parts. A document
belongs to the matched or the denied document set. There are two main disadvantages in this
model. A document cannot have a relevance measure in this model. It belongs or doesn’t
belong to a matched document set. The order of the matched document set cannot be defined in
any way. This is clearly a disadvantage because some documents are more relevant to a
customer than others. The second disadvantage is the size of a matched document set. If the
source document set is large it is hard to define user profile granularity in such fashion that
matched document set has manageable size. In many cases too many or too less documents are
to be found relevant according to Boolean model [Çet00]. The Boolean model was widely used
in SDI implementations such as library systems in the beginning of the 90th decade. Nowadays
the vector space model has gained more popularity.
In Vector space model a user profile is a term vector associated with term weights [Yan94a].
Therefore text is treated as a vector in multidimensional space. Dimensions consist of the
words that are used to describe the content of a document. Stop lists are used to remove the
terms that appear often in text. Terms like more, such and perhaps don’t bear any meaning to
document content and therefore they are included in stop word lists. The significant terms must
19
be stemmed. The term weights construct an important refinement compared to Boolean model.
With these weights an importance of terms can be specified. Statistical methods can be used for
deciding witch terms usually bear more meaning within the text. If inverted document
frequency is used, terms that appear rarely in documents gains greater weights than those terms
that are found in many documents. When matching is carried out, the profile vector and the
document vector is compared against each other by using for example a cosine similarity
measure. If similarity is within a certain threshold the document is distributed to the customer
that owns the matching user profile. By using the calculated similarity measures matched
documents can be ordered. This is clearly a positive feature because customers can assume that
documents that are in the top of the result list will be more relevant than the ones in the end of
the result list. The vector space model is said to fall into best-match model category. This is
because the matching is fuzzy and it is based on estimation of document relevance instead of an
exact yes or no value.
It is important to notice that Boolean and vector base model don’t fit into our environment as
they were introduced above. This is because our purpose is to distribute and augment
information that has various media types like text, video and sound. If SDI is implemented
according to these models, the actual document content is analyzed when user profile and
document profile matching is calculated. This would be impossible when for example the
decision on video material distribution is made. Instead we rely on document metadata and user
metadata when matching is calculated. When metadata is used for matching, a significant
advantage has been acquired. Word stemming and stop lists for several different languages are
not needed. This is clearly an advantage because one implementation can be used all over the
world instead of different language versions for every different mother tongue areas. This is
also commercially feasible aspect.
A user profile can be presented as a query. This is feasible especially in the Boolean model but
other solutions have been presented as well [Alt00]. In many cases the idea of presenting the
user profiles as queries is not actually a model like Boolean model or vector space model but
rather an implementation method for the used retrieval models. The same information can be
included to database queries as is contained in the Boolean model. Databases offer an effective
way of executing the actual matching task. We are using this method in the CoMet System.
However we are not using the Boolean model for our matching. In the next chapters we
20
describe the matching implementation for our predecessor project SmartPush. After that we
will present the matching method that will be used in the CoMet System.
4.4.1 SmartPush matching (asymmetric distance measure)
SmartPush project stated that symmetrical distance measure is not capable of catching the
interest distribution of the customer’s interest areas [Sav98]. Although the structures of a user
profile and the document metadata are the same, they have different kind of semantics. Usually
a customer’s user profile consists of many separate interest areas. If a customer is interested in
high fidelity, apartment advertisements and menus in near by restaurants it would be odd to
state that the document having best relevance to user covers all these three subject matters.
Because of this assumption, distance measure should not be symmetric. The SmartPush project
matched the documents against parts of the user profile. In this way a part of the user profile
represents one interest area. If a document is found to match one part of the user profile, it is
assumed to be relevant against that part of the user profile instead of looking the user profile as
one whole entity. After this statement the role of the user profile becomes distinct from the role
of the document and the similarity measure is no longer symmetric as it was in the vector space
model.
The SmartPush project stored metadata in the same way as we introduced in chapter 4.1 and
concept weights were calculated in the same way as we showed in chapter 4.2. The hierarchical
ontology helps the calculation burden because the hierarchy itself defines the relations between
concepts in a dimension. With this advantage in mind SmartPush project introduced a
calculation method where different dimension levels were exploited [Sav99]. Similarity
calculations were done separately for every dimension level. First a coarse measure was
calculated. It gives only a rough estimate for the document and the user profile closeness by
calculating how much the document misses the user profile. Another calculation took care of
the areas where document and user profile has weight in the desired area. This was called the
fine measure. Finally a way to combine these two measures was introduced. The combined
distance measure was calculated separately for every dimension level. Because of this feature a
summarization method was needed to calculate the final similarity method. The dimension
level had affect to the amount that one single dimension level affected to the combined
measure. With this calculation method every document had some distance measure.
21
It is important to notice that SmartPush didn’t try to exclude documents from the result set but
rather calculate the optimal distribution order. In the CoMet System this kind of arrangement
won’t work. The CoMet System has a service where augmentation features are offered to
customers. Information augmentation utilizes various heterogeneous document management
systems to find relevant document, videos and other relevant background material for a news
item. It wouldn’t make sense to calculate the augmentation order of every single document that
is stored into various document management systems. Therefore we need a calculation method
that includes exclusionary features.
4.4.2 LCH-matching
The CoMet System needs a matching method that is capable of handling large amount of
metadata material to enable information augmentation and data re-use. We are using metadata
information for matching calculations. This helps us to decrease the calculation burden because
metadata information is typically compact compared to document text or video information.
Despite of these advantages that we have already gained, the selected matching method still
needs to be efficient. Typically SDI research has concentrated on recall and precision problem.
These are important aspects but they really don’t matter if high recall and precision are
achieved with a calculation method that is untenable with its time consumption [Yan94a].
Acceptable response times must be guaranteed for information augmentation and data re-use.
Different index structures can be used for matching optimization [Yan94a, Yan94b, Alt00].
We have decided to use the hierarchical ontology structure that was introduced in chapter 4.1.
Once the ontology has been constructed we can use it for modeling the user profiles and the
document profiles. This is time conserving because the definition of a good ontology is a
difficult task. The user profile has been introduced in chapter 4.3 and the document profile was
presented in chapter 4.2. The use of one ontology is an advantage because calculation methods
are easier to design and implement when user profiles and document profiles use the same
hierarchical model. Time-consuming ontology mappings are not needed and the calculation
process is more intuitive for human observer than in the situation where the use and the
document profiles use a different ontology.
22
The CoMet System compares the document profiles against the user profiles and decides the
closeness of these two entities. The first step is to find the largest combined hierarchy (LCH). It
is a definition for the largest hierarchy that the user profile and the document profile share in
the used ontology. In this way we can isolate the potentially relevant documents away from the
discarded document set. The LCH can be examined as a whole or dimensions can be observed
independently. Dimensions can be divided further to concept braches. As we can see from the
figure 6, a dimension can split into different subject matters like Sport and further into Motor
sport and Ice hockey. These concept branches form separate subject areas that are relevant for
this particular customer. When these areas are inspected separately better augmentation recall
and precision can be obtained than in the situation where the whole LCH is treated as one
entity. Depending on the service (augmentation, matching data re-use or story chain detection)
different matching methods can be used. Therefore the matching granularity can differ from a
service model to another. This enables us to fine tune the matching process depending on the
task that the CoMet System is working on.
Next we present a simplified example of the LCH detection and the final matching calculation
method. In figure 6 we have two profiles: the user profile and the document profile. They are
already familiar because the same profiles were used in chapter 4.2 and 4.3 when the
hierarchical user profile and document profiles were introduced. When these two profiles are
inspected the largest combined hierarchy can be found in the center on the figure 6. This is
rather efficient inspection because hierarchical trees can be easily examined. The resulted LCH
has only one branch but this doesn’t have to be the case. In fact in our case similarities can be
Politics0
Financial0
Ferrari0
Räikkönen0.8
Sauber0.8
Häkkinen0
McLaren0
Team0.8
General0.2
F11
Rally0
Motors port1
Ice hockey0
Sport1
Subject1
Author Location
Root
Politics0.2
Financial0.3
Team0.05
General0.15
F10.2
Rally0.1
Motor sport0.3
Ice hockey0.2
Sport0.5
Subject1
Author Location
Root
Team0.8 vs. 0.05
General0.2 vs. 0.15
F10.8 vs. 0.2
Motor sport1 vs 0.3
Sport1 vs. 0.5
Subject1
Root
Largest Combined Hierarchy (LCH)
Document specific ontology / profile
User specific ontology / profile
LCH is used to calculate match or delivery to customer
Figure 6: Largest combined hierarchy matching
23
found only from one dimension but in more complex situation LCH reaches over several
dimensions. The final stage involves the actual calculation. We can see that calculation is
relevant only from dimension level +1 and deeper in the hierarchy because dimension level is
always 1 for both profiles. Calculated elements would be Sports (1 vs. 0,5), Motor sports (1 vs.
0,3), F1 (0,8 vs. 0,2), Team (0.8 vs. 0.05) and General (0.2 vs. 0.15). The result from the
comparison of these five concepts bears the final similarity measure in this simplified example.
In this manner we have been able to exclude the non-relevant documents from the result set.
After that a similarity measure was calculated to those documents that had LCH with a user
profile. The result set can be arranged according to the similarity measure results.
4.4.3 Weighted LCH-matching
The deepness of the used hierarchy has a significant effect on the expression power of the used
ontology. This can be seen from figure 3 where Subject dimension is rather deep (dimension
level + 6). In dimension level +1 Sport concept is introduced. When we look at the dimension
level +3, Sport has been sharpened to F1 concept. If LCH is found from dimension level +3 i.e.
F1, its more significant than LCH that reaches only to dimension level +1 i.e. Sport. The
expression power of dimension level +3 is quite strong in our ontology. If we presume that two
documents exists witch both have LCH with the user profile only in one dimension. The LCH-
matching ranks them to be equally similar against the user profile and the LCH is located in the
same branch but the other one seems to have deeper dimension level. In these circumstances
the document having deeper LCH should be ranked above the other document. This can be
arranged by emphasizing those matching results that are calculated in deeper dimension levels.
Besides of excluding documents from the result set, the deepness of LCH can be exploited in
inverted fashion. If a user profile is very limited or contains only few deep branches, the CoMet
System benefits from user profile generalization. Lets presume that only one branch has been
defined for a customer’s user profile. This branch is the same as the LCH in figure 6. A
customer is obviously very interested in F1 news items. If the first weighted-LCH results
exclude all news items from the result set, user profile generalization can be exploited. The
deepness of the used hierarchy has a significant effect on the describing force of the used
ontology. If we emphasize the similarity calculation weight that is used in deepest dimension
level –1, a broader view to the defined user profile ontology can be constructed. If a customer
24
is interested in F1 news items, one can assume that he would like to have other motor sport
related news items when F1 news items are not available. This kind of feature could be used as
a “see also”-type of recommendation service. Besides showing a relevant document a customer
can broaden his or her view by examining the articles that belongs to a nearby concept in the
systems hierarchical ontology.
5 Conclusions and future work
We have presented the need for selective dissemination of information (SDI) for a publishing
company. This problem area contains for example electronic publications like online
newspapers. External news providers and internal content creators create news items and other
documents such as pictures and sound material. Distribution of a news item is based on
metadata information witch is a compact representation method of document content.
Then we focused on metadata presentation model. We presented a hierarchically ontology
structure for metadata information container. This model is used for storing the customer’s user
profiles and the document profiles. LCH-matching (Largest Combined Hierarchy) is used for
matching user profiles and document profiles. News items are distributed according to the
calculation result of LCH-matching. Then we specified the LCH-matching method to the
weighted LCH-matching. In this matching model the hierarchical ontology is exploited to
adjust the matching result. The deepness of the used hierarchy has a significant effect on the
describing force of the used ontology. Therefore the deepness of the used LCH must be
emphasized in the matching calculations.
Our goal was to define a content management independent metadata handling system that
contains the needed information to be able to provide SDI services in electronic publishing.
The basic metadata structures and an approximate description of the CoMet System
architecture were presented in this paper. However the distribution architecture has been
specified in more detailed manner but that information was not included in this paper. The
future challenges include a more detailed description of weighted LCH-matching and
especially the architectural design of the distributed metadata database. The metadata database
will also be responsible for calculating the matching of user profiles and document profiles.
25
This is a very important issue in our future work. According to our research in matching
implementations it will be a novel way of doing the matching calculation. Relational database
will provide a very powerful facility for handling large amounts of metadata information.
References
Alt00 Altinel, M., Franklin, M., Efficient Filtering of XML Documents for Selective
Dissemination of Information, In Proceedings of the 26th Conference on Very
Large Databases, September 10 – 14, Cairo, Egypt, pages 53 – 64, Morgan
Kaufmann 2000
Bel92 Belkin, N., Croft W., Information Filtering and Information Retrieval: Two Sides
of the Same Coin?, Communications of the ACM, Vol. 35, No. 12, December,
pages 29 – 38, ACM 1992
Bel96 Bell, T., Moffat, A., The Design of a High Performance Information Filtering
System, In Proceedings of the 19th ACM SIGIR conference on Research and
Development in Information Retrieval, August 18 – 22, Zurich Switzerland,
ACM 1996
Bra98 Brahat, K., Kamba, T., Albers, M., Personalized, interactive news on the web.
Multimedia Systems, Vol. 6, September, pages 349 – 358, Springer-Verlag 1998
Çet00 Çetintemel, U., Franklin M., Giles C., Self-Adaptive User Profiles for Large-
Scale Data Delivery, In Proceeding of the 16th IEEE conference on data
engineering, pages 622 – 633, IEEE 2000
Che95 Chesnais, P., Mucklo, M., Sheena, J., The Fishwrap Personalized News System,
In Proceedings of the 2nd International Workshop on Community Networking
Integrating Multimedia Services to the Home, June 20 – 22, Princeton, New
Jersey USA, pages 275 – 282, IEEE 1995
26
Dup99 Dublin Core Metadata Initiative., Dublin Core Metadata Element Set, Version
1.1: Reference Description, 2.7.1999, http://purl.oclc.org/dc/documents/rec-dces-
19990702.htm
Fol92 Foltz, P., Dumais, S., Personalized Information Delivery: An Analysis of
Information Filtering Methods, Communications of the ACM, Vol. 35, No. 12,
December, pages 51 – 60, ACM 1992
For00 Forsberg, K et al., Extensible use of RDF in a Business Context, In proceeding of
the 9th International World Wide Web Conference The Web: The Next
Generation, Amsterdam, May 15 – 19, 2000
Grö98 Grötschel, M., Lügger, J., Scientific Information Systems and Metadata, Report
SC 98-26 (October 1998), Konrad-Zuse-Zentrum für Informationstechnik Berlin,
Germany
Haa94 Haake, A., Hüser, C., Reichenberger, K., The individualized electronic
newspaper: an example of an active publication. Electronic Publishing, Vol. 7(2),
June, pages 89 – 111, Wiley 1994
Ipt00 The International Press Telecommunications Council., News Industry Text
Format (NITF), www.nitf.org, 2000
Jok00 Jokela, S., Turpeinen, M., Sulonen, R., Ontology Development for Flexible
Content, HICSS-33 Minitrack on Systems Support for Electronic Business on the
Internet, January 4 – 7, 2000
Kam97 Kamba, T., Sakagami, H., Koseki, Y., Anatagonomy: a Personalized Newspaper
on the World Wide Web, International Journal of Human-Computer Studies, Vol.
46, No. 6, June, pages 789 – 803, Academic Press 1997
Lab99 Labrou, Y., Finin, T., Yahoo! As an Ontology – Using Yahoo! Categories to
Describe Documents, In Proceedings of the 8th international conference on
Information knowledge management, November 2 – 6, Kansas City, MO USA,
27
pages 180 – 187, ACM 1999
Mar98 Marshall, C., Making Metadata: a Study of Metadata Creation for a Mixed
Physical-Digital Collection, In Proceedings of the 3rd ACM Conference on
Digital libraries, June 23 - 26, Pittsburgh, PA USA, pages 162 – 171, ACM 1998
Onl01 Online Newspapers., www.onlinenewspapers.com, 2001
Saa99 Saarela, J., The Role of Metadata in Electronic Publishing, D.Sc. thesis, Acta
Polyechnica Scandinavica, Ma 102, 1999
Sav98 Savia, E., Kurki, T., Jokela, S., Metadata Based Matching of Documents and
User Profiles, In Human and Artificial Information Processing, In Proceedings of
the 8th Finnish Artificial Intelligence Conference, pages 61 – 69, Finnish
Artificial Intelligence Society 1998
Sav99 Savia, E., Mathematical Methods for a Personalized Information Service,
Master’s Thesis, Helsinki University of Technology, Institute of Matchematics,
HUT 1999
Kur99 Kurki, T., et al., Agents in Delivering Personalized Content Based on Semantic
Metadata, Intelligent Agents in Cyberspace, Papers from the AAAI Spring
Symposium, Technical Report SS-99-03, AAAI Press 1999
W3C00a W3C., Resource Description Framework (RDF), Schema Specification, W3C
Candidate recommendation 27 March 2000, http://www.w3.org/TR/2000/CR-rdf-
schema-20000327
W3C00b W3C., Extensible Markup Language (XML) (Second Edition), W3C
Recommendation 6 October 2000, http://www.w3.org/TR/2000/REC-xml-
20001006
W3C01 W3C., Document Object Model (DOM), http://www.w3.org/DOM
28
Xml01 XML and the news industry, www.xmlnews.org, 2001
Yan94a Yan, T., Garcia-Molina H., Index Structures for Information Filtering Under the
Vector Space Model, In Proceedings of the 10th International Conference on Data
Engineering, February 14 – 18, Houston, Texas USA, pages 337 – 347, IEEE
1994
Yan94b Yan, T., Garcia-Molina H., Index Structures for Selective Dissemination of
Information Under the Boolean Model, ACM Transaction on Database Systems,
Vol. 19, No. 2, June, pages 332 – 364, ACM 1994