using metadata in electronic publishing - tkk · electronic publishing and particularly in online...

29

Using Metadata in Electronic Publishing

(CoMet internal report)

Juha Puustjärvi

[email protected]

Jukka Yli-Koivisto [email protected]

Espoo 10.5.2001 Software Business and Engineering Institute HELSINKI UNIVERSITY of TECHNOLOGY

30

Content

1 INTRODUCTION.................................................................................................................... 1

2 RELATED WORK .................................................................................................................. 2

3 COMET..................................................................................................................................... 4

3.1 THE PROBLEM ........................................................................................................................ 4 3.2 COMET SOLUTION ISSUES...................................................................................................... 6 3.3 WORKING ENVIRONMENT ...................................................................................................... 7

3.3.1 External news providers ............................................................................................. 8 3.3.2 Content creator........................................................................................................... 8 3.3.3 CoMet System ............................................................................................................. 9 3.3.4 Customer..................................................................................................................... 9 3.3.5 Kernel ....................................................................................................................... 10 3.3.6 CoMet application layer ........................................................................................... 11

3.4 SERVICES PROVIDED ............................................................................................................ 12

4 METADATA STRUCTURES AND MATCHING............................................................. 12

4.1 HIERARCHICAL ONTOLOGY.................................................................................................. 13 4.2 DOCUMENT PROFILES .......................................................................................................... 15 4.3 USER PROFILES .................................................................................................................... 16 4.4 MATCHING .......................................................................................................................... 18

4.4.1 SmartPush matching (asymmetric distance measure).............................................. 20 4.4.2 LCH-matching .......................................................................................................... 21 4.4.3 Weighted LCH-matching .......................................................................................... 23

5 CONCLUSIONS AND FUTURE WORK........................................................................... 24

REFERENCES.......................................................................................................................... 25

1

1 Introduction

Various publishing houses have started to exploit new publishing medias like the Internet and

various terminals such as cellular phones and pocket PC’s. In these environments a publishing

house faces new challenges. More content can be distributed compared to the conventional

newspapers where space is limiting the actual information quantity available. In electronic

publication explanatory background material can be linked to news items. This is called

information augmentation and it provides larger perspectives to readers. Several information

sources must be utilized to provide background material from various heterogeneous document

management systems. This is done by using metadata information witch has been produced to

describe the content of material created by different content creators. Information augmentation

is therefore dependent of the services provided by the system that stores the metadata.

Selective dissemination of information (SDI) is a form of electronic publishing. Newspapers

that are available online are a good example of SDI services. According to WebWombat

[Onl01] there are 65 online newspapers at the moment in Finland. Entertainment and financial

services are also available in various different formats. These services rely on pull or push type

service model. Every service that distributes information by forcing its customers to browse

provided information manually is a pull type service. In this model the user is responsible for

wanting the given information. He or she can then point and click the link to the wanted

information or request it some other way. In push service type the content producer has the

knowledge of its customers and their needs [Çet00]. Because of this knowledge a publishing

house can push the provided content to its customers. The pushed content can be browsed but

the customer cannot directly control it. Content selection is made indirectly by changing the

user profile. The user profile is the key of knowing the customer needs.

Pull type services can use user profiles as well as push type services. User profile is knowledge

about user and his or her needs and interests. The user profile is compared against the

document metadata. If a match is found the content can be shown to the customer. To provide

this kind of services, effective ways of metadata handling is needed. Different media types and

heterogeneous content management systems are very difficult issues as well. Therefore content

management independent metadata handling model must be implemented to provide services

2

for push and pull type services and information augmentation. News content brings some

special features that are needed in content management and distribution of electronic

publications. SDI concern issues like used metadata format for describing news items and

meaningful ontology hierarchy, which describes the metadata structures and their relations

against each other.

Our solution is to build a content management independent metadata handling system that

contains the needed information to be able to build push and pull type services in electronic

publishing. This is achieved by separating metadata from it’s original content. Therefore a

separate database is needed for the metadata. This database is handled by using the facilities

provided by the conventional relational databases. The developed system benefits from

traditional database services such as data abstraction, high-level access through query

languages and controlled multi-user control. The system can also be distributed to

geographically different locations and yet remain efficient. The core system serves multiple

push and pull type applications at the same time through its service interface.

The remainder of the report is organized as follows. In chapter 1, we look at the related work in

electronic publishing and particularly in online newspapers. Then we introduce the basic

problem area and environment for the CoMet System. In chapter 4, we look at the hierarchical

metadata structures that are used in the CoMet System. Also the matching problem is

examined. At the end we present the conclusions and the need for further research.

2 Related work

SmartPush [Sav98, Kur99] is a personated delivery system for economic news items and it has

similarities to our work. It was a research project at the Helsinki University of Technology

from 1997 to 2000 and it is our predecessor project. SmartPush used hierarchical ontology that

is similar to our profile construction. However the used similarity calculation method differs

from our approach. SmartPush uses asymmetric similarity measure [Sav99] that doesn’t

exclude the documents from the result set. It rather ranks the incoming document flow to have

a certain order in a result set. This is clearly a disadvantage according to our aims in

information augmentation and data re-use. SmartPush uses agents in its implementation. Our

3

approach differs also in this section. We will exploit relational database and its calculation

power for matching purposes. SmartPush used a matching agent that was coded in Java. Our

vision is that a relational database can offer a rather efficient way of handling huge document

metadata mass compared to calculation power that Java can offer.

Fishwrap [Che95] is a personalized newspaper that uses news material from several external

news providers. It allows topic selection and layout customization of the personalized news

page. Incoming news feed is matched against the user specified topics. With this pre-

categorizing the system gains in performance compared to online matching of all topics.

Fishwrap has also community vide features. Page one contains several news items. Fishwrap

keeps count on the interest of news items by counting the number of readers that a news item

has. News item’s position on the page one changes according the popularity it has gained.

Fishwrap has also information augmentation features. Fishwrap checks its photo and sound

databases for pictures and sound recordings that mach to the news item.

User behavior had a significant role in user profile adaptation in Krakatoa Chronicle [Bra98]

and its successor project Anatagonomy [Kam97]. The used Java applet enabled very intensive

customer behavior surveillance. The customer behavior is tracked while she or he reads:

activities like scrolling, maximizing, opening articles in new windows or saving them into

scrapbook is assumed to reflect positive interest towards the article content. This kind of

explicit feedback turned out to be worse than implicit feedback that is probably stored as

keyword lists. The storing method for user profiles wasn’t explicitly described in the article but

the used calculation methods and user profile characterizing let’s us assume a keyword list

presentation. Krakatoa Chronicle looks like an ordinary newspaper because it has a multi-

column layout and justified text. News items relevance was shown as a slider widget that the

customer could re-adjust for feedback. Also community related similarity measure was show

for every news item. Krakatoa Chronicle and Anatagonomy used client-server architecture

where a server handed the news items to client witch was responsible for layout generation and

feedback surveillance. This kind of architecture doesn’t suit to the CoMet System because it

limits the used customer terminals.

Telepublishing [Haa94] was implemented in HyperNeWS. This method enables a newspaper

like layout as in Krakatoa Chronicle but the used implementation method restricts all the

4

terminals that do not have HyperNeWS available. A second layout type was also available. It

was designed to support the noted ways of customer’s electronic newspaper usage. A

significant amount of work was targeted to develop ways to support the content creation

process with electronic newspapers. One notable feature in Telepublishing is background

material that is offered to news items. This feature has similarities to our information

augmentation feature in the CoMet System. However Telepublishing seems to offer its

background material in a limited amount compared to our System. Personating features are also

present but the implementation of matching problem is not clarified in the article that describes

the Telepublishing system.

3 CoMet

CoMet stands for content management of a media company based on metadata. The aim of the

project is to build a working prototype of the CoMet System and test it in real content

production environment. The System will provide various services that help content producers

daily activities. The CoMet System can be used as the main intelligence of personalized push

and pull services that will be distributed to different terminals including mobile phones and

desktop computers.

3.1 The problem

Publishing houses can have a large number of different media products in their key business

area such as newspapers, television, Internet services, radio, and publishing activities. Content

production includes a large number of different media types and customer terminals. In this

kind of media rich environment it is essential that content production will be done a bit

differently for every different media type. Content production facilities for newspaper

production has different requirements compared to CD-ROM production. Therefore separate

content production systems are needed for both instances to be able to provide the needed

functionality for both of them. The drawbacks from heterogeneous document management

systems are impaired content re-usability and content dissemination through different media

types.

5

Metadata is a critical component of an effective information management [Mar98]. Therefore

metadata information must be added to the content produced in a publishing house. Produced

content is then stored to various content management systems used in a publishing house.

Metadata contains descriptive information about the document and therefore it can be used as a

guide for end user in their interaction with the document collection. Descriptive metadata

contains also information about various issues beyond the scope of the document’s content

itself like information about the author, media type, production date and version of a document.

A publishing house produces a large amount of content like news items and other documents.

However it may not able to produce all content it needs for its media distribution. News items

can also be bought from an external news producers like Reuters. Document metadata

standards such as RDF [W3C00a] and Duplin Core [Dup99] are needed for efficient data

interchange. In case of special kind of content these standards may not provide the needed

means for information interchange [Grö98, For00]. Dissatisfaction can be caused by

insufficient amount of metadata information available or unsuitable metadata structures. NITF

(News Industry Text Format) [Ipt00, XML01] standard has been introduced to provide the

needed document and metadata structures for news industry. If several different metadata

formats are being used, document matching can cause problems. Because of that metadata

within the CoMet System will be homogeneous.

Content re-use and information augmentation are key issues in electronic publishing [Saa99].

The benefits of content re-use are obvious. If content has already been produced it would be

waist of recourses to produce it again. Content authoring is efficient only if content is produced

only ones. In information augmentation explanatory background material is added to content.

This can been seen in electronic newspapers as links to relevant background material. In this

way the content of an electronic publication can be enriched to meet the customers personalized

needs. Therefore background material must be individually chosen for every logical customer.

If a customer is using an electronic publication like an online newspaper, background material

can be added from other publication systems to meet the individualized needs of a newspaper

reader. This is a difficult task if heterogeneous document management systems are used with

incompatible metadata structures.

6

3.2 CoMet solution issues

Our aim is to design data and metadata structures and their distribution that can be used in

electronic publishing. By these means data re-use and information augmentation are made

possible not only from one single instance of production but from all document management

systems that are involved in content production within a publishing company.

We must admit that a publishing house cannot abandon their current content production

environments and simply start using a new one. Current systems could have gone through

intensive customization to meet the desired production and business needs. Therefore a

separate guide to information must be presented. Metadata must be stored in a relational

database along with the location of the document. The CoMet System provides a transparent

access to all documents within a production company. All media types like text, video and

sound can be distributed by our system. We have to emphasis that the CoMet System doesn’t

replace actual content production tools. It provides the means of finding and distributing

content to customers.

Our focus is to provide services for electronic publishing and especially for the news publishing

industry. Our System will be a relational database centric solution for information handling.

The CoMet System gains from the benefits of relational databases such as data abstraction,

high-level access through query languages and controlled multi-user control. Publishing houses

can have geographically distributed content management systems. The CoMet System can be

distributed as well to meet the needed performance demands and for hiding the actual content

production point.

In SDI systems information is distributed according to user profiles that contain the information

about users interest areas. Therefore a method for matching documents and user profiles must

be introduced [Fol92]. The vector space model is a popular way to calculate the similarity of a

user profile and a document. It has been alleged that recall and precision of the vector space

model is superior compared to the Boolean (relational) model [Çet00]. Recall is the total ratio

of documents received compared to those relevant documents that exist in the source collection.

Precision is the percentage of the returned documents that were desirable for the customer.

7

We suggest that the vector space model is not the only suitable matching method in the

electronic publishing environment. This is due the limited size of source collection involved in

electronic publications. Especially in news distribution a news item is relevant only for a short

period of time. We argue that the Boolean (relational) model can be specified by SQL-queries

with a relatively high precision and recall. To be able to do this user profiles and document

metadata must be presented in such fashion that effective matching can be calculated by the

relational database management system. This technique will provide on efficient way of

calculating user profiles and document profile similarity. To be able to compare the vector

space model to our suggestion both models will be implemented in the CoMet System.

3.3 Working environment

Electronic newspaper is a combination of news items, documents, pictures and other media

objects produced by a publishing house and external news providers. Externally produced

content must be converted and stored to publishers internal document management systems.

The produced content can then be distributed to end-users. Different terminals such as desktop

computers and mobile phones can be used for reading the content that has been made available.

The CoMet System handles the distribution and personating of content. Overview of the CoMet

System architecture can be seen in figure 1. This delivery process is now taken into closer

observation.

Ontology Conversion and Meta Creation

CCooMMeett SSyysstteemm

News

Profile

External News Providers

Customer

Content Creator

Meta and Content Creation

Figure 1: CoMet Architecture

8

3.3.1 External news providers

A publishing house can use external news providers for its news production. In an extreme

example a publishing house doesn’t create content at all. It can act as a news gatherer and

distributor that buy all the distributed content from external news providers. If a publishing

house has it’s own content production, external content providers can still be used. If

information is needed from some special and narrow subject, a specialist is called to help in

content production. Publishing houses buy content often from a few static external news

providers like Reuters. In these conditions a data interchange method can be negotiated. News

items and documents can be delivered using for example the NITF standard.

Metadata is delivered along with the news item that has been produced by an external news

producer. Unfortunately the received metadata information can differ from publishing houses

internal metadata structures and standards. Also the hierarchy that builds metadata structures

i.e. ontology can be different. In that case external news producers ontology must be mapped to

publishers own ontology. Additional metadata creation could be needed as well to fulfill the

publishers metadata needs. Ontology mapping is not a trivial operation and therefore this

subject goes beyond the scope of this paper. We assume that effective ontology mapping

methods exist. It is important to be able to reduce the metadata creation burden. If the ontology

cannot be mapped extra work is needed for re-creation of document metadata. We stated in

chapter 3.1 that content must be produced only once. The same statement goes for metadata

creation as well.

3.3.2 Content creator

Content creator is responsible for creating the articles and news items i.e. content that are

produced within a publishing house. Newspapers are a good example of publishing houses that

have their own content production. When content production is done inside the publishing

house, metadata information can be created according to internal standards. Therefore

additional ontology mappings are not needed. Created content i.e. news items and documents

are then stored to the publishing houses document handling systems.

9

Content creators and editors create metadata. An automated metadata creation tool has been

implemented to help this task. Fully automatic metadata creation is a difficult task because the

interpretation of the content relies on human expertise. However metadata creators can exploit

the basic metadata structures that have been created automatically. These suggestions for

metadata are then audited and generation errors are fixed [Kur99]. Accuracy of the created

metadata is essential for the service. Inaccuracy will reduce precision and recall of document

delivery.

3.3.3 CoMet System

The CoMet System handles distribution and personating duties in a publishing house. CoMet

Kernel acts as an information mediator between information sources and their users. It provides

services for content creators, editors and other customers using SDI services. Content

distribution is based on created metadata. This information is compared to user profiles. If a

document is found to be interesting for a user according to his or her user profile, the document

is show to him or her with relevant background material.

Document metadata usage is the key element in the CoMet System. Therefore it must be aware

of all the documents that are saved to the content management systems within a publishing

house. Particularly metadata information must be placed physically inside the CoMet System.

Documents can still be placed to the various document management systems but a gateway to

them must be implemented. With document management system gateways the CoMet System

offers transparent access to all documents. The CoMet System utilizes these gateways when

documents are delivered to end-users.

3.3.4 Customer

A customer is a person using the CoMet System. A publishing house has internal and external

customers. Internal customers are content creators, editors and other personnel within a

publishing house. External customers are external news providers or people that are using SDI

10

services. The vast majority of customers are ordinary people that are using services provided

by a publishing house.

The CoMet System provides personalizing features to its customers. Every logical customer

has his or her own personal view to the distributed content. One customer can have several user

profiles. A separate user profile is needed to be able to distinguish for example a work related

user profile and a spare time user profile. In the evening user might want to explore totally

different material than during office hours.

Another important issue concerns customers who are using different terminals for connecting to

services provided by a publishing house. The same information cannot be shown in the same

way for cellular phones and desktop computers because of the differences in terminal displays.

An abstract of a document or a short news item is readable from cellular phones display.

Desktop computers larger displays are more suitable for big articles and documents. Therefore

CoMet System has to distribute partly different material to small display terminals. Information

augmentation is also problematic in small display terminals. If the display space is limited the

actual news item and its typography and layout are more significant than the ability to use

information augmentation features.

3.3.5 Kernel

The CoMet Kernel is responsible for all the personalizing and information augmentation

features provided by the CoMet System. The main purpose of the CoMet Kernel is to provide

services to applications within a publishing house that need personalized SDI services for

information distribution. CoMet Intelligence that constructs the core of CoMet Kernel does this

all. Figure 2 shows the details of the CoMet System. CoMet Kernel is the center part of the

CoMet System. It contains CoMet Intelligence, document metadata database (Meta DBMS in

figure 2) and document store.

The document metadata database has an essential role in CoMet Kernel. All the metadata

information is stored there. It contains document metadata and user profiles. Only these

elements are stored in the metadata database. Documents are stored to a separate document

management system i.e. document store. This is due the fact that publishing houses are

11

dependent to their current solutions for document management. This is because of the business

functions that are built heavily on the document systems that are currently being used in

content production. The used document management systems already provide good versioning

functions and authoring facilities. Therefore documents are separated from metadata. Location

of a document is stored in metadata. CoMet Kernel provides a gateway to current document

management systems. Therefore current document creation processes can remain unaltered.

The document gateway can be implemented to file systems (FS), object databases and other

sources.

3.3.6 CoMet application layer

CoMet System divides into two main parts: CoMet Kernel and CoMet Application Layer. The

latter is the part of the system that interacts with the users, like content creators and customers.

Applications have a service interface that they can use. Through this interface applications in

CoMet Application Layer can interact with the CoMet Kernel. Applications have the basic

service portfolio in use and they do not have to worry about SDI issues. The main concern is

the amount of information augmentation used and the wanted layout. CoMet Kernel does not

provide layout issues at all. The layout is done in the application layer. Therefore content is

separated from the presentation to be able to support different terminals [Saa99]. Metadata can

contain also information about the devices that document can be distributed to.

External News Providers

Content Provider

Customer

CoMet Application Layer

Push applications

Pull applications

Content provider’s tools

CCooMMeett KKeerrnneell

CCooMMeett IInntteelllliiggeennccee

MMeettaa DDBBMMSS

DDooccuummeenntt SSttoorree

•Document Metadata •Profiles

Documents in ODBMS or FS

Figure 2: The CoMet System

12

CoMet Kernel can simultaneously serve several CoMet Application Layer applications.

Therefore CoMet System can be used by multiple applications that want to use SDI services

that are made available trough CoMet Application Layer interface. Basically this interface

works as follows. CoMet Kernel provides a line of services. These services can be called to

provide SDI functions. Service request can be for example a document request for a customer

that wants all the news items relevant to his or her profile. CoMet Kernel does the matching

and then hands the documents to requesting application.

3.4 Services provided

The CoMet System has four main services. Firstly, we look at the personalizing features. The

CoMet System has metadata information about customers and documents. The CoMet System

is capable of performing information filtering task and acting as an information mediator

between information sources and their users. Secondly, the CoMet System has content re-use

services. If a news item that has been originally created for a television newsreader is used

again in electronic publication, re-using of content is utilized. The CoMet System can assist

content creators in finding relevant related news items and other documents. Thirdly, the

CoMet System provides information augmentation features. In information augmentation

explanatory background material is added to content. This can been seen as links to documents

that create relevant background material for a news item. Fourthly, story chain management

can be used. This information about articles and their relations against each other is stored in

metadata database.

4 Metadata structures and matching

As we emphasized before, metadata has very important role in the CoMet System. The main

purpose of metadata is to contain a compact representation of a document’s content and a user

profile. Once metadata is created from the actual document content, it should be possible to

process it independently from the original content. This knowledge is used later on to deliver

relevant documents to the CoMet System customers without accessing the original documents.

Created metadata information must be machine-readable. This means that metadata must be

13

processed without human assistance. Metadata creation doesn’t have to be fully automated and

it can contain manual phases. After metadata has been created, it must be processed

automatically [Sav98].

Used metadata structures must be implemented in such fashion that a uniform metadata format

can be used to describe content from different sources. A structured data format, that can

contain different value types, is an efficient way of implementing metadata structures. We use

XML [W3C00b] for presenting the CoMet metadata structures. This kind of structured data is

easy to handle and different accessing methods like DOM (Document Object Model) [W3C01]

has been introduced for handling it. However our approach leaves the chosen metadata standard

open as long as a structured metadata format is used.

Metadata usage has several advantages over using document content for deciding if a document

is distributed to a CoMet System’s customer [Jok00]. The obvious advantage is the size of

metadata. It captures the essential semantics of the source. Therefore created metadata is

potentially smaller than the actual document. Size intensive file types like video and sound

benefits form the use of metadata because it can save a significant amount of storage space and

computation time. Metadata supports all media formats by using only one representation

format. This is a significant advantage because only one matching algorithm must be

implemented to cover all media types and their distribution. The only disadvantage is the time

that must be consumed for metadata creation. This effort increases document creation times and

human resource needs in a publishing house. Another difficult issue is the need for chancing

metadata structures. New media formats may require changes in the used metadata structures.

In this case it is not clear how old and new metadata formats relate to each other.

4.1 Hierarchical ontology

Ontology can be understood in many ways depending on the circumstances it appears. By the

ontology we mean a set of metadata structures consisting of concepts and their relations against

each other. These concepts build a definition of the problem domain. The ontology can contain

different angles i.e. dimensions of the content. These dimensions contain the needed

information for describing the documents within the problem domain. Each dimension

describes one aspect of the problem domain such as story subject, author or geographic location

14

of the story. Every concept in a dimension is orthogonal i.e. it is independent from other

concepts in other dimensions. Dimensions build up from concepts and their relation. They form

a hierarchical model of a dimension.

Ontology and their structures can vary depending on implementation. We will use hierarchical

ontology structures because of the calculation and the implementation advantages that we have

stated in the beginning of this chapter. Besides that hierarchical model is efficient to handle it

provides easily interpreted visual description of dimensions and their concept relations. A

simple example of a hierarchical ontology can been seen in figure 3. In this example a

hierarchical ontology forms a tree. The root level is the top node for this hierarchical model.

Under that is dimension level. From that level different orthogonal dimensions can be accessed.

Here we have three dimensions: Subject, Author and Location. We take Location into closer

observation. In the next level (Dimension level + 1) Location divides into two parts: Finland

and International. This is the leaf level of Location dimension. Finland and International are

concepts in this dimension. As can be seen in figure 3, number of levels in a dimension can

vary. In our example Subject dimension is much deeper than the other two dimensions. The

deepness of the used dimension is not limited. In Subject dimension there are 7 dimension

levels. In this dimension, concepts and their relation can be observed better than in Location

dimension. Human observer can quickly form an impression of this dimension and the concept

relations that bears within Subject dimension.

Politics Financial

Räikkönen

Sauber

Häkkinen

McLaren Ferrari

Team General

F1 Rally

Motor sport

SM-Liiga NHL

Ice hockey

Sport

Subject Author

Finland International

Location

RootRoot Level

Dimension level

Dimension level +1

Dimension level +2

Dimension level +3

Dimension level +4

Dimension level +5

Dimension level +N ...

Dimension level +6

Figure 3: Ontology construction

15

It is important to notice that ontology structures must be created before metadata. Metadata is

in fact the terms that relate to concepts in the dimensions of ontology. Therefore the needed

metadata information is presented in hierarchical ontology construction. Ontology creation is a

difficult task [Jok00]. A person that has a good knowledge of the problem domain and

customer’s needs should create the used ontology. Several iteration times in ontology creation

could be needed before a useful ontology is found. The problem domain will evolve and change

over time. This means that ontology is never in its final state and therefore it must be updated

periodically. This raises a question about the documents where old ontology structures have

been used for containing metadata information. It could mean that old metadata information

must be converted to mach the new model.

4.2 Document profiles

Document profile contains the semantic metadata that has been created to describe the content

of the actual document. Metadata is stored in the hierarchical ontology structures. In figure 4

we can see a news item and its metadata information. This simplified example is based on the

Subject dimension that has been introduced in picture 3. The news item is about formula 1

driver Kimi Räikkönen and his thoughts before his first race in Interlagos circuit. Our ontology

fits into this problem domain and it is therefore capable of describing this news item. The news

item reflects its content to Subject dimensions Sport concept.

Formula One newcomer Kimi Räikkönen is not at all perturbed by the fact that he will be racing on a new track this weekend, but unlike any new F1 track he has encountered so far this season, the Interlagos circuit runs anti-clockwise. "Interlagos is another new circuit for me to learn. It doesn't worry me to know that it is anti-clockwise and considered as one of the toughest of the year in the calendar," said the young Finn who has scored one championship point in two Grands Prix. "The track is difficult because it is very bumpy. It is a lot of work to find a special set up for that, unlike any other track in the F1 calendar. I'm looking forward to driving there, and to challenging again for some points," he concluded. Last year the Sauber team was forced to pull out of the weekend after excessive vibrations caused by the bumpy nature of the circuit caused many rear wing failures for the team. Sauber are confident they will have no such problems this year.

Politics0

Financial0

Ferrari0

Räikkönen0.8

Sauber0.8

Häkkinen0

McLaren0

Team0.8

General0.2

F11

Rally0

Motor sport1

Ice hockey0

Sport1

Subject1

Author Location

Root

Document specific ontology / profile

Figure 4: Document profile

16

The news items relations to subject dimension concepts is expressed as weights in the leaf

nodes of the used hierarchical ontology. Leaf weights are then summarized to its parent node.

Parent nodes weight is the sum of its child node weights. This summarization is continued until

dimension level is reached. Every dimension and the values in it will be normalized. If weights

have been defined to concepts, the dimension level will always have value 1. Every dimension

is independent from other dimensions. If the values are not set, the dimension level node will

have value 0. If the dimension has gained values for leafs, the dimension level node will always

have value 1. This means that every dimension is normalized. In figure 4 concept F1 has a

value 1. This is calculated as a sum from its child node weights 0.8 (Team) and 0.2 (General).

The summarization is then continued and finally weight 1 is put to Subject concept in

dimension level.

Document metadata creation is a part of document creation process. Content creators or editors

are responsible for metadata creation. As pointed out in the previous chapters automatic

metadata creation can help this task. When we take a look at the news item and its ontology in

picture 4, it is clear what the weights in concepts are reflecting. This is because people are very

talented in observing document content [Lab99]. Human expertise is therefore needed in

metadata creation process. In the SmartPush project a content provider’s tool was created to

help the metadata creation task [Kur99]. The content provider’s tool analyzed texts content by

using certain key terms like country names for capturing the needed metadata information.

Content creators and editors then modified these generated suggestions for suitable semantic

metadata information. The tool helped the basic metadata generation task but the accuracy of

the metadata was still in the hands of a human expert. Another problem is the problem domain.

SmartPush project used financial news items and content provider’s tool was optimized for that

problem domain. In other kind of news areas automatic metadata generation results would not

be as good as they were in financial news. Therefore automatic metadata generation must be

optimized separately for every different problem domain area.

4.3 User profiles

A user profile captures the user interests in machine understandable form. A user profile can be

build from a set of keywords that describe the preferred interest areas of a customer. Keywords

are compared against news items. If a news item contains a term or several terms from the

17

keyword list, it is shown to the customer. This method is used in Krakatoa Chronicle [Bra98].

Another popular way to store user profile as a term vector associated with term weights

[Bel96]. We are using the same hierarchical ontology that was presented earlier. In this way we

are able to use the same hierarchical presentation model for documents and user profiles. This

can be seen as an advantage when document matching is taken into closer observation.

In figure 5 we can see a customer and his user profile. It consists of Subject, Author and

Location dimensions although only Subject dimension has meaning in this user profile. Subject

dimension divides into Politics, Financial and Sports. Sport concept has the finest presentation

because that has been refined to dimension level + 4. Concept weights are calculated in the

same way as in document ontology. User profile uses the same ontology that was presented in

figure 3.

A user profile must be created before a customer can use the SDI services that are provided by

CoMet System. Its purpose is to describe the long-term information needs of a customer in a

particular subject area. Customers personal information needs can vary in daily basis but these

changes in customer interest areas cannot be captured. Therefore long-term interest is the only

interest form that can be captured in a sufficient accuracy. In this method document recall and

precision can be guaranteed.

Politics0.2

Financial0.3

Team0.05

General0.15

F10.2

Rally0.1

Motor sport0.3

Ice hockey0.2

Sport0.5

Subject1

Author Location

Root

User specific ontology / profile

Customer

Figure 5: User profile

18

4.4 Matching

The idea of personalized SDI is based on distributing relevant documents to customers

depending on their individual information needs. A user profile is utilized when knowledge is

needed concerning customer interests. Decision on document delivery is made according the

knowledge about document content and its relevance to the user. The matching problem occurs

when a decision on document delivery must be made according to customer’s user profile. This

kind of decision problem exists as well in information retrieval (IR). IR research has raised

three retrieval models that are applicable also to SDI. These models are the Boolean model, the

vector space model and the probabilistic model [Bel92]. The first two models are used broadly

in different SDI implementations.

In the Boolean model a user profile is a sequence of distinct word i.e. terms. This keyword list

contains the terms that characterize the document content in those documents that are defined

as relevant documents according to the customer. A profile matches a document if all the terms

in a user profile appears in the document [Yan94b]. Because of this feature, the Boolean model

is said to be an exact match model. It divides the document store into two parts. A document

belongs to the matched or the denied document set. There are two main disadvantages in this

model. A document cannot have a relevance measure in this model. It belongs or doesn’t

belong to a matched document set. The order of the matched document set cannot be defined in

any way. This is clearly a disadvantage because some documents are more relevant to a

customer than others. The second disadvantage is the size of a matched document set. If the

source document set is large it is hard to define user profile granularity in such fashion that

matched document set has manageable size. In many cases too many or too less documents are

to be found relevant according to Boolean model [Çet00]. The Boolean model was widely used

in SDI implementations such as library systems in the beginning of the 90th decade. Nowadays

the vector space model has gained more popularity.

In Vector space model a user profile is a term vector associated with term weights [Yan94a].

Therefore text is treated as a vector in multidimensional space. Dimensions consist of the

words that are used to describe the content of a document. Stop lists are used to remove the

terms that appear often in text. Terms like more, such and perhaps don’t bear any meaning to

document content and therefore they are included in stop word lists. The significant terms must

19

be stemmed. The term weights construct an important refinement compared to Boolean model.

With these weights an importance of terms can be specified. Statistical methods can be used for

deciding witch terms usually bear more meaning within the text. If inverted document

frequency is used, terms that appear rarely in documents gains greater weights than those terms

that are found in many documents. When matching is carried out, the profile vector and the

document vector is compared against each other by using for example a cosine similarity

measure. If similarity is within a certain threshold the document is distributed to the customer

that owns the matching user profile. By using the calculated similarity measures matched

documents can be ordered. This is clearly a positive feature because customers can assume that

documents that are in the top of the result list will be more relevant than the ones in the end of

the result list. The vector space model is said to fall into best-match model category. This is

because the matching is fuzzy and it is based on estimation of document relevance instead of an

exact yes or no value.

It is important to notice that Boolean and vector base model don’t fit into our environment as

they were introduced above. This is because our purpose is to distribute and augment

information that has various media types like text, video and sound. If SDI is implemented

according to these models, the actual document content is analyzed when user profile and

document profile matching is calculated. This would be impossible when for example the

decision on video material distribution is made. Instead we rely on document metadata and user

metadata when matching is calculated. When metadata is used for matching, a significant

advantage has been acquired. Word stemming and stop lists for several different languages are

not needed. This is clearly an advantage because one implementation can be used all over the

world instead of different language versions for every different mother tongue areas. This is

also commercially feasible aspect.

A user profile can be presented as a query. This is feasible especially in the Boolean model but

other solutions have been presented as well [Alt00]. In many cases the idea of presenting the

user profiles as queries is not actually a model like Boolean model or vector space model but

rather an implementation method for the used retrieval models. The same information can be

included to database queries as is contained in the Boolean model. Databases offer an effective

way of executing the actual matching task. We are using this method in the CoMet System.

However we are not using the Boolean model for our matching. In the next chapters we

20

describe the matching implementation for our predecessor project SmartPush. After that we

will present the matching method that will be used in the CoMet System.

4.4.1 SmartPush matching (asymmetric distance measure)

SmartPush project stated that symmetrical distance measure is not capable of catching the

interest distribution of the customer’s interest areas [Sav98]. Although the structures of a user

profile and the document metadata are the same, they have different kind of semantics. Usually

a customer’s user profile consists of many separate interest areas. If a customer is interested in

high fidelity, apartment advertisements and menus in near by restaurants it would be odd to

state that the document having best relevance to user covers all these three subject matters.

Because of this assumption, distance measure should not be symmetric. The SmartPush project

matched the documents against parts of the user profile. In this way a part of the user profile

represents one interest area. If a document is found to match one part of the user profile, it is

assumed to be relevant against that part of the user profile instead of looking the user profile as

one whole entity. After this statement the role of the user profile becomes distinct from the role

of the document and the similarity measure is no longer symmetric as it was in the vector space

model.

The SmartPush project stored metadata in the same way as we introduced in chapter 4.1 and

concept weights were calculated in the same way as we showed in chapter 4.2. The hierarchical

ontology helps the calculation burden because the hierarchy itself defines the relations between

concepts in a dimension. With this advantage in mind SmartPush project introduced a

calculation method where different dimension levels were exploited [Sav99]. Similarity

calculations were done separately for every dimension level. First a coarse measure was

calculated. It gives only a rough estimate for the document and the user profile closeness by

calculating how much the document misses the user profile. Another calculation took care of

the areas where document and user profile has weight in the desired area. This was called the

fine measure. Finally a way to combine these two measures was introduced. The combined

distance measure was calculated separately for every dimension level. Because of this feature a

summarization method was needed to calculate the final similarity method. The dimension

level had affect to the amount that one single dimension level affected to the combined

measure. With this calculation method every document had some distance measure.

21

It is important to notice that SmartPush didn’t try to exclude documents from the result set but

rather calculate the optimal distribution order. In the CoMet System this kind of arrangement

won’t work. The CoMet System has a service where augmentation features are offered to

customers. Information augmentation utilizes various heterogeneous document management

systems to find relevant document, videos and other relevant background material for a news

item. It wouldn’t make sense to calculate the augmentation order of every single document that

is stored into various document management systems. Therefore we need a calculation method

that includes exclusionary features.

4.4.2 LCH-matching

The CoMet System needs a matching method that is capable of handling large amount of

metadata material to enable information augmentation and data re-use. We are using metadata

information for matching calculations. This helps us to decrease the calculation burden because

metadata information is typically compact compared to document text or video information.

Despite of these advantages that we have already gained, the selected matching method still

needs to be efficient. Typically SDI research has concentrated on recall and precision problem.

These are important aspects but they really don’t matter if high recall and precision are

achieved with a calculation method that is untenable with its time consumption [Yan94a].

Acceptable response times must be guaranteed for information augmentation and data re-use.

Different index structures can be used for matching optimization [Yan94a, Yan94b, Alt00].

We have decided to use the hierarchical ontology structure that was introduced in chapter 4.1.

Once the ontology has been constructed we can use it for modeling the user profiles and the

document profiles. This is time conserving because the definition of a good ontology is a

difficult task. The user profile has been introduced in chapter 4.3 and the document profile was

presented in chapter 4.2. The use of one ontology is an advantage because calculation methods

are easier to design and implement when user profiles and document profiles use the same

hierarchical model. Time-consuming ontology mappings are not needed and the calculation

process is more intuitive for human observer than in the situation where the use and the

document profiles use a different ontology.

22

The CoMet System compares the document profiles against the user profiles and decides the

closeness of these two entities. The first step is to find the largest combined hierarchy (LCH). It

is a definition for the largest hierarchy that the user profile and the document profile share in

the used ontology. In this way we can isolate the potentially relevant documents away from the

discarded document set. The LCH can be examined as a whole or dimensions can be observed

independently. Dimensions can be divided further to concept braches. As we can see from the

figure 6, a dimension can split into different subject matters like Sport and further into Motor

sport and Ice hockey. These concept branches form separate subject areas that are relevant for

this particular customer. When these areas are inspected separately better augmentation recall

and precision can be obtained than in the situation where the whole LCH is treated as one

entity. Depending on the service (augmentation, matching data re-use or story chain detection)

different matching methods can be used. Therefore the matching granularity can differ from a

service model to another. This enables us to fine tune the matching process depending on the

task that the CoMet System is working on.

Next we present a simplified example of the LCH detection and the final matching calculation

method. In figure 6 we have two profiles: the user profile and the document profile. They are

already familiar because the same profiles were used in chapter 4.2 and 4.3 when the

hierarchical user profile and document profiles were introduced. When these two profiles are

inspected the largest combined hierarchy can be found in the center on the figure 6. This is

rather efficient inspection because hierarchical trees can be easily examined. The resulted LCH

has only one branch but this doesn’t have to be the case. In fact in our case similarities can be

Politics0

Financial0

Ferrari0

Räikkönen0.8

Sauber0.8

Häkkinen0

McLaren0

Team0.8

General0.2

F11

Rally0

Motors port1

Ice hockey0

Sport1

Subject1

Author Location

Root

Politics0.2

Financial0.3

Team0.05

General0.15

F10.2

Rally0.1

Motor sport0.3

Ice hockey0.2

Sport0.5

Subject1

Author Location

Root

Team0.8 vs. 0.05

General0.2 vs. 0.15

F10.8 vs. 0.2

Motor sport1 vs 0.3

Sport1 vs. 0.5

Subject1

Root

Largest Combined Hierarchy (LCH)

Document specific ontology / profile

User specific ontology / profile

LCH is used to calculate match or delivery to customer

Figure 6: Largest combined hierarchy matching

23

found only from one dimension but in more complex situation LCH reaches over several

dimensions. The final stage involves the actual calculation. We can see that calculation is

relevant only from dimension level +1 and deeper in the hierarchy because dimension level is

always 1 for both profiles. Calculated elements would be Sports (1 vs. 0,5), Motor sports (1 vs.

0,3), F1 (0,8 vs. 0,2), Team (0.8 vs. 0.05) and General (0.2 vs. 0.15). The result from the

comparison of these five concepts bears the final similarity measure in this simplified example.

In this manner we have been able to exclude the non-relevant documents from the result set.

After that a similarity measure was calculated to those documents that had LCH with a user

profile. The result set can be arranged according to the similarity measure results.

4.4.3 Weighted LCH-matching

The deepness of the used hierarchy has a significant effect on the expression power of the used

ontology. This can be seen from figure 3 where Subject dimension is rather deep (dimension

level + 6). In dimension level +1 Sport concept is introduced. When we look at the dimension

level +3, Sport has been sharpened to F1 concept. If LCH is found from dimension level +3 i.e.

F1, its more significant than LCH that reaches only to dimension level +1 i.e. Sport. The

expression power of dimension level +3 is quite strong in our ontology. If we presume that two

documents exists witch both have LCH with the user profile only in one dimension. The LCH-

matching ranks them to be equally similar against the user profile and the LCH is located in the

same branch but the other one seems to have deeper dimension level. In these circumstances

the document having deeper LCH should be ranked above the other document. This can be

arranged by emphasizing those matching results that are calculated in deeper dimension levels.

Besides of excluding documents from the result set, the deepness of LCH can be exploited in

inverted fashion. If a user profile is very limited or contains only few deep branches, the CoMet

System benefits from user profile generalization. Lets presume that only one branch has been

defined for a customer’s user profile. This branch is the same as the LCH in figure 6. A

customer is obviously very interested in F1 news items. If the first weighted-LCH results

exclude all news items from the result set, user profile generalization can be exploited. The

deepness of the used hierarchy has a significant effect on the describing force of the used

ontology. If we emphasize the similarity calculation weight that is used in deepest dimension

level –1, a broader view to the defined user profile ontology can be constructed. If a customer

24

is interested in F1 news items, one can assume that he would like to have other motor sport

related news items when F1 news items are not available. This kind of feature could be used as

a “see also”-type of recommendation service. Besides showing a relevant document a customer

can broaden his or her view by examining the articles that belongs to a nearby concept in the

systems hierarchical ontology.

5 Conclusions and future work

We have presented the need for selective dissemination of information (SDI) for a publishing

company. This problem area contains for example electronic publications like online

newspapers. External news providers and internal content creators create news items and other

documents such as pictures and sound material. Distribution of a news item is based on

metadata information witch is a compact representation method of document content.

Then we focused on metadata presentation model. We presented a hierarchically ontology

structure for metadata information container. This model is used for storing the customer’s user

profiles and the document profiles. LCH-matching (Largest Combined Hierarchy) is used for

matching user profiles and document profiles. News items are distributed according to the

calculation result of LCH-matching. Then we specified the LCH-matching method to the

weighted LCH-matching. In this matching model the hierarchical ontology is exploited to

adjust the matching result. The deepness of the used hierarchy has a significant effect on the

describing force of the used ontology. Therefore the deepness of the used LCH must be

emphasized in the matching calculations.

Our goal was to define a content management independent metadata handling system that

contains the needed information to be able to provide SDI services in electronic publishing.

The basic metadata structures and an approximate description of the CoMet System

architecture were presented in this paper. However the distribution architecture has been

specified in more detailed manner but that information was not included in this paper. The

future challenges include a more detailed description of weighted LCH-matching and

especially the architectural design of the distributed metadata database. The metadata database

will also be responsible for calculating the matching of user profiles and document profiles.

25

This is a very important issue in our future work. According to our research in matching

implementations it will be a novel way of doing the matching calculation. Relational database

will provide a very powerful facility for handling large amounts of metadata information.

References

Alt00 Altinel, M., Franklin, M., Efficient Filtering of XML Documents for Selective

Dissemination of Information, In Proceedings of the 26th Conference on Very

Large Databases, September 10 – 14, Cairo, Egypt, pages 53 – 64, Morgan

Kaufmann 2000

Bel92 Belkin, N., Croft W., Information Filtering and Information Retrieval: Two Sides

of the Same Coin?, Communications of the ACM, Vol. 35, No. 12, December,

pages 29 – 38, ACM 1992

Bel96 Bell, T., Moffat, A., The Design of a High Performance Information Filtering

System, In Proceedings of the 19th ACM SIGIR conference on Research and

Development in Information Retrieval, August 18 – 22, Zurich Switzerland,

ACM 1996

Bra98 Brahat, K., Kamba, T., Albers, M., Personalized, interactive news on the web.

Multimedia Systems, Vol. 6, September, pages 349 – 358, Springer-Verlag 1998

Çet00 Çetintemel, U., Franklin M., Giles C., Self-Adaptive User Profiles for Large-

Scale Data Delivery, In Proceeding of the 16th IEEE conference on data

engineering, pages 622 – 633, IEEE 2000

Che95 Chesnais, P., Mucklo, M., Sheena, J., The Fishwrap Personalized News System,

In Proceedings of the 2nd International Workshop on Community Networking

Integrating Multimedia Services to the Home, June 20 – 22, Princeton, New

Jersey USA, pages 275 – 282, IEEE 1995

26

Dup99 Dublin Core Metadata Initiative., Dublin Core Metadata Element Set, Version

1.1: Reference Description, 2.7.1999, http://purl.oclc.org/dc/documents/rec-dces-

19990702.htm

Fol92 Foltz, P., Dumais, S., Personalized Information Delivery: An Analysis of

Information Filtering Methods, Communications of the ACM, Vol. 35, No. 12,

December, pages 51 – 60, ACM 1992

For00 Forsberg, K et al., Extensible use of RDF in a Business Context, In proceeding of

the 9th International World Wide Web Conference The Web: The Next

Generation, Amsterdam, May 15 – 19, 2000

Grö98 Grötschel, M., Lügger, J., Scientific Information Systems and Metadata, Report

SC 98-26 (October 1998), Konrad-Zuse-Zentrum für Informationstechnik Berlin,

Germany

Haa94 Haake, A., Hüser, C., Reichenberger, K., The individualized electronic

newspaper: an example of an active publication. Electronic Publishing, Vol. 7(2),

June, pages 89 – 111, Wiley 1994

Ipt00 The International Press Telecommunications Council., News Industry Text

Format (NITF), www.nitf.org, 2000

Jok00 Jokela, S., Turpeinen, M., Sulonen, R., Ontology Development for Flexible

Content, HICSS-33 Minitrack on Systems Support for Electronic Business on the

Internet, January 4 – 7, 2000

Kam97 Kamba, T., Sakagami, H., Koseki, Y., Anatagonomy: a Personalized Newspaper

on the World Wide Web, International Journal of Human-Computer Studies, Vol.

46, No. 6, June, pages 789 – 803, Academic Press 1997

Lab99 Labrou, Y., Finin, T., Yahoo! As an Ontology – Using Yahoo! Categories to

Describe Documents, In Proceedings of the 8th international conference on

Information knowledge management, November 2 – 6, Kansas City, MO USA,

27

pages 180 – 187, ACM 1999

Mar98 Marshall, C., Making Metadata: a Study of Metadata Creation for a Mixed

Physical-Digital Collection, In Proceedings of the 3rd ACM Conference on

Digital libraries, June 23 - 26, Pittsburgh, PA USA, pages 162 – 171, ACM 1998

Onl01 Online Newspapers., www.onlinenewspapers.com, 2001

Saa99 Saarela, J., The Role of Metadata in Electronic Publishing, D.Sc. thesis, Acta

Polyechnica Scandinavica, Ma 102, 1999

Sav98 Savia, E., Kurki, T., Jokela, S., Metadata Based Matching of Documents and

User Profiles, In Human and Artificial Information Processing, In Proceedings of

the 8th Finnish Artificial Intelligence Conference, pages 61 – 69, Finnish

Artificial Intelligence Society 1998

Sav99 Savia, E., Mathematical Methods for a Personalized Information Service,

Master’s Thesis, Helsinki University of Technology, Institute of Matchematics,

HUT 1999

Kur99 Kurki, T., et al., Agents in Delivering Personalized Content Based on Semantic

Metadata, Intelligent Agents in Cyberspace, Papers from the AAAI Spring

Symposium, Technical Report SS-99-03, AAAI Press 1999

W3C00a W3C., Resource Description Framework (RDF), Schema Specification, W3C

Candidate recommendation 27 March 2000, http://www.w3.org/TR/2000/CR-rdf-

schema-20000327

W3C00b W3C., Extensible Markup Language (XML) (Second Edition), W3C

Recommendation 6 October 2000, http://www.w3.org/TR/2000/REC-xml-

20001006

W3C01 W3C., Document Object Model (DOM), http://www.w3.org/DOM

28

Xml01 XML and the news industry, www.xmlnews.org, 2001

Yan94a Yan, T., Garcia-Molina H., Index Structures for Information Filtering Under the

Vector Space Model, In Proceedings of the 10th International Conference on Data

Engineering, February 14 – 18, Houston, Texas USA, pages 337 – 347, IEEE

1994

Yan94b Yan, T., Garcia-Molina H., Index Structures for Selective Dissemination of

Information Under the Boolean Model, ACM Transaction on Database Systems,

Vol. 19, No. 2, June, pages 332 – 364, ACM 1994