an information storage system for large-scale knowledge

42
An Information Storage System for Large-Scale Knowledge Resources Haruo Yokota Global Scientific Information and Computing Center Tokyo Institute of Technology [email protected]

Upload: others

Post on 14-Mar-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Information Storage System for Large-Scale Knowledge

An Information Storage System for Large-Scale Knowledge Resources

Haruo Yokota

Global Scientific Information and Computing CenterTokyo Institute of Technology

[email protected]

Page 2: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 2

Outline• The goal and requirements of an

information storage system for this project• Our approach to realize the goal• Examples of services using the system• Software and hardware configurations of

the system• Concluding remarks

Page 3: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 3

Goal• Our Mission: Provide an infrastructure to

support for interdisciplinary research– In the Tokyo Institute of Technology 21st

Century COE Program of “Framework for Systematization and Application of Large-scale Knowledge Resources‘”

⇒ Develop an advanced information storage system– To preserve and utilize the large-scale

knowledge resources for researchers and students

Page 4: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 4

Flexibility and Extensibility• These are key issues to implement the

system• The usages of the stored knowledge

resources are assumed to have wide variety– The configurations and functions of the system

should be flexible• The number of knowledge resources and

access requests are expected to increase rapidly– The system should be extensible

Page 5: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 5

Information Component Examples• Written texts & drawn figures

– Frequently combined in documents– Data formats: flat text, postscript, pdf, ppt, etc.

• Printed materials– e.g. pictures, scanned historical documents (the tale of Genji)– Data format: bitmap data, jpeg, gif, etc.

• Recorded video streams & spontaneous speech– To support e-learning and speech recognition research– Data formats: mpeg, RealMedia, WindowsMedia, QuickTime, etc.

• Structured information– e.g. corpus, experimental data, etc.– Data formats: tables (relational databases), XML, etc.

• We may soon have other data formats and even other media (e.g. 3D contents)

Page 6: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 6

Targets To Be Stored

Knowledge Resources

Wide Variety

A Storage System

Should be Flexible and Extensible

Many Data Formats

Video StreamsSpeeches

Printed MaterialsDrawn Figures

Written Texts

Page 7: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 7

Approach• To treat these wide and dynamic variety of

information components, we take an approach to combine two types of systems:– An advanced information storage system:

KnowledgeStore (KS)• Providing common functions for handling the

variety of data formats– External systems

• Dedicated for special applications based on the knowledge resources

Page 8: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 8

External Systems• Responsible for executing independent

applications– Using the information components preserved

in the KnowledgeStore.• The dynamics of environments

– Change of user requirements – Appearance of new type of materials

can easily be coped with by modifying some external systems locally or adding new one

Page 9: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 9

KnowledgeStore (KS)• Prepares web-service APIs to the external

systems– As well as ordinary web interfaces for

interactive users – The Web service is standardized interface

between software components using HTTP and SOAP

• Provides common functions for these external application systems via the web-service APIs.

Page 10: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 10

External System Interface

Knowledge Resources

ExternalSystem 1

Web Service APIs

ExternalSystem 2

Common Functions

ExternalSystem n

HTTP & SOAPKnowledgeStore

Page 11: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 11

Common Functions• Three Types of Functions:

– Contents Management Functions– Contents Retrieval Functions– Access Management Functions

• To implement them, we prepare – Hierarchical collections as folders

• With user access authorization– Metadata for each contents

• By referring Dublin Core (a metadata standard)

Page 12: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 12

Contents Management• User define Metadata

– Data Types• Based on system predefined basic data type• Integer, Floating Point Number, Date, Character

Sting, File, XML, RDB, Video Stream, URL, Reference, etc.

– Contents Attributes• Title, Creator, Subject, Description, Created Date,

Modified Date, Language, Versions, etc.– Contents Definition

• Collections of contents attributes

Page 13: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 13

Contents Retrieval• Search functions for the metadata

– Search targets can be specified by the data types and contents attributes

• Full text search functions for text contents• Structural search functions for the RDB

and XML contents• Combinations of these functions in a folder

or across multiple folders

Page 14: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 14

Access Management• There are a number of services having own user

authorization mechanism• If they requires password every time enter some

service, it should be a trouble for users• To avoid it, we adopt single sign-on (SSO)

mechanisms for the services• Provides user group management

– Administrators, External System Developers, Contents Register, General Users, etc.

• Preserves access logs

Page 15: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 15

Overview of KS

Single Sign-On

ExternalSystem 1

ExternalSystem n

Contents Metadata

Advanced Information Storage System: KnowledgeStore

ContentsRetrieval

WebInterface

Web ServiceInterface

ContentsManagement

WebInterface

Web ServiceInterface

AccessManagement

WebInterface

Web ServiceInterface

…Users

Page 16: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 16

Contents Management I/F

Access Authorization is specified for each hierarchical folder

Page 17: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 17

Contents Creation I/F

Metadata can be specified for each content

Page 18: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 18

Attribute Search Interface

Contents can be searched with contents attributes

Page 19: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 19

Examples of External Systems• UPRISE (by Our Group)

– A presentation scene retrieval system• Asunaro (by Group of Prof. Nishina)

– An e-learning system to study Japanese• PRESRI (by Group of Prof. Okumura)

– A citation index collecting research papers from the WWW

• Research Mining (by Our Group)

– A system discovering macro-flows of research by applying a mining method to databases of papers

• And so on

Page 20: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 20

UPRISE• UPRISE: Unified Presentation slide

Retrieval by Impression Search Engine

• It stores the combination of presentation slides and videos recording of the circumstances of the presentation– To support retrieval services based on the

impression indicators• Expressing how well a slide matches the given

keywords

Page 21: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 21

Impression Indicators• Consider properties of presentation slides

– Position Impression Indicator: Ip• Utilize structures of a presentation slide

– Duration Impression Indicator: Id• Utilize the time information to select slides

especially in the case of multiple appearances of the same slide by backtracking or reuse

– Context Impression Indicator: Ic• Utilize the context of a slide appearance sequence

Page 22: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 22

Evaluation of the Indicators• Extracting emergence of the relevant slide

– from multiple appearances of the same slide caused by the backtracking or reuse

• Asked ten testers to specify a target appearance for keywords they choose– To determine the relevant emergence within multiple

appearances in videos– Each tester does six or seven experiments using

different sets of keywords• Use 20 presentations containing 849 slides and

27209 keywords

Page 23: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 23

Comparison with tf.idf

0

5

10

15

20

25

30

35

40

0 10 20 30 40

Ranking by UPRISE

Ran

king

bytf.

idf

• Graph compares the ranking of the slide appearances specified by the testers.– X: impression indicator– Y: traditional tf.idf

• A point upper the diagonal line indicates the specified appearance is ranked higher in the list by the impression indicator; the inferiority for lower points.

Page 24: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 24

Ranking Histogram

0

5

10

15

20

25

30

35

1 6 11 16 >21

Ranking of Target Slides

Freq

uenc

y

Ittf.idf

The histogram illustrates that many selected appearances are listed earlier by our approach and about a half of them are treated as the top ranking

Page 25: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 25

Experimental Results• Calculating Precisions

– Precision of the proposed method is 59.4%– while tf.idf is 32.2% for the same condition

• The results demonstrate that proposed impression indicator is effective to search for a presentation scene in presentation databases

Page 26: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 26

UPRISE & KS• Presentation slides and recorded video

streams are stored in the KnowledgeStore• Utilize several functionalities of KS

– Content retrieval of slides and videos– Metadata and full text search for slides– Dedicated Indexes corresponding to the

impression indicators– Single sign-on function

Page 27: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 27

ExtractStructure Info. Synchronize

ConversionInto pdf

ConversionInto aviCreate Metadata & Index

PresentationSlides (ppt)

Metadata IndexDocuments Video Stream

KnowledgeStore

Web-Service APIs

Web-Service APIs

Search & User Interface

Dataflow Between UPRISE & KS

Digital Video

Page 28: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 28

Synchronization• To synchronize the presentation slides

and the video– We have developed a system by using

character recognition techniques for the video stream

• It enable past presentations to be retrieved• It also utilizes information of a laser pointer

– As a collaborative work with Fujitsu Labs.

Page 29: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 29

Search Interface of UPRISE• When keywords are specified, UPRISE

searches the table of metadata for tuplesmatching the given keywords using the index for the table– UPRISE calculates the impression indicators with the

given keywords• Web-based interface showing thumbnails of the

slides is developed– The size of each thumbnail is varied by the

impression indicator.– When a thumbnail is clicked, the video is started by

the portion showing the slide of the thumbnail.

Page 30: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 30

Snapshot of Searching a SlideInput Keyword Click

Synchronizing Display of the Video & Slides

Each line indicates part of a sequence of presented slidesThe size of each thumbnail is varied by the impression indicator

Page 31: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 31

Other External Systems & KS• Asunaro (An e-learning system to study Japanese)

– KS manages the multi-lingual manuscripts and databases for language translation

– The related materials, such as video stream for foreign language study, are also stored in the KS

• PRESRI (A citation index)– KS stores the research papers and providing search

functions for them– The PRESRI can also use RDB in KS to store the

citation information• Research Mining (A research macro-flows discoverer)

– It can share the research papers and citationinformation for the PRESRI stored in the KS

Page 32: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 32

Hardware Configuration (1/3)• Use a Fiber Channel (FC) switch to configure a

storage area network (SAN)– 2Gbps 16 ports FC switch– To connect a number of servers with FC-RAID

• having the storage virtualization mechanism to manage storage space by sharing a pooled logical storage volumes

– Total capacity is 9 TB• The configuration easily enable to scale up

– storage capacity– processing performance

by changing the number of disks and servers.

Page 33: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 33

Hardware Configuration (2/3)• Assign services to a number of servers:

– video stream servers– web servers– relational database, XML management servers– single sign-on servers– contents creation servers

to enhance the freedom in the extension of servers

• It allows adjusting the processing performance for service by service.

Page 34: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 34

Hardware Configuration (3/3)• For preserving the reliability of the system

– Additional ATA RAIDs for backup• Nowadays, cheap RAIDs tend to be used as

backup devices instead of magnetic-tape drives from cost-performance point of view

• Three 3TB RAIDs (Totally 9TB)• Connected to the SAN• Automatic backup from primary RAID to backup

RAID without using LAN (LAN free backup)– Uninterruptible power supplies (UPSs)

Page 35: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 35

Hardware Configuration Overview

SSOサーバSSO Server

BackupRAIDs9 TB

ScannerCamera

etc.

RDB / XMLServer Contents CreationWeb ServerVideo Stream

Server

LAN (Gbit Ether Switch)

Video/Sound

ContentsIndexMetadataDocuments

Contents

RAIDs9TB

SAN (2Gbps 16port FC Switch)

clientThe Internet

Page 36: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 36

Specifications of Servers (1/2)• Video Stream Server

– UltraSparcIIIi 1.062GHz x 4– 8GB Memory and 36GB x 4 HDD– Solaris 8 and Helix Universal Server

• Web Server– UlltraSparcIIIi 1.062GHz x 4– 8GB Memory and 36GB x 4 HDD– Solaris 8 and Sun ONE Application Server7

• RDB / XML Server– Pentium Xeon 3.06GHz x 2– 4GB Memory and 18.6 GB HDD– Redhat Enterprise Linux Advanced Sever 3– Oracle 9i and Tamnio XML DB

Page 37: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 37

Specifications of Servers (2/2)• Single Sign-On Server for Video Stream Access

– Pentium Xeon 3.06GHz x 2– 1GB Memory and 18.6 GB HDD– Redhat Linux 9

• Single Sign-On Server for Web Access– Pentium Xeon 3.06GHz x 2– 2GB Memory and 36.4 GB HDD– Redhat Linux 9

• Contents Creation Server– Pentium Xeon 3.06GHz x 2– 2GB Memory and 18.6 GB HDD– Windows XP Professional

Page 38: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 38

Very Hot News• The hardware is being set up TODAY!

Page 39: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 39

Switches and Servers

Page 40: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 40

Concluding Remarks (1/2)• We proposed a combination of an

information storage system and external systems by web-service APIs to the TitechCOE LKR project.– The information storage system named

KnowledgeStore provides common functions for handling the variety of data formats

– The external systems execute special applications based on the knowledge resources such as presentation retrieval, e-learning, and research macro-flows discovery

Page 41: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 41

Concluding Remarks (2/2)• The configuration realize the flexibility and

extensibility required for managing the large-scale knowledge resources.

• Currently– Implementing the KnowledgeStore and

external systems– Considering the enhancement of functions,

such as managing versions.

Page 42: An Information Storage System for Large-Scale Knowledge

March 9, 2004 Haruo Yokota (GSIC, Titech) 42

Thanks for your attention