an information storage system for large-scale knowledge
TRANSCRIPT
An Information Storage System for Large-Scale Knowledge Resources
Haruo Yokota
Global Scientific Information and Computing CenterTokyo Institute of Technology
March 9, 2004 Haruo Yokota (GSIC, Titech) 2
Outline• The goal and requirements of an
information storage system for this project• Our approach to realize the goal• Examples of services using the system• Software and hardware configurations of
the system• Concluding remarks
March 9, 2004 Haruo Yokota (GSIC, Titech) 3
Goal• Our Mission: Provide an infrastructure to
support for interdisciplinary research– In the Tokyo Institute of Technology 21st
Century COE Program of “Framework for Systematization and Application of Large-scale Knowledge Resources‘”
⇒ Develop an advanced information storage system– To preserve and utilize the large-scale
knowledge resources for researchers and students
March 9, 2004 Haruo Yokota (GSIC, Titech) 4
Flexibility and Extensibility• These are key issues to implement the
system• The usages of the stored knowledge
resources are assumed to have wide variety– The configurations and functions of the system
should be flexible• The number of knowledge resources and
access requests are expected to increase rapidly– The system should be extensible
March 9, 2004 Haruo Yokota (GSIC, Titech) 5
Information Component Examples• Written texts & drawn figures
– Frequently combined in documents– Data formats: flat text, postscript, pdf, ppt, etc.
• Printed materials– e.g. pictures, scanned historical documents (the tale of Genji)– Data format: bitmap data, jpeg, gif, etc.
• Recorded video streams & spontaneous speech– To support e-learning and speech recognition research– Data formats: mpeg, RealMedia, WindowsMedia, QuickTime, etc.
• Structured information– e.g. corpus, experimental data, etc.– Data formats: tables (relational databases), XML, etc.
• We may soon have other data formats and even other media (e.g. 3D contents)
March 9, 2004 Haruo Yokota (GSIC, Titech) 6
Targets To Be Stored
Knowledge Resources
Wide Variety
A Storage System
Should be Flexible and Extensible
Many Data Formats
Video StreamsSpeeches
Printed MaterialsDrawn Figures
Written Texts
March 9, 2004 Haruo Yokota (GSIC, Titech) 7
Approach• To treat these wide and dynamic variety of
information components, we take an approach to combine two types of systems:– An advanced information storage system:
KnowledgeStore (KS)• Providing common functions for handling the
variety of data formats– External systems
• Dedicated for special applications based on the knowledge resources
March 9, 2004 Haruo Yokota (GSIC, Titech) 8
External Systems• Responsible for executing independent
applications– Using the information components preserved
in the KnowledgeStore.• The dynamics of environments
– Change of user requirements – Appearance of new type of materials
can easily be coped with by modifying some external systems locally or adding new one
March 9, 2004 Haruo Yokota (GSIC, Titech) 9
KnowledgeStore (KS)• Prepares web-service APIs to the external
systems– As well as ordinary web interfaces for
interactive users – The Web service is standardized interface
between software components using HTTP and SOAP
• Provides common functions for these external application systems via the web-service APIs.
March 9, 2004 Haruo Yokota (GSIC, Titech) 10
External System Interface
Knowledge Resources
ExternalSystem 1
Web Service APIs
ExternalSystem 2
Common Functions
ExternalSystem n
…
HTTP & SOAPKnowledgeStore
March 9, 2004 Haruo Yokota (GSIC, Titech) 11
Common Functions• Three Types of Functions:
– Contents Management Functions– Contents Retrieval Functions– Access Management Functions
• To implement them, we prepare – Hierarchical collections as folders
• With user access authorization– Metadata for each contents
• By referring Dublin Core (a metadata standard)
March 9, 2004 Haruo Yokota (GSIC, Titech) 12
Contents Management• User define Metadata
– Data Types• Based on system predefined basic data type• Integer, Floating Point Number, Date, Character
Sting, File, XML, RDB, Video Stream, URL, Reference, etc.
– Contents Attributes• Title, Creator, Subject, Description, Created Date,
Modified Date, Language, Versions, etc.– Contents Definition
• Collections of contents attributes
March 9, 2004 Haruo Yokota (GSIC, Titech) 13
Contents Retrieval• Search functions for the metadata
– Search targets can be specified by the data types and contents attributes
• Full text search functions for text contents• Structural search functions for the RDB
and XML contents• Combinations of these functions in a folder
or across multiple folders
March 9, 2004 Haruo Yokota (GSIC, Titech) 14
Access Management• There are a number of services having own user
authorization mechanism• If they requires password every time enter some
service, it should be a trouble for users• To avoid it, we adopt single sign-on (SSO)
mechanisms for the services• Provides user group management
– Administrators, External System Developers, Contents Register, General Users, etc.
• Preserves access logs
March 9, 2004 Haruo Yokota (GSIC, Titech) 15
Overview of KS
Single Sign-On
ExternalSystem 1
ExternalSystem n
Contents Metadata
Advanced Information Storage System: KnowledgeStore
ContentsRetrieval
WebInterface
Web ServiceInterface
ContentsManagement
WebInterface
Web ServiceInterface
AccessManagement
WebInterface
Web ServiceInterface
…Users
March 9, 2004 Haruo Yokota (GSIC, Titech) 16
Contents Management I/F
Access Authorization is specified for each hierarchical folder
March 9, 2004 Haruo Yokota (GSIC, Titech) 17
Contents Creation I/F
Metadata can be specified for each content
March 9, 2004 Haruo Yokota (GSIC, Titech) 18
Attribute Search Interface
Contents can be searched with contents attributes
March 9, 2004 Haruo Yokota (GSIC, Titech) 19
Examples of External Systems• UPRISE (by Our Group)
– A presentation scene retrieval system• Asunaro (by Group of Prof. Nishina)
– An e-learning system to study Japanese• PRESRI (by Group of Prof. Okumura)
– A citation index collecting research papers from the WWW
• Research Mining (by Our Group)
– A system discovering macro-flows of research by applying a mining method to databases of papers
• And so on
March 9, 2004 Haruo Yokota (GSIC, Titech) 20
UPRISE• UPRISE: Unified Presentation slide
Retrieval by Impression Search Engine
• It stores the combination of presentation slides and videos recording of the circumstances of the presentation– To support retrieval services based on the
impression indicators• Expressing how well a slide matches the given
keywords
March 9, 2004 Haruo Yokota (GSIC, Titech) 21
Impression Indicators• Consider properties of presentation slides
– Position Impression Indicator: Ip• Utilize structures of a presentation slide
– Duration Impression Indicator: Id• Utilize the time information to select slides
especially in the case of multiple appearances of the same slide by backtracking or reuse
– Context Impression Indicator: Ic• Utilize the context of a slide appearance sequence
March 9, 2004 Haruo Yokota (GSIC, Titech) 22
Evaluation of the Indicators• Extracting emergence of the relevant slide
– from multiple appearances of the same slide caused by the backtracking or reuse
• Asked ten testers to specify a target appearance for keywords they choose– To determine the relevant emergence within multiple
appearances in videos– Each tester does six or seven experiments using
different sets of keywords• Use 20 presentations containing 849 slides and
27209 keywords
March 9, 2004 Haruo Yokota (GSIC, Titech) 23
Comparison with tf.idf
0
5
10
15
20
25
30
35
40
0 10 20 30 40
Ranking by UPRISE
Ran
king
bytf.
idf
• Graph compares the ranking of the slide appearances specified by the testers.– X: impression indicator– Y: traditional tf.idf
• A point upper the diagonal line indicates the specified appearance is ranked higher in the list by the impression indicator; the inferiority for lower points.
March 9, 2004 Haruo Yokota (GSIC, Titech) 24
Ranking Histogram
0
5
10
15
20
25
30
35
1 6 11 16 >21
Ranking of Target Slides
Freq
uenc
y
Ittf.idf
The histogram illustrates that many selected appearances are listed earlier by our approach and about a half of them are treated as the top ranking
March 9, 2004 Haruo Yokota (GSIC, Titech) 25
Experimental Results• Calculating Precisions
– Precision of the proposed method is 59.4%– while tf.idf is 32.2% for the same condition
• The results demonstrate that proposed impression indicator is effective to search for a presentation scene in presentation databases
March 9, 2004 Haruo Yokota (GSIC, Titech) 26
UPRISE & KS• Presentation slides and recorded video
streams are stored in the KnowledgeStore• Utilize several functionalities of KS
– Content retrieval of slides and videos– Metadata and full text search for slides– Dedicated Indexes corresponding to the
impression indicators– Single sign-on function
March 9, 2004 Haruo Yokota (GSIC, Titech) 27
ExtractStructure Info. Synchronize
ConversionInto pdf
ConversionInto aviCreate Metadata & Index
PresentationSlides (ppt)
Metadata IndexDocuments Video Stream
KnowledgeStore
Web-Service APIs
Web-Service APIs
Search & User Interface
Dataflow Between UPRISE & KS
Digital Video
March 9, 2004 Haruo Yokota (GSIC, Titech) 28
Synchronization• To synchronize the presentation slides
and the video– We have developed a system by using
character recognition techniques for the video stream
• It enable past presentations to be retrieved• It also utilizes information of a laser pointer
– As a collaborative work with Fujitsu Labs.
March 9, 2004 Haruo Yokota (GSIC, Titech) 29
Search Interface of UPRISE• When keywords are specified, UPRISE
searches the table of metadata for tuplesmatching the given keywords using the index for the table– UPRISE calculates the impression indicators with the
given keywords• Web-based interface showing thumbnails of the
slides is developed– The size of each thumbnail is varied by the
impression indicator.– When a thumbnail is clicked, the video is started by
the portion showing the slide of the thumbnail.
March 9, 2004 Haruo Yokota (GSIC, Titech) 30
Snapshot of Searching a SlideInput Keyword Click
Synchronizing Display of the Video & Slides
Each line indicates part of a sequence of presented slidesThe size of each thumbnail is varied by the impression indicator
March 9, 2004 Haruo Yokota (GSIC, Titech) 31
Other External Systems & KS• Asunaro (An e-learning system to study Japanese)
– KS manages the multi-lingual manuscripts and databases for language translation
– The related materials, such as video stream for foreign language study, are also stored in the KS
• PRESRI (A citation index)– KS stores the research papers and providing search
functions for them– The PRESRI can also use RDB in KS to store the
citation information• Research Mining (A research macro-flows discoverer)
– It can share the research papers and citationinformation for the PRESRI stored in the KS
March 9, 2004 Haruo Yokota (GSIC, Titech) 32
Hardware Configuration (1/3)• Use a Fiber Channel (FC) switch to configure a
storage area network (SAN)– 2Gbps 16 ports FC switch– To connect a number of servers with FC-RAID
• having the storage virtualization mechanism to manage storage space by sharing a pooled logical storage volumes
– Total capacity is 9 TB• The configuration easily enable to scale up
– storage capacity– processing performance
by changing the number of disks and servers.
March 9, 2004 Haruo Yokota (GSIC, Titech) 33
Hardware Configuration (2/3)• Assign services to a number of servers:
– video stream servers– web servers– relational database, XML management servers– single sign-on servers– contents creation servers
to enhance the freedom in the extension of servers
• It allows adjusting the processing performance for service by service.
March 9, 2004 Haruo Yokota (GSIC, Titech) 34
Hardware Configuration (3/3)• For preserving the reliability of the system
– Additional ATA RAIDs for backup• Nowadays, cheap RAIDs tend to be used as
backup devices instead of magnetic-tape drives from cost-performance point of view
• Three 3TB RAIDs (Totally 9TB)• Connected to the SAN• Automatic backup from primary RAID to backup
RAID without using LAN (LAN free backup)– Uninterruptible power supplies (UPSs)
March 9, 2004 Haruo Yokota (GSIC, Titech) 35
Hardware Configuration Overview
SSOサーバSSO Server
BackupRAIDs9 TB
ScannerCamera
etc.
RDB / XMLServer Contents CreationWeb ServerVideo Stream
Server
LAN (Gbit Ether Switch)
Video/Sound
ContentsIndexMetadataDocuments
Contents
RAIDs9TB
SAN (2Gbps 16port FC Switch)
clientThe Internet
March 9, 2004 Haruo Yokota (GSIC, Titech) 36
Specifications of Servers (1/2)• Video Stream Server
– UltraSparcIIIi 1.062GHz x 4– 8GB Memory and 36GB x 4 HDD– Solaris 8 and Helix Universal Server
• Web Server– UlltraSparcIIIi 1.062GHz x 4– 8GB Memory and 36GB x 4 HDD– Solaris 8 and Sun ONE Application Server7
• RDB / XML Server– Pentium Xeon 3.06GHz x 2– 4GB Memory and 18.6 GB HDD– Redhat Enterprise Linux Advanced Sever 3– Oracle 9i and Tamnio XML DB
March 9, 2004 Haruo Yokota (GSIC, Titech) 37
Specifications of Servers (2/2)• Single Sign-On Server for Video Stream Access
– Pentium Xeon 3.06GHz x 2– 1GB Memory and 18.6 GB HDD– Redhat Linux 9
• Single Sign-On Server for Web Access– Pentium Xeon 3.06GHz x 2– 2GB Memory and 36.4 GB HDD– Redhat Linux 9
• Contents Creation Server– Pentium Xeon 3.06GHz x 2– 2GB Memory and 18.6 GB HDD– Windows XP Professional
March 9, 2004 Haruo Yokota (GSIC, Titech) 38
Very Hot News• The hardware is being set up TODAY!
March 9, 2004 Haruo Yokota (GSIC, Titech) 39
Switches and Servers
March 9, 2004 Haruo Yokota (GSIC, Titech) 40
Concluding Remarks (1/2)• We proposed a combination of an
information storage system and external systems by web-service APIs to the TitechCOE LKR project.– The information storage system named
KnowledgeStore provides common functions for handling the variety of data formats
– The external systems execute special applications based on the knowledge resources such as presentation retrieval, e-learning, and research macro-flows discovery
March 9, 2004 Haruo Yokota (GSIC, Titech) 41
Concluding Remarks (2/2)• The configuration realize the flexibility and
extensibility required for managing the large-scale knowledge resources.
• Currently– Implementing the KnowledgeStore and
external systems– Considering the enhancement of functions,
such as managing versions.
March 9, 2004 Haruo Yokota (GSIC, Titech) 42
Thanks for your attention