chapter 6 text and multimedia languages and properties

82
Hsin-Hsi Chen 6-1 Chapter 6 Text and Multimedia Languages and Properties Hsin-Hsi Chen Department of Computer Science and Informatio n Engineering National Taiwan University f the materials in the following is selected from Dr Kuang-hu n XML and RDF (Department of Library Information Science, al Taiwan University)

Upload: neka

Post on 18-Jan-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Chapter 6 Text and Multimedia Languages and Properties. Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University. Part of the materials in the following is selected from Dr Kuang-hua Chen’s - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-1

Chapter 6Text and Multimedia Languages and

Properties

Hsin-Hsi Chen

Department of Computer Science and Information Engineering

National Taiwan University

Part of the materials in the following is selected from Dr Kuang-hua Chen’stalk on XML and RDF (Department of Library Information Science, National Taiwan University)

Page 2: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-2

what is a document

• document: a single unit of information– complete logical unit

• research paper, book, manual

– part of a larger text• paragraph, passage, an entry in a dictionary, …

– a physical unit• file, email, Web page

Page 3: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-3

characteristics of a document

Document

Syntax

Presentation Style

Semantics

Text + Structure + Other Media

How a documentis displayed or printed

Express structure,presentation style,or even external

actions

Creator

Author

implicit, orexpressed in a language

Page 4: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-4

Metadata( 元資料,超資料,中介資料,中間資料,後設資料,詮釋資料 )

• Definition– Data about the data, e.g., schema in a DBMS– describe other information based on some rules or

policies

• Type– Descriptive Metadata

• Metadata that is external to the meaning of the document• Dublin Core

– Semantic Metadata• Metadata that can be found within the document’s content• Library of Congress subject codes

Page 5: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-5

Dublin Core

• Metadata Element Set (15) – 主題和關鍵詞( Subject)

• 資源的主題,即敘述資源主題或內容的關鍵字或片語,包括控制詞彙或分類架構

– 題名( Title )• 由創造者或出版者給予資源的名稱

– 著者( Creator )• 創造資源內容的個人、組織或機構

– 簡述 (descriptions)• 資源內容的文字描述,包括文件的摘要或是影像資源概述

– 出版者( Publisher )• 發表資源的組織,例如出版社、大學部門、團體或組織

Page 6: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-6

Dublin Core (Continued)

– 其他參與者( Contributors )• 其他對資源的創造有貢獻的個人或組織,例如編者、譯者或插

畫者 – 出版日期( Date )

• 資源發表的日期 – 資源類型( Type )

• 資源的種類,例如首頁、小說、詩、技術報告、字典等 – 資料格式( Format )

• 資源的檔案格式,例如 text/html 、 ASCII 、或是 JPEG 影像檔等

– 資源識別代號( Identifier )• 用來標示資源唯一性的字串或數字,例如網路資源 URL 或 UR

N ,以及 ISBN 或其他正式名稱

Page 7: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-7

Dublin Core (Continued)

– 關連( Relation )• 與其他資源的關連,例如所屬的系列或其他關係

– 來源( Source )• 作品是由何處衍生而來

– 語言( Language )• 資源內容所採用的語文

– 涵蓋時空( Coverage )• 資源的時間與空間特性

– 版權規範( Rights )• 資源版權聲明以及版權管理使用之規範

Page 8: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-8

器物的例子<?xml version="1.0"?><dc-record><type> 器物 </type><format> 銅、琺瑯 </format><format> 掐絲 </format><title> 景泰掐絲琺瑯番蓮紋盒 </title><title>cloisonnie box with lotus-spray decoration</title><description>1400/1500</description><description> 銅胎,蓋與器身鑄成浮雕式八瓣蓮花形 </descripti

on><description> 高 63.cm 口徑 12.4cm 重 634.6 克 </description><description> 陳夏生,明清琺瑯器展覽圖錄。台北:國立故宮

博物院,民 88 年 2 月。 </description>

Page 9: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-9

器物的例子(續)<subject> 景泰掐絲琺瑯番蓮紋盒 </subject><subject> 日常生活 </subject><subject> 容器 </subject><subject> 銅、琺瑯 </subject><subject> 掐絲 </subject><subject> 地區 ( 社的座落位置 )(r) place</subject><date>1400/1500</date><coverage> 地區 ( 社的座落位置 )(r) place</coverage><rights> 臺灣 , 故宮 </rights></dc-record>

Page 10: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-10

紙本水墨的例子<?xml version="1.0"?><dc-record>

<type> 紙本水墨 </type>

<type> 原件 </type>

<title> 古木流泉 </title>

<description> 全文 </description>

<description> 紙本水墨 </description>

<description>30*48.7</description>

<description> 蓼塘。楊世家藏。神。品。項元汴印。項子京家珍藏。項墨林鑑賞章。墨林秘玩。?李項氏士家寶玩。張澤之。柯亭文房之印。乾隆御覽之寶。石渠寶笈。重華宮鑑藏寶。樂善堂圖書記。 </description>

Page 11: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-11

紙本水墨的例子(續)<description>1127/1189</description><description> 國立故宮博物院編輯委員會,宋代書畫冊頁

名品特展。台北:國立故宮博物院,民 84 年 9 月。 </description>

<subject>風景 </subject><creator>馬和之 </creator><date>1127/1189</date><language>zh</language><right> 臺灣 , 故宮 </right></dc-record>

Page 12: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-12

MARC• Machine-Readable Cataloging Record• The most used format for library records• An Example (NTU Lib)書名 公共藝術年鑑 Public art in Taiwan eng 何政廣 總編輯出版項 臺北市 行政院文化建設委員會 民 88-出版項 1999.稽核項 冊 彩圖 29公分附註 據民 87 年書目資料著錄中文標題 csh 公共藝術 -- 年鑑其他作者 何 政廣控制號 100982322.控制號 100982322.國際標準號 957-02-4468-2 平裝 NT$500.國會卡片號 cw 88008821.

Page 13: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-13

Web Metadata

• purposes– cataloging (e.g., BibTex)– content rating

• Protect children from reading some type of documents

– intellectual property rights– digital signatures (for authentication)– privacy levels – applications to electronic commerce– …

• RDF (Resource Description Framework)

Page 14: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-14

RDF

• description of nodes and attached attribute/value pairs

• nodes: any Web resource

• attributes: properties of nodes

• values: text strings or other nodes (Web resources or metadata instances)

Page 15: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-15

RDF基本模型

ResourceProperty

Value

Subject Predicate Object

Statement

Page 16: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-16

範例一

機器貓小叮噹

作者籐子不二雄

漫畫

型態

Page 17: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-17

RDF結構模型

ResourceProperty

Resource

valuevalue

Property Property

Page 18: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-18

範例二

機器貓小叮噹

作者Dummy

籐子不二雄[email protected]

電子郵件 姓名

Page 19: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-19

Name Space

• 提供使用其他機構控制詞彙的機制• 提供各權威機構制定控制詞彙的機制• 範例 <RDF xmlns=“http://www.w3.org/TR/WD-rdf-syntax/”

xmlns:dc=“http://purl.org/dc/elements/1.0/”>

Dublin Core

Name Space

Page 20: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-20

DC in RDF

Resourcedc:type

dc:title

dc:description

dc:subject

dc:coverage

dc:creator

dc:contributor

dc:publisher

dc:date

dc:relation

dc:language

dc:identifier

dc:rights

dc:format

dc:source

Page 21: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-21

A DC Example in RDF

http://x.html Kevin Chendc:creator

<RDF xmlns = “http://www.w3.org/TR/WD-rdf-syntax#” xmlns:dc = “http://purl.org/dc/elements/1.0/”> <Description about = “http://x.html”> <dc:creator> Kevin Chen </dc:creator> </Description></RDF>

Page 22: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-22

RDF 語法

<RDF xmlns = “http://www.w3.org/TR/WD-rdf-syntax#” xmlns:dc = “http://purl.org/dc/elements/1.0/”> <Description about = “http://www.lis.ntu.edu.tw/~khchen/”> <dc:Title> The Magic Shelter </dc:Title> <dc:Creator> Kuang-hua Chen </dc:Creator> </Description></RDF>

http://www.lis.ntu.edu.tw/~khchen/

“The Magic Shelter”

dc:creator“Kuang-hua Chen”

dc:title

Page 23: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-23

Text

• Formats– Basic form

• ASCII, …

– Document interchange• Rich Text Format (RTF): used by word processors• Portable Document Format (PDF) and Postcript: use

d for display or printing documents• MIME (Multipurpose Internet Mail Exchange): sup

port multiple character sets, multiple languages, and multiple media

Page 24: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-24

Text (Continued)

– compress• Compress (Unix)

• ARJ (PCs)

• ZIP (gzip in Unix and Winzip in Windows)

Page 25: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-25

Information Theory

• entropy– Measure information content or information

uncertainty

12log

iii ppE

where is the number of symbols in the alphabet pi is a probability for symbol i

Page 26: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-26

Modeling Natural Language

• Issue 1: how a word is formulated– symbols (separate-words and belong-to-words)– Vowels are more frequent than most consonants– Binomial model (0-order Markov model): each symbol is

generated with a certain probability– k-order Markov model

• Extension: how a sentence is formulated– 5-order Markov model in Bible– finite-state model (regular languages)– grammar model (context free and other languages)

Page 27: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-27

Modeling Natural Language(Continued)

• Issue 2: how different words are distributed inside each document

• Zipf’s law– The frequency of the i-th most frequent word is

1/i times that of the most frequent word– In a text of n words with a vocabulary of V wor

ds, the i-th most frequent word appears n /(iHV())

v

1

1)(j j

VH

=1.5~2.0

V

j j

nwordst

nV

1

11

1...

3

1

2

11

Page 28: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-28

F

Words

V

Text size

There are a few hundred words which take up 50% of the text.Words (stopwords) that are too frequent can be disregarded.

Page 29: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-29

Modeling Natural Language(Continued)

• Issue 3: the distribution of words in the documents of a collection

• Negative binomial distribution– The fraction of documents containing a word k

times

)1(1

)( ppk

kkF kk

where p and depend on the word and the document collection

p=9.24 and =0.42 for word “said” in Brown corpus

Page 30: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-30

Modeling Natural Language(Continued)

• Issue 4: number of distinct words in a document (document vocabulary)

• Heaps’ Law– The vocabulary of a text of size n words is

V = Kn

where K and depend on the particular textK: between 10 and 100: a positive value less than 1 (e.g., 0.4 < < 0.6)

Page 31: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-31

Modeling Natural Language(Continued)

• Issue 5: average length of words• Heaps’ law

– The length of the words in the vocabulary increases logarithmically with the text size

• Longer words should appear as the text grows.

• The average length of the overall text is constant.

Page 32: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-32

Similarity Model

• distance function– symmetric: distance(a,b)=distance(b,a)– triangle inequality:

distance(a,c)distance(a,b)+distance(b,c)– measure

• Edit distance: minimum number of character insertions, deletions, and substitutionse.g., Edit-distance(color, colour)=1, Edit-distance(survey, surgery)=2

• Longest common subsequence: only deletion is allowede.g., LCS(survey, surgery)=surey (non-common is deleted)

• Longest common sequence of lines between two files: e.g., diff command in Unix

Page 33: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-33

Markup Languages

• Definition– Textual syntax that describes formatting

actions, structure information, text semantics, attributes, etc.

• Types

– Procedural Markup

– Descriptive Markup

Page 34: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-34

程序性標示 (Procedural Markup)

Page 35: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-35

描述性標示 (Descriptive Markup)

Page 36: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-36

描述性標示的特色

• 將文件內容與呈現格式區分開來

• 針對文件的語意結構進行標誌

Page 37: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-37

SGML(Standard Generalized Markup Language)

• 1986 年 ISO 所制定的標準- ISO 8879

• 屬於描述性標示。• 是一種 Meta-language

– HTML 是 SGML 的應用。

Page 38: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-38

SGML 的特色• 有彈性 (flexibility)

– 能描述任何資訊結構與任何複雜文件。• 非專屬性 (non-proprietary) 、平台獨立性 (platform-independence) 與系統獨立性 (system-independence) – 利於文件的交換與長期保存。

• 資訊再利用性 (re-usability)

Page 39: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-39

SGML 文件的組成

• SGML declaration– 指定文件所使用的字集,及特定的選項功能。

• DTD (Document Type Definition)– 定義文獻所包含的 elements 。– 定義 elements 的內容與屬性。– ...

• DI (Document Instance)– 加上標示的文件。

Page 40: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-40

SGML Declaration

• 指定 SGML 文件使用的字元集,及特定的選項功能。

• 可以不特別指定 SGML declaration ,文件會採用 SGML 預設的字元集與功能設定。

• <!SGML “ISO 8879-1986” ...

Page 41: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-41

Example : Email 的文件結構

Email

Body

ToSubjectDate

From

Page 42: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-42

An SGML DTD for Email

<!-- Elements Min Content --><!-- ----------- ----- ---------------------------------- --><!ELEMENT Email -- (From,Date,To+,Subject, Body?)><!ELEMENT From -O (#PCDATA)> <!ELEMENT Date -O (#PCDATA)><!ELEMENT To -- (#PCDATA)><!ELEMENT Subject -O (#PCDATA)><!ELEMENT Body -- (#PCDATA)><!-- End of Email DTD -->

commentstarting and ending tagscompulsory(-) or optional (O)

,: concatenation|: logical or?: 0 or 1 occurrence*: 0 or 1 occurrences+: 1 occurrences

PCDATA: ASCII charactersNDATA: binary dataEMPTY

Page 43: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-43

An SGML DI for Email DTD

<!DOCTYPE Email SYSTEM “c:\temp\email.dtd”>

<Email>

<From>Joe

<Date>1999-7-14 AM 09:20

<To>Jay</To>

<To>Jennifer</To>

<Subject>Learning XML

<Body>XML 將在 Web 上大放異彩,趕快學喔! …</Body>

</Email>

user defined (vs. PUBLIC)

The endingtag is optional

Page 44: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-44

Page 45: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-45

SGML, DTDs, Document Instances, and Presentation Instances

SGML

DTD DTD ….

DI DI DI ….

印刷版本 Hypertext版本

盲人點字版本

….

DSSSL (Document Style Semantic Specification Language)FOSI (Formatted Output Specification Instance)

Page 46: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-46

SGML 發展的限制• SGML應用程式不易開發• SGML 文件不易在Web上傳佈• 缺乏廠商的支援

Page 47: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-47

HTML (Hypertext Markup Language)

• 是 SGML 的應用:– HTML 2.0 DTD– HTML 3.2 DTD– HTML 4.0 DTD

• 目前 Web 上寫作網頁的標準資料格式• 簡單易學• 具可攜性 (portable)• 可結合超連結 (hyperlink) 與多媒體

Most HTML instances do notexplicitly make reference to the DTD

Page 48: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-48

HTML 的特性• HTML DTD 的設計主要是滿足線上顯示的需求

• HTML 有內建的樣式 (style)

• HTML引用 SGML 的標示最簡化特徵 (markup minimization feature)

• HTML沒有採用 SGML 的超連結機制

Page 49: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-49

HTML 的限制• 結構上的限制• 資訊再利用的限制• 資料交換的限制• 自動文件處理的限制• 無法支援較精確的查詢• 各家廠商推出的 HTML Extension 不相容

Page 50: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-50

XML (eXtensible Markup Language)

• W3C Recommendation 10-February-1998 – XML 1.0

• 大廠支持: Microsoft 、 Netscape 、 Sun 、 ...

• XML is SGML-- rather than HTML++• 取 SGML 之長,補 HTML 之短

– 允許使用者依據需求,自行定義 tags– 能在 Web 上傳遞

Page 51: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-51

W3C Data Format

http://www.w3c.org/

Page 52: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-52

XML最重要的特性• 可擴展性 (Extensibility)

– XML讓使用者根據需要,自行定義標籤。• 結構性 (Structure)

– XML能描述各種複雜的文件結構。• 可確認性 (Validation)

– XML可以根據 DTD 對文件進行結構確認。

Page 53: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-53

XML 標準

• XML-Language: SGML without tears– Self-describing Documents – Well-formed and Valid Documents

• XML-Link: Power linking– simple and extended links

• XML-Style: Separate style from content– XSL (Extensible Style sheet Language)

Page 54: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-54

XML 標準制定現況• XML 1.0 :

– W3C Recommendation 10-Feb-1998

• XML Namespace :– W3C Recommendation 14-Jan-1999

• XLink & Xpointer :– W3C Working Draft 03-March-1998

• XSL :– W3C Working Draft 16-Dec-1998

Page 55: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-55

Well-formed XML Rules

• 包含一個以上的 elements• 恰有一個 root element• 不能省略 start-tag 或 end-tag• 所有的 tags 必須呈現適當的巢狀 (nest) 結構。 ( 如 <B><I>bold and italic</B>italic</I> 是不允許的 )• empty tags 必須遵守特殊的 XML 語法。 ( 如 <img src=“…”/> )• 所有的 attribute value 必須括上單引號或雙引號 . ( 如: <font size=“2”> )• 所有的實體都必須宣告

Page 56: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-56

Writing Well-Formed XML

• Step 1 : Make an XML Declaration

• Step 2 : Creating a Root Element

• Step 3 : Writing in XML

• Step 4 : Parsing your document

Page 57: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-57

Step 1:Make an XML Declaration

• <?xml version=”1.0” standalone=”yes”?>

• <?xml version=”1.0” encoding=”UTF-8” standalone=”yes”?>

• <?xml version=”1.0” encoding=”big5” standalone=”yes”?>

without DTD

Page 58: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-58

Step 2:Creating a Root Element

<?xml version=”1.0” standalone=”yes”?>

<Email>

……

</Email>

Page 59: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-59

Step 3:Writing in XML

<?xml version=”1.0” encoding=“big5” standalone=”yes”?><Email> <From>Joe</From> <Date>1999-7-14 AM 09:20</Date> <To>Jay</To> <To>Jennifer</To> <Subject>Learning XML</Subject> <Body>XML 將在 Web 上大放異彩,趕快學喔! …</Body></Email>

End tag cannotomitted

Page 60: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-60

Step 4:Parsing your document

• Checking if your well-formed XML document conforms to well-formed XML rules.

• Use a parser to check well-formedness– for example: the XML parser embedded in IE5

Page 61: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-61

Explorer 5.0 瀏覽 Well-formed XML

Page 62: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-62

Explorer 5.0 瀏覽錯誤的 XML 文件

Page 63: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-63

Writing Valid XML

• Step 1 : Make an XML declaration.

• Step 2 : Designing a DTD.

• Step 3 : Writing Valid XML.

• Step 4 : Parsing your Valid XML document.

Page 64: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-64

Step 1:Make an XML Declaration

• <?xml version=”1.0” standalone=”no”?>

• <?xml version=”1.0” encoding=”UTF-8” standalone=”no”?>

• <?xml version=”1.0” encoding=”big5” standalone=”no”?>

Page 65: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-65

Step 2 : Designing a DTD

<!-- Elements Content -->

<!-- ----------- ---------------------------------- -->

<!ELEMENT Email (From,Date,To+,Subject,Body?)>

<!ELEMENT From (#PCDATA)>

<!ELEMENT Date (#PCDATA)>

<!ELEMENT To (#PCDATA)>

<!ELEMENT Subject (#PCDATA)>

<!ELEMENT Body (#PCDATA)>

<!-- End of Email DTD -->

Page 66: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-66

Step 3 : Writing Valid XML

<?xml version=”1.0” encoding=“big5” standalone=”no”?><!DOCTYPE Email SYSTEM ”email.dtd"><Email> <From>Joe</From> <Date>1999-7-14 AM 09:20</Date> <To>Jay</To> <To>Jennifer</To> <Subject>Learning XML</Subject> <Body>XML 將在 Web 上大放異彩,趕快學喔! …</Body></Email>

Page 67: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-67

XML Simple Link

Page 68: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-68

XML Extended linking: multiple ends

Page 69: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-69

XML Extended linking:addressing by structure

Page 70: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-70

XML Extended linking

Page 71: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-71

XSL: XML counterpart of CSS (Cascading Style Sheet)

• Sample : email.css

Email,From,Date,To,Subject,Body,

{display:block;margin-left:5%;

margin-right:5%;border-style:groove;}

Page 72: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-72

XML document with Style

<?xml version=”1.0” encoding=“big5” standalone=”no”?><?xml-stylesheet href ="email.css" type="text/css"?><Email> <From>Joe</From> <Date>1999-7-14 AM 09:20</Date> <To>Jay</To> <To>Jennifer</To> <Subject>Learning XML</Subject> <Body>XML 將在 Web 上大放異彩,趕快學喔! …</Body></Email>

Page 73: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-73

Explorer 5.0 瀏覽結合 CSS 的 XML文件

Page 74: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-74

XML 的應用

• Database interchange

• Client-side processing

• User views of the data

• Information filtering

Page 75: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-75

Multimedia

• medias– text, sound, images, video

• issues– volume, format, processing requirements

Page 76: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-76

Formats

• image– bit-mapped/pixel-based display

• The simplest format• XBM, BMP, PCX• disadvantages: redundancy

– compression• Compuserve’s Graphic Interchange Format (GIF)

– lossy compression• Joint Photographic Experts Group (JPEG)

– exchange• Tagged Image File Format (TIFF)

Page 77: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-77

Formats

• Audio– AU, MIDI, WAVE

• Video – MPEG, AVI, QuickTime

Page 78: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-78

Textual Images

• definition– images of documents that contain mainly typed or

typeset text

– obtained by OCR

• image retrieval– Alternative 1

• At creation time, a set of keywords (called metadata) is associated with each image

• Conventional text retrieval techniques can be applied to keywords

Page 79: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-79

Textual Images (Continued)

– Alternative 2• Use OCR to extract the text of the image

• The resultant ASCII text can be used to extract keywords

– Alternative 3• Use the symbols extracted from the images as basic

units to combine image retrieval techniques with sequence retrieval techniques

Page 80: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-80

Taxonomy of Web languages

Page 81: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-81

相關資源• HTML-4: http://www.w3.org/TR/REC-html40• W3C: http://www.w3c.org/• OCLC: http://purl.oclc.org/• XML: http://www.xml.org/• XML Parser: http://xdev.datachannel.com/• DDML: Document Definition Markup Language.

http://www.w3.org/TR/NOTE-ddml• Xschema: http://purl.oclc.org/NET/xschema

Page 82: Chapter 6 Text and Multimedia Languages and Properties

Hsin-Hsi Chen 6-82

參考文獻J. Kunze, “Encodeing Dubin Core Metadata in HTML”, <ftp:

//ftp.ietf.org/internet-drafts/draft-kunze-dchtl-00.txt>.

E. Miller, P. Miller and d. Brickley, “Guidance on Expressing the Dublin Core within the Resource Description Framework (RDF)”, <http://www.ukoln.ac.uk/interop-focus/activites/dc/datamodel/WD-dc-rdf-19990423.htm>.