1 cs 502: computing methods for digital libraries lecture 4 text
TRANSCRIPT
![Page 1: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/1.jpg)
1
CS 502: Computing Methods for Digital Libraries
Lecture 4
Text
![Page 2: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/2.jpg)
2
Administration
• Assignment 1 submission problems:
Due date postponed to Thursday 12:20
Demonstration by Dean Eckstrom
• Wednesday discussion classes:
Olin 155, 7:30-8:25 and 8:35 to 9:00
Check Notices for sections
![Page 3: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/3.jpg)
3
Digital Libraries and Checking Information
Email to Teaching Assistants:
"I have heard that ..."
"There is a rumor that ..."
Authoritative source(s):
Course web site -- Notices
![Page 4: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/4.jpg)
4
Text
The richness of text
• Elements: letters, scripts, symbols
• Structure: words, sentences, paragraphs, headings, tables
• Appearance: fonts, layout, design, materials
• Special: mathematics, music
Digital libraries must represent ever variant!
![Page 5: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/5.jpg)
5
Markup and Page Description
Mark-up languages represent the structure of text
e.g., SGML, XML
The mark-up must be combined with a style sheet for rendering.
Page description languages represent the appearance of text
e.g., PostScript, PDF
![Page 6: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/6.jpg)
6
Markup and Style Sheets
style sheet renderingsoftware
documentcontent andstructure
formatteddocument
![Page 7: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/7.jpg)
7
Alternative Renderings
style sheetfor display
renderingsoftware
documentcontent andstructure
printeddocument
renderingsoftware
style sheetfor print
computerdisplay
![Page 8: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/8.jpg)
8
Example: the Oxford English Dictionary
• Typography of printed text represented semantic information.
• Keyboard the text, capturing all typographic information.
• Automatic parser to extract semantics (e.g., date, quotation, phonetics, etc.).
• Markup in SGML to tag semantic information.
• Separate style sheets for various editions, print, CD-ROM, online.
• Before the web, yet used with the web.
![Page 9: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/9.jpg)
9
Character
Distinguish between
• the abstract character as a structural element,
"A"
• representations of the character
A A A A 100001 A A "capital a"
![Page 10: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/10.jpg)
10
ASCII
A binary encoding of a character as an 8-bit byte,e.g., 01000001 is the encoding for "A"
0
127
255
printable ASCII
standard (7-bit) ASCII
extended (8-bit) ASCII
32
![Page 11: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/11.jpg)
11
Unicode
Unicode
• 16-bit codes that represent distinct characters
• organized by scripts, not languages
• compatible with Unihan (Chinese, Japanese, Korean)
![Page 12: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/12.jpg)
12
Scripts
Scripts supported by Unicode 2.0
Arabic Armenian Bengali Bopomofo Cyrillic Devanagari Georgian Greek Gujarati Gurmkhi Han Hangul Hebrew Hiragana Kannada Katakana Latin Lao Malayalam Oriya Phonetic Tamil Telugu Thai Tibetan
![Page 13: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/13.jpg)
13
More Scripts
Numbers General Diacritics General Punctuation General Symbols Mathematical Symbols Technical Symbols Dingbats Arrows, Blocks, Box Drawing Forms & Geometric Shapes Miscellaneous Symbols Presentation Forms
![Page 14: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/14.jpg)
14
Unicode and UTF-8
UTF-8
• a stream encoding of Unicode characters.
• one to six bytes to represent each Unicode character, identified by number of leading ones.
• single byte characters are identical to printable ASCII, e.g., 01000001 has no leading one, therefore it is a single byte code.
![Page 15: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/15.jpg)
15
Markup Languages
SGML (Standard Generalized Markup Language)
A system for creating markup languages that represent the structure of a document
XML (eXtensible Markup Language)
A simplified version of SGML intended for use with online information
DTD (Data Type Definition)
A markup specification for a class of documents, defined within the SGML framework
HTML (Hypertext Markup Language)
A markup and formatting language with links to other objects
![Page 16: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/16.jpg)
16
XML Example (Metadata)
<?xml version="1.0"?><!DOCTYPE dlib-meta0.1 SYSTEM "http://www.dlib.org/dlib/dlib-meta01.dtd"><dlib-meta0.1> <title>Digital Libraries and the Problem of Purpose</title> <creator>David M. Levy</creator> <publisher>Corporation for National Research Initiatives</publisher> <date date-type = "publication">January 2000</date> <type resource-type = "work">article</type> continued on next slide
![Page 17: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/17.jpg)
17
continued from previous slide <identifier uri-type = "DOI">10.1045/january2000-levy</identifier> <identifier uri-type = "URL">http://www.dlib.org/dlib/january00/01levy.html</identifier> <language>English</language> <relation rel-type = "InSerial"> <serial-name>D-Lib Magazine</serial-name> <issn>1082-9873</issn> <volume>6</volume> <issue>1</issue> </relation> <rights>Copyright (c) David M. Levy</rights></dlib-meta0.1>
XML Example (Metadata)
![Page 18: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/18.jpg)
18
Constructing a DTD: Entities
Entities are basic units of information:
• Character entities
a b ... z 0 1 ... 9 ! ? ...
< α
• Any other entities
&logo; &square-root;
![Page 19: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/19.jpg)
19
Entities
• The name of an entity is purely mnemonic. It makes no assertions about the context in which the entity is used or its appearance when rendered.
• The DTD used by a scientific publisher will have about 4,000 entities to represent all the special symbols and the variants used in scientific disciplines.
![Page 20: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/20.jpg)
20
Constructing a DTD: Elements
Elements define the structure.
An element is a string of entities, bracketed by tags:
<p>This is a paragraph.</p>
<heading1>Some heading</heading1>
<author>Jane Austen</author>
<manuscript>John Hancock</manuscript>
![Page 21: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text](https://reader034.vdocument.in/reader034/viewer/2022052701/56649e3f5503460f94b3001a/html5/thumbnails/21.jpg)
21
Constructing a DTD: Grammar
Every DTD has a grammar that defines:
• allowable relationships between entities and elements
• hierarchies and nesting
• etc.
The grammar is expressed as a set of rules that can be processed automatically.