html, css, url, unicode
TRANSCRIPT
HTML, CSS, URL, UnicodeHTML, CSS, URL, Unicode
WhyWhy CareCare AboutAbout HistoryHistory??
Quoting HTML5 Spec section 1.6g p
It must be admitted that many aspects of HTML appear at first glance to be nonsensical and inconsistentat first glance to be nonsensical and inconsistent.
HTML, its supporting DOM APIs, as well as many of its supporting technologies, have been developed over a period of several decades by a wide arrayhave been developed over a period of several decades by a wide arrayof people with different priorities who, in many cases, did not know of each other's existence.
Features have thus arisen from many sources, and have not always been designed in especially consistent ways. Furthermore, because of the unique characteristics of the Web, implementation bugs haveq , p goften become de-facto, and now de-jure, standards, as content is often unintentionally written in ways that rely on them before they can be fixed.
2HTML, CSS, URL, Unicode
HyperText Markup LanguageHyperText Markup Language
HTML acronymn: Hyper Text Markup LanguageHTML acronymn: Hyper-Text Markup Language• Combination:
• Hypertext• Hypertext• Markup language
Hypertext:C ll ti f d t t d b h li k• Collections of document connected by hyperlinks
• (Paul Otlet, philosophical treatise (1934), Vannevar Bush hypothetical Memex system (1945)Vannevar Bush, hypothetical Memex system (1945), Ted Nelson introduced hypertext (1968))
3HTML, CSS, URL, Unicode
Markup LanguagesMarkup Languages
Notation for adding formal structure to textNotation for adding formal structure to textStandard Generalized Markup Language, SGML (1986)(dates to Charles Goldfarb the INLINE system (1970))
<!DOCTYPE greeting [i ( )
(dates to Charles Goldfarb, the INLINE system (1970))
<!ELEMENT greeting (#PCDATA)><!ATTLIST greeting style (big|small) "small"><!ENTITY hi "Hello">
]><greeting style="big"> &hi; world! </greeting>
4HTML, CSS, URL, Unicode
The Origins of the WWWThe Origins of the WWW
WWW was invented by Tim Berners Lee at CERN (1989)WWW was invented by Tim Berners-Lee at CERN (1989)• Hypertext on the Internet (replacing FTP)
Three constituents: HTML + URL + HTTP
HTML is an SGML language for hypertextURL i t ti f l ti filURL is an notation for locating files on servesHTTP is a high-level protocol for file transfers
5HTML, CSS, URL, Unicode
The Design of HTMLThe Design of HTML
Simple, purist design principlesHTML describes the logical structure of a documentBrowsers are free to interpret tags differentlyHTML is a lightweight file formatSize of file containing just ”Hello World!”:g j
Postscript 11,274 bytesPDF 4,915 bytesMS Word 19,456 bytesHTML 28 bytes
6HTML, CSS, URL, Unicode
The History of HTMLThe History of HTML
1992: HTML 1.0, Tim Berners-Lee original proposal1993: HTML+, some physical layout1994: HTML 2.0, standard with best features1995: Non-standard Netscape features1996: Competing Netscape and Explorer features1996: HTML 3.2, the Browser Wars end1997: HTML 4.0, stylesheets are introduced1999 HTML 4 01 h i !1999: HTML 4.01, we have a winner!2000: XHTML 1.0, an XML version of HTML 4.012001 XHTML 1 1 mod lari ation2001: XHTML 1.1, modularization2002: XHTML 2.0, simplified and generalized2006/7 ?: (X)HTML5 development of HTML and XHTML
7HTML, CSS, URL, Unicode
2006/7-?: (X)HTML5, development of HTML and XHTML
Uniform Resource LocatorUniform Resource Locator
A Web resource is located by a URLA Web resource is located by a URL
htt // 3 /TR/ht l4/http://www.w3.org/TR/html4/
h th
Relative URL
scheme server path
Relative URLsgml/dtd.html
Fragment identifierh // 3 / / 4/# i i
8HTML, CSS, URL, Unicode
http://www.w3.org/TR/HTML4/#minitoc
URIs, URNs, and IRIsURIs, URNs, and IRIs
Uniform Resource Identifier (URI)Uniform Resource Identifier (URI)scheme:scheme-specific-part
Conventions about use of / # and ?Conventions about use of /, #, and ?
Uniform Resource Name (URN)Uniform Resource Name (URN)urn:isbn:0-471-94128-X
International Resource Identifier (IRI)http://www blåbærgrød dk/blåbærgrød htmlhttp://www.blåbærgrød.dk/blåbærgrød.html
http://www.xn--blbrgrd-fxak7p.dk/bl%E5b%E6rgr%F8d.html
9HTML, CSS, URL, Unicode
Survivor’s Guide to HTMLSurvivor’s Guide to HTML
Overall structure of an HTML documentOverall structure of an HTML document
<html><html>
<head>
<title>The Title of the Document</title>
</head>
<body bgcolor="white">
...
</body>
</html></html>
10HTML, CSS, URL, Unicode
Simple Formatting (1/2)Simple Formatting (1/2)
<html>
<head><head>
<title>Good Advice</title>
</head>
<body><body>
<h1>Good Advice for Everyday Life</h1>
<h2>For UNIX programmers</h2>
b /b<b>Never</b> type:
<p><tt>rm -rf /*</tt><p>
on your computer.
<h2>For Nuclear Scientists</h2>
<b>Never</b> press the
<i>Big <font color="red">Red</font> Button</i>.<i>Big <font color red >Red</font> Button</i>.
</body>
</html>
11HTML, CSS, URL, Unicode
Simple Formatting (2/2)Simple Formatting (2/2)
12HTML, CSS, URL, Unicode
More FormattingMore Formatting
<html>
<head><head>
<title>Things To Do</title>
</head>
b d<body>
<ol>
<li>Feed the cat.
li h h ll d<li>Try out the shell command:
<pre>foreach x ( `ls` )
cat $x | tr "aeiouy" "x" > $x
end</pre>
<li>Buy ticket for Timbuktu.
</ol>
</body>
</html>
13HTML, CSS, URL, Unicode
Hyperlinks: Source DocumentHyperlinks: Source Document
<html>
<head><head>
<title>Source Document</title>
</head>
<body>
<a href="target.html#danger">Better look here</a>.
</body></body>
</html>
14HTML, CSS, URL, Unicode
Hyperlinks: Target DocumentHyperlinks: Target Document
<html>
<head><head>
<title>Target Document</title>
</head>
<body>y
...
<a name="danger"></a>
<h2>Chapter 17: Dangerous Shell Commands</h2>
Never execute a shell command that inadvertently changes
all vowels to the character 'x'.
</body>
/ht l</html>
15HTML, CSS, URL, Unicode
TablesTables
<table border="1">
<tr><tr>
<td>PostScript</td>
<td align="right">11,274 bytes</td>
</tr></tr>
<tr>
<td>PDF</td>
<td align="right">4 915 bytes</td><td align= right >4,915 bytes</td>
</tr>
<tr>
<td>MS Word</td><td>MS Word</td>
<td align="right">19,456 bytes</td>
</tr>
t<tr>
<td>HTML</td>
<td align="right">28 bytes</td>
/
16HTML, CSS, Unicode, URL, HTTP
</tr>
</table>
FillFill--Out FormsOut Forms
Collects named values from the client:
<form method="get" action="http://www.google.com/search">
i " " " "<input type="text" name="q">
<input type="submit" name="btnG" value="Google Search">
</form>/
17HTML, CSS, URL, Unicode
GUI ElementsGUI Elements<input name="foo" type="text" size="20"><hr><input name="bar" type="radio" value="s">Small<input name "bar" type "radio" value "m">Medium<input name= bar type= radio value= m >Medium<input name="bar" type="radio" value="l">Large<hr><input name="baz" type="checkbox" value="c">Cheese<input name="baz" type="checkbox" value="p">Pepperoni<input name="baz" type="checkbox" value="a">Anchovies<hr><hr><select name="bar"><option value="s">Small<option value="m">Medium<option value="l">Large
</select><hr><select name="baz" multiple><option value="c">Cheese<option value="p">Pepperoni<option value="a">Anchovies
</select><hr>
"f " " " l "20"<textarea name="foo" rows="5" cols="20">Write something here...</textarea><hr><input name="foo" type="password" value="tomato"><hr><input name "foo" type "file"><input name= foo type= file ><hr><input name="foo" type="hidden" value="you can't see this"><hr><input name="qux" type="image" src="Denmark.gif"><hr><input type="submit" value="Submit this form">
18HTML, CSS, URL, Unicode
<input type submit value Submit this form ><hr><input type="reset" value="Reset this form"
Logical Versus PhysicalLogical Versus Physical
Logical structure Physical layout
•the page starts with a header•the entries are written in a list•numbers are emphasized
•headers are centered, huge, and grey•lists have square bullets•emphasis is rendered in bold style italics•numbers are emphasized •emphasis is rendered in bold-style italics
19HTML, CSS, Unicode, URL, HTTP
Survivor’s Guide to CSSSurvivor’s Guide to CSS
Cascading Stylesheets separate structure from layoutThe essential concepts are selectors and propertiesProperties may have different values:color red, yellow, rgb(212,120,20)
font-style normal, italic, obliquefont style normal, italic, oblique
font-size 12pt, larger, 150%, 1.5em
text align left right center justifytext-align left, right, center, justify
line-height normal, 1.2em, 120%
di l bl k i li li idisplay block, inline, list-item, none
20HTML, CSS, Unicode, URL, HTTP
Structure of a StylesheetStructure of a Stylesheet
A selector is a list of tag namesA selector is a list of tag namesFor each selector, some properties are assigned values:values:b {color: red; font-size: 12pt}
i {color: green}i {color: green}
Longer selectors give context sensitivity:table b {color: red; font-size: 12pt}
form b {color: yellow; font-size: 12pt}
i {color green}i {color: green}
The most specific selector is chosen to apply
21HTML, CSS, Unicode, URL, HTTP
ClassesClasses
HTML elements may have a class attributey
<p><b class=”firstword”>First</b> we add structure,<p><b class= firstword >First</b> we add structure, then <b>style</b>
Classes can be denoted in selectors using dot .
p b.firstword {font-style: italics}
Classes have high specificity
22HTML, CSS, Unicode, URL, HTTP
Specificity in ActionSpecificity in Action
<html>
<head><body>
b l f H ! /b<head>
<style type="text/css">
b {color: red;}
b b {color: blue;}
<b class=foo>Hey!</b><b>Wow!
<b>Amazing!</b><b class=foo>Impressive!</b>
b b {color: blue;}
b.foo {color: green;}
b b.foo {color: yellow;}
b b { l }
p /<b class=bar>k00l!</b><i>Fantastic!</i>
</b></body>b.bar {color: maroon;}
</style>
<title>CSS Test</title>
/h d
</body></html>
Hey! Wow! Amazing! Impressive! K00l! Fantastic!
</head>
Hey! Wow! Amazing! Impressive! K00l! Fantastic!
23HTML, CSS, Unicode, URL, HTTP
Applying a StylesheetApplying a Stylesheet
h1 { color: #888; font: 50px/50px "Impact"; text-align: center; }{ ; p / p p ; g ; }
ul { list-style-type: square; }
em { font-style: italic; font-weight: bold; }
<html>
<head>
<title>Phone Numbers</title>
li k h f " l " l " l h " " / "<link href="style.css" rel="stylesheet" type="text/css">
</head>
<body>
<h1>Phone Numbers</h1><h1>Phone Numbers</h1>
<ul>
<li>John Doe, <em>(202) 555-1414</em>
<li>Jane Dow, <em>(202) 555-9132</em>, ( ) /
<li>Jack Doe, <em>(212) 555-1742</em>
</ul>
</body>
24HTML, CSS, URL, Unicode
</html>
HTML HTML ValidityValidity
HTML has a formal syntax specification800 lines of DTD notationA validator gives syntax errors for invalid documentsMost HTML documents on the Web are invalid:
www.microsoft.com 123 errorswww.cnn.com 58 errorswww.ibm.com 30 errorswww.google.com 27 errorswww.sun.com 19 errors
25HTML, CSS, URL, Unicode
Validation ErrorsValidation Errors
Line 3, column 7: document type does not allow element "BODY" here.
<body>
^
Line 4, column 13: document type does not allow element "B" here; assuming missing "CAPTION" start-tag
<table><b>123</i></table>
^
Line 4, column 20: end tag for element "I" which is not open.<html>Line 4, column 20: end tag for element I which is not open.
<table><b>123</i></table>
^
Line 4, column 28: end tag for "B" omitted, but its declaration does not permit this.
<table><b>123</i></table>
<body><table><b>123</i></table>
</body>^
Line 4, column 11: start tag was here.
<table><b>123</i></table>
^
Line 4, column 28: end tag for "CAPTION" omitted, but its declaration does not permit this.
</html>
Line 4, column 28: end tag for CAPTION omitted, but its declaration does not permit this.
<table><b>123</i></table>
^
Line 4, column 11: start tag was here.
<table><b>123</i></table>
^
Line 4, column 28: end tag for "TABLE" which is not finished.
<table><b>123</i></table>
^
Line 6, column 6: end tag for "HTML" which is not finished.
26HTML, CSS, URL, Unicode
Line 6, column 6: end tag for HTML which is not finished.
</html>
Reasons for InvalidityReasons for Invalidity
Ignorance of the HTML standardForgiving browsers try to interpret invalid input
<h2>Lousy HTML</h1><h2>Lousy HTML</h1><li><a>This is not very</b> good.<li><i>In fact, it is quite bad</em></ul>
Lack of testing
But the browser does <a naem="goof">something.
g• ”This page is optimized for the XYZ browser”• ”This page is best viewed in 1024x768”This page is best viewed in 1024x768
Automatic tools generate invalid HTML output
27HTML, CSS, URL, Unicode
Problems with InvalidityProblems with Invalidity
There are several different browsersThere are several different browsersEach browser has many different implementationsEach implementation must interpret invalid HTMLEach implementation must interpret invalid HTMLThere are many arbitrary choices to make
That browsers do accept invalid HTMLh d i d th HTML t d d• has undermined the HTML standard
• HTML renders differently in most browsers
28HTML, CSS, URL, Unicode
Bytes vs. CharactersBytes vs. Characters
HTML documents are text, often stored on diskLogically, a text file is a sequence of charactersBut physically a sequence of bytesBut physically, a sequence of bytesSeveral mappings form bytes to characters exist:• ASCII (http://www.asciitable.com/)• ISO-8859-1 (aka Latin-1)• EBCDIC• Unicode (UTF-8, UTF-16, UTF-32)
Unicode aims to cover all characters in all past or present written languages
29HTML, CSS, URL, Unicode
Unicode CharactersUnicode Characters
A character is a symbol that appears in a texty pp• letters of the alphabet• pictograms (like © and ☂)p g ( ☂)• accents
Unicode characters are abstract entities:Unicode characters are abstract entities:• LATIN CAPITAL LETTER A
• LATIN CAPITAL LETTER A WITH RING ABOVE• LATIN CAPITAL LETTER A WITH RING ABOVE
• HIRAGANA LETTER SA
• RUNIC LETTER THURISAZ THURS THORN• RUNIC LETTER THURISAZ THURS THORN
• UMBRELLA
30HTML, CSS, URL , Unicode
Unicode GlyphsUnicode Glyphs
A glyph is a graphical presentationg yp g p pA typical example is: ÅThis may represent several characters:This may represent several characters:• LATIN CAPITAL LETTER A WITH RING ABOVE
• ANGSTROM SIGN• ANGSTROM SIGN
Or even a sequence of characters:• LATIN CAPITAL LETTER A
• COMBINING RING ABOVE
Some characters even result in several glyphs
31HTML, CSS, URL , Unicode
Unicode Code PointsUnicode Code Points
A code point is a unique number assigned to p q gevery Unicode characterCode points are between 0 and 1,114,112Code points are between 0 and 1,114,112Only around 100,000 are used todayF lFor example• The character HIRAGANA LETTER SA is assigned the
d i t 12 373code point 12,373Code point 0 through 127 coincide with ASCIISome code point are never assigned
32HTML, CSS, URL , Unicode
Unicode Character EncodingUnicode Character Encoding
A character encoding interprets a sequence of g p qbytes as a sequence of code pointsThe bytes are first parsed into code unitsThe bytes are first parsed into code unitsCode units have a fixed lengthO d it b i d tOne or more code units may be required to denote a code pointExamples are UTF-8, UTF-16, UTF-32
33HTML, CSS, URL , Unicode
EncodingEncoding IssuesIssues: : ClassicalClassical ErrorsErrors
utf-8 encoded page iso-8859-1 encoded pagep gread asiso 8859 1 encoded
p gread asutf 8 encodediso-8859-1 encoded utf-8 encoded
utf-8 encoded page read as utf-16
34HTML, CSS, URL , Unicode
FunFun withwith Ruby: Ruby: UnicodeUnicode in in SourceSource TextsTexts
These are legal Ruby 1.9 programsg y p g# encoding: utf-8def ∑(x) # encoding: utf-8
d f ( ) d∑( )
sum = 0x.each do |i|
def ↣(x);x;end;def ☁(x); !x; end;
sum = sum+iendsum
☀ = false☂ = ☁ ↣ ☀
sumendprint "∑ [1, 2, 3, 4] =" puts "umbrella? #{☂}"p ∑ [ , , , ]puts ∑ [1, 2, 3, 4]
#=> umbrella? true
35HTML, CSS, URL , Unicode
#=> ∑ [1, 2, 3, 4] = 10 #=> umbrella? true
UTFUTF--88
A multi-byte encoding of Unicodey g• A code point is from 1 to 4 code units• A code unit is a single byteg y0XXXXXXX directly represent the corresponding code pointscode points110XXXXX indicates that 2 code units are used
i di t th t 3 d it d1110XXXX indicates that 3 code units are used11110XXX indicates that 4 code units are usedThe remaining code units look like 10XXXXXXXs concatenated form the code point in binary
36HTML, CSS, URL , Unicode
Xs concatenated form the code point in binary
UTFUTF--88
Examplep• 11100011 10000001 10010101
=> 0011000001010101 => 12373 (code point)p
=> HIRAGANA LETTER SA (character)
=> さ(glyph)(g yp )
UTF-8 has some nice properties• Extends ASCII• Extends ASCII• For common characters it uses only one byte• Good chance of detecting UTF-8 text
37HTML, CSS, URL , Unicode
OtherOther CharacterCharacter EncodingsEncodings
UTF-16: two byte code unit, ”endianess” mattersyUTF-32: fixed width, four-byte code unitISO 8859 1: another popular character encodingISO-8859-1: another popular character encoding• Only 256 code points
Single b te code nits• Single byte code units• Coincides with ASCII on code points 0-127
C t t l U i d• Cannot represent general UnicodeIn all, there are hundreds of different encodings...
38HTML, CSS, URL , Unicode
Character Encodings in HTMLCharacter Encodings in HTML
The document may declare its own encoding:y g
<meta http-equiv="Content-Type"p q yp
content="text/html; charset=ISO-8859-1">
Unicode characters may be represented as:Unicode characters may be represented as:さ
39HTML, CSS, URL , Unicode
World Wide Web Consortium (W3C)World Wide Web Consortium (W3C)
Develops HTML CSS and most Web technologyDevelops HTML, CSS, and most Web technology• Read about it in the book :-)• Reports are at various stages: from working draft to• Reports are at various stages: from working draft, to
recommendation.
Consensus among members
Limited intellectual property rights
Free Web access to technical reports (unlike ISO)
40HTML, CSS, URL , Unicode
Essential Online ResourcesEssential Online Resources
http://www w3 org/TR/html4/http://www.w3.org/TR/html4/
http://www.w3.org/Addressing/
http://www.w3.org/Style/CSS/
http://validator.w3.org/p // g/
http://www.w3.org/
http://unicode org/http://unicode.org/
41HTML, CSS, URL , Unicode