html, css, url, unicode

41
HTML, CSS, URL, Unicode HTML, CSS, URL, Unicode

Upload: others

Post on 03-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HTML, CSS, URL, Unicode

HTML, CSS, URL, UnicodeHTML, CSS, URL, Unicode

Page 2: HTML, CSS, URL, Unicode

WhyWhy CareCare AboutAbout HistoryHistory??

Quoting HTML5 Spec section 1.6g p

It must be admitted that many aspects of HTML appear at first glance to be nonsensical and inconsistentat first glance to be nonsensical and inconsistent.

HTML, its supporting DOM APIs, as well as many of its supporting technologies, have been developed over a period of several decades by a wide arrayhave been developed over a period of several decades by a wide arrayof people with different priorities who, in many cases, did not know of each other's existence.

Features have thus arisen from many sources, and have not always been designed in especially consistent ways. Furthermore, because of the unique characteristics of the Web, implementation bugs haveq , p goften become de-facto, and now de-jure, standards, as content is often unintentionally written in ways that rely on them before they can be fixed.

2HTML, CSS, URL, Unicode

Page 3: HTML, CSS, URL, Unicode

HyperText Markup LanguageHyperText Markup Language

HTML acronymn: Hyper Text Markup LanguageHTML acronymn: Hyper-Text Markup Language• Combination:

• Hypertext• Hypertext• Markup language

Hypertext:C ll ti f d t t d b h li k• Collections of document connected by hyperlinks

• (Paul Otlet, philosophical treatise (1934), Vannevar Bush hypothetical Memex system (1945)Vannevar Bush, hypothetical Memex system (1945), Ted Nelson introduced hypertext (1968))

3HTML, CSS, URL, Unicode

Page 4: HTML, CSS, URL, Unicode

Markup LanguagesMarkup Languages

Notation for adding formal structure to textNotation for adding formal structure to textStandard Generalized Markup Language, SGML (1986)(dates to Charles Goldfarb the INLINE system (1970))

<!DOCTYPE greeting [i ( )

(dates to Charles Goldfarb, the INLINE system (1970))

<!ELEMENT greeting (#PCDATA)><!ATTLIST greeting style (big|small) "small"><!ENTITY hi "Hello">

]><greeting style="big"> &hi; world! </greeting>

4HTML, CSS, URL, Unicode

Page 5: HTML, CSS, URL, Unicode

The Origins of the WWWThe Origins of the WWW

WWW was invented by Tim Berners Lee at CERN (1989)WWW was invented by Tim Berners-Lee at CERN (1989)• Hypertext on the Internet (replacing FTP)

Three constituents: HTML + URL + HTTP

HTML is an SGML language for hypertextURL i t ti f l ti filURL is an notation for locating files on servesHTTP is a high-level protocol for file transfers

5HTML, CSS, URL, Unicode

Page 6: HTML, CSS, URL, Unicode

The Design of HTMLThe Design of HTML

Simple, purist design principlesHTML describes the logical structure of a documentBrowsers are free to interpret tags differentlyHTML is a lightweight file formatSize of file containing just ”Hello World!”:g j

Postscript 11,274 bytesPDF 4,915 bytesMS Word 19,456 bytesHTML 28 bytes

6HTML, CSS, URL, Unicode

Page 7: HTML, CSS, URL, Unicode

The History of HTMLThe History of HTML

1992: HTML 1.0, Tim Berners-Lee original proposal1993: HTML+, some physical layout1994: HTML 2.0, standard with best features1995: Non-standard Netscape features1996: Competing Netscape and Explorer features1996: HTML 3.2, the Browser Wars end1997: HTML 4.0, stylesheets are introduced1999 HTML 4 01 h i !1999: HTML 4.01, we have a winner!2000: XHTML 1.0, an XML version of HTML 4.012001 XHTML 1 1 mod lari ation2001: XHTML 1.1, modularization2002: XHTML 2.0, simplified and generalized2006/7 ?: (X)HTML5 development of HTML and XHTML

7HTML, CSS, URL, Unicode

2006/7-?: (X)HTML5, development of HTML and XHTML

Page 8: HTML, CSS, URL, Unicode

Uniform Resource LocatorUniform Resource Locator

A Web resource is located by a URLA Web resource is located by a URL

htt // 3 /TR/ht l4/http://www.w3.org/TR/html4/

h th

Relative URL

scheme server path

Relative URLsgml/dtd.html

Fragment identifierh // 3 / / 4/# i i

8HTML, CSS, URL, Unicode

http://www.w3.org/TR/HTML4/#minitoc

Page 9: HTML, CSS, URL, Unicode

URIs, URNs, and IRIsURIs, URNs, and IRIs

Uniform Resource Identifier (URI)Uniform Resource Identifier (URI)scheme:scheme-specific-part

Conventions about use of / # and ?Conventions about use of /, #, and ?

Uniform Resource Name (URN)Uniform Resource Name (URN)urn:isbn:0-471-94128-X

International Resource Identifier (IRI)http://www blåbærgrød dk/blåbærgrød htmlhttp://www.blåbærgrød.dk/blåbærgrød.html

http://www.xn--blbrgrd-fxak7p.dk/bl%E5b%E6rgr%F8d.html

9HTML, CSS, URL, Unicode

Page 10: HTML, CSS, URL, Unicode

Survivor’s Guide to HTMLSurvivor’s Guide to HTML

Overall structure of an HTML documentOverall structure of an HTML document

<html><html>

<head>

<title>The Title of the Document</title>

</head>

<body bgcolor="white">

...

</body>

</html></html>

10HTML, CSS, URL, Unicode

Page 11: HTML, CSS, URL, Unicode

Simple Formatting (1/2)Simple Formatting (1/2)

<html>

<head><head>

<title>Good Advice</title>

</head>

<body><body>

<h1>Good Advice for Everyday Life</h1>

<h2>For UNIX programmers</h2>

b /b<b>Never</b> type:

<p><tt>rm -rf /*</tt><p>

on your computer.

<h2>For Nuclear Scientists</h2>

<b>Never</b> press the

<i>Big <font color="red">Red</font> Button</i>.<i>Big <font color red >Red</font> Button</i>.

</body>

</html>

11HTML, CSS, URL, Unicode

Page 12: HTML, CSS, URL, Unicode

Simple Formatting (2/2)Simple Formatting (2/2)

12HTML, CSS, URL, Unicode

Page 13: HTML, CSS, URL, Unicode

More FormattingMore Formatting

<html>

<head><head>

<title>Things To Do</title>

</head>

b d<body>

<ol>

<li>Feed the cat.

li h h ll d<li>Try out the shell command:

<pre>foreach x ( `ls` )

cat $x | tr "aeiouy" "x" > $x

end</pre>

<li>Buy ticket for Timbuktu.

</ol>

</body>

</html>

13HTML, CSS, URL, Unicode

Page 14: HTML, CSS, URL, Unicode

Hyperlinks: Source DocumentHyperlinks: Source Document

<html>

<head><head>

<title>Source Document</title>

</head>

<body>

<a href="target.html#danger">Better look here</a>.

</body></body>

</html>

14HTML, CSS, URL, Unicode

Page 15: HTML, CSS, URL, Unicode

Hyperlinks: Target DocumentHyperlinks: Target Document

<html>

<head><head>

<title>Target Document</title>

</head>

<body>y

...

<a name="danger"></a>

<h2>Chapter 17: Dangerous Shell Commands</h2>

Never execute a shell command that inadvertently changes

all vowels to the character 'x'.

</body>

/ht l</html>

15HTML, CSS, URL, Unicode

Page 16: HTML, CSS, URL, Unicode

TablesTables

<table border="1">

<tr><tr>

<td>PostScript</td>

<td align="right">11,274 bytes</td>

</tr></tr>

<tr>

<td>PDF</td>

<td align="right">4 915 bytes</td><td align= right >4,915 bytes</td>

</tr>

<tr>

<td>MS Word</td><td>MS Word</td>

<td align="right">19,456 bytes</td>

</tr>

t<tr>

<td>HTML</td>

<td align="right">28 bytes</td>

/

16HTML, CSS, Unicode, URL, HTTP

</tr>

</table>

Page 17: HTML, CSS, URL, Unicode

FillFill--Out FormsOut Forms

Collects named values from the client:

<form method="get" action="http://www.google.com/search">

i " " " "<input type="text" name="q">

<input type="submit" name="btnG" value="Google Search">

</form>/

17HTML, CSS, URL, Unicode

Page 18: HTML, CSS, URL, Unicode

GUI ElementsGUI Elements<input name="foo" type="text" size="20"><hr><input name="bar" type="radio" value="s">Small<input name "bar" type "radio" value "m">Medium<input name= bar type= radio value= m >Medium<input name="bar" type="radio" value="l">Large<hr><input name="baz" type="checkbox" value="c">Cheese<input name="baz" type="checkbox" value="p">Pepperoni<input name="baz" type="checkbox" value="a">Anchovies<hr><hr><select name="bar"><option value="s">Small<option value="m">Medium<option value="l">Large

</select><hr><select name="baz" multiple><option value="c">Cheese<option value="p">Pepperoni<option value="a">Anchovies

</select><hr>

"f " " " l "20"<textarea name="foo" rows="5" cols="20">Write something here...</textarea><hr><input name="foo" type="password" value="tomato"><hr><input name "foo" type "file"><input name= foo type= file ><hr><input name="foo" type="hidden" value="you can't see this"><hr><input name="qux" type="image" src="Denmark.gif"><hr><input type="submit" value="Submit this form">

18HTML, CSS, URL, Unicode

<input type submit value Submit this form ><hr><input type="reset" value="Reset this form"

Page 19: HTML, CSS, URL, Unicode

Logical Versus PhysicalLogical Versus Physical

Logical structure Physical layout

•the page starts with a header•the entries are written in a list•numbers are emphasized

•headers are centered, huge, and grey•lists have square bullets•emphasis is rendered in bold style italics•numbers are emphasized •emphasis is rendered in bold-style italics

19HTML, CSS, Unicode, URL, HTTP

Page 20: HTML, CSS, URL, Unicode

Survivor’s Guide to CSSSurvivor’s Guide to CSS

Cascading Stylesheets separate structure from layoutThe essential concepts are selectors and propertiesProperties may have different values:color red, yellow, rgb(212,120,20)

font-style normal, italic, obliquefont style normal, italic, oblique

font-size 12pt, larger, 150%, 1.5em

text align left right center justifytext-align left, right, center, justify

line-height normal, 1.2em, 120%

di l bl k i li li idisplay block, inline, list-item, none

20HTML, CSS, Unicode, URL, HTTP

Page 21: HTML, CSS, URL, Unicode

Structure of a StylesheetStructure of a Stylesheet

A selector is a list of tag namesA selector is a list of tag namesFor each selector, some properties are assigned values:values:b {color: red; font-size: 12pt}

i {color: green}i {color: green}

Longer selectors give context sensitivity:table b {color: red; font-size: 12pt}

form b {color: yellow; font-size: 12pt}

i {color green}i {color: green}

The most specific selector is chosen to apply

21HTML, CSS, Unicode, URL, HTTP

Page 22: HTML, CSS, URL, Unicode

ClassesClasses

HTML elements may have a class attributey

<p><b class=”firstword”>First</b> we add structure,<p><b class= firstword >First</b> we add structure, then <b>style</b>

Classes can be denoted in selectors using dot .

p b.firstword {font-style: italics}

Classes have high specificity

22HTML, CSS, Unicode, URL, HTTP

Page 23: HTML, CSS, URL, Unicode

Specificity in ActionSpecificity in Action

<html>

<head><body>

b l f H ! /b<head>

<style type="text/css">

b {color: red;}

b b {color: blue;}

<b class=foo>Hey!</b><b>Wow!

<b>Amazing!</b><b class=foo>Impressive!</b>

b b {color: blue;}

b.foo {color: green;}

b b.foo {color: yellow;}

b b { l }

p /<b class=bar>k00l!</b><i>Fantastic!</i>

</b></body>b.bar {color: maroon;}

</style>

<title>CSS Test</title>

/h d

</body></html>

Hey! Wow! Amazing! Impressive! K00l! Fantastic!

</head>

Hey! Wow! Amazing! Impressive! K00l! Fantastic!

23HTML, CSS, Unicode, URL, HTTP

Page 24: HTML, CSS, URL, Unicode

Applying a StylesheetApplying a Stylesheet

h1 { color: #888; font: 50px/50px "Impact"; text-align: center; }{ ; p / p p ; g ; }

ul { list-style-type: square; }

em { font-style: italic; font-weight: bold; }

<html>

<head>

<title>Phone Numbers</title>

li k h f " l " l " l h " " / "<link href="style.css" rel="stylesheet" type="text/css">

</head>

<body>

<h1>Phone Numbers</h1><h1>Phone Numbers</h1>

<ul>

<li>John Doe, <em>(202) 555-1414</em>

<li>Jane Dow, <em>(202) 555-9132</em>, ( ) /

<li>Jack Doe, <em>(212) 555-1742</em>

</ul>

</body>

24HTML, CSS, URL, Unicode

</html>

Page 25: HTML, CSS, URL, Unicode

HTML HTML ValidityValidity

HTML has a formal syntax specification800 lines of DTD notationA validator gives syntax errors for invalid documentsMost HTML documents on the Web are invalid:

www.microsoft.com 123 errorswww.cnn.com 58 errorswww.ibm.com 30 errorswww.google.com 27 errorswww.sun.com 19 errors

25HTML, CSS, URL, Unicode

Page 26: HTML, CSS, URL, Unicode

Validation ErrorsValidation Errors

Line 3, column 7: document type does not allow element "BODY" here.

<body>

^

Line 4, column 13: document type does not allow element "B" here; assuming missing "CAPTION" start-tag

<table><b>123</i></table>

^

Line 4, column 20: end tag for element "I" which is not open.<html>Line 4, column 20: end tag for element I which is not open.

<table><b>123</i></table>

^

Line 4, column 28: end tag for "B" omitted, but its declaration does not permit this.

<table><b>123</i></table>

<body><table><b>123</i></table>

</body>^

Line 4, column 11: start tag was here.

<table><b>123</i></table>

^

Line 4, column 28: end tag for "CAPTION" omitted, but its declaration does not permit this.

</html>

Line 4, column 28: end tag for CAPTION omitted, but its declaration does not permit this.

<table><b>123</i></table>

^

Line 4, column 11: start tag was here.

<table><b>123</i></table>

^

Line 4, column 28: end tag for "TABLE" which is not finished.

<table><b>123</i></table>

^

Line 6, column 6: end tag for "HTML" which is not finished.

26HTML, CSS, URL, Unicode

Line 6, column 6: end tag for HTML which is not finished.

</html>

Page 27: HTML, CSS, URL, Unicode

Reasons for InvalidityReasons for Invalidity

Ignorance of the HTML standardForgiving browsers try to interpret invalid input

<h2>Lousy HTML</h1><h2>Lousy HTML</h1><li><a>This is not very</b> good.<li><i>In fact, it is quite bad</em></ul>

Lack of testing

But the browser does <a naem="goof">something.

g• ”This page is optimized for the XYZ browser”• ”This page is best viewed in 1024x768”This page is best viewed in 1024x768

Automatic tools generate invalid HTML output

27HTML, CSS, URL, Unicode

Page 28: HTML, CSS, URL, Unicode

Problems with InvalidityProblems with Invalidity

There are several different browsersThere are several different browsersEach browser has many different implementationsEach implementation must interpret invalid HTMLEach implementation must interpret invalid HTMLThere are many arbitrary choices to make

That browsers do accept invalid HTMLh d i d th HTML t d d• has undermined the HTML standard

• HTML renders differently in most browsers

28HTML, CSS, URL, Unicode

Page 29: HTML, CSS, URL, Unicode

Bytes vs. CharactersBytes vs. Characters

HTML documents are text, often stored on diskLogically, a text file is a sequence of charactersBut physically a sequence of bytesBut physically, a sequence of bytesSeveral mappings form bytes to characters exist:• ASCII (http://www.asciitable.com/)• ISO-8859-1 (aka Latin-1)• EBCDIC• Unicode (UTF-8, UTF-16, UTF-32)

Unicode aims to cover all characters in all past or present written languages

29HTML, CSS, URL, Unicode

Page 30: HTML, CSS, URL, Unicode

Unicode CharactersUnicode Characters

A character is a symbol that appears in a texty pp• letters of the alphabet• pictograms (like © and ☂)p g ( ☂)• accents

Unicode characters are abstract entities:Unicode characters are abstract entities:• LATIN CAPITAL LETTER A

• LATIN CAPITAL LETTER A WITH RING ABOVE• LATIN CAPITAL LETTER A WITH RING ABOVE

• HIRAGANA LETTER SA

• RUNIC LETTER THURISAZ THURS THORN• RUNIC LETTER THURISAZ THURS THORN

• UMBRELLA

30HTML, CSS, URL , Unicode

Page 31: HTML, CSS, URL, Unicode

Unicode GlyphsUnicode Glyphs

A glyph is a graphical presentationg yp g p pA typical example is: ÅThis may represent several characters:This may represent several characters:• LATIN CAPITAL LETTER A WITH RING ABOVE

• ANGSTROM SIGN• ANGSTROM SIGN

Or even a sequence of characters:• LATIN CAPITAL LETTER A

• COMBINING RING ABOVE

Some characters even result in several glyphs

31HTML, CSS, URL , Unicode

Page 32: HTML, CSS, URL, Unicode

Unicode Code PointsUnicode Code Points

A code point is a unique number assigned to p q gevery Unicode characterCode points are between 0 and 1,114,112Code points are between 0 and 1,114,112Only around 100,000 are used todayF lFor example• The character HIRAGANA LETTER SA is assigned the

d i t 12 373code point 12,373Code point 0 through 127 coincide with ASCIISome code point are never assigned

32HTML, CSS, URL , Unicode

Page 33: HTML, CSS, URL, Unicode

Unicode Character EncodingUnicode Character Encoding

A character encoding interprets a sequence of g p qbytes as a sequence of code pointsThe bytes are first parsed into code unitsThe bytes are first parsed into code unitsCode units have a fixed lengthO d it b i d tOne or more code units may be required to denote a code pointExamples are UTF-8, UTF-16, UTF-32

33HTML, CSS, URL , Unicode

Page 34: HTML, CSS, URL, Unicode

EncodingEncoding IssuesIssues: : ClassicalClassical ErrorsErrors

utf-8 encoded page iso-8859-1 encoded pagep gread asiso 8859 1 encoded

p gread asutf 8 encodediso-8859-1 encoded utf-8 encoded

utf-8 encoded page read as utf-16

34HTML, CSS, URL , Unicode

Page 35: HTML, CSS, URL, Unicode

FunFun withwith Ruby: Ruby: UnicodeUnicode in in SourceSource TextsTexts

These are legal Ruby 1.9 programsg y p g# encoding: utf-8def ∑(x) # encoding: utf-8

d f ( ) d∑( )

sum = 0x.each do |i|

def ↣(x);x;end;def ☁(x); !x; end;

sum = sum+iendsum

☀ = false☂ = ☁ ↣ ☀

sumendprint "∑ [1, 2, 3, 4] =" puts "umbrella? #{☂}"p ∑ [ , , , ]puts ∑ [1, 2, 3, 4]

#=> umbrella? true

35HTML, CSS, URL , Unicode

#=> ∑ [1, 2, 3, 4] = 10 #=> umbrella? true

Page 36: HTML, CSS, URL, Unicode

UTFUTF--88

A multi-byte encoding of Unicodey g• A code point is from 1 to 4 code units• A code unit is a single byteg y0XXXXXXX directly represent the corresponding code pointscode points110XXXXX indicates that 2 code units are used

i di t th t 3 d it d1110XXXX indicates that 3 code units are used11110XXX indicates that 4 code units are usedThe remaining code units look like 10XXXXXXXs concatenated form the code point in binary

36HTML, CSS, URL , Unicode

Xs concatenated form the code point in binary

Page 37: HTML, CSS, URL, Unicode

UTFUTF--88

Examplep• 11100011 10000001 10010101

=> 0011000001010101 => 12373 (code point)p

=> HIRAGANA LETTER SA (character)

=> さ(glyph)(g yp )

UTF-8 has some nice properties• Extends ASCII• Extends ASCII• For common characters it uses only one byte• Good chance of detecting UTF-8 text

37HTML, CSS, URL , Unicode

Page 38: HTML, CSS, URL, Unicode

OtherOther CharacterCharacter EncodingsEncodings

UTF-16: two byte code unit, ”endianess” mattersyUTF-32: fixed width, four-byte code unitISO 8859 1: another popular character encodingISO-8859-1: another popular character encoding• Only 256 code points

Single b te code nits• Single byte code units• Coincides with ASCII on code points 0-127

C t t l U i d• Cannot represent general UnicodeIn all, there are hundreds of different encodings...

38HTML, CSS, URL , Unicode

Page 39: HTML, CSS, URL, Unicode

Character Encodings in HTMLCharacter Encodings in HTML

The document may declare its own encoding:y g

<meta http-equiv="Content-Type"p q yp

content="text/html; charset=ISO-8859-1">

Unicode characters may be represented as:Unicode characters may be represented as:&#12373;

39HTML, CSS, URL , Unicode

Page 40: HTML, CSS, URL, Unicode

World Wide Web Consortium (W3C)World Wide Web Consortium (W3C)

Develops HTML CSS and most Web technologyDevelops HTML, CSS, and most Web technology• Read about it in the book :-)• Reports are at various stages: from working draft to• Reports are at various stages: from working draft, to

recommendation.

Consensus among members

Limited intellectual property rights

Free Web access to technical reports (unlike ISO)

40HTML, CSS, URL , Unicode

Page 41: HTML, CSS, URL, Unicode

Essential Online ResourcesEssential Online Resources

http://www w3 org/TR/html4/http://www.w3.org/TR/html4/

http://www.w3.org/Addressing/

http://www.w3.org/Style/CSS/

http://validator.w3.org/p // g/

http://www.w3.org/

http://unicode org/http://unicode.org/

41HTML, CSS, URL , Unicode