text manipulation

Text manipulation

Suppose you want to build a web-page which will always contain the latest sports headlines collected from several newspaper websites

You might, for example, wish to include the Guardian’s sports headlines on your page

Adding these headlines manually

• You would have to access the source of the Guardian page

• You would then have to find the text which defines the headlines

• Analyse it• And copy the relevant

bits into the HTML for your own web-page

• Examining it, we find that the source contains one HTML table for each sport in the list of top stories

• Here is the table for the tennis headlines on the page seen earlier:

<table cellspacing="0"><tr><td class="imgholder"><a

HREF="/tennis/story/0,10069,1581862,00.html"><img src="http://image.guardian.co.uk/sys-images/Sport/Pix/pictures/2005/09/30/andy2.jpg" width="128" height="128" border="0" alt="Andy Murray in action during his win over Robby Ginepri" /></a></td>

<td><a HREF="/tennis/story/0,10069,1581862,00.html">Murray magic books semi spot</a> Tennis: The biggest win of his career to-date saw Andy Murray stun Robby Ginepri and reach the last four at the Thailand Open.

 <a HREF="/tennis/story/0,10069,1580918,00.html">Tough home Davis Cup tie

for GB</a> <a HREF="/tennis/0,10067,495916,00.html">More tennis</a></td></tr></table>

• Here is the text which defines the main tennis headline on the page shown earlier:



<a HREF="/tennis/story/0,10069,1581862,00.html">

Murray magic books semi spot

</a>

 

Tennis: The biggest win of his career to-date saw Andy Murray stun Robby Ginepri and reach the last four at the Thailand Open.

• To get this story onto your own web-page you could then copy the relevant HTML segment into the source code for your web-page

• But …

• … doing this manually is very labour-intensive

• We ought to automate the complete task

Adding headlines automatically

• To add headlines automatically, you would have to write a program which would– Download the source code for the Guardian

page– Analyse this source code to extract the

appropriate text– Add the relevant text to source code for your

own web-page

Adding headlines automatically

• Later, we will see how to download page sources from other websites

• Now, we will focus on the issue of text analysis

Regular Expressions

• Regular expression technology provides a convenient way of searching string for patterns of interest

Regular expressions (contd.)

• Example regular expression:

/ab*c/

this searches the target string for substring(s) that comprise

“an a followed by zero or more instances of b followed by by a c”

• It will match any of the following substrings:ac

abc

abbc

abbbc

….

Using regular expressions in PHP

• Regular expressions are supported in several languages, including PHP

• PHP provides a group of pre-defined functions for using them

• For now, we will focus on just one of these, the preg_replace function

The preg_replace function

• Format of call:preg_replace (regexp, replacement, subject [, int limit])

• This function returns the result of replacing substrings in subject which match regexp with replacement

• The number of matching substrings which are replaced is controlled by the optional parameter limit

• An example application is on the next slide

• PHP code <?php

$myString = "xyzacklmabbcpqrabbbbbcstu";

echo "myString is $myString ";

$myString = preg_replace("/ab*c/","_",$myString);

echo "myString is now $myString";

?>

• Resultant output ismyString is xyzacklmabbcpqrabbbbbcstu

myString is now xyz_klm_pqr_stu

Using the limit parameter in preg_replace

• PHP code <?php

$myString = "xyzacklmabbcpqrabbbbbcstu";


$myString = preg_replace("/ab*c/","_",$myString,1);


?>

• Resultant output ismyString is xyzacklmabbcpqrabbbbbcstu

myString is now xyz_klmabbcpqrabbbbbcstu

Meta-characters

• We have seen that certain characters have a special meaning in regular expressions:– the example on the last few slides used the * character which

means

“0 or more instances of the preceding character or pattern”

• These are called meta-characters• Other meta-characters are listed on the next slide

• The meta-characters include:• the * character which means “0 or more instances of preceding”

• the + character, which means “1 or more instances of preceding”• the ? character, which means “0 or 1 instances of preceding”• the { and } character delimit an expression specifying a range of

acceptable occurrences of the preceding character• Examples:

{m} means exactly m occurences of preceding character/pattern

{m,} means at least m occurrences of preceding char/pattern

{m,n} means at least m, but not more than n, occurrences of preceding char/pattern

• Thus,

{0,} is equivalent to *

{1,} is equivalent to +

{0,1} is equivalent to ?


• Further meta-characters are:• the ^ character, which matches the start of a string

• the $ character, which matches the end of a string

• the . character which matching anything except a newline character

• the [ and ] character starts an equivalence class of characters, any of which can match one character in the target string

• the ( and ) characters delimit a group of sub-patterns• the | character separates alternative patterns


• Example expression:

/^a.*d$/

this matches the entire target string provided the target string starts with an a, followed by zero or more non-newline characters, and ends with a d


Example application

• PHP code

<?php

$myString1 = ”abcdefghijklmnopqrstuvd";

echo "myString1 is $myString1 ";

$myString1 = preg_replace(”/^a.*d$/","_",$myString1);

echo "myString1 is now $myString1 ";

$myString2 = ”xabcdefghijklmnopqrstuvd";


$myString2 = preg_replace(”/^a.*d$/","_",$myString2);

echo "myString2 is now $myString2";

?>• Resultant output is

myString1 is abcdefghijklmnopqrstuvd

myString1 is now _

myString2 is xabcdefghijklmnopqrstuvd

myString2 is now xabcdefghijklmnopqrstuvd


• Example expression:

/^a.{2,5}d$/

this replaces the entire target string with “x”, provided the target string starts with an a, followed by between two and five non-newline characters, and ends with a d


Regular expressions (contd.)• PHP code

<?php$myString1 = "adabbbbccccaaaabbbbccccd";


$myString1 = preg_replace(”/^a.{2,5}d$/","_",$myString1);


$myString2 = "afghd";


$myString2 = preg_replace(”/^a.{2,5}d$/","_",$myString2);



myString1 is adabbbbccccaaaabbbbccccd

myString1 is now adabbbbccccaaaabbbbccccd

myString2 is afghd

myString2 is now _



/(abc){2,5}d/

this matches sub-string(s) in the target that comprise “between 2 and 5 repeats of the pattern abc followed by a d”


• PHP code <?php

$myString = "klmabcabcabcdpqrabcdklmabcabcabcabcdxyz";


$myString = preg_replace("/(abc){2,5}d/","_",$myString);


?>

• Resultant output ismyString is klmabcabcabcdpqrabcdklmabcabcabcabcdxyz

myString is now klm_pqrabcdklm_xyz



/(foo|bar)/

this matches sub-strings foo or bar • An example application is on the next slide

• PHP code <?php

$myString = ”abcfoodefbarghi";


$myString = preg_replace("/(foo|bar)/","_",$myString);


?>

• Resultant output ismyString is abcfoodefbarghi

myString is now abc_def_ghi


• Although some characters have special meanings in regular expressions, we may, sometimes, just want to use them to match themselves in the target string

• We do this by escaping them in the regular expression, by preceding them with a backslash \


/^a\^+.*d$/

this matches the entire target string, provided the target string starts with an a, followed by one or more carat characters, followed by zero or more non-newline characters, and ends with a d


Example application

• PHP code

<?php

$myString1 = ”adabbbbcabbcabced";


$myString1 = preg_replace(”/â\^+.*d$/","_",$myString1);


$myString2 = ”a^^âbbbbcabbcabceed";


$myString2 = preg_replace(”/â\^+.*d$/","_",$myString2);



myString1 is adabbbbcabbcabced

myString1 is now adabbbbcabbcabced

myString2 is a^^âbbbbcabbcabceed myString2 is now _


• As mentioned earlier, the [ and ] characters have a special meaning in regular expressions – they delimit an equivalence class of characters, any one of

which may be used to match one character in the target string


/a[KLM]b/

replaces any substring comprising “the letter a followed by one of the three letters KLM, followed by the letter b”


• The ^ character has a special meaning when used as the first character between [ and ] characters; this meaning is different from its special meaning when used outside the [ and ] characters– when used as the first character between the [ and ] characters,

the ^ character specifies the complement of the equivalence class that would have been specified if its were absent


/a[^KLM]b/

replaces any substring comprising “the letter a followed by any single letter that is not one of KLM, followed by the letter b”


• The - character also has a special meaning when used between [ and ] characters:– it is used to join the start and end of a sequence of characters,

any one of which may be used to match one character in the target string


/a[0-9]b/

matches any substring comprising “the letter a followed by one digit, followed by the letter b”

Cs 4408 got here on 30 sep 2005



/ %[a-fA-F0-9]/

matches any substring comprising “an % followed by a hexadecimal digit”

Regular expressions (contd.)• Certain escape sequences also have a special

meaning in regular expressions. They define certain commonly used equivalence classes of characters:\w is equivalent to [a-zA-Z0-9_] \W is equivalent to [^a-zA-Z0-9_] \d is equivalent to [0-9] \D is equivalent to [^0-9] \s is equivalent to [ \n\t\f\r] \S is equivalent to [^ \n\t\f\r] \b denotes a word boundary\B denotes a non-word boundary

• Note the SP characters in the meaning of \s and \S, that is the white-space equivalence includes SP

• Byt the way, \f is formFeed and \r is carriageReturn



/ %\d\d\d\D/

matches any substring comprising “an % followed by three decimal digits, followed by a non-digit”


/ \s\w\w\s/

matches any substring comprising “a white-space character, followed by two word characters, followed by another white-space character”

• PHP code <?php

$myString = ”This is not an apple";


$myString = preg_replace("/\s\w\w\s/","_",$myString);


?>

• Resultant output ismyString is This is not an apple

myString is now This_not_apple


• The standard quantifiers are all "greedy”

– they match as many occurrences as possible without causing the pattern to fail.

• It is possible to make them “frugal”

– that is, make them match the minimum number of times necessary

• We do this by following the quantifier with a "?"

• *? Match 0 or more times, preferably only 0• +? Match 1 or more times, preferably only 1 time• ?? Match 0 or 1 time, preferably only 0• {n}? Match exactly n times• {n,}? Match at least n times, preferably only n times• {n,m}? Match at least n but not more than m times, preferably only n

times

<?php$myString1 = ”abcabcabcabc";


$myString1 = preg_replace(”/(abc){2,5}/",”x",$myString1);


$myString2 = "abcabcabcabc";


$myString2 = preg_replace(”/(abc){2,5}?/",”x",$myString2);echo "myString2 is now $myString2";


myString1 is abcabcabcabc

myString1 is now x


myString2 is now xx

• What is going on here? See next slide for contrast

<?php$myString1 = ”abcabcabcabc";


$myString1 = preg_replace(”/(abc){2,5}/",”x",$myString1,1);


$myString2 = "abcabcabcabc";


$myString2 = preg_replace(”/(abc){2,5}?/",”x",$myString2,1);




myString1 is now x


myString2 is now xabcabc

• Discussion of contrast with previous slide ...

A digression

• Before proceeding to further regexp concepts, let’s look at applying to HTML manipulation what we have already seen

Example task

• Suppose we have the following HTML <ul><li>wine</li><li>f12</li><li>cheese</li></ul>

• Suppose we want to eliminate from the list any list item whose content comprises only non-digits

• That is, we want the HTML to become<ul><li>f12</li></ul>

<?php$myString =

”<ul><li>wine</li><li>f12</li><li>cheese</li></ul>";echo "myString is $myString ";

$myString = preg_replace(”/<li>\D+<\/li>/",”",$myString);

echo "myString is now $myString ";

?>

• Resultant output ismyString is

wine

f12

cheese

myString is now

f12

Seeing the raw-HTML

• Suppose we want to see the raw HTML in our output

• That is, suppose we wanted to seemyString is <ul><li>wine</li><li>f12</li><li>cheese</li></ul>

myString is now<ul><li>f12</li></ul>

• We would have to replace all occurrences of < with <

• We could use regular expressions for this but, – the string to be replaced is a constant– so we can use a simpler technology

<?php$myString =

”<ul><li>wine</li><li>f12</li><li>cheese</li></ul>";echo "myString is ".str_replace(“<“,”<”,$myString).” ";

$myString = preg_replace("/<li>\D+<\/li>/",”x",$myString);

echo "myString is now ".str_replace(“<“,”<”,$myString);

?>

• Now the resultant output ismyString is

<ul><li>wine</li><li>f12</li><li>cheese</li></ul>

myString is now <ul><li>f12</li></ul>

• Suppose we want to replace every list item with the fixed phraselistItem

• That is, we wanted to see this outputmyString is


myString is now <ul> listItem listItem listItem </ul>

• Cs 4408 got here on 11 oct 2005

Regular expressions (contd.)• Suppose we try this

<?php$myString =


$myString = preg_replace("/<li>.+<\/li>/",” listItem ",$myString);echo "myString is now ".str_replace(“<“,”<”,$myString);

?>



myString is now <ul> listItem </ul>

• What is wrong?

• We need to make the + quantifier ungreedy

Regular expressions (contd.)• We must do this

<?php$myString =


$myString = preg_replace("/<li>.+?<\/li>/",” listItem ",$myString);echo "myString is now ".str_replace(“<“,”<”,$myString);

?>



myString is now <ul> listItem listItem listItem </ul>

End of digression

• Back to regular expressions ...

Regular expressions (contd.) -- remembering subpattern matches

• When a <pattern> is being matched with a target string, substrings that match sub-patterns can be remembered and re-used later in the same pattern

• Sub-patterns whose matching substrings are to be remembered are enclosed in parentheses

• The sub-patterns are implicitly numbered, starting from 1 and their matching substrings can then be re-used later in the pattern by using back-references like \1 or \2 or \3

• However, to get the backslash, we need to escape it, so we must type \\1 or \\2 or \\3 in our regular expressions

Using back-references (contd.)• PHP code

<?php

$myString1 = ”klmAklmAAklmABklmBklmBBklm";echo "myString is $myString ";

$myString1 = preg_replace(”/([A-Z])\\1/",”_",$myString1);

echo "myString1 is now $myString1 ";


myString1 is klmAklmAAklmABklmBklmBBklm

myString1 is now klmAklm_klmABklmBklm_klm

http://www.cs.ucc.ie/j.bowen/cs4408/slides/

text manipulation

Documents