web programming week 13 old dominion university department of computer science cs 418/518 fall 2010...

26
Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein<[email protected]> 11/23/10

Upload: shawn-terry

Post on 26-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Web Programming

Week 13

Old Dominion UniversityDepartment of Computer Science

CS 418/518 Fall 2010

Martin Klein<[email protected]>

11/23/10

Page 2: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Relational Data Model is a Special Case…

SELECT pi.fname, m.aces, m.unforced_errors, m.winnersFROM player_info pi, matches mWHERE pi.fname = “Andre”

AND m.opponent_name = “Sampras”AND m.year = “2002”;

Page 3: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Unstructured Datais More Common…

Page 4: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Precision and Recall

source: http://www.hsl.creighton.edu/hsl/Searching/Recall-Precision.html

how much extra stuff did you get? how much did you miss?

Page 5: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Precision and Recall

source: http://www.hsl.creighton.edu/hsl/Searching/Recall-Precision.html

10 documents in the index are relevantsearch returns 20 documents5 of which are relevant

1 out of 4 retrieved documents are relevant half of the relevant documents were retrieved

Page 6: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Precision and Recall1

10

Precision

Recall

figure 1.2in FBY

Page 7: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Why Isn’t Recall Always 100%?

Virginia Agricultural and Mechanical College?

Virginia Agricultural and Mechanical Collegeand Polytechnic Institute?

Virginia Polytechnic Institute?

Virginia Polytechnic Instituteand State University?

Virginia Tech?

Page 8: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Precision and Recall - Literature

Page 9: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

CREATE Table

mysql> CREATE TABLE ODUtennis( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,-> title VARCHAR(200), -> body TEXT, -> );Query OK, 0 rows affected (0.00 sec)

Page 10: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

INSERTmysql> INSERT INTO ODUtennis (title, body) VALUES

-> ('Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals','The fourth-seeded Monarchs ...'),

-> ('Monarchs Close Out Season With 4-3 Win Over South Alabama','ODU closes out the 2010 tennis schedule...'),

-> ('ODU Edged By DePaul in Mens Tennis Action, 4-3','Junior Tobias Fanselow was the other Monarch…'),

-> ('ODU Mens Tennis Drops Delaware, 5-2, in CAA Action','The Old Dominion mens tennis team…doubles…Monarchs...'), -> ('Mens Tennis Nipped by #64 UNC Wilmington, 4-3','The two teams split the first two doubles matches…Monarchs…doubles...');

Query OK, 5 rows affected (0.00 sec)Records: 5 Duplicates: 0 Warnings: 0

Page 11: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

LIKE & REGEXP

• We can search rows with the “LIKE” (or “REGEXP”) operator– http://dev.mysql.com/doc/refman/5.0/en/pattern-matching.html

– for tables of any size, this will be s-l-o-w

Page 12: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Example 1

mysql> SELECT id, title FROM ODUtennis WHERE title LIKE 'Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals';

mysql> SELECT id, title FROM ODUtennis WHERE title REGEXP 'Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals';

+----+--------------------------------------------------------------------------------------+| id | title |+----+--------------------------------------------------------------------------------------+| 1 | Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals |+----+--------------------------------------------------------------------------------------+1 row in set (0.00 sec)

Page 13: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Example 2

mysql> SELECT id, title FROM ODUtennis WHERE title REGEXP 'Monarchs';

+----+---------------------------------------------------------------------------------------+| id | title |+----+---------------------------------------------------------------------------------------+| 1 | Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals || 2 | Monarchs Close Out Season With 4-3 Win Over South Alabama |+----+----------------------------------------------------------------------------------------+2 rows in set (0.01 sec)

BUT

mysql> SELECT id, title FROM ODUtennis WHERE title LIKE 'Monarchs';Empty set (0.00 sec)

Page 14: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Full-Text Search – The Better Way

• MATCH()…AGAINST()• Performs a natural language search over index• Index = set of one ore more columns of the same table• Index as argument to MATCH()• Search string as argument to AGAINST()• If used in WHERE clause result returned in order of

relevance score– Relevance: similarity between search string and index row

Page 15: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

CREATE Table

mysql> CREATE TABLE ODUtennis ( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,-> title VARCHAR(200), -> body TEXT, -> FULLTEXT (title,body)-> );Query OK, 0 rows affected (0.00 sec)

• can only create FULLTEXT on CHAR, VARCHAR or TEXT columns• “title” and “body” still available as regular columns• if you want to search only on “title”, you need to create a separate index

Page 16: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

INSERTmysql> INSERT INTO ODUtennis (title, body) VALUES

-> ('Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals','The fourth-seeded Monarchs ...'),

-> ('Monarchs Close Out Season With 4-3 Win Over South Alabama','ODU closes out the 2010 tennis schedule...'),

-> ('ODU Edged By DePaul in Mens Tennis Action, 4-3','Junior Tobias Fanselow was the other Monarch…'),

-> ('ODU Mens Tennis Drops Delaware, 5-2, in CAA Action','The Old Dominion mens tennis team…doubles…Monarchs...'), -> ('Mens Tennis Nipped by #64 UNC Wilmington, 4-3','The two teams split the first two doubles matches…Monarchs…doubles...');

Query OK, 5 rows affected (0.00 sec)Records: 5 Duplicates: 0 Warnings: 0

Page 17: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

MATCH .. AGAINST

mysql> SELECT * FROM ODUtennis WHERE MATCH(title,body) AGAINST('fanselow');+----+------------------------------------------------+------------------------------------------------------------------------------------+| id | title | body |+----+------------------------------------------------+------------------------------------------------------------------------------------+| 3 | ODU Edged By DePaul in Mens Tennis Action, 4-3 | Junior Tobias Fanselow was the other … |+----+------------------------------------------------+------------------------------------------------------------------------------------+1 row in set (0.00 sec)

mysql> SELECT * FROM ODUtennis WHERE MATCH(title,body) AGAINST('Monarchs');Empty set (0.00 sec)

why?!

Page 18: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Ranking

• If the word appears in > 50% of the rows then the word is considered a “stop word” and is not matched (unless you are in Boolean mode)– this makes sense for large collections (the word

is not a good discriminator of records), but can lead to unexpected results for small collections

Page 19: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Stopwords• Stopwords exist in stoplists or negative dictionaries• Idea: remove low semantic content

– index should only have “important stuff”

• What not to index is domain dependent, but often includes:– “small” words: a, and, the, but, of, an, very, etc. – NASA ADS example:

• http://adsabs.harvard.edu/abs_doc/stopwords.html

– MySQL full-text index:• http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html

Page 20: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Stopwords• Punctuation, numbers often stripped or treated as

stopwords– precision suffers on searches for:

• NASA TM-3389• F-15• X.500• .NET• Tree::Suffix

• MySQL also treats words < 4 characters as stopwords– too bad for: “Liu”, “ORF”, “DEA”, etc.

Page 21: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Getting the Rank

mysql> SELECT id, MATCH(title,body) AGAINST('doubles') from ODUtennis;

+----+---------------------------------------------------+| id | MATCH(title,body) AGAINST('doubles') |+----+---------------------------------------------------+| 1 | 0 || 2 | 0 || 3 | 0 || 4 | 0.28169098496437 || 5 | 0.61670464277267 |+----+---------------------------------------------------+5 rows in set (0.00 sec)

Page 22: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Getting the Rank in Order

mysql> SELECT id, title, MATCH(title,body) AGAINST('doubles') AS scoreFROM ODUtennis WHERE MATCH(title,body) AGAINST('doubles');

+----+-------------------------------------------------------------------------+----------------------------+| id | title | score |+----+-------------------------------------------------------------------------+----------------------------+| 5 | Mens Tennis Nipped by #64 UNC Wilmington, 4-3 | 0.61670464277267 || 4 | ODU Mens Tennis Drops Delaware, 5-2, in CAA Action | 0.28169098496437 |+----+-------------------------------------------------------------------------+----------------------------+2 rows in set (0.00 sec)

Page 23: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Boolean Mode

• Does not use the 50% threshold• Does use stopwords, length limitation• Operator list:

– http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html

mysql> SELECT id, title FROM ODUtennisWHERE MATCH(title,body) AGAINST('+Monarchs' IN BOOLEAN MODE);

+----+-----------------------------------------------------------------------------------------+| id | title |+----+-----------------------------------------------------------------------------------------+| 1 | Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals || 2 | Monarchs Close Out Season With 4-3 Win Over South Alabama || 4 | ODU Mens Tennis Drops Delaware, 5-2, in CAA Action || 5 | Mens Tennis Nipped by #64 UNC Wilmington, 4-3 |+----+-----------------------------------------------------------------------------------------+4 rows in set (0.00 sec)

Page 24: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Blind Query Expansion (AKA Automatic Relevance Feedback)

• General assumption: user query is insufficient– Too short

– Too generic

– Too many results

• How does one keep up with Virginia Tech’s multiple names / nicknames?– Hokies, Fighting Gobblers, VPI, VPI&SU, Va Tech, VT

• Idea:

1. run the query with the requested terms

2. then take the results and

3. re-run the query with the most relevant terms from the initial results

Page 25: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

Blind Query Expansion (AKA Automatic Relevance Feedback)

mysql> SELECT * FROM ODUtennis WHERE MATCH(title,body) AGAINST('fanselow' WITH QUERY EXPANSION);

+----+-------------------------------------------------------------------------+----------------------------------------------------------------+| id | title | body | +----+-------------------------------------------------------------------------+-----------------------------------------------------------------+| 3 | ODU Edged By DePaul in Mens Tennis Action, 4-3 | Junior Tobias Fanselow was the other Monarch | | | to win, earning the win at No. 1 singles over || | Alasdair Graetz, 6-2, 7-6 |----------------------------------------------------------------------------------------------------------------------------------------------------+| 4 | ODU Mens Tennis Drops Delaware, 5-2, in CAA Action | The Old Dominion mens tennis team improved || | to 14-8 overall and 2-3 in the CAA with a 5-2 || | victory over the Delaware Fighting Blue Hens || | on Wednesday. After dropping all three doubles || | matches, the Monarchs went on to win five of | | | the six singles matches for the victory. |+----+----------------------------------------------------+--------------------------------------------------------------------------------------+2 rows in set (0.00 sec)

Page 26: Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10

For More Information…

• MySQL documentation:– http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

• Chapter 12/13 “Building a Content Management System”

• CS 751/851 “Introduction to Digital Libraries”– http://www.cs.odu.edu/~mln/teaching/– esp. “Information Retrieval Concepts” lecture

• Is MySQL the right tool for your job?– http://lucene.apache.org/

MySQL examples in this lecture based on those found at dev.mysql.comcontent snippets taken from www.odusports.com