g. mecca, v. crescenzi, and p. merialdo. roadrunner: towards

22

Click here to load reader

Upload: tommy96

Post on 09-Jun-2015

4.253 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. Proceedings of the 27th Very Large Databases Conference, Rome, Italy, pages 109-118, 2001.http://citeseer.nj.nec.com/crescenzi01roadrunner.html

G. Ianni, Intelligent Anticipated Exploration of Web Sites (2001)http://citeseer.nj.nec.com/ianni01intelligent.html

Robert Baumgartner, Supervised Wrapper Generation with Lixto. The VLDB Journal 2001http://citeseer.nj.nec.com/baumgartner01supervised.htmlor via the Lixto Downloads page:http://www.dbai.tuwien.ac.at/proj/lixto/download.html

Robert Baumgartner Sergio Flesca Georg Gottlob, Declarative Information Extraction, Web Crawling and Recursive Wrapping with Lixto, Proc LPNMR'01, 6th International Conference on Logic Programming and Nonmonotonic Reasoning, 2001. (LNCS )

Robert Baumgartner Sergio Flesca Georg Gottlob, Visual Web Information Extraction with Lixto. The VLDB Journal 2001http://citeseer.nj.nec.com/baumgartner01visual.html

Lixto web sitehttp://www.dbai.tuwien.ac.at/proj/lixto/

William Cohen, Lee Jensen, A Structured Wrapper Induction System for Extracting Information from Semi-Structured Documents, Proc IJCAI-2001 Workshop on Adaptive Text Extraction and Mining (2001)http://citeseer.nj.nec.com/cohen01structured.html

Fabio Ciravegna, Daniela Petrelli, User Involvement in Adaptive Information Extraction: Position Paper, Proc IJCAI-2001 Workshop on Adaptive Text Extraction and Mining (2001)http://www.smi.ucd.ie/ATEM2001/proceedings/ciravegna-position-atem2001.pdf

Ralph Grishman, Adaptive Information Extraction and Sublanguage Analysis, Proc IJCAI-2001 Workshop on Adaptive Text Extraction and Mining (2001)http://www.smi.ucd.ie/ATEM2001/proceedings/grishman-position-atem2001.pdf

Shian-Hua Lin Academia Sinica 128 Academia Road Sec. 2... Discovering Informative Content Blocks from Webhttp://citeseer.nj.nec.com/530062.htmlhttp://kp05.iis.sinica.edu.tw/shlin/paper/kdd-ShianHuaLin.pdf

M. Brian Blake The MITRE Corporation Center for Advanced Aviation System...An Autonomous Decentralized Architecture for Distributed Data Management and Dissemination http://citeseer.nj.nec.com/461926.html

M. Brian Blake, Patricia Liguori, ISADS, An Automated Client-Driven Approach to Data Extraction using an Autonomous Decentralized Architecture (2001)http://citeseer.nj.nec.com/blake01automated.html

Yuan Jiang, Using Heuristic Approaches to Detect Record Boundaries in Semistructured Web Documentshttp://students.cs.byu.edu/~jiang/thesis/

Line Eikvil, Information Extraction from World Wide Web A Survey (1999). Norwegian Computing Centerhttp://citeseer.nj.nec.com/eikvil99information.html

Brad Adelberg, NoDoSE - A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents, SIGMOD Conference 1998http://students.cs.byu.edu/~jiang/thesis/nodose.ps

Ion Muslea, Steve Minton, Craig Knoblock, Wrapper Induction for Semistructured, Web-based Information Sources (1998)

Page 2: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

Sergey Brin, Extracting Patterns and Relations from the World Wide Web, WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT'98http://citeseer.nj.nec.com/brin98extracting.html

Naveen Ashish, Craig A. Knoblock, Semi-automatic Wrapper Generation for Internet Information Sources, Proc 2nd IFCIS Conference on Cooperative Information Systems (CoopIS '97), pp 160-169, June 1997. http://citeseer.nj.nec.com/28908.html

Naveen Ashish and Craig Knoblock, Wrapper Generation for Semi-structured Internet Sources, ACM SIGMOD Workshop on Management of Semi-structured Data, 1997, Tucson , Arizona .http://ic.arc.nasa.gov/~ashish/sig.ps

Naveen Ashish, Craig A. Knoblock, Wrapper Generation for Semi-structured Internet Sources, SIGMOD Record, Vol. 26, No. 4, December 1997. (Invited Paper) http://ic.arc.nasa.gov/~ashish/sig.ps

Naveen Ashish's Home Pagehttp://ic.arc.nasa.gov/~ashish/

Craig Knoblock, Steve Minton, Jose-Luis Ambite, Naveen Ashish, Pragnesh Modi, Ion Muslea, Andrew Philpot and Sheila Tejada, Modeling Web Sources for Information Integration, Proc. AAAI '98, 15th National Conference on Artificial Intelligence, July 1998.http://ic.arc.nasa.gov/~ashish/aaai98.ps

Venkatesh Ganti, Mong-Li Lee, Raghu Ramakrishnan: ICICLES: Self-Tuning Samples for Approximate Query Answering. VLDB 2000: 176-187http://www.acm.org/sigmod/vldb/conf/2000/P176.pdf

Kushmerick N. (1997). Wrapper Induction for Information Extraction. Ph.D. Dissertation, University of Washington. Technical Report UW-CSE-97-11-04. http://www.cs.ucd.ie/staff/nick/home/research/download/kushmerick-phd.ps.gz

Nicholas Kushmerick, Bernd ThomasAdaptive information extraction: Core technologies for information agents (2002). In Intelligent Information Agents R&D in Europe: An AgentLink perspective. Springer.http://citeseer.nj.nec.com/kushmerick02adaptive.html

Bernd Thomas papers etc web site:http://www.uni-koblenz.de/~bthomas/MIA_HTML/

Nicholas Kushmerick, Finite-state approaches to Web information extraction (2002)http://citeseer.nj.nec.com/kushmerick02finitestate.html

Nicholas Kushmerick, Gleaning Answers From the Webhttp://citeseer.nj.nec.com/504568.htmlhttp://www.cs.ucd.ie/staff/nick/home/research/download/kushmerick-maftkb-ss02.pdf

Nicholas Kushmerick, Wrapper Verification. World Wide Web Journal, 3(2) pp 79-94.http://www.cs.ucd.ie/staff/nick/home/research/download/kushmerick-wwwj2000.ps.gz

The Niagara Internet Query System (Wisconsin)List of papers by Niagara group members. 1999-2002http://www.cs.wisc.edu/niagara/Publications.html

Building XML Statistics for the Hidden Web. Ashraf Aboulnaga and Jeffrey F. Naughton. VLDB Conference 2002http://www.cs.wisc.edu/niagara/papers/vldb02xmlolstat.pdf

Form-Based Proxy Caching for Database-Backed Web Sites. Qiong Luo and Jeffrey F. Naughton.VLDB Conference 2001

Page 3: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

http://www.cs.wisc.edu/niagara/papers/formProxyFull.pdf

NiagaraCQ: A Scalable Continuous Query System for Internet Databases. Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang. Proc. SIGMOD 2000 , p379-390. http://www.cs.wisc.edu/niagara/papers/NiagaraCQ.pdf

B.T. Messmer, H. Bunke, Subgraph Isomorphism in Polynomial Time (1995) Technical Report IAM 95-003, University of Bern, Institute of Computer Science and Applied Mathematics, Bern, Switzerland.http://citeseer.nj.nec.com/messmer95subgraph.html

David Eppstein, Subgraph Isomorphism in Planar Graphs and Related Problems (1999) Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, pages 632--640, 1995. http://citeseer.nj.nec.com/eppstein95subgraph.htmlRevised later for a Journal Article in 1999 (That is the printed version you have).

Gio Wiederhold, Mediators in the Architecture of Future Information Systems, Readings in Agents, 1992http://citeseer.nj.nec.com/wiederhold92mediators.html

Anthony Tomasic, Louiqa Raschid, P. Valduriez, Scaling Access to Heterogeneous Data Sources with DISCO Knowledge and Data Engineering, 1998http://citeseer.nj.nec.com/tomasic98scaling.html

A. Tomasic, L. Raschid, and P. Valduriez, Scaling Heterogeneous Databases and the Design of Disco (1996) Issn apport de recherche Institut National De Recherche En Informatique Et En...Proc. International Conference on Distributed Computing Systems, ICDCS, 1996.http://citeseer.nj.nec.com/tomasic96scaling.html

Olga Kapitskaia, Anthony Tomasic, Patrick Valduriez, Dealing with Discrepancies in Wrapper Functionality Proc. 13eme Journees Bases de Donnees Avancees, BDA, 1997http://citeseer.nj.nec.com/rd/97012851%2C145295%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/papers/cs/299/ftp:zSzzSzftp.inria.frzSzINRIAzSzpublicationzSzRRzSzRR-3138.pdf/kapitskaia97dealing.pdf

Anthony Tomasic, Rémy Amouroux, Philippe Bonnet , The Distributed Information Search Component (DisCo) and the World Wide Webhttp://citeseer.nj.nec.com/186003.html

P. Ipeirotis, and L. Gravano, Distributed Search over the Hidden-Web: Hierarchical Database Sampling and Selection, Proceedings of the 28th International Conference on Very Large Databases (VLDB 2002), 2002http://qprober.cs.columbia.edu/publications/vldb2002.pdf

P. Ipeirotis, L. Gravano, and M. Sahami, Probe, Count, and Classify: Categorizing Hidden-Web Databases, Proceedings of the 2001 ACM SIGMOD International Conference On Management of Data, 2001.http://qprober.cs.columbia.edu/publications/sigmod2001.pdf

L. Gravano, P. Ipeirotis, and M. Sahami, Query- vs. Crawling-based Classification of Searchable Web Databases, IEEE Data Engineering Bulletin, vol. 25, no. 1, March 2002.http://qprober.cs.columbia.edu/publications/deb-mar2002.pdf.

Andrea Calì, Diego Calvanese, Giuseppe De Giacomo,Maurizio Lenzerini, On the Role of Integrity Constraints in Data Integration, Data Engineering Bulletin, 25(3) September 2002.http://www.research.microsoft.com/research/db/debull/A02sept/l-article.ps

Rachel A. Pottinger, Philip A. Bernstein, Creating a Mediated Schema Based on Initial Correspondences, Data Engineering Bulletin, 25(3) September 2002. (Special Issue on Integration Management)http://www.research.microsoft.com/research/db/debull/A02sept/po-article.ps

V. Crescenzi, G. Mecca and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. Proceedings of the 27th Very Large Databases Conference, Rome, Italy, pages 109-118, 2001.http://citeseer.nj.nec.com/crescenzi01roadrunner.html

Page 4: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

Valter Crescenzi, Giansalvatore Mecca , Paolo Merialdo, Automatic Web Information Extraction in the RoadRunner System . International Workshop on Data Semantics in Web Information Systems (DASWIS-2001) in conjunction with 20th International Conference on Conceptual Modeling (ER 2001)

R. Kosala, J. Van den Bussche, M. Bruynooghe and H. Blockeel. Information Extraction in Structured Documents using Tree Automata Induction. To appear in Principles of Data Mining and Knowledge Discovery, Proceedings of the 6th International Conference (PKDD-2002). Preliminary versionhttp://citeseer.nj.nec.com/506574.html

R. Kosala and H. Blockeel, Web mining research: A survey. ACM SIGKDD Explorations, 2(1) pp 1-15, 2000,Special issue on "Internet Data Mining". SIGKDD Explorations: Newsletter of the ACM Special Interest Group on Knowledge Discovery & Data Mining.http://citeseer.nj.nec.com/kosala00web.html

Boris Chidlovskii, Jon Ragetli and Maarten de Rijke. Wrapper Generation via Grammar Induction. Proc. ECML 2000, 11th European Conf on Machine Learning, 2000. (LNAI 1810) pp 96-108.http://home-4.12move.nl/~sh364624/docs/chidlovskii.pdf

Boris Chidlovskii, Jon Ragetli, Maarten de Rijke, Automatic Wrapper Generation for Web Search Engines, Proc. 1st Intl Conf on Web-Age Information Management, 2000. (LNCS 1846) pp 399-410. http://home-4.12move.nl/~sh364624/docs/waimk.pdf

Jon Ragetli's publications page:http://home-4.12move.nl/~sh364624/publicaties.html

Wolfgang May, Rainer Himmeröder, Georg Lausen, Bertram Ludäscher, A Unified Framework for Wrapping, Mediating and Restructuring Information from the Web.

Naveen Ashish, PhD Thesis, March 2000. Optimizing Information Mediators by Selectively Materialising Data. (supervised by Craig Knoblock). Naveen Ashish, Craig Knoblock. Wrapper Generation for Semi-structured Internet Sources, Workshop on Management of Semistructured Data, Ventana Canyon Resort, Tucson, Arizona.

Gestalts Project: Networked DatabasesRaghu Ramakrishnan, University of Wisconsin. Web page Oct '98:http://www.cs.wisc.edu/~raghu/gestalts/

The COD Project, University of Wisconsin. (Semantic database integration.)Raghu Ramakrishnan.http://www.cs.wisc.edu/~cod/

Eric Brill, Jimmy Lin, Michele Banko, Susan T. Dumais, Andrew Y. Ng: Data-Intensive Question Answering. (TREC 2001)http://trec.nist.gov/pubs/trec10/papers/Trec2001Notebook.AskMSRFinal.pdf

G. Ianni, Intelligent Anticipated Exploration of Web Sites (2001)http://citeseer.nj.nec.com/ianni01intelligent.html

William Cohen, Lee Jensen, A Structured Wrapper Induction System for Extracting Information from Semi-Structured Documents (2001)http://citeseer.nj.nec.com/cohen01structured.html

M. Brian Blake, Patricia Liguori, ISADS, An Automated Client-Driven Approach to Data Extraction using an Autonomous Decentralized Architecture (2001)http://citeseer.nj.nec.com/blake01automated.html

M. Brian Blake The MITRE Corporation Center for Advanced Aviation System...An Autonomous Decentralized Architecture for Distributed Data Management and Dissemination http://citeseer.nj.nec.com/461926.html

Page 5: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

B.T. Messmer, H. Bunke, Subgraph Isomorphism in Polynomial Time (1995) Technical Report IAM 95-003, University of Bern, Institute of Computer Science and Applied Mathematics, Bern, Switzerland.http://citeseer.nj.nec.com/messmer95subgraph.html

David Eppstein, Subgraph Isomorphism in Planar Graphs and Related Problems (1999) Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, pages 632--640, 1995. http://citeseer.nj.nec.com/eppstein95subgraph.htmlRevised later for a Journal Article in 1999

Home page of Ismail Khalil Ibrahim: http://www.tk.uni-linz.ac.at/~ismail/

Stephane Bressan - National University of Singapore: http://www.comp.nus.edu.sg/~steph/

Wee Hyong Tok and Stéphane Bressan, dbRouter - A Scaleable and Distributed Query Optimization and Processing Framework, Proc. 13th International Conference on Database and Expert Systems Applications, DEXA 2002, p. 658-668. (LNCS 2453).http://link.springer.de/link/service/series/0558/papers/2453/24530658.pdf

Stéphane Bressan, Cheng GohSemantic Integration of Disparate Information Sources over the Internet using Constraint Propagation. http://context.mit.edu/~steph/cp97/cp97.html

Stéphane Bressan and Cheng Hian Goh, Semantic integration of disparate information sources over the internet using constraint propagation. In Workshop on Constraint Reasoning on the Internet, 1997.

S. Bressan, C. H. Goh, T. Lee, S. Madnick, and M. Siegel. A Procedure for Mediation of Queries to Sources in Disparate Contexts. Proc. of the International Logic Programming Symposium, pp. 213-227, Port Jefferson, N.Y., October 12-17, 1997.

S. Bressan and C. H. Goh, Semantic Integration of Disparate Information Sources over the Internet using Constraint Propagation. Workshop on Constraint Reasoning on the Internet at CP-97, 1997.

S. Bressan, C. H. Goh, K. Fynn, M. Jakobisiak, K. Hussein, H. Kon, T. Lee, S. Madnick, T. Pena, J. Qu, A. Shum, and M. Siegel. The Context Interchange Mediator Prototype. Proc. ACM SIGMOD Conference, 1997.

Bressan, K. Fynn, C. H. Goh, S. Madnick, T. Pena, and M. Siegel. Overview of Prolog Implementation of the Context Interchange Mediator. Proc. Intl Conference on Practical Applications of Prolog, pp. 83-93, 1997.

C. H. Goh, S. Bressan, S. Madnick, and M. Siegel. Context Interchange: New Features and Formalisms for the Intelligent Integration of Information. Sloan School of Management Working Paper, January 1997. Submitted for publication.

COntext INterchange: List of Publicationshttp://context.mit.edu/~coin/publications/

COntext INterchange: List of Publications

Information Integration with Attribution Support for Corporate Profiles Lee, T., Chams, M., Nado, R., Madnick, S., and Siegel, M., ACM Conference on Information and Knowlege Management, 1999

Context Mediation on Wall Street, Moulton, A., Madnick, S. E., and Siegel, M. D., CoopIS98 Answering Queries in Context, Bressan, S. and Goh, C., International Conference on Flexible Query

Answering, 1998 (LNAI) Source Attribution for Querying Against Semi-structured Documents

Lee, T., Bressan, S., and Madnick, S., Workshop on Web Information and Data Management, ACM Conference on Information and Knowledge Management, 1998

Semantic Integration of Disparate Information Sources over the Internet Using ConstraintsBressan, S. and Goh, C., Constraint Programming Workshop on Constraints and the Internet, 1997

Page 6: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

Extraction and Integration of Data from Semi-structured Documents into Business ApplicationsBressan, S. and Bonnet, Ph, Conference on the Industrial Applications of Prolog, 1997

Multimodal Integration of Disparate Information Sources with Attribution, Lee, T. and Bressan, S., Entity Relationship Workshop on Information Retrieval and Conceptual Modelling, 1997

Context Mediation: New Features and Formalisms for the Intelligent Integration of InformationGoh, C. and Bressan, S. and Madnick, S. and Siegel. M., Sloan Working Paper 3941, 1997

A Procedure for the Context Mediation of Queries to Disparate Sources, Goh, C. and Bressan, S. and Lee. T. and Madnick, S. and Siegel. M., International Logic Programming Symposium, 1997

Information Brokering on the World Wide Web, Bressan, S. and Lee. T., WebNet world Conference1997

The COntext INterchange Mediator Prototype, Bressan, S. and Fynn, K. and Goh, C. and Jakobisiak, M. and Hussein, K. and Kon, H. and Lee. T. and Madnick, S. and Pena, T. and Qu, J. and Shum, A. and Siegel, M., ACM SIGMOD International Conference on Management of Data, 1997

Overview of the Prolog Implementation of the COntext INterchange Prototype, Bressan, S. and Fynn, K. and Goh, C. and Madnick, S. and Pena, T. and Siegel, M., Fifth International Conference on Practical Applications of Prolog, 1997

PENNY: A Programming Language and Compiler for the Context Interchange Project, Pena, T., MIT Master Thesis, Electrical Engineering and Computer Science, 1997

A Planner/Optimizer/Executioner for Context Mediated Queries, Fynn, K., MIT Master Thesis, Electrical Engineering and Computer Science, 1997

Representing and Reasoning about Semantic Conflicts in Heterogeneous Information SystemsGoh, C., Ph.D. Thesis, MIT Sloan School of Management, 1996

-------------------

Frank Dignum - Utrecht University http://www.cs.uu.nl/people/dignum/

I. K. Ibrahim, W. Winiwarter, S. Bressan. Semantic Query Transformation for the Intelligent Integration of Information Sources over the Web. Proc. of the International Workshop on Information Integration on the Web, Rio de Janeiro, Brazil, April 2001. http://www.ifs.univie.ac.at/~ww/wiiw01.pshttp://citeseer.nj.nec.com/487085.html

I. K. Ibrahim, V. Dignum, W. Winiwarter, E. Weippl. Logic Based Approach to Semantic Query Transformation for Knowledge Management Applications. Proc. of the International Conference on Knowledge Management, Berlin, Springer, 2002. http://www.ifs.univie.ac.at/~ww/iknow02.pdf

I. K. Ibrahim, W. Winiwarter, S. Bressan. Rewriting Rules for Semantic Query Transformation in E-Commerce Applications. Proc. of the 9th IFIP 2.6 Working Conference on Database Semantics, Dordrecht, Kluwer Academic Publishers, 2001. http://www.ifs.univie.ac.at/~ww/ds9.ps .. doesn't print properly. Try the following pdf file from citeseer instead:http://citeseer.nj.nec.com/rd/97012851%2C469573%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/papers/cs/24013/http:zSzzSzwww.ifs.univie.ac.atzSz%7EwwzSzds9.pdf/rewriting-rules-for-semantic.pdf

I.K. Ibrahim's PhD thesis 2001: Semantic Query Transformation for the Intelligent Integration of Information, Gadjah Mada University (Is not available on-line and is written in Indonesian, he tells me. He says he is translating it for publication as a book in 2003).supervised by: Stephane Bressan - National University of Singapore Frank Dignum - Utrecht University

A. P. Sheth and J. A. Larson. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Computing Surveys, 22(3):183--236, 1990.This document is not available on-line.Amit P. Sheth’s home page is:http://lsdis.cs.uga.edu/~amit/

Page 7: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

Steven Prestwich, Stéphane Bressan, A SAT Approach to Query Optimization in Mediator Systems, Fifth International Symposium on the Theory and Applications of Satisfiability Testing (SAT 2002), May 6-9, 2002, Cincinnati, Ohio, USA http://citeseer.nj.nec.com/prestwich02sat.htmlconference page: http://gauss.ececs.uc.edu/Conferences/SAT2002/sat2002list.html

H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajararnan, Y. Sagiv, J. Ullman, V. Vassalos, and J. Widom, "The TSIMMIS Approach to Mediation: Data Models and Languages", Journal of Intelligent Information Systems, 2 (1997) 117-132.

Bressan, S. and Bonnet, Ph. Extraction and Integration of Data from Semi-structured Documents into Business Applications, Conference on the Industrial Applications of Prolog, 1997 http://context.mit.edu/~coin/publications/inap97/inap97.ps

Shuchi Patel, Amit Sheth, Planning And Optimizing Semantic Information Requests Using Domain Modeling And Resource Characteristics (2001) 26 page Technical Report.http://citeseer.nj.nec.com/rd/97012851%2C454763%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/papers/cs/22883/http:zSzzSzlsdis.cs.uga.eduzSzlibzSzdownloadzSz126-TR-Shuchi.pdf/patel01planning.pdf

Ajay Hemnani and Stephane Bressan, Information Extraction - Tree Alignment Approach to Pattern Discovery in Web Documents, Proc Database and Expert Systems Applications, 13th International Conference, DEXA 2002, (LNCS 2453) http://link.springer.de/link/service/series/0558/papers/2453/24530789.pdf

Ajay Hemnani and Stephane Bressan, Extracting Information from Semi-structured Web Documents, Proc. OOIS 2002, Advances in Object-Oriented Information Systems ,2002. LNCS 2426, pp. 166-175http://link.springer.de/link/service/series/0558/papers/2426/24260166.pdf

Yannis Papakonstantinou’s home page:http://www.db.ucsd.edu/people/yannis.htm(contains an annotated bibiography)

Yannis Papakonstantinou, Data Integration: The Need, the Challenges and the Approaches Plenary talk given at the International Symposium on Information Systems and Engineering, July 2002. Provides a classification of data integration techniques and challenges in the XML virtual view-based approach. http://www.db.ucsd.edu/people/yannis/ISE2002_files/frame.htm59 slides. Web presentation.

V. Vassalos,Y. Papakonstantinou: Describing and Using Query Capabilities of Heterogeneous Sources. VLDB'97, 1997 http://dbpubs.stanford.edu/pub/showDoc.Fulltext?lang=en&doc=1997-44&format=ps&compression=

Yannis Papakonstantinou, Ashish Gupta, Laura Haas, Capabilities-Based Query Rewriting in Mediator Systems (1998) .. 42 pagesProceedings of 4th International Conference on Parallel and Distributed Information Systemshttp://citeseer.nj.nec.com/rd/4641068%2C322185%2C1%2C0.25%2CDownload/http%3AqSqqSqciteseer.nj.nec.comqSqcacheqSqpapersqSqcsqSq15133qSqhttp%3AzSzzSzwww.db.ucsd.eduzSzpublicationszSzdapd.pdf/papakonstantinou98capabilitiesbased.pdf

Yannis Papakonstantinou Ashish Gupta Laura Haas, Capabilities-Based Query Rewriting in Mediator Systems (another version) (1996) .. 29 pageshttp://citeseer.nj.nec.com/rd/24323359%2C3393%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/papers/cs/6182/http:zSzzSzwww-db.stanford.eduzSzpubzSzpapakonstantinouzSz1995zSzcbr-extended.pdf/papakonstantinou96capabilitiesbased.pdf

Michael R. Genesereth, Arthur M. Keller, et al., Infomaster: An Information Integration System (1997) http://citeseer.nj.nec.com/cache/papers/cs/15863/http:zSzzSzmas.cs.umass.eduzSz~aseltinezSz791SzSzgenesereth.infomaster.pdf/genesereth97infomaster.pdf

Michalis Petropoulos Home page:http://www-cse.ucsd.edu/~mpetropo/

Page 8: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

Michalis Petropoulos, Y. Papakonstantinou and V. Vassalos, QURSED: QUerying and Reporting SEmistructured Data, ACM SIGMOD Conference, 2002. http://www.db.ucsd.edu/People/michalis/pubs/sigmod02.pdf

Michalis Petropoulos, and V. Hristidis, Semantic Caching of XML Databases , Fifth International Workshop on the Web and Databases (WebDB), 2002.http://www.db.ucsd.edu/People/michalis/pubs/webdb02.pdf

Relational Wrapper for Navigation-Driven Lazy MediatorTalk given at San Diego Supercomputer Center as part of the MIX project in August 99.http://www.db.ucsd.edu/People/michalis/presentations/rdbwrapper.pdf

Languages for Semistructured DataTalk given at the Fall 99 Advanced Database Topics Seminar (CSE 291) of the CSE Department of UCSD. http://www.db.ucsd.edu/People/michalis/presentations/qlsemi.pdf

Jean-Robert Gruser, Louiqa Raschid , Vladimir Zadorozhny , Tao Zhan, Learning response time for Web Sources using query feedback and application in query optimization, The VLDB Journal, 9(1) 18-37http://link.springer.de/link/service/journals/00778/papers/0009001/00090018.pdf#xml=http://athene.em.springer.de/search97cgi/s97_cgi?action=view&VdkVgwKey=%2Fjour%2Fjour%2F00778%2Fpapers%2F0009001%2F00090018.pdf&doctype=xml&collection=springer02&queryZIP=%28%22wrapper%22%29AND%28%22web%22%29

Jaeyoung Yang, Jungsun Kim, Kyoung-Goo Doh, and Joongmin Choi, Wrapper Generation by Using XML-Based Domain Knowledge for Intelligent Information Extraction. Springer Lecture Notes in Computer Science, LNAI 2417, pp 472--??, 2002.http://link.springer.de/link/service/series/0558/papers/2417/24170472.pdf

Jaeyoung Yang, Eunseok Lee and Joongmin Choi, A Shopping Agent That Automatically Constructs Wrappers for Semi-structured Online Vendors.Lecture Notes in Computer Science, Vol. 1983, p. 368-??, 2000.http://link.springer-ny.com/link/service/series/0558/papers/1983/19830368.pdf

Jaeyoung Yang, Heekuck Oh, Kyung-Goo Doh and Joongmin Choi, A Knowledge-Based Information Extraction System for Semi-structured Labeled Documents.Lecture Notes in Computer Science, Vol. 2412, p. 105-??, 2002.http://link.springer-ny.com/link/service/series/0558/papers/2412/24120105.pdf

Gunter Grieser , Klaus P. Jantke , Steffen Lange , Bernd Thomas, A Unifying Approach to HTML Wrapper Representation and Learning, LNCS1967, pp 50-http://link.springer.de/link/service/series/0558/papers/1967/19670050.pdf#xml=http://athene.em.springer.de/search97cgi/s97_cgi?action=view&VdkVgwKey=%2Fjour%2Fseries%2F0558%2Fpapers%2F1967%2F19670050.pdf&doctype=xml&collection=springer02&queryZIP=%28%22wrapper%22%29AND%28%22web%22%29

The TSIMMIS Web Site:http://www-db.stanford.edu/tsimmis/

TSIMMIS Publicationshttp://www-db.stanford.edu/tsimmis/publications.html

Joachim Hammer, Hector Garcia-Molina, Svetlozar Nestorov, Ramana Yerneni, Marcus Breunig, Vasilis Vassalos, "Template-Based Wrappers in the TSIMMIS System". In Proceedings of the Twenty-Sixth SIGMOD International Conference on Management of Data, Tucson, Arizona, May 12-15, 1997, pp 532-535. ftp://www-db.stanford.edu/pub/papers/wrapper-demo.pshttp://citeseer.nj.nec.com/hammer97templatebased.html

Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, J. Ullman. "A Query Translation Scheme for Rapid Implementation of Wrappers". In International Conference on Deductive and Object-Oriented Databases, 1995. ftp://www-db.stanford.edu/pub/papers/querytran.ps

Page 9: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

also available as PDF file from citeseer

J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. "Extracting Semistructured Information from the Web". In Proceedings of the Workshop on Management of Semistructured Data. Tucson, Arizona, May 1997. ftp://www-db.stanford.edu/pub/papers/extract.ps

Chen Li, Ramana Yerneni, Vasilis Vassalos, Hector Garcia-Molina, Yannis Papakonstantinou, Jeffrey Ullman, Murty Valiveti. "Capability Based Mediation in TSIMMIS". SIGMOD 98 Demo, Seattle, June 1998. http://www-db.stanford.edu/pub/papers/cap.ps

V. Vassalos , Y. Papakonstantinou. "Describing and Using Query Capabilities of Heterogeneous Sources". In VLDB Conference, Athens, Greece, August 1997. ftp://www-db.stanford.edu/pub/papers/query-cap-ext.ps

Useful classified links to papers etc available on the web, and others:Post-Modern Database Systems: Databases Meet the Web http://db.cs.berkeley.edu/postmodern/

Data Integration at the University of Washingtonhttp://data.cs.washington.edu/integration/

Data Management and Intelligent Internet Systems http://www.cs.washington.edu/research/irdb.intro.html

Gunter Grieser, Klaus P. Jantke, Steffen Lange, Bernd Thomas, A Unifying Approach to HTML Wrapper Representation and Learning, DS 2000, Kyoto, Japan, 4.-6.12.2000 http://www.dfki.de/~lexikon/Publikationen/Files/GJLT-DS-2000.pdf

Gunter Grieser,Steffen Lange, Learning Approaches to Wrapper Induction FLAIRS 2001, 21-23 May 2001, Key West, FL http://www.dfki.de/~lexikon/Publikationen/Files/FLAIRS-2001-GL.pdf

Steffen Lange, Gunter Grieser, Klaus P. Jantke, Extending Elementary Formal Systems Algorithmic Learning Theory, 12th International Conference, ALT 2001 LNAI 2225, pp. 332 - 347, 2001.http://www.dfki.de/~lexikon/Publikationen/Files/ALT-2001-LGJ.ps

Manuel Álvarez, Alberto Pan, Juan Raposo, Fidel Cacheda, Ángel Viña: FINDER: A Mediator System for Structured and Semi-Structured Data Integration. 847-851

Juan Raposo, Alberto Pan, Manuel Álvarez, Justo Hidalgo, Ángel Viña: The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes. DEXA Workshops 2002: 313-320

Alberto Pan, Juan Raposo, Manuel Álvarez, Justo Hidalgo, Ángel Viña: Semi-Automatic Wrapper Generation for Commercial Web Sources. Engineering Information Systems in the Internet Context 2002: 265-283

Alberto Pan, Paula Montoto, Anastasio Molano, Manuel Álvarez, Juan Raposo, Vicente Orjales, Ángel Viña: Mediator Systems in E-Commerce Applications. WECWIS 2002: 228-235

Proceedings of the First International Workshop on Web Document Analysis (WDA2001) Seattle, Washington, USA. September 8, 2001 (in association with ICDAR'01) http://www.csc.liv.ac.uk/~wda2001/

A. Rahman and H. Alam, Content extraction from HTML documents,Proceedings of the First International Workshop on Web Document Analysis (WDA2001), Seattle, Washington, USA. September 8, 2001 (in association with ICDAR'01) http://www.csc.liv.ac.uk/~wda2001/Papers/11_rahman_wda2001.pdf

V. Lakshmi, A.-H. Tan and C.-L. Tan, Web structure analysis for information mining Proceedings of the First International Workshop on Web Document Analysis (WDA2001) , Seattle, Washington, USA. September 8, 2001 (in association with ICDAR'01)

Page 10: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

http://www.csc.liv.ac.uk/~wda2001/Papers/18_lakshmi_wda2001.pdf

Yuan Jiang, Record-Boundary Discovery In Web Documents, MSc Dissertation, 1998.http://citeseer.nj.nec.com/rd/97012851%2C294151%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/compress/0/papers/cs/14590/http:zSzzSzosm7.cs.byu.eduzSzdegzSzpaperszSzSJ.Thesis.ps.gz/jiang99recordboundary.ps

David W. Embley, Y. S. Jiang, Yiu-Kai Ng: Record-Boundary Discovery in Web Documents. SIGMOD Conference 1999: 467-478

Publications of Embley's Data Extraction Group at Brigham Young University:http://www.deg.byu.edu/

David W. Embley, Cui Tao, and Stephen W. Liddle, Automating the Extraction of Data from HTML Tables with Unknown Structure, submitted, May 2003. (29 pages)http://www.deg.byu.edu/papers/dke2003etl.pdf

Stephen W. Liddle, Kimball A. Hewett, and David W. Embley, An Integrated Ontology Development Environment for Data Extraction, submitted, April 2003.http://www.deg.byu.edu/papers/ista2003.pdf

Tim Chartrand, Ontology-Based Extraction of RDF Data from the World Wide Web, Masters Thesis, March 2003.http://www.deg.byu.edu/papers/tim_thesis.pdf

Li Xu and D.W. Embley, Combining the Best of Global-as-View and Local-as-View for Data Integration, submitted. http://www.deg.byu.edu/papers/PODS.integration.pdf

S.W. Liddle, D.W. Embley, D.T. Scott, and S.H. Yau, Extracting Data Behind Web Forms, Proceedings of the Workshop on Conceptual Modeling Approaches for e-Business, Finland, October, 2002.http://www.deg.byu.edu/papers/vldb02.pdf

Sai Ho (Tony) Yau, Automating the Extraction of Data Behind Web Forms, Masters Thesis, December 2001. http://www.deg.byu.edu/papers/TonyYauThesis.docOn the Automatic Extraction of Data from the Hidden Web by S.W. Liddle, S.H. Yau, and D.W. Embley, Proceedings of the International Workshop on Data Semantics in Web Information Systems (DASWIS-2001), Yokohama, Japan, 27-30 November 2001. (181K .pdf) http://www.deg.byu.edu/papers/daswis01.pdf

D.W. Embley and L. Xu, Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents, WebDB'00 Proceedings http://www.deg.byu.edu/papers/WebDB00.ps

D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.-K. Ng, R.D. SmithConceptual-Model-Based Data Extraction from Multiple-Record Web Pages (1999), Data & Knowledge Engineering, 31(3) 227-251.http://citeseer.nj.nec.com/rd/15353230%2C389588%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/compress/0/papers/cs/18642/http:zSzzSzwww.deg.byu.eduzSzpaperszSzdke99.ps.gz/embley99conceptualmodelbased.psThis 30-page paper uses 'Stanford Certainty Theory' for which they reference the following book:Luger, G.F., Stubblefield, W.A.: Artificial Intelligence: Structures and Strategies for Complex Problem Solving, Third Edition. Addison Wesley Longman, Inc., (1998)

Lakshmi Vijjappu, Ah-Hwee Tan, and Chew-Lim Tan. Web Structure Analysis for Information Mining. In proceedings, ICDAR'01 Workshop on Web Document Analysis, Seattle, September 10-13, 2001. http://textmining.krdl.org.sg/people/ahhwee/papers/web_analysis_wda01.pdf

Jiang T, Wang L and Zhang K, "Alignment of trees - an alternative to tree edit'', Theoretical Computer Science, Vol. 143, No. 1, 1995, pp. 137-148

L. Wang, T. Jiang and D. Gusfield, A more efficient approximation scheme for tree alignment, SIAM J. Comput. 30(1), 283-299. 2000. 17 pages long.

Page 11: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

http://www.cs.cityu.edu.hk/~lwang/research/siamj00.pdf

Chia-Hui Chang and Shao-Chen Lui, IEPAD: Information Extraction based on Pattern DiscoveryProc 10th International Conference on World Wide Web, WWW10, May 2-5, 2001.http://www10.org/cdrom/papers/223/This is a web document (in html) which prints as 11 pages.

Wen-Tau Yih MSc Thesis: Template-based Information Extraction from Tree-structured HTML Documents.National Taiwan University (1997) . 98 pages.http://citeseer.nj.nec.com/rd/97012851%2C36105%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/papers/cs/2073/http:zSzzSzmobile.csie.ntu.edu.twzSz%7Er4526048zSzDocumentszSzthesis.pdf/yih97templatebased.pdf

Jane Yung-Jen Hsu, Wen-tau Yih, Template-Based Information Mining from HTML Documents, Proc AAAI-97, pp 256-262. (1997) 7 pages.http://citeseer.nj.nec.com/rd/38223961%2C242103%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/papers/cs/11305/http:zSzzSzhugo.csie.ntu.edu.twzSz%7EwtyihzSzDocumentszSzTbIE.pdf/hsu97templatebased.pdf

Chun-Nan Hsu, Ming-Tzung Dung, Generating Finite-State Transducers For Semi-Structured Data Extraction From The Web, Information Systems 23(8) 521-536, 1998. (18 pages).http://citeseer.nj.nec.com/rd/97012851%2C127191%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/papers/cs/5709/http:zSzzSzwww.iis.sinica.edu.twzSz%7EchunnanzSzDOWNLOADSzSzjis2.pdf/hsu98generating.pdf

Adaptive Internet Intelligent Agents Research Group (SoftMealy wrapper/extractor)http://chunnan.iis.sinica.edu.tw/software.html

IEEE Data Engineering Bulletin .. online papers:http://www.research.microsoft.com/research/db/debull/issues-list.htm

Serge Abiteboul, Issues in Monitoring Web Data, Proc. 13th International Conference on Database and Expert Systems Applications, DEXA 2002, pp1-8. (LNCS 2453)

Serge Abiteboul: Querying Semi-Structured Data. ICDT, 1997http://dbpubs.stanford.edu:8090/pub/1996-19

Jason McHugh, Serge Abiteboul, Roy Goldman, Dallan Quass, Jennifer Widom, Lore: A Database Management System for Semistructured Data (1997), SIGMOD Record 26(3) 1997, pp54-66.http://citeseer.nj.nec.com/mchugh97lore.html

Fifth International Workshop on the Web and Databases (WebDB 2002)Madison, Wisconsin - June 6-7, 2002. Links to papers:http://feast.ucsd.edu/webdb2002/papers.html

Links to previous WedDB workshops:http://feast.ucsd.edu/webdb2002/previous.html

J. Cowie, Y. Wilks, Information Extraction.In R. Dale, H. Moisl and H. Somers (eds.) Handbook of Natural Language Processing. New York: Marcel Dekker. (2000)http://www.dcs.shef.ac.uk/~yorick/papers/infoext.pdf

other IE papers by Yorick Wilks at Sheffield:http://www.dcs.shef.ac.uk/~yorick/papers.html

Report on Discussion Group III: Web Content Extraction and Mining, from the First International Workshop on Web Document Analysis (WDA 2001)http://www.csc.liv.ac.uk/~wda2001/Discussions/Klink_Hurst/Klink_Hurst.html

Page 12: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

Tao Guan and Kam Fai Wong, KPS --- a Web Information Mining Algorithm, Proc WWW8 Conference May 1999.http://www8.org/w8-papers/4a-search-mining/kps/kps.html

William W. Cohen's Papers: Rule Learninghttp://www-2.cs.cmu.edu/~wcohen/pubs-r.html

more papers by William W. Cohenhttp://www-2.cs.cmu.edu/~wcohen/pubs-s.html

William Cohen, Matthew Hurst & Lee S. Jensen, A Flexible Learning System for Wrapping Tables and Lists in HTML Documents (HTML), in WWW-2002 (2002). http://www2002.org/CDROM/refereed/355/

William Cohen, David McAllester, and Henry Kautz, Hardening Soft Information Sources (Postscript), in KDD-2000 (2000). http://www-2.cs.cmu.edu/~wcohen/postscript/kdd-2000.ps

William W. Cohen and Wei Fan, Learning Page-Independent Heuristics for Extracting Data from Web Pages, Proc The Eighth International World Wide Web Conference, May 11-14, 1999http://www8.org/w8-papers/5a-search-query/learning/

WWW8 Conference Refereed Papers: http://www8.org/fullpaper.html

Chia-Hui Chang, Shao-Chen Lui, and Yen-Chin Wu, Applying Pattern Mining to Web Information Extraction, Proc. PAKDD 2001, Knowledge Discovery and Data Mining, 5th Pacific-Asia Conference, Hong Kong, April 2001. Lecture Notes in Computer Science 2035, pp 4-16. LNAI 2035.http://link.springer.de/link/service/series/0558/papers/2035/20350004.pdf

Fuchun Peng, Models for Information Extraction http://citeseer.nj.nec.com/489954.html

Daniela Florescu, Alon Levy, and Alberto Mendelzon, Database Techniques for the World-Wide Web: A Survey. SIGMOD Record, 27(3):59-74 (1998).http://citeseer.nj.nec.com/cache/papers/cs/1996/ftp:zSzzSzftp.db.toronto.eduzSzpubzSzpaperszSzsigrec.pdf/florescu98database.pdfweb version:http://oopsla.snu.ac.kr/xweet/seminar/990728-jmjeong/www.html

Ion Muslea, Steve Minton, Marina del Rey, Craig Knoblock, A hierarchical approach to wrapper induction, Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, 1999.http://citeseer.nj.nec.com/muslea99hierarchical.html

I. Muslea. Extraction Patterns for Information Extraction Tasks: A Survey. In Proceedings of Workshop on Machine Learning and Information Extraction (AAAI-99) pag. 1-6, Orlando, Florida, 1999http://blondie.cs.byu.edu/CS652/muslea99extraction.pdf

Nicholas Kushmerick, Gleaning the Web, IEEE Intelligent Systems, Vol. 14, No. 2, March/April 1999, pp. 20-22 http://www.cs.ucd.ie/staff/nick/home/research/download/kushmerick-ieeeis99.pdf

Wensheng Wu, Clement Yu, Weiyi Meng, King-Lup Yu, Text Database Selection for Longer Queries, 24 pages(This concerns meta-search engine implentation). 2002.http://citeseer.nj.nec.com/455651.html

Andreas Eberhart, Survey of RDF data on the Web (2002).http://citeseer.nj.nec.com/eberhart02survey.html

Andreas Eberhart, SmartGuide: An Intelligent Information System basing on Semantic Web Standards.http://www.i-u.de/schools/eberhart/icai2002.pdf

Michael K. Bergman, The Deep Web: Surfacing Hidden Value, Journal of Electronic Publishing 2001 vol 7

Page 13: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

This White Paper is a version of the one on the BrightPlanet site.http://www.brightplanet.com/deepcontent/tutorials/DeepWeb/deepwebwhitepaper.pdf

S. Lawrence and CL Giles, Accessibility of information on the Web , Nature, Vol. 400, pp. 107-109, 1999. http://wwwmetrics.com/

D. Buttler and L. Liu and C. Pu, A Fully Automated Object Extraction System for the World Wide Web. Proc. Intl. Conf. on Distributed Computing Systems, 2001. pp 361 - 371.http://citeseer.nj.nec.com/rd/67081539%2C427047%2C1%2C0.25%2CDownload/http%3AqSqqSqwww.cc.gatech.eduqSqprojectsqSqinfosphereqSqpapersqSqfinal-icdcs01.ps

David Buttler, Terence Critchlow Using Meta-Data to Automatically Wrap Bioinformatics Sources (.pdf) In ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) Workshop on Objects, XML, and Databases, October 2001.http://www.cc.gatech.edu/~buttler/DAML/OOPSLA_01.pdf

Ling Liu, David Buttler, Terence Critchlow, Wei Han, Henrique Paques, Calton Pu, Daniel Rocco. BioZoom: Exploiting Source-Capability Information for Integrated Access to Multiple Bioinformatics Data Sources. In Proc. of 3rd IEEE Symposium on Bioinfomatics and Bioengineering, 2003.http://disl.cc.gatech.edu/SDM/papers/bibe03.pdf

David Buttler's home page with papers links:http://www.cc.gatech.edu/~buttler/

Kazem Taghva, Allen Condit, and Julie Borsack. Autotag: A tool for creating structured document collections from printed materials, Proc. Electronic Publishing, Artistic Imaging, and Digital Typography Conference, EP'98 and RIDT'98, St Malo, France, April 1998, pages 420-431. LNCS 1375.http://www.isri.unlv.edu/publications/isripub/Taghva98b.pdfsee also links on page:http://www.isri.unlv.edu/publications/isri-conf.php

Kazem Taghva, Allen Condit, and Julie Borsack, An evaluation of an automatic markup system. Proc. IS&T/SPIE 1995 Intl. Symp. on Electronic Imaging Science and Technology, San Jose, CA, February 1995http://www.isri.unlv.edu/publications/isripub/Taghva95a.ps

Theodore W Hong, Keith L Clark, Using Grammatical Inference to Automate Information Extraction from the Web, Proc PKDD 2001, 5th European Conference on Principles of Data Mining and Knowledge Discovery, 2001 pp 216-226. (LNCS 2168)http://www.springerlink.com/app/home/content.asp?wasp=ecxxykvutncq491hfj6u&referrer=contribution&format=2&page=1

William W. Cohen's Papers: Text Categorizationhttp://www-2.cs.cmu.edu/~wcohen/pubs-t.htmla list of papers with links, including the following:William Cohen, Improving A Page Classifier with Anchor Extraction and Link Analysis (PDF), in NIPS-2002 (2002).http://www-2.cs.cmu.edu/~wcohen/postscript/nips-2002.pdf

Un Yong Nahm and Raymond J. Mooney, Text Mining with Information Extraction. Proceedings of the AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases, pp. 60-67, Stanford, CA, March 2002.http://www.cs.utexas.edu/users/ml/papers/discotex-aaaisymp-02.pdf

Mattis Neiling, Markus Schaal, Martin Schumann, WrapIt: Automated Integration of Web Databases with Extensional Overlaps. 2nd Intl.Workshop of theWorking Group "Web and Databases" of the German Informatics Society (GI) (Workshop WebDB 2002), Erfurt, Thuringia, Germany, October 9-10, 2002, pp 184-198. (LNCS 2593) http://citeseer.nj.nec.com/552045.htmlhttp://link.springer.de/link/service/series/0558/papers/2593/25930184.pdf

Andreas Rauber, Oliver Witvoet, Andreas Aschenbrenner, Robert Bruckner, Putting the World Wide Web into a Data Warehouse: A DWH-based Approach to Web Analysishttp://citeseer.nj.nec.com/546144.html

Page 14: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

Arnaud Sahuguet and Fabien Azavant, Building intelligent Web applications using lightweight wrappers, Data & Knowledge Engineering, 36(3), March 2001, pp 283-316.Available via link on the W4F project's web page: http://db.cis.upenn.edu/Research/w4f.html

Sonia Bergamaschi, Silvana Castano, Maurizio Vincini and Domenico Beneventano, Semantic integration of heterogeneous information sources, Data & Knowledge Engineering, 36(3), March 2001, pp 215-249.

Matthias Klusch, Information agent technology for the Internet: A survey, Data & Knowledge Engineering, 36(3), March 2001, pp 337-372.http://www.dfki.de/%7Eklusch/papers/iat-dke-2000.zip

Matthias Klusch Homepage (he edits the LNCS annual conf: International Workshop Series on Cooperative Information Agents)http://www.dfki.de/~klusch/

Paolo Atzeni, Giansalvatore Mecca, Paolo Merialdo, Semistructured and Structured Data in the Web: Going Back and Forth, Proc ACM SIGMOD Workshop on Management of Semistructured Data 1997, pp 1-8.http://citeseer.nj.nec.com/atzeni97semistructured.html

Robert B. Doorenbos, Oren Etzioni, Daniel S. Weld, A Scalable Comparison-Shopping Agent for the World-Wide Web, Proceedings of the First International Conference on Autonomous Agents (Agents'97), pp 39-48, 1997.http://citeseer.nj.nec.com/doorenbos97scalable.html.. this link is to a 20-page technical report version of the ten-page conference paper.

Daniela Florescu, Alon Levy, and Alberto Mendelzon, Database Techniques for the World-Wide Web: A Survey, SIGMOD Record 27(3), pp 59-74, 1998.http://citeseer.nj.nec.com/florescu98database.htmlweb pages version of the above paper:http://oopsla.snu.ac.kr/xweet/seminar/990728-jmjeong/www.html

Vladislav Shkapenyuk, Torsten Suel, Design and Implementation of a High-Performance Distributed Web Crawler, Proc 18th Intl Conf on Data Engineering (ICDE’02), pp 357–368, 2002.http://citeseer.nj.nec.com/shkapenyuk02design.html

GERY, Mathias. CHEVALLET, Jean-Pierre. "Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages" International Workshop on Web Dynamics (London, UK, January 2001). 10 pageshttp://www.dcs.bbk.ac.uk/webDyn/webDynPapers/gery.ps

SODERLAND, Stephen, Learning to Extract Text-based Information from the World Wide Web, Proc KDD'97, pp 251-254, 1997.http://www-nlp.cs.umass.edu/pubs/Soderland-KDD97.ps

Stephen Soderland, Learning Information Extraction Rules for Semi-structured and Free Text (1999) Machine Learning 34(1-3) pp 233-272, 1999.http://citeseer.nj.nec.com/soderland99learning.htmlA wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extraction rules automatically.

WEB MINING on-line list of links:http://www.dcc.uchile.cl/~ljaramil/investigacion/wmining.html

Mary Elaine Califf's Web Sitehttp://www.acs.ilstu.edu/faculty/mecalif/calif.htm

URLS of data sources:

RISE: Repository of Online Information Sources Used in Information Extraction Tasks

Page 15: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

http://www.isi.edu/info-agents/RISE/

OKRA .. Service has been discontinued. http://okra.ucr.edu/

BigBookhttp://www.bigbook.com/

Internet Address Finderhttp://www.iaf.net/

Quote Serverhttp://www.secapl.com/cgi-bin/qs

JOBS (a newsgroup)news:misc.jobs.offered

L.A. Weekly Home Pagehttp://www.laweekly.com/

L.A. Weekly Restaurants Guidehttp://www.laweekly.com/restaurants/search.html

ZAGAT's Guide to Los Angeles Restaurantshttp://www.zagat.com/

Seattle Times Rentals http://classifieds.nwsource.com/classified/also Jobs, Autos, RealEstate .. etc

Seminar Announcements .. no URL provided by RISE repository:http://www-2.cs.cmu.edu/~dayne/SeminarAnnouncements/__Source__.htmlWHIRL: A Set of 111 Sources used by William Cohen in the WHIRL projecthttp://www.isi.edu/info-agents/RISE/Original_WHIRL/__Source__.html

The 34 WIEN sources, including OKRA, BigBook, Internet Address Finder, and Quote Server:URL no longer valid.

The Road Runner Project: Towards Automatic Data Extraction from Large Web Sites http://www.dia.uniroma3.it/db/roadRunner/experiments.htmlincludes a list of data sources and their experimental results for them.It includes the source pages they used, so is very valuable for comparison expts: amazon.com The most popular e-commerce Web site buy.comA popular e-commerce Web site wine.comAn e-commerce Web site dedicated to wines uefa.comThe official Web site of the European Football (Soccer) Association majorleguebaseball.comThe official Web site of the Majorleague Baseball barnesandnoble.comA popular e-commerce Web site nba.comThe Official NBA Web Site rpmfind.netA site hosting Linux RPM software packages. Data Sources used by Boris Chidlovskii, Jon Ragetli and Maarten de Rijke. :

Page 16: G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards

Library of Congresshttp://www.lcweb.loc.gov

IICMhttp://www.iicm.edu

CS Bibliography (Karlsruhe)http://liinwww.ira.uka.de/bibliography/index.html

CS Bibliography (Trier)http://www.informatik.uni-trier.de/~ley/db/index.html

ftpSearch .. their Belgian url is no longer availableftpSearch.de (various countries are still available, eg ftpsearch.lt) says use file search on alltheweb instead:http://www.alltheweb.com/

A list of on-line Bibliographies:http://zeeb.library.cmu.edu/bySubject/CS+ECE/bibs.html

The COD Project, University of Wisconsin. (Semantic database integration.)Raghu Ramakrishnan, Coral Deductive DB, etc.http://www.cs.wisc.edu/~cod/

Eric Brill, Jimmy Lin, Michele Banko, Susan T. Dumais, Andrew Y. Ng: Data-Intensive Question Answering. (TREC 2001)http://trec.nist.gov/pubs/trec10/papers/Trec2001Notebook.AskMSRFinal.pdf

Gestalts Project: Networked DatabasesRaghu Ramakrishnan, University of Wisconsin. Web page Oct '98:http://www.cs.wisc.edu/~raghu/gestalts/

Lixto web sitehttp://www.dbai.tuwien.ac.at/proj/lixto/