topical categorization of large collections of electronic theses and dissertations

22
Topical Categorization Topical Categorization of Large Collections of of Large Collections of Electronic Theses and Electronic Theses and Dissertations Dissertations Venkat Srinivasan & Edward A. Venkat Srinivasan & Edward A. Fox Fox Virginia Tech, Blacksburg, VA, Virginia Tech, Blacksburg, VA, USA USA ETD 2009 – June 11, 2009 ETD 2009 – June 11, 2009

Upload: haviva-barnett

Post on 15-Mar-2016

35 views

Category:

Documents


2 download

DESCRIPTION

Topical Categorization of Large Collections of Electronic Theses and Dissertations. Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA ETD 2009 – June 11, 2009. Outline. Introduction Goals Approach Results Future Work. Introduction – great source. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Topical Categorization of Topical Categorization of Large Collections of Large Collections of

Electronic Theses and Electronic Theses and DissertationsDissertations

Venkat Srinivasan & Edward A. FoxVenkat Srinivasan & Edward A. FoxVirginia Tech, Blacksburg, VA, USAVirginia Tech, Blacksburg, VA, USA

ETD 2009 – June 11, 2009ETD 2009 – June 11, 2009

Page 2: Topical Categorization of Large Collections of Electronic Theses and Dissertations

OutlineOutline IntroductionIntroduction GoalsGoals ApproachApproach ResultsResults Future WorkFuture Work

Page 3: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Introduction – great sourceIntroduction – great source Electronic submission of dissertations is Electronic submission of dissertations is

increasingly preferred.increasingly preferred. ETDs are a great information source.ETDs are a great information source.

Substantial amount of research on Substantial amount of research on a topica topic

Thorough literature reviewThorough literature reviewPointers to other resources Pointers to other resources (Reference section)(Reference section)

Page 4: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Introduction- under-utilizedIntroduction- under-utilized Yet ETDs are under-utilized. Yet ETDs are under-utilized.

Research papers, books etc. are still major (and in some Research papers, books etc. are still major (and in some cases the only) sources of information for most people. cases the only) sources of information for most people.

Most people (except grad students trained in this) don’t Most people (except grad students trained in this) don’t even think about reading a dissertation! even think about reading a dissertation!

Possible causesPossible causes Access to ETDs not streamlined.Access to ETDs not streamlined.

Users don’t know where to look for ETDs. Users don’t know where to look for ETDs. ETDs of interest could be buried in search engine results. ETDs of interest could be buried in search engine results.

Some universities do not allow outside access to their ETD Some universities do not allow outside access to their ETD collection.collection.

Page 5: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Introduction - needsIntroduction - needs Efforts have been made to make ETDs more Efforts have been made to make ETDs more

accessible.accessible. NDLTD, VTLS, Scirus, etc. provide means of NDLTD, VTLS, Scirus, etc. provide means of

access to ETDs from different universities. access to ETDs from different universities. Not very feature rich and convenient:Not very feature rich and convenient:

Users search for ETDs based on keywords.Users search for ETDs based on keywords. Don’t know what lies underneath (no idea Don’t know what lies underneath (no idea

about the size, topical coverage, etc. of ETD about the size, topical coverage, etc. of ETD collections)collections)

Not very amenable to browsing (users have to Not very amenable to browsing (users have to sift through search results)sift through search results)

Page 6: Topical Categorization of Large Collections of Electronic Theses and Dissertations

GoalsGoals Provide a portal to ETD collections of Provide a portal to ETD collections of

more different universitiesmore different universities Provide value added servicesProvide value added services

Categorize by topicCategorize by topicSupport searching and browsing Support searching and browsing the collection using various the collection using various criteria (by topic, keywords, date, criteria (by topic, keywords, date, author, etc.)author, etc.)

Page 7: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Goals - prioritiesGoals - priorities Set up infrastructure for crawling Set up infrastructure for crawling

ETDs of various universitiesETDs of various universities Come up with techniques for Come up with techniques for

categorizing them into topical areascategorizing them into topical areas Set up a user-friendly search and Set up a user-friendly search and

browse interfacebrowse interface

Page 8: Topical Categorization of Large Collections of Electronic Theses and Dissertations

ApproachApproach Crawl ETDs from various universitiesCrawl ETDs from various universities Develop a taxonomyDevelop a taxonomy Categorize ETDs into topics in the Categorize ETDs into topics in the

taxonomy treetaxonomy tree Index the ETDsIndex the ETDs Develop a search and browse Develop a search and browse

interfaceinterface

Page 9: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Approach - crawlingApproach - crawling NDLTD’s Union Catalog - as starting pointNDLTD’s Union Catalog - as starting point Dublin Core metadata gatheredDublin Core metadata gathered URLs used to crawl ETDs and other data from the URLs used to crawl ETDs and other data from the

respective universities’ websitesrespective universities’ websites Custom crawlers written Custom crawlers written

Technologies used: Perl, and other open source Perl Technologies used: Perl, and other open source Perl libraries (WWW, Mechanize, etc.)libraries (WWW, Mechanize, etc.)

All metadata (Dublin Core metadata from Union All metadata (Dublin Core metadata from Union Catalog, and the metadata obtained from respective Catalog, and the metadata obtained from respective universities) is stored in our MySQL backend database. universities) is stored in our MySQL backend database.

Page 10: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Approach - taxonomyApproach - taxonomy Need medium generality and specificity, Need medium generality and specificity,

as opposed to those from Proquest, as opposed to those from Proquest, DMOZ, or WikipediaDMOZ, or Wikipedia

For example, DMOZ has more than For example, DMOZ has more than 500,000 nodes !500,000 nodes !

Solution?Solution? Prune the DMOZ category tree, and then Prune the DMOZ category tree, and then

enhance it using Proquest categorization enhance it using Proquest categorization system.system.

Page 11: Topical Categorization of Large Collections of Electronic Theses and Dissertations

TOP

BusinessArts Computers Health Science Society

Category Tree (only top 2 levels Category Tree (only top 2 levels shown)shown)

Approach – taxonomy Approach – taxonomy levelslevels

Level 1

Level 2

Page 12: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Approach - categorizeApproach - categorize Supervised classification approach usedSupervised classification approach used Training set built by using topic labels as query to GoogleTraining set built by using topic labels as query to Google 50 webpages retrieved and used for training Naïve 50 webpages retrieved and used for training Naïve

Baye’s classifier for each node (to distinguish between its Baye’s classifier for each node (to distinguish between its children)children)

ETD metadata used for categorizationETD metadata used for categorization Level-wise categorizationLevel-wise categorization

Page 13: Topical Categorization of Large Collections of Electronic Theses and Dissertations

CategoryTree

Document Sets

Google Naïve Bayes Classifiers

Training Sets

Web Interface

ETD Collection

Categorized ETDs

Category label for each node used as query

Top 50 webpages (for each node in the tree)

Cleanup (stemming, stopword removal, etc.)

Level-wise categorization

ETD metadata used for categorization

BrowsingTraining

ETDs categorized into a node of the category tree (after classification)

Approach (contd.)Approach (contd.) Algorithm PipelineAlgorithm Pipeline

Page 14: Topical Categorization of Large Collections of Electronic Theses and Dissertations

ResultsResults Crawled metadata for all the ETDs from the Crawled metadata for all the ETDs from the

NDLTD Union CatalogNDLTD Union Catalog~800,000 ETDs in Union Catalog~800,000 ETDs in Union Catalog 15 Dublin Core fields extracted and 15 Dublin Core fields extracted and storedstored

Crawled ~200,000 dissertations from the Crawled ~200,000 dissertations from the respective universities (where permissible) respective universities (where permissible) and indexing is in progressand indexing is in progress

Technology used: Lucene search engineTechnology used: Lucene search engine More dissertations being crawledMore dissertations being crawled

Page 15: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Results (contd.)Results (contd.) Enhanced taxonomy developedEnhanced taxonomy developed

Some subtrees are shown in the Some subtrees are shown in the following few slides.following few slides.

The taxonomy currently is 4 levels The taxonomy currently is 4 levels deep and has ~200 nodes.deep and has ~200 nodes.

It is being enhanced to be 5-6 It is being enhanced to be 5-6 levels deep.levels deep.

Page 16: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Results (contd.)Results (contd.)Arts

SpeechLiterature Art History Classical Studies

Visual Arts Performing

Arts……

Enhanced taxonomy (some nodes Enhanced taxonomy (some nodes from the “Arts” subtree shown)from the “Arts” subtree shown)

Level 2

Level 3

Page 17: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Results (contd.)Results (contd.)

Enhanced taxonomy (some nodes Enhanced taxonomy (some nodes from the “Business” subtree shown)from the “Business” subtree shown)

Business

E-CommerceAccounting Human Resources Investing Banking Management……

Level 2

Level 3

Page 18: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Results (contd.)Results (contd.) Categorized >74K ETDs from 8 universitiesCategorized >74K ETDs from 8 universities

MIT, Virginia Tech, Caltech, NCSU, Georgia MIT, Virginia Tech, Caltech, NCSU, Georgia Tech, Ohiolink, Rice, Texas A&MTech, Ohiolink, Rice, Texas A&M

Categorized into 5 topical areas (Arts, Categorized into 5 topical areas (Arts, Business, Computers, Health, Science, Business, Computers, Health, Science, Society)Society)

Categorization into lower levels of category Categorization into lower levels of category tree (levels 3 and 4, that is) is in progresstree (levels 3 and 4, that is) is in progress

Page 19: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Results (contd.)Results (contd.)Name of theUniversity

Total No. of ETDs

Category

Arts Business Computers Health Science Society

MIT 29804 653 1847 6507 375 7141 555

Virginia Tech 11976 742 627 2665 1218 3317 340

Ohiolink 8020 1056 350 1267 1322 2887 345

Rice 6685 937 235 1181 145 2412 62

NCSU 5026 283 245 1419 512 2436 114

Texas A&M 4834 302 363 1363 566 2115 125

CalTech 4774 58 52 1392 29 3096 18

Georgia Tech 3582 32 133 1348 85 1233 23

TOTAL 74701 4063 3852 17142 4252 24637 1582

Page 20: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Results (contd.)Results (contd.) Algorithm is time efficient.Algorithm is time efficient.

Training the classifier is done offline.Training the classifier is done offline. Classification is fast.Classification is fast. Classifying this collection of ~74,000 Classifying this collection of ~74,000

ETDs took <30 mins.ETDs took <30 mins. Hopefully classifiers developed can be Hopefully classifiers developed can be

applied to other data and in other applied to other data and in other systems. systems.

Page 21: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Future WorkFuture Work Increase coverageIncrease coverage

Crawl more ETDs Crawl more ETDs Collaborate with universities and consortia to gain Collaborate with universities and consortia to gain

access to ETD collectionsaccess to ETD collections Better categorization approachesBetter categorization approaches

Leverage query expansion techniques to build training Leverage query expansion techniques to build training setset

Web interface to facilitate browsing and searchWeb interface to facilitate browsing and search User studies to measure the efficacy of the systemUser studies to measure the efficacy of the system

Page 22: Topical Categorization of Large Collections of Electronic Theses and Dissertations

Questions ?Questions ?

[email protected]@[email protected]@vt.edu

Demo info available at Demo info available at http://fox.cs.vt.edu/etdbrowhttp://fox.cs.vt.edu/etdbrow

se/se/