overbeeke

5
Supporting the Exploration of Research Spaces Chwhynny Overbeeke [email protected] Supervisors Enrico Motta, Tom Heath, Paul Mulholland Department Knowledge Media Institute Status Full-time Probation viva Before Starting date December 2009 1 Introduction It is often hard to make sense of what exactly is going on in the research community. What topics or researchers are new and emerging, gaining popularity, or disappearing? How does this happen and why? What are the key publications or events in a particular area? How can we understand whether geographical shifts are occurring in a research area? There are several tools available that allow users to explore different elements of a research area. However, making sense of the dynamics of a research area is still a very challenging task. This leads to my research question: How can we improve the level of support for people to explore the dynamics of a research commu- nity? 2 Framework and Background In order to answer this question we first need to identify the different elements, relations and dimensions that define a research area and put them into a framework. We then need to find existing tools that address these elements, and categorize them according to our framework in order to identify gaps in the current level of support. Some elements we already identified are: people, institutions and organizations, events, activity, popularity, publications, citations, time, geography, keywords, studentships, funding, impact, and technologies. The people element is about the researchers that are or were present in the research community, whilst the institutions and organizations element refers to the research groups, institutions, and organizations that are active within an area of research, and the affiliations the people within the community have with them. Events can be workshops, conferences, seminars, competitions, or any other kind of research-related happening. EventSeer 1 is a service that aggregates all the calls for papers and event announcements that float around the web into one common, searchable tool. It keeps track of events, people, topics and organizations, and lists the most popular people, topics, and organizations per week. 1 http://www.eventseer.net 2010 CRC PhD Student Conference Page 69 of 125

Upload: anesah

Post on 11-May-2015

381 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Overbeeke

Supporting the Exploration of Research Spaces

Chwhynny [email protected]

Supervisors Enrico Motta, Tom Heath, Paul MulhollandDepartment Knowledge Media InstituteStatus Full-timeProbation viva BeforeStarting date December 2009

1 Introduction

It is often hard to make sense of what exactly is going on in the research community. What topicsor researchers are new and emerging, gaining popularity, or disappearing? How does this happenand why? What are the key publications or events in a particular area? How can we understandwhether geographical shifts are occurring in a research area? There are several tools available thatallow users to explore different elements of a research area. However, making sense of the dynamicsof a research area is still a very challenging task. This leads to my research question:

How can we improve the level of support for people to explore the dynamics of a research commu-nity?

2 Framework and Background

In order to answer this question we first need to identify the different elements, relations anddimensions that define a research area and put them into a framework. We then need to findexisting tools that address these elements, and categorize them according to our framework inorder to identify gaps in the current level of support. Some elements we already identified are:people, institutions and organizations, events, activity, popularity, publications, citations, time,geography, keywords, studentships, funding, impact, and technologies.

The people element is about the researchers that are or were present in the research community,whilst the institutions and organizations element refers to the research groups, institutions, andorganizations that are active within an area of research, and the affiliations the people within thecommunity have with them. Events can be workshops, conferences, seminars, competitions, or anyother kind of research-related happening. EventSeer1 is a service that aggregates all the calls forpapers and event announcements that float around the web into one common, searchable tool. Itkeeps track of events, people, topics and organizations, and lists the most popular people, topics,and organizations per week.

1http://www.eventseer.net

2010 CRC PhD Student Conference

Page 69 of 125

Page 2: Overbeeke

The activity element refers to how active the researchers, institutions, and organizations are withinthe field, for instance event attendance or organization, or the number and frequency of publicationsand events. A tool that can be used to explore this is Faceted DBLP2, a server interface for theDBLP server3 which provides bibliographic information on major computer science journals andproceedings [Ley 2002]. Faceted DBLP starts with some keyword and shows the result set alongwith a set of facets, e.g. distinguishing publication years, authors, venues, and publication types.The user can characterize the result set in terms of main research topics and filter it according tocertain subtopics. There are GrowBag graphs available for keywords (number of hits/coverage).

Popularity is about the interest that is displayed in a person, institution or organization, publica-tion, topic, technology, or event. WikiCFP4 is a service that helps organize and share academicinformation. Users can browse and add calls for papers per subject category, and users to add callsfor papers to their own personal user list. Each call for paper has information on the event name,date, location, and deadline. WikiCFP also provides hourly updated lists of the most popularcategories, calls for papers, and user lists.

One indicator of topic popularity is the number of publications on a topic. There are many toolsthat show the number of publications per topic per year. PubSearch is a fully automatic web miningapproach for the identification of research trends that searches and downloads scientific publicationsfrom web sites that typically include academic web pages [Tho et al. 2003]. It extracts citationswhich are stored in the tool’s Web Citation Database which is used to generate temporal documentclusters and journal clusters. These clusters are then mined to find their interrelationships, whichare used to detect trends and emerging trends for a specified research area.

Another indicator of popularity is how often a publication or researcher is cited. Citations canalso help identify relations between researchers through analysis of who is citing who and when,and what their affiliations are. Publish Or Perish is a piece of software that retrieves and analyzesacademic citations [Harzing and Van der Wal 2008]. It uses Google Scholar5 to obtain raw citations,and analyzes them. It presents a wide range of citation metrics such as the total number of papersand citations, average number of citations per paper and author, the average number of papers perauthor and year, an analysis of number of authors per paper, et cetera.

Topics, interests, and people evolve over time, and the makeup of the research community changeswhen people and organizations enter or leave certain research areas or change their direction.Some topics appear to be more established or densely represented in certain geographical areas,for instance because a prolific institution is located there and has attracted several experts on aparticular topic, or because many events on a topic are held in that area. AuthorMapper6 is anonline tool for visualizing scientific research. It searches journal articles from the SpringerLink7

and allows users to explore the database by plotting the location of authors, research topics andinstitutions on a world map. It also allows users to identify research trends through timeline graphs,statistics and regions.

Keywords are an important indicator of a research area because they are the labels that have beenput on publications or events by the people and organizations within that research area. Google

2http://dblp.l3s.de/3http://dblp.uni-trier.de/4http://www.wikicfp.com/5http://scholar.google.com/6http://www.authormapper.com/7http://www.springerlink.com/

2010 CRC PhD Student Conference

Page 70 of 125

Page 3: Overbeeke

Scholar is a subset of the Google search index consisting of full-text journal articles, technical re-ports, preprints, thesis, books, and web sites that are deemed ’scholarly’ [Noruzi 2005, Harzing andVan der Wal 2008]. Google Scholar has crawling and indexing agreements with several publishers.The system is based on keyword search only and its results are organized by a closely guardedrelevance algorithm. The ’cited-by-x’ feature allows users to see by whom a publication was cited,and where.

The availability of new studentships indicates that a research area is trying to attract new people.This may mean that the area is hoping to expand, change direction, or become more established.The availability of funding within a research area or topic is an indicator of the interest thatis displayed in it, or the level of importance it is deemed to have at a particular time. ThePostgraduate Studentships web site8 offers a search engine as well as a browsable list of study orfunding opportunities organized by subjects, masters, PhD/doctoral and professional doctoratesand a browsable list of general funders, funding universities and featured departments. The sitealso lists open days and fairs.

The level of impact of the research carried out by a research group, institution, organization orindividual researcher leads to their establishment in the research community, which in turn couldlead to more citations and event attendance. The technologies element refers to the technologiesthat are developed within an area of research, and their impact, popularity and establishment.Research impact is on a small scale implemented into Scopus (http://www.scopus.com/), currentlya preview-only tool which, amongst other things, identifies and matches an organization with allits research output, tracks how primary research is practically applied in patents and tracks theinfluence of peer-reviewed research on web literature. It covers nearly 18,000 titles from over 5,000publishers, 40,000,000 records, scientific web pages, and articles-in-press. A tool that ranks publi-cations is DBPubs, a system for analyzing and exploring the content of database publications bycombining keyword search with OLAP-style aggregations, navigation, and reporting [Baid et al.2008]. It performs keyword search over the content of publications. The meta data (title, author,venue, year et cetera) provide OLAP static dimensions, which are combined with dynamic dimen-sions discovered from the content of the publications in the search result, such as frequent phrases,relevant phrases and topics. Based on the link structure between documents (i.e. citations) publi-cation ranks are computed, which are aggregated to find seminal papers, discover trends, and rankauthors.

Finally, we would like to discuss a more generic tool, DBLife9 [DeRose et al. 2007, Goldberg andAndrzejewski 2007, Doan et al. 2006], which is a prototype of a dynamic portal of current informa-tion for the database research community. It automatically discovers and revisits web pages andresources for the community, extracts information from them, and integrates it to present a unifiedview of people, organizations, papers, talks, et cetera. For example, it provides a chronologicalsummary, has a browsable list of organizations and conferences, and it summarizes interesting newfacts for the day such as new publications, events, or projects. It also provides community statisticsincluding top cited people, top h-indexed people, and top cited publications. DBLife is currentlyunfinished and does not have full functionality, but from the prototype alone one can conclude itwill most likely address quite a few elements from our framework.

8http://www.postgraduatestudentships.co.uk/9http://dblife.cs.wisc.edu/

2010 CRC PhD Student Conference

Page 71 of 125

Page 4: Overbeeke

3 Methodology

In order to find out what are the key problems people encounter when trying to make sense of thedynamics of a research area we will carry out an empirical study, which consists of a task and ashort questionnaire.

The 30 to 40 minute task is to be carried out by around 10 to 12 subjects who will be asked toinvestigate a research area that is fairly new to them and write a short report on their findings.The subjects’ actions will be recorded using screen capture software and the subjects themselveswill be videoed for the duration of the task so that the entire exploration process is documented.The screen capture will show the actions the subjects take and the tools they use to reach theirgoal. The video data will show any reactions the subjects may display during their explorationprocess, for example confusion or frustration with a tool they are trying to use. The questionnairewill be filled out by as many subjects as possible, who will be asked to identify the key elementsof a research area which they would take into account when planning a PhD research. In thequestionnaire people will be made aware of the framework we created, but we will allow for openanswers and additions to the existing framework.

The technical study will consist of an overview, comparison, critical review, and gap analysis ofexisting tools that support the exploration of the research community. It will link those tools toour framework in order to find out to what extent the several elements are covered by the existingtools.

At this stage we will have highlighted the key elements that define a research area, identified gapsin the existing support for the exploration of the research community, and gathered evidence tosupport this by mapping existing tools to our framework, carrying out a practical task, and sendingout a questionnaire. We will then aim to improve support for people to explore the dynamics ofthe research community by implementing novel tools, addressing the gaps that have emerged fromthese studies. Our hypothesis is that at least some of these gaps are due to the lack of integrationbetween different types of data covering different elements of a research area.

References

Baid, A., Balmin, A., Hwang, H., Nijkamp, E., Rao, J., Reinwald, B., Simitsis, A., Sismanis, Y.,and Van Ham, F. (2008). DBPubs: Multidimensional Exploration of Database Publications.Proceedings of the VLDB Endowment, 1(2):1456–1459.

DeRose, P., Shen, W., Chen, F., Lee, Y., Burdick, D., Doan, A., and Ramakrishnan, R. (2007).DBLife: A Community Information Management Platform for the Database Research Commu-nity. In Weikum, G., Hellerstein, J., and Stonebraker, M., editors, Proceedings of the 3rd BiennialConference on Innovative Data Systems Research (CIDR 2007), Asilomar, California, USA.

Diederich, J. and Balke, W. (2008). FacetedDBLP - Navigational Access for Digital Libraries.Bulletin of the IEEE Technical Committee on Digital Libraries (TCDL), 4(1).

Diederich, J., Balke, W., and Thaden, U. (2007). Demonstrating the Semantic GrowBag: Au-tomatically Creating Topic Facets for FacetedDBLP. In Proceedings of the ACM IEEE JointConference on Digital Libraries (JCDL 2007), Vancouver, British Columbia, Canada.

2010 CRC PhD Student Conference

Page 72 of 125

Page 5: Overbeeke

Doan, A., Ramakrishnan, R., Chen, F., DeRose, P., Lee, Y., McCann, R., Sayyadian, M., and Shen,W. (2006). Community Information Management. IEEE Data Engineering Bulletin, Special Issueon Probabilistic Databases, 29.

Goldberg, A. and Andrzejewski, D. (2007). Automatic Research Summaries in DBLife. CS 764:Topics in Database Management Systems.

Harzing, A. and Van der Wal, R. (2008). Google Scholar as a New Source for Citation Analysis.Ethics in Science and Environmental Politics, 8:61–73.

Ley, M. (2002). The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspec-tives. In Proceedings of the 9th International Symposium (SPIRE 2002), pages 481–486, Lisbon,Portugal.

Noruzi, A. (2005). Google Scholar: The New Generation of Citation Indexes. Libri, 55:170–180.

Tho, Q., Hui, S., and Fong, A. (2003). Web Mining for Identifying Research Trends. In Sembok,T., Badioze Zaman, H., Chen, H., Urs, S., and Myaeng, S., editors, Proceedings of the 6th Inter-national Conference on Asian Digital Libraries (ICADL 2003), pages 290–301, Kuala Lumpur,Malaysia. Springer.

2010 CRC PhD Student Conference

Page 73 of 125