wikulu: an extensible architecture for integrating …wikis are web-based, collaborative content...

Proceedings of the ACL-HLT 2011 System Demonstrations, pages 74–79,Portland, Oregon, USA, 21 June 2011. c©2011 Association for Computational Linguistics

Wikulu: An Extensible Architecture for Integrating Natural LanguageProcessing Techniques with Wikis

Daniel Bar, Nicolai Erbs, Torsten Zesch, and Iryna GurevychUbiquitous Knowledge Processing Lab

Computer Science Department, Technische Universitat DarmstadtHochschulstrasse 10, D-64289 Darmstadt, Germany

www.ukp.tu-darmstadt.de

Abstract

We present Wikulu1, a system focusing onsupporting wiki users with their everydaytasks by means of an intelligent interface.Wikulu is implemented as an extensible archi-tecture which transparently integrates naturallanguage processing (NLP) techniques withwikis. It is designed to be deployed with anywiki platform, and the current prototype inte-grates a wide range of NLP algorithms suchas keyphrase extraction, link discovery, textsegmentation, summarization, or text similar-ity. Additionally, we show how Wikulu canbe applied for visually analyzing the resultsof NLP algorithms, educational purposes, andenabling semantic wikis.

1 Introduction

Wikis are web-based, collaborative content author-ing systems (Leuf and Cunningham, 2001). As theyoffer fast and simple means for adding and editingcontent, they are used for various purposes such ascreating encyclopedias (e.g. Wikipedia2), construct-ing dictionaries (e.g. Wiktionary3), or hosting onlinecommunities (e.g. ACLWiki4). However, as wikis donot enforce their users to structure pages or add com-plementary metadata, wikis often end up as a massof unmanageable pages with meaningless page titlesand no usable link structure (Buffa, 2006).

To solve this issue, we present the Wikulu sys-tem which uses natural language processing to sup-port wiki users with their typical tasks of adding,

1Portmanteau of the Hawaiian terms wiki (“fast”) and kukulu(“to organize”)

2http://www.wikipedia.org3http://www.wiktionary.org4http://aclweb.org/aclwiki

organizing, and finding content. For example,Wikulu supports users with reading longer texts byhighlighting keyphrases using keyphrase extractionmethods such as TextRank (Mihalcea and Tarau,2004). Support integrated in Wikulu also includestext segmentation to segment long pages, text simi-larity for detecting potential duplicates, or text sum-marization to facilitate reading of lengthy pages.Generally, Wikulu allows to integrate any NLP com-ponent which conforms to the standards of ApacheUIMA (Ferrucci and Lally, 2004).

Wikulu is designed to integrate seamlessly withany wiki. Our system is implemented as an HTTPproxy server which intercepts the communicationbetween the web browser and the underlying wikiengine. No further modifications to the original wikiinstallation are necessary. Currently, our system pro-totype contains adaptors for two widely used wikiengines: MediaWiki5 and TWiki6. Adaptors for otherwiki engines can be added with minimal effort. Gen-erally, Wikulu could also be applied to any web-based system other than wikis with only slight mod-ifications to its architecture.

In Figure 1, we show the integration of Wikuluwith Wikipedia.7 The additional user interface com-ponents are integrated into the default toolbar (high-lighted by a red box in the screenshot). In this ex-ample, the user has requested keyphrase highlight-ing in order to quickly get an idea about the maincontent of the wiki article. Wikulu then invokes the

5http://mediawiki.org (e.g. used by Wikipedia)6http://twiki.org (often used for corporate wikis)7As screenshots only provide a limited overview of

Wikulu’s capabilities, we refer the reader to a screencast:http://www.ukp.tu-darmstadt.de/research/projects/wikulu

74

Figure 1: Integration of Wikulu with Wikipedia. The aug-mented toolbar (red box) and the results of a keyphraseextraction algorithm (yellow text spans) are highlighted.

corresponding NLP component, and highlights thereturned keyphrases in the article. In the next sec-tion, we give a more detailed overview of the differ-ent types of support provided by Wikulu.

2 Supporting Wiki Users by Means of NLP

In this section, we present the different types ofNLP-enabled support provided by Wikulu.

Detecting Duplicates Whenever users add newcontent to a wiki there is the danger of duplicatingalready contained information. In order to avoid du-plication, users would need comprehensive knowl-edge of what content is already present in the wiki,which is almost impossible for large wikis likeWikipedia. Wikulu helps to detect potential du-plicates by computing the text similarity betweennewly added content and each existing wiki page.If a potential duplicate is detected, the user is noti-fied and may decide to augment the duplicate pageinstead of adding a new one. Wikulu integrates textsimilarity measures such as Explicit Semantic Anal-ysis (Gabrilovich and Markovitch, 2007) and LatentSemantic Analysis (Landauer et al., 1998).

Suggesting Links While many wiki users read-ily add textual contents to wikis, they often re-strain from also adding links to related pages. How-ever, links in wikis are crucial as they allow usersto quickly navigate from one page to another, orbrowse through the wiki. Therefore, it may be rea-sonable to augment a page about the topic sentiment

Figure 2: Automatic discovery of links to other wiki ar-ticles. Suitable text phrases to place a link on are high-lighted in green.

analysis by a link to a page providing related in-formation such as evaluation datasets. Wikulu sup-ports users in this tedious task by automatically sug-gesting links. Link suggestion thereby is a two-stepprocess: (a) first, suitable text phrases are extractedwhich might be worth to place a link on (see Fig-ure 2), and (b) for each phrase, related pages areranked by comparing their relevance to the currentpage, and then presented to the user. The user maythus decide whether she wants to use a detectedphrase as a link or not, and if so, which other wikipage to link this phrase to. Wikulu currently inte-grates link suggestion algorithms by Geva (2007)and Itakura and Clarke (2007).

Semantic Searching The capabilities of a wiki’sbuilt-in search engine are typically rather limitedas it traditionally performs e.g. keyword-based re-trieval. If that keyword is not found in the wiki, thequery returns an empty result set. However, a pagemight exist which is semantically related to the key-word, and should thus yield a match.

As the search engine is typically a core part of thewiki system, it is rather difficult to modify its be-havior. However, by leveraging Wikulu’s architec-ture, we can replace the default search mechanismsby algorithms which allow for semantic search to al-leviate the vocabulary mismatch problem (Gurevychet al., 2007).

Segmenting Long Pages Due to the open edit-ing policy of wikis, pages tend to grow rather fast.

75

Figure 3: Analysis of a wiki article with respect to topicalcoherence. Suggested segment breaks are highlighted byyellow bars.

For users, it is thus a major challenge to keep anoverview of what content is present on a certainpage. Wikulu therefore supports users by analyzinglong pages through employing text segmentation al-gorithms which detect topically coherent segmentsof text. It then suggests segment boundaries whichthe user may or may not accept for inserting a sub-heading which makes pages easier to read and betterto navigate. As shown in Figure 3, users are also en-couraged to set a title for each segment.8 When ac-cepting one or more of these suggested boundaries,Wikulu stores them persistently in the wiki. Wikulucurrently integrates text segmentation methods suchas TextTiling (Hearst, 1997) or C99 (Choi, 2000).

Summarizing Pages Similarly to segmentingpages, Wikulu makes long wiki pages more acces-sible by generating an extractive summary. Whilegenerative summaries generate a summary in ownwords, extractive summaries analyze the originalwiki text sentence-by-sentence, rank each sentence,and return a list of the most important ones (see Fig-ure 4). Wikulu integrates extractive text summariza-tion methods such as LexRank (Erkan and Radev,2004).

Highlighting Keyphrases Another approach toassist users in better grasping the idea of a wiki pageat a glance is to highlight important keyphrases (seeFigure 1). As Tucker and Whittaker (2009) have

8In future work, we plan to suggest suitable titles for eachsegment automatically.

Figure 4: Extractive summary of the original wiki pageshown in Figure 3

shown, highlighting important phrases assists userswith reading longer texts and yields faster under-standing. Wikulu thus improves readability by em-ploying automatic keyphrase extraction algorithms.Additionally, Wikulu allows to dynamically adjustthe number of keyphrases shown by presenting aslider to the user. We integrated keyphrase extrac-tion methods such as TextRank (Mihalcea and Tarau,2004) and KEA (Witten et al., 1999).

3 Further Use Cases

Further use cases for supporting wiki users include(i) visually analyzing the results of NLP algorithms,(ii) educational purposes, and (iii) enabling semanticwikis.

Visually Analyzing the Results of NLP Algo-rithms Wikulu facilitates analyzing the results ofNLP algorithms by using wiki pages as input doc-uments and visualizing the results directly on thatpage. Consider an NLP algorithm which performssentiment analysis. Typically, we were to put ouranalysis sentences in a text file, launch the NLP ap-plication, process the file, and would read the outputfrom either a built-in console or a separate outputfile. This procedure suffers from two major draw-backs: (a) it is inconvenient to copy existing datainto a custom input format which can be fed into theNLP system, and (b) the textual output does not al-low presenting the results in a visually rich manner.

Wikulu tackles both challenges by using wikipages as input/output documents. For instance,

76

by running the sentiment analysis component rightfrom within the wiki, its output can be written backto the originating wiki page, resulting in visuallyrich, possibly interactive presentations.

Educational Purposes Wikulu is a handy tool foreducational purposes as it allows to (a) rapidly createtest data in a collaborative manner (see Section 2),and (b) visualize the results of NLP algorithms, asdescribed above. Students can gather hands-on ex-perience by experimenting with NLP components inan easy-to-use wiki system. They can both collab-oratively edit input documents, and explore possi-ble results of e.g. different configurations of NLPcomponents. In our system prototype, we integratedhighlighting parts-of-speech which have been deter-mined by a POS tagger.

Enabling Semantic Wikis Semantic wikis suchas the Semantic MediaWiki (Krotzsch et al., 2006)augment standard wikis with machine-readable se-mantic annotations of pages and links. As thoseannotations have to be entered manually, this stepis often skipped by users which severely limits theusefulness of semantic wikis. Wikulu could supportusers e.g. by automatically suggesting the type of alink by means of relation detection or the type of apage by means of text categorization. Thus, Wikulucould constitute an important step towards the se-mantification of the content contained in wikis.

4 System Architecture

In this section, we detail our system architecture anddescribe what is necessary to make NLP algorithmsavailable through our system. We also give a walk-through of Wikulu’s information flow.

4.1 Core Components

Wikulu builds upon a modular architecture, as de-picted in Figure 5. It acts as an HTTP proxy serverwhich intercepts the communication between theweb browser and the target wiki engine, while it al-lows to run any Apache UIMA-compliant NLP com-ponent using an extensible plugin mechanism.

In the remainder of this section, we introduce eachmodule: (a) the proxy server which allows to addWikulu to any target wiki engine, (b) the JavaScriptinjection that bridges the gap between the client- and

server-side code, (c) the plugin manager which givesaccess to any Apache UIMA-based NLP component,and (d) the wiki abstraction layer which offers ahigh-level interface to typical wiki operations suchas reading and writing the wiki content.

Proxy Server Wikulu is designed to work withany underlying wiki engine such as MediaWiki orTWiki. Consequently, we implemented it as anHTTP proxy server which allows it to be enabled atany time by changing the proxy settings of a user’sweb browser.9 The proxy server intercepts all re-quests between the user who interacts with her webbrowser, and the underlying wiki engine. For ex-ample, Wikulu passes certain requests to its lan-guage processing components, or augments the de-fault wiki toolbar by additional commands. We elab-orate on the latter in the following paragraph.

JavaScript Injection Wikulu modifies the re-quests between web browser and target wiki by in-jecting custom client-side JavaScript code. Wikuluis thus capable of altering the default behavior ofthe wiki engine, e.g. replacing a keyword-based re-trieval by enhanced search methods (cf. Section 2),adding novel behavior such as additional toolbarbuttons or advanced input fields, or augmenting theoriginating web page after a certain request has beenprocessed, e.g. an NLP algorithm has been run.

Plugin Manager Wikulu does not perform lan-guage processing itself. It relies on Apache UIMA-compliant NLP components which use wiki pages(or parts thereof) as input texts. Wikulu offers a so-phisticated plugin manager which takes care of dy-namically loading those NLP components. The plu-gin loader is designed to run plugins either everytime a wiki page loads, or manually by picking themfrom the augmented wiki toolbar.

The NLP components are available as server-sideJava classes. Via direct web remoting10, those com-ponents are made accessible through a JavaScriptproxy object. Wikulu offers a generic language pro-cessing plugin which takes the current page contents

9The process of enabling a custom proxy server can besimplified by using web browser extensions such as Mul-tiproxy Switch (https://addons.mozilla.org/de/firefox/addon/multiproxy-switch).

10http://directwebremoting.org

77

Browser

Duplicate Detection

JavaScriptInjection

Plu

gin

Man

ager

Wiki AbstractionLayer

Wiki

Semantic Search

Link Suggestion

Text Segmentation

Text Summarization

Keyphrase Highlighting

...

Wik

ulu

Pro

xy

Apache UIMA-compliantNLP components

User

Figure 5: Wikulu acts as a proxy server which interceptsthe communication between the web browser and the un-derlying wiki engine. Its plugin manager allows to inte-grate any Apache UIMA-compliant NLP component.

as input text, runs an NLP component, and writes itsoutput back to the wiki. To run a custom ApacheUIMA-compliant NLP component with Wikulu, onejust needs to plug that particular NLP componentinto the generic plugin. No further adaptations tothe generic plugin are necessary. However, more ad-vanced users may create fully customized plugins.

Wiki Abstraction Layer Wikulu communicateswith the underlying wiki engine via an abstractionlayer. That layer provides a generic interface foraccessing and manipulating the underlying wiki en-gine. Thereby, Wikulu can both be tightly coupled toa certain wiki instance such as MediaWiki or TWiki,while being flexible at the same time to adapt to achanging environment. New adaptors for other tar-get wiki engines such as Confluence11 can be addedwith minimal effort.

4.2 Walk-Through Example

Let’s assume that a user encounters a wiki pagewhich is rather lengthy. She realizes that Wikulu’skeyphrase extraction component might help her tobetter grasp the idea of this page at a glance, soshe activates Wikulu by setting her web browser topass all requests through the proxy server. After

11http://www.atlassian.com/software/confluence

JSInjection

ProxyServer

Keyphr.Plugin

Wiki Abstr. Lay. Wiki

get content from wiki page getpage

extractkeyphrases

Browser

highlightkeyphrases

injectkeyphrases

Figure 6: Illustration of Wikulu’s information flow whena user has requested to highlight keyphrases on the cur-rent page as described in Section 4.2

applying the settings, the JavaScript injection mod-ule adds additional links to the wiki’s toolbar onthe originating wiki page. Having decided to ap-ply keyphrase extraction, she then invokes that NLPcomponent by clicking the corresponding link (seeFigure 6). Before the request is passed to that com-ponent, Wikulu extracts the wiki page contents us-ing the high-level wiki abstraction layer. Thereafter,the request is passed via direct web remoting to theNLP component which has been loaded by Wikulu’splugin mechanism. After processing the request, theextracted keyphrases are returned to Wikulu’s cus-tom JavaScript handlers and finally highlighted inthe originating wiki page.

5 Related Work

Supporting wiki users with NLP techniques has notattracted a lot of research attention yet. A no-table exception is the work by Witte and Gitzinger(2007). They propose an architecture to connectwikis to services providing NLP functionality whichare based on the General Architecture for Text En-gineering (Cunningham et al., 2002). Contrary toWikulu, though, their system does not integratetransparently with an underlying wiki engine, butrather uses a separate application to apply NLP tech-niques. Thereby, wiki users can leverage the powerof NLP algorithms, but need to interrupt their cur-rent workflow to switch to a different application.

78

Moreover, their system is only loosely coupled withthe underlying wiki engine. While it allows to readand write existing pages, it does not allow furthermodifications such as adding user interface controls.

A lot of work in the wiki community is done in thecontext of Wikipedia. For example, the FastestFox12

plug-in for Wikipedia is able to suggest links to re-lated articles. However, unlike Wikulu, FastestFoxis tailored towards Wikipedia and cannot be usedwith any other wiki platform.

6 Summary

We presented Wikulu, an extensible system whichintegrates natural language processing techniqueswith wikis. Wikulu addresses the major challenge ofsupporting wiki users with their everyday tasks. Be-sides that, we demonstrated how Wikulu serves asa flexible environment for (a) visually analyzing theresults of NLP algorithms, (b) educational purposes,and (c) enabling semantic wikis. By its modular andflexible architecture, we envision that Wikulu cansupport wiki users both in small focused environ-ments as well as in large-scale communities such asWikipedia.

AcknowledgmentsThis work has been supported by the Volkswagen Foun-dation as part of the Lichtenberg-Professorship Programunder grant No. I/82806, and by the Klaus Tschira Foun-dation under project No. 00.133.2008. We would like tothank Johannes Hoffart for designing and implementingthe foundations of this work, as well as Artem Vovk andCarolin Deeg for their contributions.

ReferencesMichel Buffa. 2006. Intranet Wikis. In Proceedings

of the IntraWebs Workshop at the 15th InternationalConference on World Wide Web.

Freddy Y. Y. Choi. 2000. Advances in domain indepen-dent linear text segmentation. In Proceedings of the1st Meeting of the North American Chapter of the As-sociation for Computational Linguistics, pages 26–33.

Hamish Cunningham, Diana Maynard, KalinaBontcheva, and Valentin Tablan. 2002. GATE:A Framework and Graphical Development Environ-ment for Robust NLP Tools and Applications. InProc. of the 40th Annual Meeting of the Associationfor Computational Linguistics, pages 168–175.

12http://smarterfox.com

Gunes Erkan and Dragomir Radev. 2004. LexRank:Graph-based Lexical Centrality as Salience in TextSummarization. Journal of Artificial Intelligence Re-search, 22:457–479.

David Ferrucci and Adam Lally. 2004. UIMA: An Ar-chitectural Approach to Unstructured Information Pro-cessing in the Corporate Research Environment. Nat-ural Language Engineering, pages 1–26.

Evgeniy Gabrilovich and Shaul Markovitch. 2007. Com-puting Semantic Relatedness using Wikipedia-basedExplicit Semantic Analysis. In Proceedings of the20th International Joint Conference on Artificial In-telligence, pages 1606–1611.

Shlomo Geva. 2007. GPX: Ad-Hoc Queries and Auto-mated Link Discovery in the Wikipedia. In Prepro-ceedings of the INEX Workshop, pages 404–416.

Iryna Gurevych, Christof Muller, and Torsten Zesch.2007. What to be?–Electronic Career Guidance Basedon Semantic Relatedness. In Proceedings of the 45thAnnual Meeting of the Association for ComputationalLinguistics, pages 1032–1039.

Marti A. Hearst. 1997. TextTiling: Segmenting text intomulti-paragraph subtopic passages. ComputationalLinguistics, 23(1):33–64.

Kelly Y. Itakura and Charles L. A. Clarke. 2007. Univer-sity of Waterloo at INEX2007: Adhoc and Link-the-Wiki Tracks. In INEX 2007 Workshop Preproceed-ings, pages 417–425.

Markus Krotzsch, Denny Vrandecic, and Max Volkel.2006. Semantic MediaWiki. In Proc. of the 5th Inter-national Semantic Web Conference, pages 935–942.

Thomas K. Landauer, Peter W. Foltz, and Darrell Laham.1998. An introduction to Latent Semantic Analysis.Discourse Processes, 25(2):259–284.

Bo Leuf and Ward Cunningham. 2001. The Wiki Way:Collaboration and Sharing on the Internet. Addison-Wesley Professional.

Rada Mihalcea and Paul Tarau. 2004. TextRank: Bring-ing Order into Texts. In Proceedings of the Conferenceon Empirical Methods in Natural Language Process-ing, pages 404–411.

Simon Tucker and Steve Whittaker. 2009. Have A SayOver What You See: Evaluating Interactive Compres-sion Techniques. In Proceedings of the Intl. Confer-ence on Intelligent User Interfaces, pages 37–46.

Rene Witte and Thomas Gitzinger. 2007. Connectingwikis and natural language processing systems. InProc. of the Intl. Symposium on Wikis, pages 165–176.

Ian H. Witten, Gordon W. Paynter, Eibe Frank, CarlGutwin, and Craig G. Nevill-Manning. 1999. KEA:Practical automatic keyphrase extraction. In Proceed-ings of the 4th ACM Conference on Digital Libraries,pages 254–255.

79

wikulu: an extensible architecture for integrating …wikis are web-based, collaborative content...

Documents