blog mining in a corporate environment - tu...

19
© SAT 2005 ASGAARD-TR-2005-11 blog mining in a corporate environment Andreas Aschenbrenner, Silvia Miksch September 2005

Upload: others

Post on 30-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

© SAT 2005 ASGAARD-TR-2005-11

blog mining in a corporate environment

Andreas Aschenbrenner, Silvia Miksch

September 2005

Page 2: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 2 -

executive summary

Weblogs are a young and dynamic media type. The global blogosphere displays increasing influence, ongoing expansion and continuous cultural change. Blog mining techniques are the means for exploring this vibrant ecosystem. They stem from diverse origins including social network analysis, economic research, and network theory, and they are the basis for a myriad of services offered by ever more popular blog analysis engines. However, such service providers release little technical information in order to retain their competitive edge. ‘Reputation management’ is one particular service under development by several companies just now. It promises the analysis and active management of the public opinion on brand names and products through the targeted placement of information nuggets in the blogosphere. Increasingly, blogs are also employed in corporate environments for communication, organisational learning, and other purposes. The nature of corporate blogs and overall information environments remains to be fully explored, and the applicability of blog mining techniques in such localised ecosystems remains to be tested. However, a variety of exciting tools are conceivable for analysing the state of the blogging environment, for extracting and synthesizing information, and to actively shape communication patterns. All in all, the computational analysis of blogs and overall information environments is an exciting research area, and some commitment in this area may yield innovative techniques and effective tools.

table of contents

executive summary...............................................................2

table of contents ...................................................................2

introduction..........................................................................3

characterising the blogosphere..............................................3

fundamental blog mining......................................................7

blog mining out in the wild...................................................8

nuts and bolts of blog mining..............................................10

in a corporate context .........................................................11

synthesis.............................................................................14

references ...........................................................................16

Page 3: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 3 -

introduction

Blogs have received a high profile in the last couple of years. They are well known with anybody who is comfortable with the web. Also beyond the web, they are even referenced and issued in traditional media as well. Blogs have grown to a permanent phenomenon that has become a significant component of the global information and communication infrastructure.

With the massive growth of the global blogging infrastructure several meta-services emerged, which offer services such as searching for blog postings. Well, searching actually is one of the more simple services. The more exciting services are described in detail below. Apart from this, blogging also penetrates corporate information and communication infrastructures. This trend is rather new and significant developments in this area are hard to project. Apart from making an environmental scan of all these issues, this report contemplates the transfer of mining techniques into corporate blogging platforms. It aims to identify the opportunities and risks by such a move.

characterising the blogosphere

Viewing merely the technical intricacies of blogs is insufficient to catch the overall phenomenon. Technology and culture always develop in tandem, and this report hence takes a broad perspective that includes both. This chapter aims to draw a broad picture of the global blogging environment, the public blogosphere. It discusses what blogs are, makes a historic synopsis and illustrates the blogosphere’s path from an isolated phenomenon to a massive and vibrant ecosystem.

what actually are blogs?

Blogs - which is short for web-log - are a diary format communication channel on the web. They are mostly maintained by individuals, though group blogs are increasingly popular as well. In the blog’s dated entries, the maintainers publish personal or professional information items, references to resources, or whatever comes to their minds. Blogs may be dedicated to just any topic, and there is a myriad of blog types and writing styles. The two characteristics that unite the blogging infrastructures are writing as a collection of small entries and an informal personal voice. [efi04a] So far a sketch of what blogs are. The subsequent sections present a somewhat clearer picture of the nature of blogs, their impact in society and current trends. So let’s view some of buzz words associated with blogs. As mentioned, blogs are a collection of dated entries, or postings. Permalinks provide a permanent anchor to a specific entry for reference. Usually, blogs allow comments to be made for each entry. In case another blogger prefers to make a more lengthy comment in her own blog, she can use the trackback feature [trb], which is basically an automatic reference from the blog she comments on to her own blog. Often bloggers publish a list of other blogs on their radar called the blogroll. With more and more blogs emerging it becomes increasingly difficult

Page 4: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 4 -

to keep track of everything that’s going on. RSS (Really Simple Syndication) [rss] or other blog feed formats such as Atom allow the automatic syndication of numerous blogs. RSS software aggregates all postings of those blogs a user is subscribed to into headline update messages, which link directly to the originating blog postings. The user is hence able to keep track of all her favourite blogs, and select only those messages of interest for reading. So far just a brief journey through the best known blog gimmicks. Blogs are very young and new gimmicks are being added continuously. For example, tagging allows the annotation of postings with keywords. Several blogging platforms already offer tagging extensions, however it is only just catching on. The possibility to include photos in postings has already become routine. Blogging from mobile devices on the other hand is at a rather early phase of adoption, as is posting of audio blogs or podcasts. No doubt, more exciting features are still to come.

history of blogs

Blogs emerged during the late 1990s, warily at first. The term was coined by Jorn Barger1 in December 1997. Together with him a handful likeminded web enthusiasts started their blogs and formed a small community. Early blogs were mostly link-driven sites. Bloggers assembled links to newspaper articles, scientific papers, or other online resources. Usually they provided a brief comment with the link, underlining how great, nice, wrong, manipulating - in which way the resource was noteworthy. In other words, these early blogs were predominantly filters, edited along the personal perspective of the blogger. [blo00] After a period of moderate growth, the blogosphere soared in 1999 with the release of the first easy-to-use blog gateways including Pitas2 and Blogger3. This opened blogs to a larger audience. Rapid expansion led to the scattering of what used to be a compact community, and it also led to more variation in form. In addition to the filter-style blog, a journal-like blog emerged, for instance - a collection of thoughts and very personal stories which were not necessarily tied to other online resources. While the sheer amount of available blogs made it impossible to keep track of all of them, sub-groups of bloggers started to link between their blogs, to pick up each other’s trail of thoughts, and to lead full-blown conversations between successive postings, making the blogosphere ever more interactive and vibrant. Two real-world events during 2001 further nurtured the excessive growth of the blogosphere. Both, the September 11th terrorist attacks on the New York World Trade Centers and the U.S. invasion of Afghanistan triggered a broad virtual response. A multitude of bloggers posted their opinions on recent developments, and some even devoted their entire blogs to the war on terrorism. This trend continued in coverage of the war in Iraq and other more recent developments. Another phenomenon further adds to this politically oriented blogging. Since anybody with Internet access is able to start her own blog, blogging also spread to the battle zones 1 Jorn Barger’s blog: http://www.robotwisdom.com/ 2 Pitas. http://www.pitas.com/ 3 Blogger, http://www.blogger.com/

Page 5: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 5 -

and crises areas of this planet. Thereby, people who are in touch with, or directly involved in developments of global interest are in the position to post their personal perspectives on the events. Late December 2004 victims of the Tsunami in south-east Asia reported instantaneously, quicker than most national news agencies picked up on the event.

Beyond real-time reporting, the blogosphere incorporates a wealth of different viewpoints and expertise. Besides balanced and informed discussion, this enabled the disclosure of recent scandals and general inconsistencies in political reporting by traditional media. [dre04] Blogging has thereby become an important component of the media landscape. Today, the opinions fermenting in the blogosphere reflect public opinion and, indeed, the blogosphere has the power to influence the political landscape.

the blogosphere today - a diverse and vibrant ecosystem

Estimations about the size of the blogosphere vary greatly. By August 2005, large blog search engines including BlogPulse4 and Technorati5 were watching more than 15 million blogs each. While estimations vary greatly, some put the global blogosphere to a stunning size of more than 70 million blogs [ril05].6 At the same time, the blogosphere is extremely dynamic with many tens of thousands of blogs created each day. While a great amount of discontinued blogs and novel occurrences like ‘spam blogs’ somewhat distort numbers, Technorati estimates the blogosphere to double in size about once every five months [sif05]. With regard to the connectedness of this huge and quickly growing space, there is some disagreement in the community. Despite the common assumption and some scientific indications that the blogosphere is densely connected, other initiatives contradict this notion [her05]. More research based on common concepts and accepted reference points is hence needed to establish a clear characterisation of the blogosphere.

Beyond the mere numbers, we have already seen the rapid development of the blogosphere in the historic synopsis of the previous section. Because of the fundamental evolutionary changes it underwent in its relatively brief history, a clear characterization of the blogosphere remains elusive. A genre analysis of the blogosphere [her04] shows that more than two thirds of public blogs are personal journals. Only 12 percent continue to be ‘filters’, the link-driven blog type that used to be predominant in the early stages of the blogosphere. In the same statistics, knowledge blogs (k-logs) used for collaborative knowledge management and topical discussion share a mere 3 percent of the public blogosphere. However, k-logs can be expected to be more prevalent in intranet environments.

Other genre analyses classify individual blogs according to their origin (individual - community, personal - topical) [kri02], or their content [wik05]. Overall, however, these and other attempts to impose structure on the blogosphere remain but partial snapshots of 4 BlogPulse, www.blogpulse.com 5 Technorati, www.technorati.com 6 It would be exciting to probe the applicability of Bharat and Broder’s approach for measuring the size of the web for the blogosphere [bha98] - yet another of the many opportunities to transfer knowledge from other scientific areas to blog research.

Page 6: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 6 -

the complexity of the dynamic ecosystem. A notable message from all these analyses is the fact that besides a technological and editorial perspective on blogs, the socio-cultural dimension of the whole blogosphere is at least as important. These three aspects affect each other and develop in parallel.

trends for the future

While in this complex environment projections of the blogosphere’s future are hardly possible, some general trends indicate where we are heading at. In tandem with the ongoing surge in blog numbers, the editorial style of blog postings continues to be very heterogeneous. There is no particular convergence to a few atomic blog genres. On the contrary, new blog forms continue to emerge. On the technology front, new features are being added iteratively and become part of the blog infrastructure. Trackbacks and RSS7 are just two of these features. However, the most drastic changes occur where new media and formats are drawn into the blog universe. In a recent wave, moblogs8, photoblogs9, and podcasting (i.e. audioblogs) extended the blogosphere and further blurred its borders. In a bigger picture, these are just derivatives of the overall social software revolution. Blogs are increasingly embedded in an infrastructure of social tagging, social bookmarking10, picture sharing11, and other emerging applications of the social software movement. Regarding the bloggers themselves, they stem from all walks of life and blogging emerges from the grass-roots as well as top-down. Above, the historical synopsis already outlined the trend of real-time blogging directly from the place where things happen, be it war zones or natural disasters such as the tsunami in south-east Asia. As awareness about the influence of blogs is rising, some of the more famous bloggers are being encouraged to report about specific events. For example, political parties invite bloggers to events [con04] or software companies consult them for evaluation of new software [boy05]. These events serve not only to account for the opinions of bloggers, but rather to subliminally influence them or to explicitly silence them (for example by non-disclosure agreements) [eve05]. Grass-roots bloggers are thereby incorporated in the public relations strategy of large organizations. Instead of attempting to influence independent bloggers, some organizations employ professional bloggers to produce infomercials. The effectiveness of these attempts to direct opinion in the blogosphere remains to be seen.

Beyond embedding blogs in their public relations, organizations are increasingly aware of the influence and power of blogs. [shi05a] "Today most every F500 company is looking into blogging, particularly brand centric companies [...]" [may05] Organisations increasingly employ blogs or other social software in their information and communication strategies. Some do so in order to exert some level of control over what would happen anyway, others as a genuine attempt to enhance internal information

7 See the section ‘what actually are blogs?’ above. 8 moblogs - which is short for ‘mobile blogs’ - integrate mobile devices for real-time blogging from wherever the blogger is located 9 photoblogs - photos may be embedded in normal blogs, or a blog may consist of only photos altogether 10 For example, the social bookmarking service Delicious, http://del.icio.us/ 11 For example, the picture sharing platform Flickr, http://www.flickr.com/

Page 7: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 7 -

infrastructure [del05, sne05]. The latter still bears a lot of opportunity - a promising trend indeed. New editorial styles, new technologies, the take-up of blogging by the general public and the emerging buy-in of companies and organizations - all these trends emphasize the dynamic blog landscape, and the rising importance of social software in general. However, they do not quite tell where the future will take us.

fundamental blog mining

While blogs are a relatively recent phenomenon, research about blogs is and has always been part of a network of research efforts and builds on the experiences of various adjacent scientific areas. This includes data mining, social network analysis, game theory, economic research, as well as network theory. Network theory is a particularly active field, which underwent major changes in the late 1990s. [bar03] Researchers at that time found many properties of the public internet incompatible with traditional network theory. Complex networks were formerly considered to be completely random. The link structure of the web, however, has a considerable number of nodes, to which an enormously high number of links point to. These nodes are called ‘hubs’ and they reflect a property of the web called ‘scale free’, which basically means that already popular web pages tend to become even more popular on the web. Properties that allow the web to be scale-free in the first place are its growing and dynamic nature. Clay Shirky was one of the first to point out that the blogosphere also displays properties of a scale-free network. [shi03] His article explicates that a small number of top-ranked - so-called A-list - bloggers are just those who happen to be on top of the typical power law curve of a scale free network. His article came as part of a discussion about fairness in the blogosphere and offered a scientific (natural science rather than sociology and human values) explication for why not all blogs can receive an equal share of the community’s attention. [mcg04] found that only 20 blogs receive over 10 million pageviews per month. Further down the slope of the power law curve, there are only 200 blogs with a monthly traffic of 1 million pageviews per month. The vast majority of blogs receives far less attention. Scale-free networks have been detected in a variety of other scientific areas [Alb02]. They include protein interaction maps in cells, social relationships, research collaborations and global trade networks. As they share a common mathematical basis, findings can be transferred between theses diverse fields. So, for example, techniques from research in protein cells and epidemiology can be applied in blogosphere research, which offers exciting opportunities (e.g. [gru04]). In one of these transfers between scientific areas, researchers of the HP Information Dynamics Lab were inspired by epidemiological research when they investigated information propagation through a blog network [ada05]. Particularly the techniques developed for the analysis of an outbreak of foot-and-mouth disease [hay01] allowed inferring infection trees in the blogosphere. The researchers hence investigated the propagation of information memes in the blogosphere, who infects whom with a particular topic, and the infection timing.

Page 8: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 8 -

While the HP research was conducted on crawled data from May 2003, Samuel Arbesman - a student in computational biology - monitored a meme as it spread in the blogosphere in his “Memespread” project [arb04, wir04]. To achieve this he released on a popular blog a message with a description of his project and a link to his homepage. He then logged all access attempts to his homepage and recorded the origin blogs.

HP research and the Memespread project, both consider the route a meme takes in the blogosphere to be essential. There are some blogs which serve as connectors. Their contagiousness and large fan-out to other blogs supports a meme in picking up critical mass quickly. A message stagnates if it is unable to pick up critical mass via connector blogs. Such a ‘tipping point’ has been observed in various contexts [hil02, gla00]. However, more research is needed for predicting or even manipulating the complex and dynamic processes in information propagation. Similar work has been conducted by researchers of the IBM Almaden Research Center. Their investigation of information diffusion in the blogosphere builds on various concepts and techniques, including epidemiology, the diffusion of innovation in social networks, and game theory [gru04]. By tracking topics as well as individuals they distinguished between conversations that are largely chatter and are being discussed continuously; those that are just spike and which vanish as quickly as they emerged; and spiky chatter which have a significant chatter level and which react strongly to external events. It is possible to extract sub-topics from spiky chatter, and together with just spike conversations and ongoing chatter they establish the topics discussed in the blogosphere and their dynamics. Individuals post on different topics and at different times of the life-cycle of a topic. This distinction in tandem with the identification of their position in the social web of the blogosphere defines their influence on information propagation. Connectors with a huge fan-out to other bloggers are critical linkages for making a message cross the tipping point and becoming truly widespread. IBM researchers also discovered a tipping point on a totally different scale. They surveyed the evolution of the blogosphere in the years 1999 to 2002 [kum03]. Around that time growth of the blogosphere picked up speed. By way of community extraction and burst analysis the researchers detected a tipping point in 2002 when local communities mushroomed, and the blogosphere assumed properties of a scale-free network with huge strongly connected components and the typical power law curve. This development was bound to happen as the blogosphere went beyond a certain size. Moreover, the assumption of the properties of a scale-free network was essential for making the blogosphere what it is today.

blog mining out in the wild

With the growth of the blogosphere from a handful bloggers who were tightly connected to an ever growing community of tens of millions, dedicated means for searching and navigating the blogosphere became increasingly popular. The blog search engine market is still very dynamic, with a variety of players offering a myriad of different services. The domain is still in flux as novel services continue to emerge and new engines supersede others. In this dynamic market, it appears, competition forestalls open communication about and exchange of techniques. There are only few technical papers by authors

Page 9: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 9 -

associated with a particular blog search engine to be found (e.g. [gla04]). Indeed, the handful such publications, which can be traced with some dedication, are on a rather superficial level.

A scan of the blog search engine environment shows that their facilities for searching the blogosphere vary greatly in efficiency. Obviously this is due to the quality of the engine’s database, but also and more than that due to the techniques it builds on. Unfortunately there is hardly any documentation about the particular techniques to be found. Apart from searching, the additional services are the actual attraction of blog sites. Any survey is bound to lead to Blogdex12. Cameron Marlow, an MIT student, is a pioneer in the field of applying social network analysis to the blogosphere, and created Blogdex as one of the first blog search engines [mar04]. Blogdex dubs itself as the “weblog diffusion index”. It presents “contagious information currently spreading in the weblog community”, which is basically a round-up of the most often referenced blog messages and external links (e.g. to websites, news articles). While Blogdex focuses on links in blogs, Daypop13 works on the actual text, or rather a combination of links and text. As a notable service it offers ‘Word Bursts’, which lists words that were used particularly often in the last couple of days. Similar to Blogdex’ ‘contagious information’, Daypop offers various rankings, including Top News, Top Posts, and Top Weblogs.

Technorati14 offers similar ranking services on blogs, news, books, and movies. Of course, the data for these services stems from the blogosphere, so the ‘top news’ refers to the most-discussed news items in the blogosphere. Beyond this, Technorati taps another source, tags, and interweaves them with blog data. Tags are sort of keyword or category name, and they are increasingly popular in the broad field of social software [shi05b]. Technorati combines tag-enhanced blogs with tags from other social software including photo sharing communities and social bookmarking.

BlogPulse15 stays in the blurry confines of the blogosphere. It provides the exciting services ‘Trend Search’ and ‘Conversation Tracker’. The latter extracts sequences of blogs and re-blogs in a single conversational trail. ‘Trend Search’ provides an overview of a topic as it develops over time. So, for

example, the trend analysis in Figure 1 snatched from the

12 Blogdex, www.blogdex.net 13 Daypop, www.daypop.com 14 Technorati, www.technorati.com 15 BlogPulse, www.blogpulse.com

Figure 1 - Blogpulse trend analysis

Page 10: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 10 -

BlogPulse website flags the topics “war on terrorism” and “hurricane Katrina” as they evolved from early June to late August. “War on terrorism” has a relatively high chatter level. Interestingly, the London bombings on July 7 generated only a minor peak, which may indicate a low percentage of European bloggers represented in the BlogPulse database. Hurricane Katrina hit New Orleans on August 29. The graph shows a rising curve when people became aware of the imminent hazard a few days before, a small dip in the curve during the impact, but overall a very steep rising curve. The time graph is interactive - a click on any point in the curve yields a list of the relevant blog postings at the very point of time.

Another exciting service is the ‘Visual Neighborhood’ of BlogStreet16. It provides a spider-web view of related blogs. For instance, Figure 2 presents the neighbourhood, i.e. all blogs related to Mathemagenic17. The graph is interactive and builds on TouchGraph18 techniques. Unfortunately, BlogStreet merely scans just more than 100.000 blogs by the time of writing, so despite the ‘coolness’ factor the service is still rather limited.

There are - at this point of time before the market has consolidated - lots of other blog search engines (at least more than 40, cf. [pap04]). This account presented some of the more popular amongst them, and highlighted their outstanding services. Unfortunately the engines are rather restrictive with technical details about their services. Yet, dipping in academic research in this area provides some insight into the relevant techniques.

nuts and bolts of blog mining

As mentioned previously, the services in blog search engines build on a variety of scientific fields including web mining, social network analysis, network theory and epidemiology, information diffusion in economic research, and others. While it is - due to lack of information - unclear on which techniques the BlogPulse’s Conversation Tracker builds, there are various possible ways to achieve such a service. Adar et al. [ada01] describe how meme paths make their way in the blogosphere with techniques transferred from epidemiology. Their research, the Memespread project and similar activities were mentioned before19. However, tracking conversations by way of social network analysis, information diffusion theory in economic research, socio-technical argumentation analysis [moo04], and other scientific areas is equally conceivable.

16 BlogStreet, www.blogstreet.com 17 Mathemagenic, Lilia Efimova’s blog, www.mathemagenic.com 18 TouchGraph, www.touchgraph.com 19 see the section “fundamental blog mining”

Figure 2 - BlogStreet's visual neighbourhood

Page 11: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 11 -

[anj05] conducts linguistic analysis to extract knowledge flows between blogs. Their approach attempts to identify shared conceptualisations and they construct ontologies, which establish the knowledge space of all analysed blogs and positions individual bloggers within this space. This is probably best explained with the illustration in Figure 3, which features conceptualisations extracted automatically by Anjo Anjewierden from a debate between George W. Bush and John F. Kerry in a debate during the US presidential campaign [anj04].

The techniques described above for information flow analysis focus on individual blog postings. They build on both, the text of blogs and their link structure. Similar methods can be used to identify blogger communities, again through both, the text of blogs and the overall link structure. Web mining is also a rich resource for extracting communities (e.g. [kum99], [fla03]). Other techniques build on graph theory ([ish04], [tse05]), or in an entirely different approach community extraction may be inferred from reading patterns by employing extraneous source data (e.g. blogrolls, RSS subscriptions) [efi05]. Shinsuke Nakajima et al. [nak05] attempt to identify the role of individual bloggers within the blogosphere. The roles they identify are agitators and summarizers, in contrast to [hil02], who distinguishes between experts, link mavens, and connectors.

Techniques for analysing blogs are still young, and the scientific field hugely dynamic and unstable. Often there are multiple approaches to the same ends, yet, authoritative cross-evaluations are still missing. Moreover, information about methodical examination of individual techniques may be unavailable due to a rather restrictive information policy of blog search engines. However, as research progresses the community will hopefully converge towards common concepts and reference points to foster authoritative benchmarking. More exciting developments and findings can be expected as the blogosphere continues to evolve, it keeps integrating with other social software, and as blog search engines grow to include non-English, or even non-North American blogs in their activities.

in a corporate context

According to Gartner, [gar05] corporate blogging is heading down the hype cycle curve towards the trough of disillusionment. However, they expect corporate blogging to quickly gain ground and move towards a state of productive application in less than two years.

Going more into detail, ‘corporate blogging’ actually subsumes a variety of different contexts and objectives, for which blogging facilities are being applied. There are various attempts to structure the field (e.g. [roe04], [zer05], [hen05]), and more effort is needed to establish a structure that is widely shared. Figure 4 displays a direct approach to mapping

Figure 3 - extracted conceptualisations

Page 12: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 12 -

existing blog types, which is a variation and extension of Ansgar Zerfaß’ chart with a focus on internal communication ([zer05], [hen05]).

Figure 4 - continuum of internal blog types

The graph draws up a continuum of objectives from information to discussion to coordination purposes. All these blog types may be edited by an individual or by a team, and they may be set up for a fixed term accompanying a specific task or as ongoing facilities. Work or research diaries [hug96] can be useful tools for documentation, transparency, and to communicate progress and elicit comments. Blogs can also be used to facilitate learning, either to assist individuals in managing their own personal learning ([efi04a]), or to allow for organisational learning by way of knowledge blogs (k-logs) ([efi04b], [nic01]). Spurring inner-organisational communication may be achieved through crisis blogs ([zer05]) or, more generally, bulletin boards. CEO blogs may be directed at both, an external and/or an internal audience. Letters from the CEO are often used in large companies to set the company strategy, connect the organisation in the face of geographical separation, and to support corporate culture. Finally, blogs may be a useful tool for project management [pru04]. This account of blog types in corporate communication is not necessarily complete, though it gives a good impression of the various applications of blogs in a corporate environment. Apart from purely internal communication, organisations may adopt an external communication strategy based on blogs. This has already been alluded to in the context of CEO blogs. Other blog variations for external communication are listed in [zer05], and they may serve for instance for marketing purposes, communication with the customer.

With the stepwise emergence of a corporate blogging landscape, best practices to guide their instalment are being formulated (e.g. [roe04], [hen05]). These initial recommendations advise organisations to actively plan, organise and monitor their blogging infrastructure. Employees need time resources and active encouragement to read and post messages. Moreover, it is important to establish clear guidelines that outline form and content, and what must be avoided. While relevant for both, such guidelines differ between purely internal blog types and external ones, where the borders between

Page 13: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 13 -

confidentiality and transparency need to specified. Despite such limiting guidelines, blog postings should be informal and frank to be accepted as an independent information and communication channel. While these two objectives work in different directions, corporate interests and blog culture can be reconciled, as numerous examples demonstrate. IBM, for example, issued a blogging policy [ibm05], and its internal blogging platform proliferates with more than a thousand blogs as yet, and external blogging20 is starting up as well [hei05]. Other large organisations with similar blogging platforms include Microsoft21, Sun Microsystems22, Groove Networks23, and others.

There are various conceivable applications of mining techniques for analysing internal corporate weblog platforms. For example, information about communication patterns may shed light on the informal structure of an organisation beyond a centrally issued organisational chart. This information of organisational dynamics and coherence could guide the optimisation of business processes; or the analysis could feed back to the weblog platform itself with measures to facilitate conversation and knowledge creation. Blog mining techniques may be applied in a corporate context to assist collaboration, for example by identifying topical overlap between different activities where no social connections exist and the stakeholders are unaware of each other. Furthermore, blog mining may be capable of extracting the topics of current concern in the organisation from a grassroots perspective, identifying the expertise and roles of employees within the organisation, and other analyses that may feed back into organisational development.

However, the application of existing blog mining techniques in a corporate context remains untested, and it is unclear whether they produce valuable information in an environment different to the public blogosphere. More research and testing is needed to fathom this.

While mining internal weblog platforms has taken a back seat up to now, some companies are venturing into the analysis of the public blogosphere to inform corporate strategy. One illustrative case is the marketing strategy for a multiplayer phone game called ‘Pocket Kingdom’. In a model implementation of viral marketing paradigms, the PR strategists of Pocket Kingdom identified the 100 most influential community leaders and approached them directly. Once those key people were evangelised, they spread the word into the whole community. This way, the directed marketing campaign achieved maximum effect with minimum budget. While the key people were not identified through blog mining in this very case, blog mining is in principle capable of achieving this in a blog environment.

Another conceivable business intelligence application includes analysis to fathom the reputation of companies or products, to identify issues to address or new market opportunities. ‘Reputation management’ tools, for instance, are high on the agenda of PR departments of large companies ([com04], [com05]), and Gartner lists this area as high

20 IBM’s blogging platform, http://www-128.ibm.com/developerworks/blogs 21 Microsoft’s blogging platform, http://blogs.msdn.com/ 22 Sun Microsystems’s blogging platform, http://blogs.sun.com/ 23 Groove Networks’s blogging platform, http://www.groove.net/blog/

Page 14: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 14 -

potential - or at least as ‘cool’ - in a research note [cla04]. Companies in this area include Intelliseek24, CooperKatz25, Expansion+ 26, MWW Group27, and Bacon's28.

synthesis

This report portrayed the social space blogosphere and, from a technical perspective, it outlined the diverse blog activities in the international community. Subsequently, it underlined the diversity of blog applications in a corporate environment, and it highlighted some conceivable blog mining applications for organisations. In the distinction of mining corporate blogging infrastructures and mining the public blogosphere for business intelligence, the upcoming section focuses on the former. Mining the public blogosphere for corporate purposes is certainly an exciting area and - as touched upon above - several companies are breaking into this new market. In contrast to this, mining corporate internal blogging infrastructures lacks due attention.

The section “in a corporate context” highlighted the variety of application scenarios for blogs as part of an internal information and communication landscape. From a technical perspective, this raises the question of their nature and technical idiosyncrasies. Are the postings particularly long?, are there lots of comments?, are they well interlinked? How do they compare to the public blogosphere? - Answers to these questions can be expected to vary according to corporate information culture and the embedding of blogs within the information infrastructure. As a general trend, there may be fewer internal links in corporate blogs, and there may be a low chatter level with few general topics that spur ongoing discussion. Expert blogs that touch predominantly on professional issues are hardly conducive to lively interactivity and a dynamic environment with strong inter-linkage. If these hypotheses turn out to be correct, information diffusion in corporate blogs can be expected to be rather localised. Such an environment is comparable to the public blogosphere in the time before the tipping point in 2001 29, which was void of the properties of a scale-free network, which lacked connectedness, dedicated communities, and generally a coherent body with an implicit structure. This again may limit the applicability of mining techniques in the first place. Both issues, the nature of corporate blogging infrastructures as well as the applicability of mining techniques in corporate blogging environments are exciting research questions. However, an answer to these questions is hardly final. Blogs for corporate infrastructures are still pretty high on the hype curve. In the long term, organisations may convert to other groupware for internal communications such as wiki’s, fora, instant messaging, or a mix of these technologies. Other social software may be further integrated in the corporate information and communication infrastructure. Foremost, corporate information culture and best practices will continue to change. 24 Intelliseek, also the creator of the BlogPulse blog search engine. www.intelliseek.com 25 CooperKatz, http://www.cooperkatz.com/blogs.shtml 26 Expansion+ (http://expansionplus.com/ 27 MWW Group, http://www.mww.com/ 28 Bacon's, http://www.bacons.com/ 29 cf. section “fundamental blog mining”

Page 15: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 15 -

This does not imply that it is better to wait until infrastructures are stable before attempting the implementation of mining techniques. Evolution is always in alternating cycles between technical and cultural shifts. After all, mining techniques have the potential to support individuals, the corporate information and communication infrastructure, as well as the whole organisation with a variety of effective applications, and these applications may themselves influence further cultural and technical evolution. Mining techniques may

• extract and synthesize information. Mining techniques may, for instance, match blogs with topics of interest for reading recommendations. They are also capable of identifying experts on particular subjects.

• inform about the state of the blogging environment. How active are bloggers? How well is it interlinked? This may feed back into strategic decisions and, for example, elicit activities to encourage reading of other blogs.

• provide evidence for strategic decisions that go beyond the information infrastructure. For example, the identification of a particularly high communication level between specific corporate divisions may suggest reinforcing the ties between these divisions in the corporate structure.

In a nutshell, the development of mining techniques for a corporate context poses a significant risk of failure, considering the open questions and the limited information currently available. It demands long-term commitment as the findings of today may be invalidate tomorrow. However, the conceivable applications are exciting, and introducing them now may have a lasting effect on the evolution of corporate information and communication infrastructures.

Page 16: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 16 -

references

[ada01] Eytan Adar, Li Zhang, Lada A. Adamic, Rajan M. Lukose: Implicit structure and the dynamics of blogspace. Workshop on the Weblogging Ecosystem, 13th International World Wide Web Conference, May 18th, 2004. http://www.hpl.hp.com/research/idl/papers/blogs/index.html

[ada05] Eytan Adar and Lada A. Adamic: Tracking Information Epidemics in Blogspace. Web Intelligence 2005, Compiegne, France, Sept. 19-22, 2005. http://www.hpl.hp.com/research/idl/papers/blogs2/index.html

[anj04] Anjo Anjewierden: Sigmund on the US Presidential Debate. Blog posting, October 2004. http://anjo.blogs.com/metis/2004/10/sigmund_on_the_.html

[anj05] Anjo Anjewierden, Robert de Hoog, Rogier Brussee and Lilia Efimova. "Detecting knowledge flows in weblogs". Common Semantics for Sharing Knowledge: Contributions to ICCS 2005 13th International Conference on Conceptual Structures, Frithjof Dau, Marie-Laure Mugnier and Gerd Stumme (eds.), Kassel University Press, Kassel, Germany, pp. 1-12, 2005 (July). http://staff.science.uva.nl/~anjo/kflows_iccs2005.pdf

[bha98] K. Bharat, A. Broder: A technique for measuring the relative size and overlap of public Web search engines. In: Computer Networks 30(1-7), pp 379-388, 1998.

[arb04] Samuel Arbesman: The Memespread Project: An Initial Analysis of the Contagious Nature of Information in Online Networks. Initial report, April 2004. http://www.arbesman.net/memespread.pdf

[bar03] Barabási, A.-L., Bonabeau, E.: Scale-free networks. Scientific American. 288, 60-69 (2003). http://www.nd.edu/~networks/PDF/Scale-Free%20Sci%20Amer%20May03.pdf

[blo00] Rebecca Blood: weblogs: a history and perspective. 7 september 2000. http://www.rebeccablood.net/essays/weblog_history.html (viewed Mai 2005)

[boy05] Danah Boyd: initial impression of Yahoo 360. In: Many2many - a group weblog on social software. March 24, 2005. http://www.corante.com/many/archives/2005/03/24/initial_impression_of_yahoo_360.php

[cla04] F. Cladwell: Cool Vendors in Social Network Analysis. Gartner Research Note COM-22-2916, March 2004. http://newyorkguide.blogs.com/psfk/files/gartner_cool_vendors_in_social_network_analysis.pdf

[com04] Computerworld: Winning the Name Game. News Story (April 2004). http://www.computerworld.com/q?45643

[com05] The Community Engine Blog: Your meme in the blogosphere, how PR-style analytics can help. February 2005. http://thecommunityengine.com/home/archives/2005/02/your_meme_in_th.html

[con04] Convention Bloggers - A community site for bloggers participating in the Republican National Convention, August 30 - September 2, 2004. Website. http://www.conventionbloggers.com/

[del05] Michelle Delio: Enterprise collaboration with blogs and wikis - Companies turn to new tools to foster dialog between employees, customers, and the public. In: InfoWorld, March 28, 2005. http://www.infoworld.com/article/05/03/28/13FEblogwiki_1.html

[dre04] Daniel W. Drezner, Henry Farrell: Web of Influence. Foreign Policy, November/December 2004. http://www.foreignpolicy.com/story/cms.php?story_id=2707

Page 17: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 17 -

[efi04a] L. Efimova, S. Fiedler: Learning webs: Learning in weblog networks. In: P. Kommers, P. Isaias, & M. B. Nunes (Eds.), Proceedings of the IADIS International Conference Web Based Communities 2004, Lisbon, 24-26 March 2004 (pp.490-494), IADIS Press. https://doc.telin.nl/dscgi/ds.py/Get/File-35344/LearningWebs.pdf

[efi04b] Efimova, L. (2004). Discovering the iceberg of knowledge work: A weblog case. In Proceedings of The Fifth European Conference on Organisational Knowledge, Learning and Capabilities (OKLC 2004), April 2-3, 2004. https://doc.telin.nl/dscgi/ds.py/Get/File-34786/OKLC_Efimova.pdf

[efi05] Lilia Efimova, Stephanie Hendrick: In search for a virtual settlement: An exploration of weblog community boundaries. (submitted to) Communities and Technologies 2005; Milano, Italy; June 2005. https://doc.telin.nl/dscgi/ds.py/Get/File-46041

[eve05] Joris Evers: Microsoft Recruits Bloggers to Preview Longhorn. In: PCWorld. May 03, 2005. http://www.pcworld.com/resource/article/0,aid,120693,pg,1,RSS,RSS,00.asp

[fla03] Flake, G.W., Tsioutsiouliklis, K., Zhukov, L.: Methods for Mining Web Communities: Bibliometric, Spectral, and Flow. In: Web Dynamics. ed by Poulovassilis, A., Levene, M. (Springer, 2003).

[gar05] Gartner: Hype Cycle for Emerging Technologies, 2005. Teleconference, July 2005. http://www.gartner.com/teleconferences/asset_129930_75.jsp

[gla00] Malcom Gladwell: The Tipping Point: How little things can make a big difference. 2000. http://www.gladwell.com/tippingpoint/

[gla04] Natalie S. Glance, Matthew Hurst and Takashi Tomokiyo: BlogPulse: Automated Trend Discovery for Weblogs. WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics. New York, May 18th 2004. http://www.blogpulse.com/papers/www2004glance.pdf

[gru04] Daniel Gruhl, R. Guha, David Liben-Nowell, Andrew Tomkins. Information diffusion through Blogspace. Proceedings of the WWW 2004, pp 491-501.

[hay01] Haydon, D.T., M. Chase-Topping, D.J. Shaw, L.Matthews, J.K. Friar, J. Wilesmith, and M.E.J. Woolhouse: The construction and analysis of epidemic trees with reference to the 2001 UK foot-and-mouth outbreak. Proceedings Royal Society B, 270:121-127.

[hei05] IBM ermutigt Mitarbeiter zum Bloggen. heise online news, 16.05.2005. http://www.heise.de/newsticker/meldung/59570

[hen05] Stefan Heng, Sabine Kaiser: Blogs: The new magic formula for corporate communications? Deutsche Bank Research, No. 53, August 2005. http://www.dbresearch.com/PROD/DBR_INTERNET_EN-PROD/PROD0000000000190745.pdf

[her04] Herring, S. C., Scheidt, L. A., Bonus, S., and Wright, E. (2004). Bridging the gap: A genre analysis of weblogs. Proceedings of the 37th Hawai'i International Conference on System Sciences (HICSS-37). Los Alamitos: IEEE Computer Society Press. http://www.blogninja.com/DDGDD04.doc

[her05] Susan C. Herring, et al: Conversations in the Blogosphere: An Analysis "From the Bottom Up". In: Proceedings of the Thirty-Eighth Hawai'i International Conference on System Sciences (HICSS-38). 2005. Los Alamitos: IEEE Press. http://www.blogninja.com/hicss05.blogconv.pdf

[hil02] J.Hiler: The tipping blog: How weblogs can turn an idea into an epidemic. Microcontent news, March 2002. http://www.microcontentnews.com/articles/tippingblog.htm

Page 18: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 18 -

[hug96] Ian Hughes: How to Keep a Research Diary. 1996. http://www.scu.edu.au/schools/gcm/ar/arr/arow/rdiary.html

[ibm05] IBM Corporation: IBM Blogging Policy and Guidelines. May 2005. http://www.snellspace.com/IBM_Blogging_Policy_and_Guidelines.pdf

[ish04] Kazunari Ishida: Extracting Latent Weblog Communities: A Partitioning Algorithm for Bipartite Graphs. Workshop on the Weblogging Ecosystem, 13th International World Wide Web Conference, May 18th, 2004. http://www-idl.hpl.hp.com/blogworkshop2005/ishida.pdf

[kri02] S. Krishnamurthy: The Multidimensionality of Blog Conversations: The Virtual Enactment of September 11. In Maastricht, The Netherlands: Internet Research 3.0 (2002).

[kum99] Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for Emerging Cyber-Communities. Computer Networks 31(11), 1481-1493 (1999). http://citeseer.ist.psu.edu/243432.html

[kum03] Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, Andrew Tomkins. On the bursty evolution of Blogspace. Proceedings of the WWW 2003, pp 568-576. http://doi.acm.org/10.1145/775152.775233

[may05] Ross Mayfield: Fear, Greed and Social Software. many2many, 25 May 2005. http://www.corante.com/many/archives/2005/05/25/fear_greed_and_social_software.php

[mar04] C. Marlow: Audience, structure and authority in the weblog community. Presented at the International Communication Association Conference, May, 2004, New Orleans, LA.

[mcg04] Rob McGann: The Blogosphere By the Numbers. ClickZ Stats, Traffic Patterns. 22 November 2004. http://www.clickz.com/stats/sectors/traffic_patterns/article.php/3438891

[moo04] A. de Moor, L. Efimova: An argumentation analysis of weblog conversations. In: M. Aakhus & M. Lind, (Eds.), Proceedings of the 9th International Working Conference on the Language-Action Perspective on Communication Modelling (LAP 2004), New Brunswick, NJ, USA, June 2-3, 2004. https://doc.telin.nl/dscgi/ds.py/Get/File-41656/lap2004_demoor_efimova.pdf

[nic01] Maish Nichani, Venkat Rajamanickam: Grassroots KM through blogging. elearningpost, May 2001.http://www.elearningpost.com/features/archives/001009.asp

[nak05] Shinsuke Nakajima, Junichi Tatemura, Yoichiro Hino, Yoshinori Hara, Katsumi Tanaka: Discovering Important Bloggers based on Analyzing Blog Threads. Proceedings of the 2nd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics. Workshop at the WWW2005, May 2005. http://www-idl.hpl.hp.com/blogworkshop2005/nakajima.pdf

[pap04] Ari Paparo: Big List of Blog Search Engines. Ari Paparo's blog, January 2004. http://www.aripaparo.com/archive/000632.html

[pru04] Reinhard Prügl, Michael Schuster: Weblogs in Project Management - Using Weblogs as a tool in innovative projects. Presentation at the BlogTalk 2.0, Vienna Austria, July 2004. http://www.knallgrau.at/blogtalk/files/weblogs_in_pm.pdf

[ril05] Duncan Riley: World wide blog count for May: now over 60 million blogs. The Blog Herald, May 25th, 2005. http://www.blogherald.com/2005/05/25/world-wide-blog-count-for-may-now-over-60-million-blogs/

Duncan Riley: Blog Count for July: 70 million blogs. The Blog Herald, July 19th, 2005. http://www.blogherald.com/2005/07/19/blog-count-for-july-70-million-blogs/

Page 19: blog mining in a corporate environment - TU Wienieg.ifs.tuwien.ac.at/techreports/Asgaard-TR-2005-11.pdf · Apart from this, blogging also penetrates corporate information and communication

page - 19 -

[roe04] Martin Röll: Business Weblogs - A pragmatic Approach to introducing Weblogs in medium and large Enterprises. In: BlogTalk, Vienna Austria, 2004. http://www.roell.net/publikationen/Business-Weblogs_BlogTalk_Paper_Martin_Roell_English.pdf

[rss] Berkman Center for Internet & Society at Harvard Law School: RSS 2.0 Specification. http://blogs.law.harvard.edu/tech/rss

[shi03] Clay Shirky: Power Laws, Weblogs, and Inequality. In: Clay Shirky's Writings About the Internet; February 2003. http://www.shirky.com/writings/powerlaw_weblog.html

[shi05a] Clay Shirky: Business data point. In: Many2many - a group weblog on social software. March 28, 2005. http://www.corante.com/many/archives/2005/03/28/business_data_point.php

[shi05b] Clay Shirky: Ontology is Overrated: Categories, Links, and Tags. In: Clay Shirky's Writings About the Internet; 2005. http://www.shirky.com/writings/ontology_overrated.html

[sif05] David Sifry: State of The Blogosphere. Sifry's Alerts, March 14, 2005. http://www.sifry.com/alerts/archives/000298.html

[sne05] James Snell: Blogging@IBM. IBM developerWorks, 16 May 2005. http://www-128.ibm.com/developerworks/blogs/dw_blog_comments.jspa?blog=351&entry=81328

[trb] Trott, M., and B. Trott: A Beginner’s Guide TrackBack. http://www.movabletype.org/trackback/beginners/

[tse05] Belle Tseng: Tomographic Clustering To Visualize Blog Communities as Mountain Views. 2nd Annual Workshop on the Weblogging Ecosystem (WWW 2005), Chiba, Japan, May 10th 2005. http://www-idl.hpl.hp.com/blogworkshop2005/tseng.pdf

[wir04] Wired News article (May 7, 2004): How the Word Gets Around. http://www.wired.com/news/infostructure/0,1377,63344,00.html

[wik05] wikipedia: Weblog. http://en.wikipedia.org/wiki/Weblog (viewed Mai 2005)

[zer05] Ansgar Zerfaß: Corporate Blogs: Einsatzmöglichkeiten und Herausforderungen. In: BIG BlogInitiativeGermany vom 27.01.2005, S. 1-9. http://www.zerfass.de/CorporateBlogs-AZ-270105.pdf