scei technical whitepaper-19.06.2012

SCEI(Semantic Communication Engine Innsbruck)pronounced SKY

Technical Whitepaper

Dieter Fensel, Michael Fried, Christoph Fuchs, Iker Larizgoitia, Alex Oberhauser, Stefan Thaler, Ioan Toma

v 0.619.06.2012

Abstract1. Introduction2. Problem definition3. Reference architecture

3.1. Semantic layer/domain knowledge3.2. Separation of components3.3. Data and content storage3.4. The weaving process in general

4. Reference implementation4.1. Content Management System

4.1.1. Domain and task specific UI4.1.2. Workflow engine and communication patterns4.1.3. Export of RDF data (OWLIM Integration)4.1.4. The Weaving Process within the CMS

4.1.4.1. Publication in CMS4.1.4.2. Feedback collection in CMS4.1.4.3. Statistics collection in CMS

4.2. dacodi4.2.1. The Weaving Process within dacodi

4.2.1.1. Common Weaver Model4.2.1.2. Publication in dacodi4.2.1.3. Feedback Collection in dacodi4.2.1.4. Statistics Collection in dacodi

4.2.3. Adapters

1

AbstractThe Semantic Communication Engine Innsbruck (SCEI) is a fully fledged online communication software suite. It supports users in online communication, gathering feedback and measuring online impact. The software contains workflow assistance as well as communication patterns supporting the planning and execution of online campaigns. Furthermore we look into possibilities of integrating crowdsourcing support like for example the translation of texts into foreign languages. In particular, we enable fast and easy one-click publishing and collection of content on a multitude of marketing channels, hiding technology complexity behind a user-friendly interface, and directly reflecting on the impact within online communities and web presence. The core idea of our approach is to introduce a semantic layer on top of the various Internet based communication channels that is domain specific (e.g. tourism, hotels, marketing, agencies, etc.) and not channel specific. This document describes the overall technical architecture of the multi channel management and communication software. Furthermore we present motivation, some use cases and architectural diagrams that outline the implementation details. Note: In this version of the document we mainly focus on the publication, feedback and statistics core functionality of SCEI together with an overview of the semantic layer. In later versions we will also cover workflow capabilities, communication patterns as well as the crowdsourcing components.

2

1. IntroductionToday’s online world is more than ever driven by the fast paced exchange of information. The rise of Facebook, YouTube and others resulted in a notable shift how companies and individuals share and exchange information. These social media platforms and online services enable everyone to interact with a huge, already established user base. While a “traditional” online presence in form of a company or personal web page is still relevant, the inherent recommendation mechanics of social media platforms are beneficial to reach potential customers. However, online information is not exclusively for human consumption. Using semantic technologies information can be enriched with metadata, making it readable for machines as well. This inherent difference of the traditional, social, and machine readable way to make information available is essential in regards to how the information is treated. While a traditional web page has many advantages in terms of content ownership and freedom of presentation, there are usually limited metrics which indicate if the presented information was appreciated by a visitor. Unless special mechanisms (such as a rating or feedback system) are implemented, visitor numbers, as well as geographic data are the only metrics available. Social media platforms, on the other hand, provide very simple feedback mechanisms which are usually unobtrusive. Further, communication is encouraged by providing an easy way to exchange messages between users. The emphasis on feedback and interaction is the main difference of how information is treated on a social media platforms as opposed to traditional web pages. Analyzing the accumulated feedback is a useful indicator to see if the published information was received well by the audience or not. Even more broadly this enables the steering of brand perception and things like a holistic online reputation management and customer relationship management. Traditional web pages and social media platforms concentrate on humans as their main data consumers. The rise of various services (such as web services or mobile applications) and publishing methods such as linked data provide an incentive to present the data in machine readable form as well. Our goal is to develop a set of tools which combines the traditional, social and machine readable way to interact with information and makes this process easier than it is with existing tools. To reach this goal we will develop a unified layer - dacodi - which is able to interact with social media platforms, and extend existing content management systems (starting with Drupal) to incorporate the social and machine readable aspect into existing solutions. Additionally we will develop support for defining workflows as well as identify communication patterns that help in the planning, execution and controlling of online campaigns. What differentiates our solution is the introduction of a semantic layer that abstracts information items and underlying concepts from the concrete channels that the user wants to manage. This semantic layer is specific to

3

a domain (e.g. Hotels, Restaurants, Doctors, Event managers, ...) which shall enable users to work from a conceptual view rather than a channel view. Throughout this paper those software components are referred to as the Semantic Communication Engine Innsbruck, or SCEI.

Figure 1: SCEI conceptual overview The aforementioned differentiation (traditional, social, machine readable) allows us to separate responsibilities of our software components, making the whole SCEI modular, which results in higher efficiency, robustness and scalability. Obviously, this separation is not strict and the three variants can overlap. Certain types of web pages combine various technologies and paradigms which do not allow a strict classification. However, this three-fold separation is not meant as a classification of current online information. Its purpose is to define the types of information with which the SCEI interacts. The aim of this document is to introduce our technical solution to the problem of online multi channel management. Before defining the general problem in detail, we are going to agree upon certain terms to define a clear terminology. After the problem description in Section 2 we will present the high level approach of our solution i.e. the reference architecture in Section 3 followed by a more detailed and technical look into the software components i.e. reference architecture in Section 4. This will comprise the separation of the system into two big parts, the CMS and dacodi components, the introduction of our content and channels merging approach achieved through something we call a weaving process as well as its impacts on publication,

4

feedback and statistics collection. Also we will introduce a Common Weaver Model which enables scalability of the system by exploiting the fact that similar channels have common characteristics. TerminologyWe define the following terms in order to establish a common understanding of the topic:

● Communication1 is the activity of conveying information according to Wikipedia. Communication requires a sender, a message (an object of communication, information or a form of information), channel (the medium) and an intended recipient. Bi directional communication is underlies a broad process model that often starts with a publication or broadcasting activity which can be followed by feedback, that again often triggers the exchange of further information afterwards or even leads to engagement in long conversation.

● Dissemination2 is the act of broadcasting content to the public without direct feedback from the audience.

● Content Management is the set of processes and technologies that support the collection, managing, and publishing of information in any form and medium. Digital content may take the form of text, multimedia files or any other file type which follows a content life cycle that requires management.

● RDFa is a W3C Recommendation that allows embedding of RDF statements into XHTML documents, HTML4 and HTML5.

● Microdata is a similar approach to RDFa. It allows to embed semantics into existing HTML content. Microdata aims to be simpler than RDFa and plays a major role in search engine optimization (SEO).

● A Channel is a means of transporting a message, therefore a medium. When in our definition an online channel does NOT equal a full communication platform. Potentially every URI is a channel. For example a HTML page within a larger website can be a channel.

● A Platform is a collection or a group of channels. For example Facebook is not one channel, but a collection of multiple channels e.g. the Facebook wall being one of them. A Platform allows access to more than one communication channel, (e.g. video, text, image).

● Pull channels are channels that actively gather data from a predefined source. A homepage (single html site) or Wiki page for example requests data from a server. These data sources can be many fold (e.g. a semantic repository) but the procedure is always the same: system pulls information from a source. For example also a Linked Data endpoint can be queried using SPARQL and the extracted information can be transformed, reused, etc. If we have the direct control over the underlying data we can semantically annotate it using technologies like RDFa and microdata.

● Push channels are channels to which information has to be explicitly sent to. These channels include email, bulletin boards and Web 2.0 platforms. None of these channels actively gather information from external data sources. This means if we want to distribute information to such channels we have to actively push it to the correct one. Also due to the fact that the user usually does not have full control over the data pool

1 http://en.wikipedia.org/w/index.php?title=Communication&oldid=4804840482 http://en.wikipedia.org/w/index.php?title=Dissemination&oldid=458980901

5

http://en.wikipedia.org/w/index.php?title=Communication&oldid=480484048




















http://en.wikipedia.org/w/index.php?title=Dissemination&oldid=458980901




















and storage it is not possible to control semantic annotation of for example a tweet or facebook post.

● Information Item is the entity to be published. An information item may be semantically enriched and thus described by an underlying concept. Viewed from the syntactic side an information item is represented as XHTML with the possibility of RDFa annotations.

● User in our terminology is an agent (human or software solution) who executes a task related to online communication.

● Adapters in dacodi are used to provide uniform access to all communication channels. They are the linking part between the actual communication channel (e.g. Facebook API for wall posts) and dacodi. We distinguish between two types of adapters: publishing and retrieval.

2. Problem definitionAfter introducing the topic we now focus on defining the resulting problems that have to be overcome if one wants to reach scalable and efficient online communication. On a high level, the general problem is the following: A user has content that he wants to make accessible to others. This content can either be published as static content on a traditional web page, as a “status update” or something comparable on a social media platform, or as RDF triples in a triple store or the Linked Open Data cloud. If the user desires to utilize multiple outlets - to reach a wider and more diverse audience for example - the content has to be published on multiple places. However, currently publication in multiple places results in duplicate effort and manual labor. While the problem may be very similar conceptually, publication on different channels works differently if looked at the technical details. These are not mere technicalities or minor differences however. When it comes to, for example, ownership of the published data, the differences between a traditional web page and a social media platform are major. This shift of ownership/responsibility implies further differences in regards to what operations are possible on published data, e.g. modification and deletion of already published content. A homepage is in most cases intended to be world readable without restrictions, whereas social networks can be quite restrictive and make content only consumable for registered customers. Additionally a homepage should be structured well in order to enable the user to quickly discover the information needed. In most social networks recent content is automatically delivered to a user in his stream. The traditional Web and especially the Web 2.0 (Social Web) are becoming an inseparable part in identity creation and represent a key medium for companies to communicate with existing and potential customers. However, the opportunities for companies of leveraging Web technologies for attracting more site visitors and reaching more target-group users is accompanied by a number of challenges. These include as stated before technical difficulties, but more importantly, handing the growing number and diversity of social platforms, specialized

6

news web pages, blogs, discussion forums and messaging services. We address these hindrances by providing innovative marketing communication and impact-measurement solutions. In particular, we offer the first product that employs semantics for creating a level of abstraction over all communication channels, thus supporting the recommendation of suitable channels and simultaneous publishing of content. In particular, we base the tool development on four main approaches for handling complexity and reducing the amount of manual effort:

● Description of communication channels’ capabilities which is implicitly given through clustering of channels into groups with similar functionality

● Semantic representation of the customer’s domain Information. SCEI makes use of semantic annotations, which can refer to a domain ontology. These annotations are useful for other services, as well as for the publication component of the tool (i.e. dacodi) and play an important role in search engine optimisation.

● Channel recommendation● Content transformation to fit a particular channel

Content distribution and feedback monitoring in various channels is a manual and labor intensive task. Take, for example, video upload. A user has to upload a video on potentially multiple platforms (YouTube, Vimeo, Facebook video), copy and paste the video title/description and enter tags manually. After the upload process, the user may want to notice his clients via social networks about the new video. Thus, the video link has to be copied and posted as a status update. Further, a short description alongside the link would be beneficial, which has again to be written or copied. Our tool wants to eliminate any non-automate-able manual labor in this and similar processes. The resulting software product saves time and hides technology specifics behind an easy-to-use interface, enabling a flexible and scalable multi-channel communication strategy. Furthermore, the tool also uses different metrics to statistically capture and analyze the online reach and impact, providing means for evaluating the online marketing strategy but also to conduct reputation management by timely reacting especially to negative posts and feedback.

3. Reference architectureIn order to solve the problems mentioned above, this section explains our solution on a conceptual level. The central element of our approach is the separation of content and online channels. This allows reusing the same content for various communication means. Through this reuse we want to achieve scalability of multi-channel communication. The explicit modeling of content independent from specific channels also adds a second element of reuse: Similar operational entities active in the same domain can reuse significant parts of such a content model. Separating content from channels also requires the explicit alignment of both. This is achieved through a weaving process. Figure 2 shows the SCEI high level, reference architecture. The following Sections give more details about reference architecture, its components, where the content generated by the user is

7

stored and the above described weaving process in general.

Figure 2: SCEI reference architecture

3.1. Semantic layer/domain knowledgeIn order to abstract the domain specific communication from the actual channels, thus lifting the distribution and data collection in channels to an upper conceptual layer, we need semantics on top of our solution. This layer on the one hand captures data in domain specific ontologies on the other hand describes the various communication channels. In order to interweave the domain specific concepts with the underlying communication channels we propose a weaving process which will be explained in more detail within this document. In the end this semantic layer will smartly decide which kind of content is distributed to which channel in which form. Let’s take for example a hotelier who wants to build up or extend the online presence of his or her business. First of all there is a need to know all relevant channels which reflect the target group. This list can include things like a homepage, mailing lists, fora or social networks. After knowing the available channels, accounts have to be created on some selected platforms. Additionally to that, a hotelier has to be present in various rating and review sites in order to maximize business opportunities. These channels can be manifold and it is extremely hard to keep an overview of what’s going on in these channels without technical assistance i.e. a tool that distributes and aggregates all channels in a single interface. However technical details, as well as emerging channels shall be integrated quickly and transparently. So the end user needs to work on a level he or she understands i.e. a domain specific layer with concepts well known in the industry sector instead of handling each channel separately.

8

3.2. Separation of componentsThe STI online communication tool is split into a set of components that can be conceptually grouped in two major parts, namely:

1. The content management system (CMS), together with the domain and task specific interfaces and the workflow engine and communication patterns component

2. The data and content distribution component, responsible mainly for the Web 2.0 communication (i.e. data distribution to and feedback collection in push channels).

Obviously this separation is needed for satisfying the different requirements of push and pull channels due to the contradicting nature of these two approaches and their application by different existing channels. One must however note that both paradigms have in common that multi-directional communication (conversations between multiple users) can occur and often statistical information can be extracted from the channels. Also through this component separation we guarantee maximum scalability, allow easy adaptation to multiple use cases and simplify the integration with the seekda hotel booking solution as well as other 3rd party apps. Another main motivation of this separation is to have a single layer which unifies social media platforms (Web 2.0 channels) - namely the data and content distribution component. This enables easy integration by providing a single common interface, as well as the possibility of external use, as mentioned above. Another approach would be to integrate everything in the CMS of your choice, thus disregarding any possibility of loose coupling, reuse and a component-based architecture. On each data change in the CMS another module sends the newly created/updated content to the data and content distribution component API. This loosely coupled approach allows an easy exchange of the CMS part and makes the data and content distribution component independent from current content management solutions. The data and content distribution component API makes it also possible to create use case specific interfaces for data and content distribution component (e.g. enables white labeling) and quickly integrate it into 3rd party applications in use cases where the “heavy weight” CMS part including things like content hosting is not needed. Following use cases were defined to show the advantage of such a flexible architecture:

● A hotelier does not want to change his existing homepage infrastructure and CMS but nevertheless profits from addressing multiple Web 2.0, e-mail and rating channels via the data and content distribution component. Setup and usage of the software must be easy in order to be performed by an averagely skilled user. Here the communication with the customer, including engagement in conversations via the tool, is the primary focus of the user. Such a tool can be offered with a low pricing scheme since the data and content distribution component does mainly the content distribution and does not have to care about site hosting and content per se.

● The dissemination partner of an EU project needs a fully fledged, out of the box solution to address all important channels at once. The CMS with semantic data export in combination with data and content distribution component enables this. The initial

9

setup is however a non trivial task since homepage structure and the links to LOD vocabularies and other ontologies must be created. However we can expect a more technically skilled user to operate on this full package.

● A marketing agency with a multitude of different customers faces several other problems. Each customer wants fully fledged offline and online presence in multiple channels. SCEI is very flexible within this regard. It is possible for them to maintain the full Web 2.0 presence of a customer via the data and content distribution component and if needed to also provide their customers with a state-of-the-art CMS solution.

In Figure 2 we provide a high level overview of all SCEI components and actors which are the following:

● User: Person that operates the software and works on the level of information items rather than channels. We distinguish between several specific user roles, namely:

○ Content creator: Person that generates the content of the items to be disseminated.

○ Workflow designer: Person that define communication patterns and workflows involving communication, multi-channel publishing and social media monitoring within an organization (e.g. a hotel business)

● Information Item: Content that traverses the system and is stored, distributed and transformed within the process.

● CMS: The content management system (in our case Drupal 7.x) exposes the user interface to the user as well as HTML in the form of a website, accessible via the Web.

○ Domain and Task specific UI: Dependent on the application domain, user, task and role the user interface adapts itself and shows easily accessible all relevant information.

○ Workflow Engine/Communication patterns: In order to support publication and controlling workflows SCEI contains an engine supporting such. Well known communication patterns help in these workflows, their planning and execution.

○ RDFa annotation: Enriches the information item with semantic metadata in order to export it to an RDF repository as well as easen the distribution via the data and content distribution component, because the tool understands the meaning of the information item instead of just the structure.

○ Scheduling: Contains rules about delayed or recurrent publication of the information item.

○ DB: Database which stores the actual content. ○ RDF export plugin: Exports all information item for the DB to an external

repository. ● Semantic repository: External RDF repository which exposes all information items via

a SPARQL endpoint. It also contains the domain and channel models and makes this information accessible to both the CMS and data and content distribution component.

● Data and content distribution: Distributes content in and aggregates information from all push channels.

○ API: Makes the data and content distribution component accessible via the CMS

10

as well as other 3rd party applications which makes this part integratable in external software solutions. Receives HTML which can be additionally enriched with RDFa annotations.

○ DB: Database stores references to the information items and their representation in the different channels.

○ Publishing Module: Is responsible for distribution the information item in different channels.

■ Content Extractor. Analyzes the HTML coming from the API and extracts all relevant information.

■ Concept to channel mapper: Decides which part of the information item will be pushed to which channel.

■ Content Transformer: Transforms the content in order to fit the channels e.g. shortens a text to 140 characters for Twitter publication.

■ Scheduling: Contains rules about delayed or recurrent publication of the information item.

○ Statistics module: Collects and stores all valuable statistical information coming from the various channels e.g. site visits, number of views, and such.

■ Item Analyzer: Handles statistics of an information item in various channels.

■ Channel Analyzer: Handles statistics coming from a specific channel regarding all information items published within.

○ Engagement Module: Is responsible for direct interaction in various channels. ■ Feedback Collector: Gathers feedback form all channels in order to

present it centrally. This can be for example comments or reviews.■ Interaction Component: Enables to react to the gathered feedback. For

example to reply on online comments.○ Impact Analyser Module: Figures out which impact publications have had.

■ Impact Analyser: Specialized form of statistics that try to figure out how to efficiently leave impact in the online world. We differentiate here between real impact, based on active publications, as well as potential impact, meaning how many people the user can potentially reach in the various channels, given a limited amount of e.g. friends or subscribers.

3.3. Data and content storageThe CMS actually stores data and content (for example pictures) in its internal database. References, meaning links, to these data items can be found in the website’s HTML code as well as the exported RDF triples. The data and content distribution component, on the other hand, is not meant as a content hosting solution and therefore does not store all content and data. The only exception where the data and content distribution component stores data like images and videos, although temporarily, is when publishing is delayed by the scheduling mechanism. We distinguish between dynamic and static information publishing. With static information we

11

refer to a “distributed profile” in all Web 2.0 channel that can be changed at once. Such a profile contains things like contact information or a representative picture, in short things that should not change frequently and are valid without temporal constraints. Dynamic information are things that will be pushed to e.g. news feeds and represent information at a certain point in time. For such publications the data and content distribution component only stores a reference to which channels content was distributed and a textual description of the content, so that it can later be identified by the user and specific feedback can be assigned to it. Acting mainly as a speaking tube, the data and content distribution component provides a lightweight and scalable solution.

3.4. The weaving process in generalAs mentioned in the previous Section, the general problem is one of content distribution and feedback collection. We define a “weaving” process to formalize the steps necessary to solve this problem. In general, this process can be broken down as follows:

1. Content input2. Selection of publication channels3. Content adaptation4. Publication5. Collection of feedback 6. Collection of statistics

4. Reference implementationWe provide a reference implementation for the reference architecture presented in Section 3. The reference implementation is outlined in Figure 3 and is splitted into two major parts: a content management system (CMS) part based on Drupal 7.x, and the data and content distribution in-house implementation called dacodi. Additionally, a external semantic repository, namely OWLIM, is used to save content. The rest of this section provides the technical details on the reference implementation.

12

Figure 3: SCEI reference implementation

4.1. Content Management System As basis for our CMS solution we use Drupal 7.x. The reason is its native RDF support, the availability of additional semantic modules, such as a SPARQL endpoint or microdata export and the possibility of third-party module development. The publication of new or the updating of existing content (information item) starts with the responsible person creating or changing one piece of information. This process is handled by the underlying CMS. If necessary, scheduling information could be provided to postpone the publication. After a successful change the content is saved to the external OWLIM repository and sent to the dacodi API. The CMS utilizes dacodi to extend its content distribution capabilities. Likewise the CMS acts as a specialized kind of user interface from the dacodi viewpoint. In the following we will further outline how RDF data is exported by the CMS and how the first part of the weaving process works. The second part of the weaving process will be described in the dacodi Section of this document.

4.1.1. Domain and task specific UI

13

The Domain and task specific user interfaces are the components through which the content users are directly interacting with our system. They are sub-components of and directly implemented using the CMS. The design and look-and-feel of these components are very much adapted to the mind setting of the user, supporting them to specify content in a terminology that is familiar to them. For example hoteliers will specify content items that they want to be disseminated in terms of offers, touristic packages, etc. The domain and task specific user interfaces support thus information dissemination abstraction based on the concrete domain, independent of the channel(s) of dissemination. The domain and task specific user interfaces also allow the user to manage and solve task specific activities including yield, brand and reputation management, customer relation management and online advertising.

4.1.2. Workflow engine and communication patternsIn order to support the user we offer a workflow engine together with support for communication patterns. This component enables user to define and manage complex workflows on top of the communication, multi-channel publishing and social media monitoring underlying SCEI components. Such workflows have usually a long lifespan and involve multiple employees working together on improving the visibility, reputation and communication of an entity. The workflow engine and communication patterns component can be used to manage the communication workflow including assigning, tracking and responding to user feedback. Using this component one can define and manage steps and protocols to be activated when certain events related to the published information occur., e.g. a bad comment on a post in Facebook is written. Take for example a hotelier. Using the workflow engine and communication patterns component, the hotelier can specify and manage when and which of his employes, depending on his availability, are taking care of responding to customers posts on various channels about his hotel, or engage with customers to present them new hotel offers.

4.1.3. Export of RDF data (OWLIM Integration)The export of the CMS content to an external triplestore repository allows the publication of the website data as a bubble in the linked data cloud. The consistency of the two databases is guaranteed with the help of hooks that are triggered by the CMS on each add, updated and delete operation. Hooks are functions that allow to intercept the CMS internal workflow. After an operation was successfully executed the RDF export plug-in creates triples and uses the Sesame REST API to add or change the content in OWLIM. For semantically annotating (RDFa and microdata) the content on the homepage exposed by the underlying CMS we use the Drupal internal database since available Drupal modules already enable this annotation functionality. The OWLIM repository mainly serves as linked data SPARQL endpoint. As seen in Figure 3, the changes to the CMS are not intrusive, since the added functionality is provided by plug-ins. Two additional plug-ins need to be written: One for the OWLIM integration, and one for dacodi.

4.1.4. The Weaving Process within the CMSIn regards to a content management system, the weaving process looks as follows:

1. Content inputThe content is entered in the CMS, directly by the user of the CMS.

2. Selection of publication channels

14

Where should the document be published, in regards to the internal document tree of the CMS. If a distribution to social media platforms via dacodi is desired, Web 2.0 channels can be selected as well.

3. Content adaptationContent adaptation is not necessary for the CMS, since there are no content restrictions.

4. PublicationThe document is published in the CMS and - if desired - as triples in the LOD cloud. The information item will be passed to dacodi during the publication phase, along with the previously selected Web 2.0 channels.

5. Collection of feedback Direct user feedback, like comments, shares, retweets, etc. is gathered by dacodi and can be queried by the CMS using the dacodi API (see more in Section Feedback Collection in dacodi).

6. Collection of statisticsCollection of visitor numbers and demographic data can be done via a tool like Google Analytics or the open source solution PIWIK.

The publication as triples in the LOD cloud (or to any external triplestore), as mentioned in step 3 of the weaving process, is done by a plugin which integrates OWLIM in the CMS.

4.1.4.1. Publication in CMSThe CMS component enables the publication of content on a homepage. It also provides functionality to annotate the website’s HTML with RDF data and export these RDF data as a whole in order to make it machine understandable.

4.1.4.2. Feedback collection in CMSFeedback from the CMS can come from various sources like for example an internal commenting or rating system.

4.1.4.3. Statistics collection in CMSStatistics within the CMS can come from various sources like Google Analytics or PIWIK for analyzing page visits or internal comment and feedback systems. In the following Section we will explain how dacodi is able to distribute content in multiple channels.

4.2. dacodiThe dacodi component is used to distribute information in various Web 2.0 and email channels, as well as collect and analyze feedback from those channels and actively engage in conversations (i.e. reply to comments). Central to dacodi is the weaving process, which enables channel selection based on the semantics of the information item to be distributed and content transformation based on these channels. If manual effort is necessary, for example for entering content in a certain spot to a wiki system, the content can be sent to the responsible webmaster via e-mail. We will describe how the weaving process works within dacodi and how the component interacts with online channels using certain Adapters for publication and feedback and statistics collection.

15

4.2.1. The Weaving Process within dacodiThe ultimate goal of the weaving process is the semi-automated publication of the information item in fitting channels, including necessary transformations, based on the information type. Thus, the weaving process can be broken down in the following steps:

1. Content inputIn the case of dacodi, this equals the acquisition of the information item; either through the API (coming e.g. from the CMS) or a dedicated user interface.

2. Selection of publication channelsSelection of appropriate Web 2.0 channels based on the information type.

3. Content adaptationTransformation of the information item into a Common Weaver Model (CWM) instance.

4. PublicationPublication of the (transformed) information item in the selected channels.

5. Collection of feedbackFeedback collection via the APIs of the used channels.

6. Collection of statisticsStatistics collection via the APIs of the used channels.

We will discuss the steps necessary, channel selection and content transformation, for the weaving process in the following Subsections. Afterwards we will explain in detail the Common Weaver Model which is part of the content transformation component and specific to dacodi but not the CMS part. Channel SelectionBased on the information item type (e.g. a business event), a fitting channel for the information item will be selected (e.g. business event is announced on LinkedIn but not Facebook). The central component of the channel selection process is the (Concept-to-Channel) Mapper, which maps each concept to the appropriate channels. Consequently the Mapper gets a concept as input, and gives back a list of channels which are relevant for the concept. The Mapper of the prototype implementation uses a static mapping which maps every concept to a list of channels. Due to the modular architecture of the application, the mapper component can be easily replaced with a more sophisticated, dynamic approach. It would be possible, for example, to implement a Mapper that incrementally learns from user adjustments and thus alters the channel mappings based on the users needs. TransformationFor every channel the information item has to fit in, a transformation is necessary. For example: A business event might include fields such as short title, long title, description, start date, end date, location and venue. Further, there might be an accompanying image which represents the event - like a poster. A channel which only takes short text messages (Twitter, for example) can’t handle all those fields. Thus, the information item has to be transformed into something what we call a Common Weaver Model instance (CWM). To expand on the previous example, one could think of combining the most important

16

information of a business event (short title, start date, end date, location and venue) into a string which fits the channel’s restrictions - Twitter’s 140 characters, for example. The transformer component defines what transformations are necessary to go from Information Item to Common Weaver Model instance.

4.2.1.1. Common Weaver ModelThe Common Weaver Model3 (CWM) exploits the fact that similar channels have common characteristics. For example: Facebook status updates and Twitter enable the user to share short text messages in form of status updates. YouTube, Vimeo and Facebook video enable the user to upload and share videos. After looking at various Web 2.0 channels, we have identified the following Common Weaver Models:

● Text: A String of varying length. Online communication as it is today relies heavily on the exchange of short text messages. In essence, those messages are simply Strings of varying length. Depending on the platform, such text messages can be between 140 (Twitter) and many thousand characters (63,206 in the case of Facebook).

● Link: A common hyperlink denoted by the <link /> or <a /> HTML element.● Image: A two-dimensional image denoted by the <img /> HTML element. While support

may differ depending on the Channel, possible Internet media types include: gif, jpeg, png, svg, tiff.

● Video: A video file. While support may differ depending on the Channel, possible Internet media types include MPEG-1 video with multiplexed audio, MP4 video, Ogg Theora video, QuickTime video, WebM Matroska-based open media format, Matroska open media format, Windows Media Video.

● Presentation: A presentation file. We want to support this type - and thus related Web 2.0 platforms like slideshare - in future version. Not supported in the prototype.

● Audio: An audio file. Not supported in the prototype. During the weaving process instances of those models will be extracted from the information item and send to the selected publishing adapters. Each Common Weaver Model instance is stored internally using a unique identifier and grouped by the information item to which it is related. These CWM instances are extracted from an information item. The granularity of the extraction depends on the information item which is to be published. For example: if the user simply wants to publish a single link in various channels, it makes sense to extract the link and publish it. On the other hand, if a more complex information item contains dozens of links, it does not make sense to extract and publish every link (this would equal annoying spamming), unless the user explicitly wants to do so.

3 The model in Common Weaver Model refers to a model from a software engineering point of view, as in MVC (Model-View-Controller). A model manages the behavior and data of the application domain.

17

Figure 4: Extraction of Common Weaver Model (CWM) instances from an information item. When an information item is published, e.g. a business event, CWM instances are extracted. Expanding on the business event example introduced in the Transformation section: If the business event includes an image, it will be extracted and published to fitting image channels like Flickr. The essential information about the event can be combined in a string format and published via text channels, such as Twitter, Xing or Facebook. Since every CWM instance knows from which information item it was extracted, a link to the original information item (in this example: the business event) can be embedded, e.g. in the description of the image.

4.2.1.2. Publication in dacodiThe publisher module takes care of two things: publication of the information item (in this stage of the weaving process represented as Common Weaver Model instances) using adapters, and scheduling. We plan to support scheduling in two ways: delayed publication and repeated publication. For example: Delayed publication can be used to announce an event or a special offer at a specific time, whereas repeated publication may be used to send reminders (e.g. for a call for papers) in all channels.

4.2.1.3. Feedback Collection in dacodiEvery Information Item that is published by dacodi is tracked by the system, to provide statistical information and a per-channel impact analysis. This feature allows the user to see how well the

18

published information item was received, without having to check every channel individually. FeedbackBasically there are three forms of feedback that are supported by various Web 2.0 platforms and thus relevant for dacodi:

● Unary feedback. Any feedback that is a predefined, positive feedback. Examples: “like” on Facebook, “retweet” on Twitter, “favourite” on flickr, “favourite” on YouTube, etc.

● Binary feedback. Any feedback that is a predefined, positive or negative feedback. Example: thumbs up/down on YouTube.

● Rating/ranking. Feedback that can be quantified on a discrete scale. Example: star rating on a hotel review platform.

● Textual feedback. Any feedback that is user-created, in form of replies, comments or any other form of written feedback. NLP techniques can be used to analyze the textual feedback to provide the user with additional information, i.e. if the comment/reply was a positive or a negative one. A user can directly react to textual feedback within dacodi if the underlying channel allows this functionality.

4.2.1.4. Statistics Collection in dacodiThere are several statistical metrics that are relevant for the user. While the unary, binary, ranking and textual feedback is centered on the information item, statistics are relevant on a per-item and per-channel basis. Examples:

● Amount of unary, binary, rating, and textual feedback per information item (this includes features like “most discussed information item”, i.e. the information item with the most textual feedback).

● Number of information items published in each channel over a certain amount of time (day, week, month, year).

● Calculation of a combined impact metric per channel, based on feedback analysis of the information items published in the channel.

4.2.3. AdaptersAs mentioned before, we distinguish between two types of adapters: publishing and retrieval. The purpose of a publishing adapter is to publish an information item in a certain channel. Retrieval adapters are used to gather information about already published information items. Since the APIs as well as the offered functionality differ from channel to channel (e.g. Twitter’s and Facebook’s API differ) a separate adapter for each channel needs to be written. In our prototype we intend to create publishing and retrieval adapters for the following platforms: YouTube, Facebook and Twitter. All three of them have a Web API and cover a majority of the features we want to realize, such as publishing videos, texts, images and links. This is a starting point for implementing new adapters that provide similar functionality. We have identified the following, channel specific features that each adapter has to be able to handle:

● Mapping CWM properties to appropriate properties in the published communication channel. For example, a Tweet post’s text property is called ‘text’

19

whereas a Facebook post’s text property is named ‘message’. Since we have to implement an adapter for each communication channel we want to address we will implement this by a simple mapping routine in each adapter.

● Authentication and authorization to the communication channel. Most Web 2.0 communication channels rely on OAuth / OAuth24 to realize authorization and authentication. However, some of them rely on OpenId5, basic HTTP authentication or other form-based authentication mechanisms to restrict user access. The adapter has to be able deal with these individual mechanisms and has to store and load the credentials of each users.

● Publish a specific CWM instance. As mentioned above, the publishing process varies from platform to platforms thus this functionality has to be abstracted. This holds also true for retrieving feedback from different platforms.

Adapter loading and naming conventionsWe designed our adapters and adapter loading mechanisms to achieve the following three goals:

● No adapter duplication. The same functionality should be achieved with the same code. (Minimize codebase, achieve simplest possible code base).

● A common adapter structure for all platforms. Platforms are differently structured. However, for the clarity of the dacodi we want a uniform way adapters are integrated in the system. Adding the same functionality (e.g. adding an image channel) should be achieved in a similar manner in all platforms.

● Automatic adapter loading and execution. There should be no manual effort involved in adding a new adapter to the system, except for programming the adapter.

To achieve these three goals, we designed the system carefully, introduced some naming conventions and loading conventions for our adapters. These are described in the following sections. Figure 5 depicts the motivation for our design. The illustration sketches that social media platforms offer more than one different way to publish information. Additionally, each user account on this platform allows access another, similar set of communication channels, e.g. when one has two accounts on Twitter, they have a duplication of all available communication channels on Twitter, say two text-channels, two image channels and so on. The difference between those channels are merely the user credentials that are used to authenticate for the post. Different platforms allow to post similar common weaver items, though. Adhering to the SRP software development principle6 we chose to write an adapter for each explicit communication channel in each platform individually. This - together with a file naming convention - also allows us to automatically load and execute adapter classes, without having to change configuration files or any additional manual effort. If an adapter class is not found this channel is simply not supported.

4 http://en.wikipedia.org/wiki/Oauth5 http://en.wikipedia.org/wiki/Openid6 http://en.wikipedia.org/wiki/Single_responsibility_principle

20

Figure 5: Platform as channel groups offering multiple ways to publish information

We named the components that define a communication channel in dacodi: a channel, a platform and user credentials. In detail, they are:

● Channel Type: Is a virtual grouping of channels that allow publishing the same common weaver items, e.g. image, video or text. This is depicted in Figure 5. It is virtual, because it is split up into many different adapters to many different platforms but is accessed in a uniform way nonetheless.

● Platform: Is a grouping of channels that have the same user credentials. An example for a platform or channel group is Facebook. The notion of the channel group has been introduced since a platform such as Twitter or Facebook actually allow access to more than one communication channel, (e.g. video, text, image)

● User credentials: This is the information needed to authenticate/authorize a client to a certain platform and associate it with a certain account. In dacodi user credentials contain the following information: an account id which associates a channel group with a user (i.e. the Facebook account id 1234 with the dacodi user 27), the authorization token- and secret which store information that is required completing actions in a platform such as posting (i.e. an OAuth2_token associated with the account or a password), the consumer key and consumer secret, which contain information about the application that is about to publish (you can think of it as the authentication of dacodi to guarantee the platform that dacodi is actually itself publishing on the user's behalf). The notion of user credentials have been introduced, since a user may have multiple accounts on one platform.

21

scei technical whitepaper-19.06.2012

Presentations & Public Speaking