research article online knowledge-based model for big data...

11
Research Article Online Knowledge-Based Model for Big Data Topic Extraction Muhammad Taimoor Khan, 1,2 Mehr Durrani, 3 Shehzad Khalid, 1 and Furqan Aziz 4 1 Bahria University, Shangrilla Road, Sector E-8, Islamabad 44000, Pakistan 2 FAST-NUCES, Industrial Estate Road, Hayatabad, Peshawar 25000, Pakistan 3 COMSATS IIT, Kamra Road, Attock 43600, Pakistan 4 IMSciences, Phase 7, Hayatabad, Peshawar 25000, Pakistan Correspondence should be addressed to Muhammad Taimoor Khan; [email protected] Received 4 February 2016; Revised 16 March 2016; Accepted 24 March 2016 Academic Editor: Leo Chen Copyright © 2016 Muhammad Taimoor Khan et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Lifelong machine learning (LML) models learn with experience maintaining a knowledge-base, without user intervention. Unlike traditional single-domain models they can easily scale up to explore big data. e existing LML models have high data dependency, consume more resources, and do not support streaming data. is paper proposes online LML model (OAMC) to support streaming data with reduced data dependency. With engineering the knowledge-base and introducing new knowledge features the learning pattern of the model is improved for data arriving in pieces. OAMC improves accuracy as topic coherence by 7% for streaming data while reducing the processing cost to half. 1. Introduction Machine learning (ML) topic models are popularly used for different Natural Language Processing (NLP) related tasks. Although it ignores word semantics, topic models provide better results with least user involvement. Unsupervised topic models are frequently used for domain exploration in many research areas where training data is not available or is expen- sive to produce. Topic models are the statistics based tech- niques, evaluating words cooccurrence probabilities, where the results improve with the size of data. e relation based techniques struggle with the richness of natural language, to explore all possible forms that a word can have. Lifelong learning models put an end to the traditional single-shot learning with the aim of improving the accuracy of single- domain data. e big textual data has large volume and has variety of domains discussed and the data is being poured in continuously and is expected to have inconsistencies and noise. Keeping up with the amount and flow of content, lifelong learning models can help build practical applications. e unsupervised topic models extract incoherent top- ics as well along with the coherent topics [1–5]. Different extensions of topic models are proposed to improve their accuracy. e semisupervised topic models used manually provided seed aspects to help find more relevant product aspects [6–9]. Supervised topic models are trained for a domain to achieve high accuracy at it. Hybrid models are trained on small labeled data to predict suitable initial values for topic model [10–15]. Knowledge-based topic models are guided by domain experts with knowledge rules instead of seed terms to deal with incoherent topics [16–18]. Transfer- learning based topic models are trained on one domain and tested on another single target domain closely related to it that is known prior to analysis. However, all of these models dealt with improving the accuracy for a single domain. erefore, they lack scalability and cannot be used for big data consisting of multiple unknown domains. Despite their high accuracy, these models have limited application as they are devised for a single-domain analysis. Lifelong machine learning model takes a completely new dimension to use a model that can process many unknown domains. It incorporates automatically generated knowledge to improve the quality and coherence of topics without requiring any external support. e model maintains its own knowledge-base that grows with experience and so the model matures in decisions to attain improved accuracy. Hindawi Publishing Corporation Computational Intelligence and Neuroscience Volume 2016, Article ID 6081804, 10 pages http://dx.doi.org/10.1155/2016/6081804

Upload: others

Post on 03-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Online Knowledge-Based Model for Big Data ...downloads.hindawi.com/journals/cin/2016/6081804.pdf · Research Article Online Knowledge-Based Model for Big Data Topic

Research ArticleOnline Knowledge-Based Model for Big Data Topic Extraction

Muhammad Taimoor Khan12 Mehr Durrani3 Shehzad Khalid1 and Furqan Aziz4

1Bahria University Shangrilla Road Sector E-8 Islamabad 44000 Pakistan2FAST-NUCES Industrial Estate Road Hayatabad Peshawar 25000 Pakistan3COMSATS IIT Kamra Road Attock 43600 Pakistan4IMSciences Phase 7 Hayatabad Peshawar 25000 Pakistan

Correspondence should be addressed to Muhammad Taimoor Khan taimoormuhammadgmailcom

Received 4 February 2016 Revised 16 March 2016 Accepted 24 March 2016

Academic Editor Leo Chen

Copyright copy 2016 Muhammad Taimoor Khan et al This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Lifelong machine learning (LML) models learn with experience maintaining a knowledge-base without user intervention Unliketraditional single-domainmodels they can easily scale up to explore big dataThe existing LMLmodels have high data dependencyconsumemore resources and donot support streaming dataThis paper proposes online LMLmodel (OAMC) to support streamingdata with reduced data dependency With engineering the knowledge-base and introducing new knowledge features the learningpattern of the model is improved for data arriving in pieces OAMC improves accuracy as topic coherence by 7 for streaming datawhile reducing the processing cost to half

1 Introduction

Machine learning (ML) topic models are popularly used fordifferent Natural Language Processing (NLP) related tasksAlthough it ignores word semantics topic models providebetter results with least user involvement Unsupervised topicmodels are frequently used for domain exploration in manyresearch areas where training data is not available or is expen-sive to produce Topic models are the statistics based tech-niques evaluating words cooccurrence probabilities wherethe results improve with the size of data The relation basedtechniques struggle with the richness of natural languageto explore all possible forms that a word can have Lifelonglearning models put an end to the traditional single-shotlearning with the aim of improving the accuracy of single-domain data The big textual data has large volume and hasvariety of domains discussed and the data is being pouredin continuously and is expected to have inconsistencies andnoise Keeping up with the amount and flow of contentlifelong learningmodels can help build practical applications

The unsupervised topic models extract incoherent top-ics as well along with the coherent topics [1ndash5] Differentextensions of topic models are proposed to improve their

accuracy The semisupervised topic models used manuallyprovided seed aspects to help find more relevant productaspects [6ndash9] Supervised topic models are trained for adomain to achieve high accuracy at it Hybrid models aretrained on small labeled data to predict suitable initial valuesfor topic model [10ndash15] Knowledge-based topic models areguided by domain experts with knowledge rules instead ofseed terms to deal with incoherent topics [16ndash18] Transfer-learning based topic models are trained on one domain andtested on another single target domain closely related to it thatis known prior to analysis However all of these models dealtwith improving the accuracy for a single domain Thereforethey lack scalability and cannot be used for big data consistingof multiple unknown domains Despite their high accuracythese models have limited application as they are devised fora single-domain analysis

Lifelong machine learning model takes a completely newdimension to use a model that can process many unknowndomains It incorporates automatically generated knowledgeto improve the quality and coherence of topics withoutrequiring any external support The model maintains itsown knowledge-base that grows with experience and so themodel matures in decisions to attain improved accuracy

Hindawi Publishing CorporationComputational Intelligence and NeuroscienceVolume 2016 Article ID 6081804 10 pageshttpdxdoiorg10115520166081804

2 Computational Intelligence and Neuroscience

LML models are the computational systems that processtasks and retain popular patterns from it This approach isalso called human-like learning or never ending learning inthe literature Lifelong learning is only recently introducedto NLP tasks Never Ending Language Learner (NELL) isconsidered the first attempt to use lifelong learning forNLP [19] It extracts information from web sources toextend a knowledge-base in an endless manner aiming toachieve improved performancewith experience Unlike othermachine learning approaches there is little research workavailable on lifelong machine learning for NLP related tasks

There are four main components of LML models thatare knowledge representation extraction transfer retentionand maintenance discussed in [20] These components arestrongly connected and analyzing them for quality improve-ment is called knowledge engineering The knowledge learntis represented through an abstracted set of features Theextraction component focuses on identifying popular pat-terns and mining them as potential knowledge The transfercomponent is responsible for transferring the impact ofknowledge into the modeling process when suitable Theretention andmaintenancemodule deals with storing knowl-edge for future use The objective is to use knowledge toreduce the number of incoherent topics Topic coherencemeans that the words in a topic do not semantically holdwell together Topic incoherence can be of various types thatis chained intruded random and unbalanced as explainedby [21] Chained problem refers to the case where there areword pairs having strong interconnection forming a chainbut there is no collective relatedness among the words of atopic Intruded topics have set of words with good corelationwithin the set but not fitting well with the rest of the wordsin the topic Random topics do not make any sense whileunbalanced topics have a mix of general and specific termsthat does not hold good together With the incoherenceproblems known themodules of LMLmodel can bemodifiedto closely focus on them

Lifelong learning topic models are introduced to benefitfrom the knowledge-base while keeping it independent ofdomain specific manual tuning This makes it possible fortopic models to mine topics from big data with manydomains Automatic Knowledge Learning (AKL) model [22]is the first such model that used automatic knowledgelearning for topic models Later there are different lifelonglearning models proposed where the process of knowledgeextraction and transfer is refined for improved accuracyand performance However they still have limitations tobe addressed The existing LML models lack a knowledgemaintenance and retention module Therefore they haveto load topics from all tasks performed to sample relevantknowledge for the current task The knowledge learnt is lim-ited which was initially consisting of must-links as word pairshaving strong positive corelation In later models cannot-links were also added as word pairs with strong negativecorelation Following batch processing approach they expectthe big data to be available prior to analysis If topics of analready performed task are removed the learning curve of themodel is affected Therefore existing state-of-the-art models

consume more storage and processing resources and do notprovide support for streaming data arriving in pieces

The proposed model is online that is processing onedomain at a time as they are received Each domain isprocessed only once with the knowledge available at thattime After processing each domain the knowledge-base isupdated With a knowledge retention module it does notrequire loading all the topics from the previously processeddomains It makes the model lightweight without compro-mising accuracy The model has low data dependency whereit only has to load the occurrences of relevant knowledgemaintained in abstract form The contributions of this paperare highlighted below

2 Highlights

(1)With online approach it processes the data as it is receivedin intervals Itmakes the proposedmodel closer to the essenceof big data streaming from different sources

(2) Knowledge retention module is introduced to storeknowledge in abstract form Unlike existing models to loadtopics from all domains it only consults the knowledge-basethus reducing dependency on dataTherefore it is efficient inperformance and resource utilization

(3) Must-links and cannot-links are extracted with asingle mechanism that is normalized Pointwise MutualInformation (PMI) as nPMI [23] They are stored main-tained and transferred irrespective of their orientation forimproved performance

(4) Being online the proposed model has limited data ata time to learn from and is expected to learn low qualityknowledge Knowledge freshness utility and confidenceare introduced to monitor prioritize and even filter outknowledge if not contributing considerably

(5) Knowledge representation and abstraction areimproved to resolve word sense disambiguation withoutconsulting all the processed domains Knowledge rules aresupported with background representing top topic wordsfrom which the rule is mined Reasonable overlap betweenthe current domain and a knowledge background suggestssame word sense

3 Related Work

Probabilistic topic models based on Latent Dirichlet Alloca-tion (LDA) extract corelations among words through statisti-cal computations by calculating probabilities of words coex-istence [24] It uses Markov Chain Monte Carlo (MCMC)inference to combine words under topics having a thematicrelevance Following bag-of-words (BOW) approach it con-siderswords presence and their positions Initially introducedwith variational Bayesian technique mostly Gibbs samplingis used for inference to converge the analysis towards moreprobable groupings in the given number of iterations mostlykept high in thousands The accuracy of topic models relieson the initial values for hyperparameters as well howeverit generally performs well for larger datasets ProbabilisticLatent Semantic Analysis (pLSA) [25] having extensions isalso used for topic modeling however LDA is preferred for

Computational Intelligence and Neuroscience 3

adjustable priors to closely refer to the nature of specificdocuments and topics LDA holds the advantage of identify-ing and aggregating topics together where they are separatesteps in other approaches for example dictionary based andrelation based The extensions of unsupervised LDA used foraspect and sentiment terms extraction as topics are [1ndash5]Thehybrid topic models used for topic extraction and sentimentanalysis are [10ndash15]The semisupervised LDAmodels guidedwith initial seed aspect terms for topic extraction as productaspects are [6ndash9]

To improve the accuracy of topic models while keepingthe user involvement tominimum knowledge-based semisu-pervised models are proposed The purpose of these modelsis to provide domain specific knowledge rules that guidethe extraction process to reduce the number of incoherenttopics extracted Dirichlet Forest priors based LDA (DF-LDA) was introduced as a knowledge-based topic model towhich knowledge rules are provided in the form of must-link and cannot-link [16] It changed the perspective towardstopic extraction as the accuracy was dependent on the qualityof knowledge The knowledge rules guided the model toput terms under the same or different topics based on theintuition A must-link increases the probability of word pairsto coexist under the same topic while a cannot-link worksopposite to it DF-LDA being transitive could not handlemultiple senses of a word which was not addressed in theirlater model [17] as well The knowledge rules are assumed tobe accurate with no refinement mechanism and are appliedwith equal priority The rules were relaxed to insets in [18]having topic labels for words and thus avoiding one-to-one relationships for word coexistenceThese semisupervisedknowledge-based topic models are manually provided withdomain specific knowledge and cannot scale up to big dataThe knowledge provided is assumed to be accurate havingno mechanism for verifying and rectifying knowledge ifprovided by mediocre domain experts

Lifelong machine learning is used for various tasks inthe works of [26ndash29] however none of them has used LMLfor topic modeling Automatic knowledge-based models areintroduced to incorporate a knowledge module with topicmodels to have more semantically corelated topicsThe auto-matic LML models learn and apply knowledge automaticallywithout external support and can be directly applied tobig data Overlap among different domains is exploited forlearning and transferring knowledge AKL [22] was the firstattempt to introduce lifelong machine learning with topicmodelingThemodel processes all the domains in the datasetwith baseline LDA to extract initial topics The extractedtopics are clustered to learn and utilize knowledge LifelongTopic Model (LTM) [30] improved the performance andaccuracy of LML topic mining by using Frequent ItemsetMining (FIM) for knowledge extraction and GeneralizedPolya Urn (GPU) model for transferring it to future taskManual support was provided to set threshold for FrequentItemset Mining (FIM) however it could not address theproblem of mining local knowledge A large threshold valuemeans mining only global knowledge while a low valuethat focuses on local knowledge brings many irrelevant

knowledge rules as well They both used positively corelatedwords only as knowledge that is must-links

In [20] automatic must-links cannot-links (AMC) modelis proposed to address the problems of earlier automatic LMLmodels Unlike other LML models it used both must-linkand cannot-link knowledge types The knowledge rules areextracted with FIM with multiple support (MS-FIM) to con-sider both global and local knowledge while avoiding irrele-vant knowledge rules Since the cannot-links are believed tobe very high due to the large size of vocabulary the modelonly extracts cannot-links for the current domain In orderto ensure their quality the must-links and cannot-links areverified from topics of all domains For transferring the learntknowledge into the inference technique MultigeneralizedPolya Urn (M-GPU) model is used For a word drawn from atopic if there are relevantmust-links all the pairedwords get apushup in the topic whereas a penalty is applied to the pairedwords in cannot-links M-GPU provides the bias supportedby knowledge in order to help distributions with topicshaving semantically corelatedwordsThe transitivity problemis addressed with a graph connecting knowledge rules thatshare a common word which is used in the same sense

The existing state-of-the-art LML models follow batchprocessing approach They expect all the domains to beavailable prior to analysis The big data is processed withbaseline LDA to extract topics in first parse In the secondparse each domain is processed again with knowledge Topicsfrom all domains are loaded for each task to learn relevantknowledge that is verified frommany domains Itmakes thesemodels highly dependent on data as they require topics fromall domains to be available at all times All the knowledgerules extracted are treated equally whereas some knowledgerules have high confidence while others marginally crossthreshold There is no mechanism to prioritize knowledgerules These models lack a knowledge maintenance moduleand have to load topics from all domains for each task andthus consume more resources

4 Proposed Model (OAMC)

An online LML model (OAMC) is proposed to address theissues discussed It is more closely related to the essence ofbig data analysis by expecting data to arrive continuously inchunks It is less dependent on data therefore consuminglimited storage and processing resources Efficiency of themodel is associated with the knowledge-base instead oftopics from all domainsWith a newly introduced knowledgemaintenance and retentionmodule the harvested knowledgerules are stored in an abstract form The knowledge-baseis maintained after each task to keep it consistent Newknowledge rules are mined while existing rules are updatedor even filtered as a continuous process Additional featuresof knowledge are introduced to help maintain and prioritizethem The working of the model is shown in Figure 1 Withthe online sources producing textual content continuouslythe proposedmodel can keep up with processing needs of thedata in handThismakes itmore practical for developing real-world monitoring and analysis applications

4 Computational Intelligence and Neuroscience

OAMCModel (Dt V N knowledge-base)(1) procedure onlineAMCModel (119863119905 119881119873 knowledge-base)(2) relRuleslarr sampleRelevantKnowledge (119881 Knowledge-base)(3) if (relRules = 0) then without knowledge(4) topicslarr LDA (119863119905 119881119873 0)(5) end if(6) if (relRules gt 0) then with relevant knowledge(7) topicslarr GibbsSampler (119863119905 119881119873 relRules)(8) end if(9) rulesrefined larr refineRules (119863119905 119881 topics relRules knowledge-base)(10) rulesnew larrmineRules (119863119905 topics)(11) knowledge-baselarr rulesrefined cup rulesnew(12) end procedure

Algorithm 1 OAMCmodel

M DocsDomain 1

M DocsDomain 2

M Docs

LDA OAMCOAMCTopics Topics Topics

Knowledge-base

Learning module Learning module Learning module

Domain N

Figure 1 Proposedmodel (OAMC) using online knowledge extrac-tion and transfer mechanism

The OAMC model can operate both with and withoutknowledge In case there is no relevant knowledge availablethe model operates as baseline LDA However as the modelmatures its knowledge-base grows to support a variety offuture tasks With relevant knowledge the model showsimproved performance by grouping semantically relatedwords under topics with the help of knowledge learnt fromprevious tasks The working of OAMC model is shownin Algorithm 1 The set of documents in current domainare 119863

119905= 119863

1 1198632 119863

119899 119881 is the vocabulary of the

current task while 119873 is the number of iterations for Gibbssampler [13] used as inference technique Knowledge-basehas the knowledge rules as must-links and cannot-links Theknowledge-base has a variety of must-links and cannot-linkslearnt through experience In step (2) only those must-linksand cannot-links are selected that are relevant to the currenttask that is both the words in must-link and cannot-linkexist in the vocabulary of the current domain The rulesrelevant to the current domain are stored as relRules If norelevant rules are found the current task is processed withbaseline LDA as shown in step (4) In step (4) 0means that norelevant knowledge is found for the current task and therefore

the baseline LDA model is to be used for the current taskIt may happen at low experience when the knowledge-basehas limited rules and the current task is about a completelydifferent subject When relevant rules are found step (7)has Gibbs sampler [20] used for incorporating knowledge toguide the inferenceTheknowledge-basemaintains itself aftereach task to be consistent Step (9) refines relRules to growtheir confidence and utility as a token of their services forthe current domain All rules in the knowledge-base get atask older As part of the continuous learning process themodel harvests new knowledge rules after processing eachtask in step (10) The new rules are merged with the existingrules and are added to the knowledge-base which is ready toprocess a new task

41 Knowledge Representation Additional features are addedto knowledge rules to help maintain the knowledge-baseand keep it consistent by continuously weeding out wrongor irrelevant knowledge In previous models the knowledgerules are maintained as separate lists of must-links andcannot-links each with a word pair and a confidence valueOur OAMC model stores them together as knowledge rulesirrespective of their orientation However the orientationis communicated through a positive or negative sign usedwith the confidence value New rule features are added asutility freshness and background to strictly monitor thecontribution of each rule To survive longer a knowledgerule kRule(119908

1 1199082) needs to maintain high utility in the tasks

that follows The confidence of a rule is also expected to beassured by future domains which adds to the net confidenceThe utility and freshness of a rule maintain a check on itscontribution and relevance A rule with high confidence butlow usage and freshness is also expected to be filtered Thesenewly added features keep the knowledge-base consistentwith quality knowledge The background feature helps toresolve word sense disambiguation by providing the contextof the knowledge rule A knowledge rule in the knowledge-base is stored as shown in Figure 2

42 Knowledge Extraction The proposed model has accessto only limited big data for a task that is domain of the

Computational Intelligence and Neuroscience 5

MineRules (Dt N topics)(1) procedureMineRules (119863119905119873 topics)(2) for (t all topics) do(3) for (i j all word pairs in a topic) do(4) if (nPMI(i j) lt 1205831) then(5) Rules

119895larr Rules (119894 119895 nPMI(119894 119895) topic) learn as negative rule

(6) end if(7) if (nPMI(i j) gt 1205832) then(8) Rules

119895larr Rule (119894 119895 nPMI(119894 119895) topic) learn as positive rule

(9) end if(10) end for(11) end for(12) end procedure

Algorithm 2 Mine rules

Knowledge-base

Background

Orientation

Freshness

Utility

Confidence

KRule(w1 w2)

Figure 2 Representation of a knowledge rule as word pair inknowledge-base Utility freshness and background are the newlyintroduced knowledge features to monitor their quality

current task Therefore the model does not have enoughinstances to verify the quality of knowledge and is expectedto learn low quality knowledge as well A knowledge rulecan be of low quality if it has enough support in data fromwhere it was learnt but is not supported by future tasks Itmayhappen due to noise or high bias in a domainThe refinementtechnique deals with it until it is removed In order to extractnew knowledge rules kRule(119908

1 1199082) normalized PMI [23] as

nPMI(1199081 1199082) is used as given in

kRule (1199081 1199082) = nPMI (119908

1 1199082) =

PMI (1199081 1199082)

log119901 (1199081 1199082) (1)

where nPMI gives a value [minus1 1] range All candidate wordpairs are evaluated through nPMI For a word pair if the scoreis close to 1 the knowledge rule (must-link) is stored withits nPMI score as confidence Word pair having it close to minus1is stored as a negative rule (cannot-link) with nPMI score asconfidence PMI in (1) can be evaluated as

PMI (1199081 1199082) = log

119901 (1199081 1199082)

119901 (1199081) 119901 (119908

2) (2)

where 119901(1199081 1199082) is the coexistence probability of 119908

1and 119908

2

while 119901(1199081) and 119901(119908

2) are probabilities of their existence in

the current domain Coexistence is the presence of two wordsin a conceptual document irrespective of their frequencyCalculating coexistence probability and individual probabil-ities is shown in (3) where119863119905 show all documents while 119863119905is total number of documents

119901 (1199081 1199082) =

119863119905 (1199081 1199082)

119863119905

119901 (119908) =119863119905 (119908)119863119905

(3)

Freshness and utility are initialized while topic of the wordpair is added to the background of the rule The twothresholds 120583

1and 120583

2used are set to harvest word pairs with

nPMI scores at either of the two extremes given in

nPMI (1199081 1199082) =

⟨1205831or⟩ 1205832

learn

else ignore(4)

When nPMI gives a value that is too high that is closeto 1 or too low that is close to minus1 then it is consideredto be of significance This untypical behavior is learnt asknowledge while the values close to zero are ignored asthey have not shown any considerable corelation Word pairshaving nPMI below 120583

1are learnt as knowledge (cannot-links)

while the values above 1205832are also learnt (must-links) The

threshold values are mentioned in Setup Our contributionin knowledge extraction is to use nPMI(119908

1 1199082) as a very

lightweight extractionmechanism Secondly bothmust-linksand cannot-links are extracted together in a single stepMS-FIM was previously used to harvest knowledge whichwas causing a performance bottleneck The rule miningmechanism is mentioned in Algorithm 2 It generates wordpairs within each topic and evaluates them for possibleknowledge from step (2) to step (11) Satisfying either of thethresholds in step (4) and step (7) the word pair is mined aspositive or negative knowledge rule

43 Knowledge Transfer The knowledge transfer mechanismselects relevant knowledge rules each time a word is sampled

6 Computational Intelligence and Neuroscience

to have its probability increased under a topic The effectof relevant knowledge rules is added as bias into the topicmodel to push up positively corelated words under the sametopic and vice versa Since the current task has limited dataknowledge from previous tasks is incorporated to guide theinference For a sampled word 119908

1in topic 119905 if a rule exists

for it then the associated word 1199082has probability updated in

current topic 119905 as shown in

119901 (rule119894= rule | 119905) prop 119901 (119908

1| 119905) times 119901 (119908

2| 119905) (5)

where 1199081and 119908

2are the two words of a rule rule

119894(must-

link or cannot-link)When the model inference increases theprobability of a word 119908

1for a topic 119905 the probability of 119908

2

is increased for topic 119905 in case of must-link and decreasedfor cannot-link To address word sense disambiguation thefollowing equation is proposed

(119881 cup ruleBackground)ruleBackground

gt 120585 (6)

Here 120585 is the threshold that has to be satisfied for overlapbetween the vocabulary of the current task and the rulebackgroundThe rule background represents the set of wordsin topic from which the rule was learnt Comparing therule background with the current vocabulary is to ensurethat the rule is used in the same context in which it wasmined For example the word Bank has different contextswhen used with Accounts and Sea respectively Topics of allprocessed tasks were previously used for the same purposeThe net confidence and utility of a knowledge rule are usedto transfer the impact of knowledge rule in topic samplingMultigeneralized Polya Urn (M-GPU) model [20] is used totransfer the effect of knowledge into the current status ofMarkov chain in Gibbs sampler in

119901 (119908 | 119905) propsum119881

1199081015840=1(1205921199081199081015840120588) times 119899

1199051199081015840 + 120573

sum119881

V=1 (sum119881

1199081015840=1(120592V1199081015840120588) times 1198991199051199081015840 + 120573)

(7)

Equation (7) shows the amount of update in the probability bya factor 120592

1199081199081015840120588 for the other word in knowledge rule When

the probability of a sampled word is increased for a topicusing the rules the probability of all thewords associatedwiththe sampled word is also increasedThe following equation isupdated to incorporate both positive and negatively corelatedrules in a single step as

119901 (119911119894= 119905 | 119911

minus119894 119908 120572 120573 120582)

prop119899119889119905

minus119894+ 120572

sum119879

119905=1(119899119889119905

minus119894+ 120572)

times

sum1199081015840119908119894isinrule (1205921199081015840 119908119894120588) times 1198991199051199081015840

minus119894+ 120573

sum119881

V=1 (sum1199081015840 Visinrule (1205921199081015840 V120588) times 1198991198961199081015840minus119894+ 120573)

(8)

For word sampled 119908 the other word 1199081015840 sharing the same

rule with it gets a push as strong as the net confidence in thedirection of its orientation The objective is to increase theprobability of positively corelated words to appear togetheras top words under the same topic and vice versa

Table 1 OAMC positive and negative knowledge rules with highutility

Type Knowledge

+ive

(tech support) (cable usb) (pro con) (customersupport) (batter charge) (video card) (high long)(connection vga) (operating system) (monitormacbook) (mac support) (cd dvd) (windows xp)(money worth) (inch picture) (input output)(friend mine) (big deal)

minusive

(sound money) (feature battery) (battery price)(feature battery) (price device) (screen sound)(review screen) (worse digital) (price easy) (gasscreen) (signal easy) (home gps) (hotel traffic)(driver cooler) (monitor processor) (monitorspeed)

44 KnowledgeMaintenance andRetention Supportedwith aknowledgemaintenance and retentionmodule useful knowl-edge rules are stored until they justify their existence Theknowledge-base is refined after each task to keep it consistentThey all get older by a taskThe relevant knowledge rules growin confidence supported by current domain and have higherutility After updating the state of each knowledge rule theyare passed through a threshold again to justify their existenceas proposed in

nPMI (1199081 1199082)

=

120592 times 120588

120582gt 1205831or

120592 times 120588

120582lt 1205832

remove

otherwise retain

(9)

where 120592 is net confidence 120588 is utility and 1120582 is freshness fora knowledge rule If the confidence of a knowledge rule incurrent task is above its net confidence 120592 the current domainis allowed to update the knowledge background to betterrepresent its context as proposed in

kRule (1199081 1199082) =

conf gt 120592 update background

otherwise unchanged(10)

where background is updated by adding words from sup-porting topic in the current task New rules are minedwhile existing rules are evolved as a continuous processThe newly added knowledge features are used to rank andprioritize knowledge rules The difference in rank of goodand bad knowledge rules widens as the model matures withexperience Algorithm 3 is used to refine existing knowledgewith step (2) having relevant knowledge as Rule1015840 In steps (4)to (7) all relevant rules get their net confidence updated andutility incremented Step (8) has all rules aged by a task Someof the popular rules from knowledge-base of OAMC modelare shown in Table 1

5 Experimental Results

The proposed model (OAMC) contributes to the applicationof lifelong learning forNLP tasks specifically used to improve

Computational Intelligence and Neuroscience 7

RefineRules (Dt N topics V Rules)(1) procedure RefineRules (119863119905119873 topics 119881 Rules)(2) Rules1015840 larr relevantRules (119881 knowledge-base) set of relevant rules(3) for (r all Rules) do(4) if (Rules1015840 contains r) do(5) 120592 larr nPMI(119903) add to cumulative confidence(6) 120588 larr +1 Rules utility increased by 1(7) end if(8) end for(12) end procedure

Algorithm 3 Refine rules

topic coherence for big streaming data Various experimentsare performed to highlight the improvement in accuracyand performance of lifelong topic mining as product aspectsUnlike traditional ML models it operates on big data with avariety of domains The standard dataset used by previousLML models is used to evaluate the proposed model incomparison to the other state-of-the-art models The dataconsists of 50 domains each of electronic products To high-light the improvement in the OAMC model the dataset wasprovided with decreasing number of domains The modelsare evaluated in topic coherence for accuracy and time inminutes for performance The number of knowledge rules isused as quality of knowledge where fewer rules producingbetter results are considered to be of better quality

51 Setup To set up the test environment an HP PavilionDM4 machine is used with Intel 24GHz Core i5 having8GB RAM and 500GB hard drive Following the evaluationapproach in existing models 15 topics are extracted perdomain for all models while as in previous models top 30words per topic are considered to calculate topic coherenceof the proposedmodel alsoThe hyperparameters are kept thesame for all models The number of Gibbs sampler iterationsis kept the same (2000) as used in previous models Themodels are configured and initialized as used in their originalwork The proposed OAMC has its thresholds 120583

1and 120583

2

set to 08 and minus09 respectively to learn knowledge andselect relevant knowledge for a given task Since negativecorelations are too many as compared to positive rules thethreshold is set to focus more on positive rules In orderto choose relevant rule the overlap threshold is set to 03between the vocabulary and knowledge background

52 Results The accuracy of lifelong models and topicmodels in general is evaluated through topic coherence It hashigh corelation with human evaluation All of the models aregiven the same number of documents in a domain for bothlearning and evaluation The existing models drop accuracywhen it has to learn from a smaller dataset with documentsin hundreds per domain while compromising accuracy whenevaluated on larger dataset with thousands of documentsTherefore the previous models were given large datasets tolearn from and were evaluated on smaller datasets However

50 40 30 20 10 8 6 4 2 1Number of domains available

Topic coherence with different number of domains in dataset

LDALTMAMC-M

AMCOAMC

minus700

minus750

minus800

minus850

minus900

minus950

minus1000

Topi

c coh

eren

ce

Figure 3The proposed model (OAMC) produces highest accuracy(as topic coherence) among lifelong learning models for streamingbig data

for fairness to all models they are provided with the test setonly to learn from and to test on

Topic coherence of the models compared has interestingoutcome LTM surprisingly could not maintain a better topiccoherence when it had to learn from a smaller dataset It evendropped below the baseline AMC and AMC-M performedwell however the proposed OAMC beat them when anumber of domains in the dataset were dropped below 10 asshown in Figure 3 AMC and AMC-M hardly learn anythingwith fewer domains which makes them ineffective towardsstreaming data Therefore their accuracy continuously wentdown towards baseline On the other hand LTM has itslearning pattern badly affected by limited variety of domainsand learnt wrong knowledge which dropped its accuracyThe knowledge rules learnt by AMC and AMC-M were100 and below per domain for less than 10 domains in thedataset which grow up to more than 2000 for 50 domainsIt is because of the overreliance of the existing models onlarge volumes of data to be available at once which makesthem least effective towards streaming data It shows thatthey only learn better knowledge and produce better resultswhen there are many domains in the dataset and they areall available prior to analysis With a robust knowledge

8 Computational Intelligence and Neuroscience

LTM AMC-M AMC OAMCLML models

Performance

020406080

100120140

Tim

e (m

in)

Figure 4 The proposed model (OAMC) has the highest perfor-mance efficiency among LML topic models

storing mechanism OAMC learnt more knowledge refinedit and used it to improve topic coherence with few domainsavailable The OAMC model shows on average 7 with amaximum of 10 improvement in topic coherence whenthere is one domain at a time

OAMC learns wrong knowledge too as it considersonly the current domain for harvesting it However with arobust knowledge retention module after performing fewtasks only good quality knowledge survives through thefiltering mechanism Knowledge rules with less confidencefrom future tasks or limited utility are pushed down andfinally filtered out Knowledge that is getting confidencereenforced from future tasks and has growing utility stayslonger To evaluate relevance of a knowledge rule with thecurrent task the domain of the current task must have therule words confidence above threshold and overlap with therule background higher than threshold The online OAMCmodels truly learn like humans having no idea of future tasksunlike existing models Its learning is dependent on the pastexperience as the domains processed only and uses it for afuture task It is less dependent on data parses it once andconsiders the knowledge module with abstract rules only tosample relevant rules and therefore shows improvement of upto 50 in performance as shown in Figure 4

For the discussed intrinsic evaluation the models arealso compared through PMI mean and median values thathad relevance with topic coherence values The extrinsicWordNet based techniques used in the literature (WuPalmerResnik Path Lin LeacockChodorow Lesk JiangConrath andHirstStOnge) were inconclusive to differentiate among thequalities of topics extracted due to the specialized nature ofelectronic products domains

53 Evaluation OAMC improving performance of AMCby 50 as shown in Figure 4 is attributed to the followingreasons OAMC does not extract relevant knowledge for eachtask and rather chooses from a well maintained knowledgemodule Each domain is parsed only once With a knowledgefiltering mechanism only good quality knowledge rules aremaintainedThere are generally two performance bottlenecksin existing LML models that is FIM for knowledge extrac-tion and higher iterations of Gibbs sampling The OAMCmodel uses nPMI [23] which is highly efficient to extract

knowledge Secondly it learns both must-links and cannot-links together with a single mechanism as rules Even with50 domains in the dataset OAMC performed better thanAMC for certain individual domains by producing highertopic coherence while using fewer knowledge rules as shownin Figure 5 Table 1 shows some of the most frequently usedknowledge rules of OAMC Through manual evaluation oftopics from all the domains it was observed that the knowl-edge learnt by LMLmodels is more effective towards randomand intruded types of topic incoherence but less effectivetowards chained and unbalanced type of incoherence In factthe knowledge-based models can at times enhance chainedtopic incoherence if there are many knowledge rule pairssharing common words

54 Knowledge Analysis The accuracy of LML modelsdepends upon the quality of its knowledge The previousmodels did not provide a discussion on the quality ofknowledge In fact improved topic coherence was consideredas reason for good quality knowledge The knowledge rulesshown in Table 1 have utility above 10 as a token of theirquality The ratio of use of knowledge rules per domain byAMC and OAMC is 5 1 signifying the quality of knowledgeproduced by OAMC Net confidence of rules also showsthat most of the relevant domains enforced the rule andincreased its confidence (above 08with 1 asmaximum) Sincethis information is not available for the existing models thevalues cannot be matched It is still very difficult to explorethe impact of each knowledge rule separately however thediscussed features in general keep track of good qualityknowledge The knowledge learnt is of two types only cap-turing positive and negative word corelations However morevarieties of knowledge can be introduced for further insightsBy introducing ldquothe type of relationshiprdquo as knowledge theunbalanced topic incoherence problem can be addressedAlong with correlation between words the nature of theircorrelationmay also be explored for better analysisThis typeof knowledge is also useful for finding hierarchical topicsThe knowledge rules extracted are isolated while knowledgeas complex network can be more effective for a variety ofanalyses AMC uses a small graph only to associate must-links sharing common word in the same context Knowledgeextraction transfer representation and retention are onlyintroduced to NLP tasks in 2010 and require mature ideasfrom other fields of research

6 Conclusion

LML models are structured to process big data having manydomains and therefore it is the only suitable solution despiteits shortcomings LML models are recently introduced to bigdata topic extraction and are far from mature The existingLML topic models had limited application due to followingbatch processing approach andoverreliance ondataOur pro-posed model named as OAMC follows online approach forLML topic modeling that can support streaming data Withan efficient learning mechanism our OAMCmodel has lowerdependency on data and therefore improved performance by

Computational Intelligence and Neuroscience 9

AlarmClock Battery CDPlayer Fan LampIndividual domains

Topic coherence for each domain

OAMCAMC

minus750

minus770

minus790

minus810

minus830

minus850To

pic c

oher

ence

minus778 minus785

minus812

minus782

minus809

minus790

minus817

minus761

minus788

(a)

OAMCAMC

AlarmClock Battery CDPlayer Fan LampIndividual domains

Knowledge rules used for each domain

284

1423

345

863

226518

1520

446

0

500

1000

1500

2000

Num

ber o

f kno

wle

dge

rule

s use

d

(b)

Figure 5 The proposed model (OAMC) in comparison to the state-of-the-art LML model (a) produces better accuracy as topic coherencefor individual domains while (b) using fewer knowledge rules

50 as compared to state-of-the-art LML models Velocity isan important feature of big data as it is believed to be receivedcontinuously at high speed The OAMC model is designedto process the data in hand only and therefore outperformsexisting models for streaming data As the data is expectedto have noise OAMC is supported with a strict monitoringand filteringmechanism tomaintain a consistent knowledge-base The rules are made to compete for survival based onutility freshness and net confidence OAMC being blind tothe future gets higher bumps in accuracy when current taskhas low relevance to past experience This is very commonto a human-like learning approach However the tendencyof such tasks for which limited relevant knowledge availabledrops with experience considerably OAMCmodel shows 7improvement in accuracy on average for streaming big databut up to 10 for certain individual domains as compared tostate of the art

In the future we will be working on introducing morevarieties of knowledge Through complex network analysisthe knowledge rules can be stored in association with eachother Cooccurrence can mean many things and thereforethe relationship type should also be considered in future Itwill help to associate aspects with products sentiments toaspects and reasons to sentiments To improve the perfor-mance and attempt for near real-time analysis Gibbs sam-pling should be replaced with a lighter inference technique

Disclosure

This research is the authorsrsquo work not published or underreview elsewhere

Competing Interests

The authors declare that they have no competing interests

Authorsrsquo Contributions

In this research work Muhammad Taimoor Khan(corresponding author) developed the model while Mehr

Durrani helped with adjusting the knowledge learningand transfer mechanism to better adapt to the currenttask Furqan Aziz contributed through incorporating theknowledge maintenance module for high efficiency andreduced data dependency Shehzad Khalid supervised theactivity and contributed through running experiments andcompiling results Shehzad Khalid helped with introducingthe new knowledge features and with prioritizing them

Acknowledgments

The authors are thankful to Mr Yousaf Marwat Lecturerat COMSATS IIT Attock for facilitating with editing andproofreading services

References

[1] S Branavan H Chen J Eisenstein and R Barzilay ldquoLearningdocument-level semantic properties from free-text annota-tionsrdquo Journal of Artificial Intelligence Research vol 34 pp 569ndash603 2009

[2] S Brody and N Elhadad ldquoAn unsupervised aspect-sentimentmodel for online reviewsrdquo in Proceedings of the Human Lan-guage Technologies Conference of the North American Chapterof the Association for Computational Linguistics (NAACL HLTrsquo10) pp 804ndash812 Association for Computational LinguisticsLos Angeles Calif USA June 2010

[3] F Li M Huang and X Zhu ldquoSentiment analysis with globaltopics and local dependencyrdquo in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI rsquo10)Atlanta Ga USA July 2010

[4] Q Mei X Ling M Wondra H Su and C Zhai ldquoTopicsentiment mixture modeling facets and opinions in weblogsrdquoin Proceedings of the 16th International World Wide Web Con-ference (WWW rsquo07) pp 171ndash180 ACM Alberta Canada May2007

[5] I Titov and R McDonald ldquoModeling online reviews withmulti-grain topic modelsrdquo in Proceedings of the 17th Interna-tional Conference on World Wide Web (WWW rsquo08) pp 111ndash120ACM Beijing China April 2008

10 Computational Intelligence and Neuroscience

[6] Y Lu C X Zhai andN Sundaresan ldquoRated aspect summariza-tion of short commentsrdquo in Proceedings of the 18th InternationalWorld Wide Web Conference (WWW rsquo09) pp 131ndash140 ACMApril 2009

[7] A Mukherjee and B Liu ldquoAspect extraction through semi-supervisedmodelingrdquo inProceedings of the 50thAnnualMeetingof the Association for Computational Linguistics (ACL rsquo12) vol1 pp 339ndash348 ssociation for Computational Linguistics July2012

[8] I Titov and R T McDonald ldquoA joint model of text and aspectratings for sentiment summarizationrdquo in Proceedings of theAnnualMeeting of the Association for Computational Linguisticsvol 8 pp 308ndash316 Association for Computational LinguisticsColumbus Ohio USA 2008

[9] H Wang Y Lu and C Zhai ldquoLatent aspect rating analysis onreview text data a rating regression approachrdquo in Proceedings ofthe 16th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo10) pp 783ndash792 ACM July2010

[10] T L Griffiths M Steyvers D M Blei and J B TenenbaumldquoIntegrating topics and syntaxrdquo in Advances in Neural Informa-tion Processing Systems pp 537ndash544 MIT Press CambridgeMass USA 2004

[11] Y Jo and A H Oh ldquoAspect and sentiment unification modelfor online review analysisrdquo in Proceedings of the 4th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo11) pp 815ndash824 ACM Hong Kong February 2011

[12] J Liu Y Cao C Y Lin Y Huang and M Zhou ldquoLow-quality product review detection in opinion summarizationrdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 334ndash342 PragueCzech Republic June 2007

[13] Y Lu and C Zhai ldquoOpinion integration through semi-supervised topic modelingrdquo in Proceedings of the 17th Interna-tional Conference onWorld Wide Web (WWW rsquo08) pp 121ndash130ACM Beijing China April 2008

[14] C Sauper A Haghighi and R Barzilay ldquoContent models withattituderdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics vol 1 ofHuman Lan-guage Technologies pp 350ndash358 Association for ComputationalLinguistics 2011

[15] W X Zhao J Jiang H Yan and X Li ldquoJointly modeling aspectsand opinions with a MaxEnt-LDA hybridrdquo in Proceedings ofthe Conference on Empirical Methods in Natural Language Pro-cessing pp 56ndash65 Association for Computational LinguisticsOctober 2010

[16] D Andrzejewski X Zhu and M Craven ldquoIncorporatingdomain knowledge into topicmodeling via Dirichlet Forest pri-orsrdquo in Proceedings of the 26th Annual International Conferenceon Machine Learning pp 25ndash32 ACM June 2009

[17] D Andrzejewski X Zhu M Craven and B Recht ldquoA frame-work for incorporating general domain knowledge into latentdirichlet allocation using first-order logicrdquo in Proceedings ofthe 22nd International Joint Conference on Artificial Intelligence(IJCAI rsquo11) pp 1171ndash1177 Barcelona Spain July 2011

[18] Z Chen A Mukherjee B Liu M Hsu M Castellanos and RGhosh ldquoDiscovering coherent topics using general knowledgerdquoin Proceedings of the 22nd ACM International Conference onInformation and Knowledge Management (CIKM rsquo13) pp 209ndash218 ACM November 2013

[19] A Carlson J Betteridge B Kisiel B Settles E R Hruschka Jrand T M Mitchell ldquoToward an architecture for never-endinglanguage learningrdquo in Proceedings of the 24th AAAI Conferenceon Artificial Intelligence vol 5 p 3 Atlanta Ga USA July 2010

[20] Z Chen and B Liu ldquoMining topics in documents standingon the shoulders of big datardquo in Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo14) pp 1116ndash1125 ACM New York NYUSA August 2014

[21] D Mimno H M Wallach E Talley M Leenders and AMcCallum ldquoOptimizing semantic coherence in topic modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 262ndash272 Association for Computa-tional Linguistics July 2011

[22] Z Chen A Mukherjee and B Liu ldquoAspect extraction withautomated prior knowledge learningrdquo in Proceedings of the 52ndAnnual Meeting of the Association for Computational Linguistics(ACL rsquo14) pp 347ndash358 Baltimore Md USA June 2014

[23] G Bouma ldquoNormalized (pointwise) mutual information incollocation extractionrdquo in Proceedings of the Conference of theGerman Society for Computational Linguistics (GSCL rsquo09) pp31ndash40 Potsdam Germany 2009

[24] D M Blei A Y Ng and M I Jordan ldquoLatent dirichletallocationrdquoThe Journal ofMachine Learning Research vol 3 no4-5 pp 993ndash1022 2003

[25] T Hofmann ldquoProbabilistic latent semantic indexingrdquo in Pro-ceedings of the 22nd Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval(SIGIR rsquo99) pp 50ndash57 ACM Berkeley Calif USA August 1999

[26] E Eaton and P L Ruvolo ldquoElla an efficient lifelong learningalgorithmrdquo in Proceedings of the 30th International Conferenceon Machine Learning (ICML rsquo13) pp 507ndash515 2013

[27] E Kamar A Kapoor and E Horvitz ldquoLifelong learning foracquiring thewisdom of the crowdrdquo in Proceedings of the 23rdInternational Joint Conference on Artificial Intelligence (IJCAIrsquo13) pp 2313ndash2320 AAAI Press Beijing China August 2013

[28] R Raina A Battle H Lee B Packer and A Y Ng ldquoSelf-taughtlearning transfer learning from unlabeled datardquo in Proceedingsof the 24th International Conference onMachine Learning (ICMLrsquo07) pp 759ndash766 ACM June 2007

[29] D L Silver Q Yang and L Li ldquoLifelong machine learningsystems Be-yond learning algorithmsrdquo in Proceedings of theAAAI Spring Symposium Lifelong Machine Learning StanfordUniversity March 2013

[30] Z Chen and B Liu ldquoTopic modeling using topics from manydomains lifelong learning and big datardquo in Proceedings of the31st International Conference on Machine Learning (ICML rsquo14)pp 703ndash711 Beijing China June 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article Online Knowledge-Based Model for Big Data ...downloads.hindawi.com/journals/cin/2016/6081804.pdf · Research Article Online Knowledge-Based Model for Big Data Topic

2 Computational Intelligence and Neuroscience

LML models are the computational systems that processtasks and retain popular patterns from it This approach isalso called human-like learning or never ending learning inthe literature Lifelong learning is only recently introducedto NLP tasks Never Ending Language Learner (NELL) isconsidered the first attempt to use lifelong learning forNLP [19] It extracts information from web sources toextend a knowledge-base in an endless manner aiming toachieve improved performancewith experience Unlike othermachine learning approaches there is little research workavailable on lifelong machine learning for NLP related tasks

There are four main components of LML models thatare knowledge representation extraction transfer retentionand maintenance discussed in [20] These components arestrongly connected and analyzing them for quality improve-ment is called knowledge engineering The knowledge learntis represented through an abstracted set of features Theextraction component focuses on identifying popular pat-terns and mining them as potential knowledge The transfercomponent is responsible for transferring the impact ofknowledge into the modeling process when suitable Theretention andmaintenancemodule deals with storing knowl-edge for future use The objective is to use knowledge toreduce the number of incoherent topics Topic coherencemeans that the words in a topic do not semantically holdwell together Topic incoherence can be of various types thatis chained intruded random and unbalanced as explainedby [21] Chained problem refers to the case where there areword pairs having strong interconnection forming a chainbut there is no collective relatedness among the words of atopic Intruded topics have set of words with good corelationwithin the set but not fitting well with the rest of the wordsin the topic Random topics do not make any sense whileunbalanced topics have a mix of general and specific termsthat does not hold good together With the incoherenceproblems known themodules of LMLmodel can bemodifiedto closely focus on them

Lifelong learning topic models are introduced to benefitfrom the knowledge-base while keeping it independent ofdomain specific manual tuning This makes it possible fortopic models to mine topics from big data with manydomains Automatic Knowledge Learning (AKL) model [22]is the first such model that used automatic knowledgelearning for topic models Later there are different lifelonglearning models proposed where the process of knowledgeextraction and transfer is refined for improved accuracyand performance However they still have limitations tobe addressed The existing LML models lack a knowledgemaintenance and retention module Therefore they haveto load topics from all tasks performed to sample relevantknowledge for the current task The knowledge learnt is lim-ited which was initially consisting of must-links as word pairshaving strong positive corelation In later models cannot-links were also added as word pairs with strong negativecorelation Following batch processing approach they expectthe big data to be available prior to analysis If topics of analready performed task are removed the learning curve of themodel is affected Therefore existing state-of-the-art models

consume more storage and processing resources and do notprovide support for streaming data arriving in pieces

The proposed model is online that is processing onedomain at a time as they are received Each domain isprocessed only once with the knowledge available at thattime After processing each domain the knowledge-base isupdated With a knowledge retention module it does notrequire loading all the topics from the previously processeddomains It makes the model lightweight without compro-mising accuracy The model has low data dependency whereit only has to load the occurrences of relevant knowledgemaintained in abstract form The contributions of this paperare highlighted below

2 Highlights

(1)With online approach it processes the data as it is receivedin intervals Itmakes the proposedmodel closer to the essenceof big data streaming from different sources

(2) Knowledge retention module is introduced to storeknowledge in abstract form Unlike existing models to loadtopics from all domains it only consults the knowledge-basethus reducing dependency on dataTherefore it is efficient inperformance and resource utilization

(3) Must-links and cannot-links are extracted with asingle mechanism that is normalized Pointwise MutualInformation (PMI) as nPMI [23] They are stored main-tained and transferred irrespective of their orientation forimproved performance

(4) Being online the proposed model has limited data ata time to learn from and is expected to learn low qualityknowledge Knowledge freshness utility and confidenceare introduced to monitor prioritize and even filter outknowledge if not contributing considerably

(5) Knowledge representation and abstraction areimproved to resolve word sense disambiguation withoutconsulting all the processed domains Knowledge rules aresupported with background representing top topic wordsfrom which the rule is mined Reasonable overlap betweenthe current domain and a knowledge background suggestssame word sense

3 Related Work

Probabilistic topic models based on Latent Dirichlet Alloca-tion (LDA) extract corelations among words through statisti-cal computations by calculating probabilities of words coex-istence [24] It uses Markov Chain Monte Carlo (MCMC)inference to combine words under topics having a thematicrelevance Following bag-of-words (BOW) approach it con-siderswords presence and their positions Initially introducedwith variational Bayesian technique mostly Gibbs samplingis used for inference to converge the analysis towards moreprobable groupings in the given number of iterations mostlykept high in thousands The accuracy of topic models relieson the initial values for hyperparameters as well howeverit generally performs well for larger datasets ProbabilisticLatent Semantic Analysis (pLSA) [25] having extensions isalso used for topic modeling however LDA is preferred for

Computational Intelligence and Neuroscience 3

adjustable priors to closely refer to the nature of specificdocuments and topics LDA holds the advantage of identify-ing and aggregating topics together where they are separatesteps in other approaches for example dictionary based andrelation based The extensions of unsupervised LDA used foraspect and sentiment terms extraction as topics are [1ndash5]Thehybrid topic models used for topic extraction and sentimentanalysis are [10ndash15]The semisupervised LDAmodels guidedwith initial seed aspect terms for topic extraction as productaspects are [6ndash9]

To improve the accuracy of topic models while keepingthe user involvement tominimum knowledge-based semisu-pervised models are proposed The purpose of these modelsis to provide domain specific knowledge rules that guidethe extraction process to reduce the number of incoherenttopics extracted Dirichlet Forest priors based LDA (DF-LDA) was introduced as a knowledge-based topic model towhich knowledge rules are provided in the form of must-link and cannot-link [16] It changed the perspective towardstopic extraction as the accuracy was dependent on the qualityof knowledge The knowledge rules guided the model toput terms under the same or different topics based on theintuition A must-link increases the probability of word pairsto coexist under the same topic while a cannot-link worksopposite to it DF-LDA being transitive could not handlemultiple senses of a word which was not addressed in theirlater model [17] as well The knowledge rules are assumed tobe accurate with no refinement mechanism and are appliedwith equal priority The rules were relaxed to insets in [18]having topic labels for words and thus avoiding one-to-one relationships for word coexistenceThese semisupervisedknowledge-based topic models are manually provided withdomain specific knowledge and cannot scale up to big dataThe knowledge provided is assumed to be accurate havingno mechanism for verifying and rectifying knowledge ifprovided by mediocre domain experts

Lifelong machine learning is used for various tasks inthe works of [26ndash29] however none of them has used LMLfor topic modeling Automatic knowledge-based models areintroduced to incorporate a knowledge module with topicmodels to have more semantically corelated topicsThe auto-matic LML models learn and apply knowledge automaticallywithout external support and can be directly applied tobig data Overlap among different domains is exploited forlearning and transferring knowledge AKL [22] was the firstattempt to introduce lifelong machine learning with topicmodelingThemodel processes all the domains in the datasetwith baseline LDA to extract initial topics The extractedtopics are clustered to learn and utilize knowledge LifelongTopic Model (LTM) [30] improved the performance andaccuracy of LML topic mining by using Frequent ItemsetMining (FIM) for knowledge extraction and GeneralizedPolya Urn (GPU) model for transferring it to future taskManual support was provided to set threshold for FrequentItemset Mining (FIM) however it could not address theproblem of mining local knowledge A large threshold valuemeans mining only global knowledge while a low valuethat focuses on local knowledge brings many irrelevant

knowledge rules as well They both used positively corelatedwords only as knowledge that is must-links

In [20] automatic must-links cannot-links (AMC) modelis proposed to address the problems of earlier automatic LMLmodels Unlike other LML models it used both must-linkand cannot-link knowledge types The knowledge rules areextracted with FIM with multiple support (MS-FIM) to con-sider both global and local knowledge while avoiding irrele-vant knowledge rules Since the cannot-links are believed tobe very high due to the large size of vocabulary the modelonly extracts cannot-links for the current domain In orderto ensure their quality the must-links and cannot-links areverified from topics of all domains For transferring the learntknowledge into the inference technique MultigeneralizedPolya Urn (M-GPU) model is used For a word drawn from atopic if there are relevantmust-links all the pairedwords get apushup in the topic whereas a penalty is applied to the pairedwords in cannot-links M-GPU provides the bias supportedby knowledge in order to help distributions with topicshaving semantically corelatedwordsThe transitivity problemis addressed with a graph connecting knowledge rules thatshare a common word which is used in the same sense

The existing state-of-the-art LML models follow batchprocessing approach They expect all the domains to beavailable prior to analysis The big data is processed withbaseline LDA to extract topics in first parse In the secondparse each domain is processed again with knowledge Topicsfrom all domains are loaded for each task to learn relevantknowledge that is verified frommany domains Itmakes thesemodels highly dependent on data as they require topics fromall domains to be available at all times All the knowledgerules extracted are treated equally whereas some knowledgerules have high confidence while others marginally crossthreshold There is no mechanism to prioritize knowledgerules These models lack a knowledge maintenance moduleand have to load topics from all domains for each task andthus consume more resources

4 Proposed Model (OAMC)

An online LML model (OAMC) is proposed to address theissues discussed It is more closely related to the essence ofbig data analysis by expecting data to arrive continuously inchunks It is less dependent on data therefore consuminglimited storage and processing resources Efficiency of themodel is associated with the knowledge-base instead oftopics from all domainsWith a newly introduced knowledgemaintenance and retentionmodule the harvested knowledgerules are stored in an abstract form The knowledge-baseis maintained after each task to keep it consistent Newknowledge rules are mined while existing rules are updatedor even filtered as a continuous process Additional featuresof knowledge are introduced to help maintain and prioritizethem The working of the model is shown in Figure 1 Withthe online sources producing textual content continuouslythe proposedmodel can keep up with processing needs of thedata in handThismakes itmore practical for developing real-world monitoring and analysis applications

4 Computational Intelligence and Neuroscience

OAMCModel (Dt V N knowledge-base)(1) procedure onlineAMCModel (119863119905 119881119873 knowledge-base)(2) relRuleslarr sampleRelevantKnowledge (119881 Knowledge-base)(3) if (relRules = 0) then without knowledge(4) topicslarr LDA (119863119905 119881119873 0)(5) end if(6) if (relRules gt 0) then with relevant knowledge(7) topicslarr GibbsSampler (119863119905 119881119873 relRules)(8) end if(9) rulesrefined larr refineRules (119863119905 119881 topics relRules knowledge-base)(10) rulesnew larrmineRules (119863119905 topics)(11) knowledge-baselarr rulesrefined cup rulesnew(12) end procedure

Algorithm 1 OAMCmodel

M DocsDomain 1

M DocsDomain 2

M Docs

LDA OAMCOAMCTopics Topics Topics

Knowledge-base

Learning module Learning module Learning module

Domain N

Figure 1 Proposedmodel (OAMC) using online knowledge extrac-tion and transfer mechanism

The OAMC model can operate both with and withoutknowledge In case there is no relevant knowledge availablethe model operates as baseline LDA However as the modelmatures its knowledge-base grows to support a variety offuture tasks With relevant knowledge the model showsimproved performance by grouping semantically relatedwords under topics with the help of knowledge learnt fromprevious tasks The working of OAMC model is shownin Algorithm 1 The set of documents in current domainare 119863

119905= 119863

1 1198632 119863

119899 119881 is the vocabulary of the

current task while 119873 is the number of iterations for Gibbssampler [13] used as inference technique Knowledge-basehas the knowledge rules as must-links and cannot-links Theknowledge-base has a variety of must-links and cannot-linkslearnt through experience In step (2) only those must-linksand cannot-links are selected that are relevant to the currenttask that is both the words in must-link and cannot-linkexist in the vocabulary of the current domain The rulesrelevant to the current domain are stored as relRules If norelevant rules are found the current task is processed withbaseline LDA as shown in step (4) In step (4) 0means that norelevant knowledge is found for the current task and therefore

the baseline LDA model is to be used for the current taskIt may happen at low experience when the knowledge-basehas limited rules and the current task is about a completelydifferent subject When relevant rules are found step (7)has Gibbs sampler [20] used for incorporating knowledge toguide the inferenceTheknowledge-basemaintains itself aftereach task to be consistent Step (9) refines relRules to growtheir confidence and utility as a token of their services forthe current domain All rules in the knowledge-base get atask older As part of the continuous learning process themodel harvests new knowledge rules after processing eachtask in step (10) The new rules are merged with the existingrules and are added to the knowledge-base which is ready toprocess a new task

41 Knowledge Representation Additional features are addedto knowledge rules to help maintain the knowledge-baseand keep it consistent by continuously weeding out wrongor irrelevant knowledge In previous models the knowledgerules are maintained as separate lists of must-links andcannot-links each with a word pair and a confidence valueOur OAMC model stores them together as knowledge rulesirrespective of their orientation However the orientationis communicated through a positive or negative sign usedwith the confidence value New rule features are added asutility freshness and background to strictly monitor thecontribution of each rule To survive longer a knowledgerule kRule(119908

1 1199082) needs to maintain high utility in the tasks

that follows The confidence of a rule is also expected to beassured by future domains which adds to the net confidenceThe utility and freshness of a rule maintain a check on itscontribution and relevance A rule with high confidence butlow usage and freshness is also expected to be filtered Thesenewly added features keep the knowledge-base consistentwith quality knowledge The background feature helps toresolve word sense disambiguation by providing the contextof the knowledge rule A knowledge rule in the knowledge-base is stored as shown in Figure 2

42 Knowledge Extraction The proposed model has accessto only limited big data for a task that is domain of the

Computational Intelligence and Neuroscience 5

MineRules (Dt N topics)(1) procedureMineRules (119863119905119873 topics)(2) for (t all topics) do(3) for (i j all word pairs in a topic) do(4) if (nPMI(i j) lt 1205831) then(5) Rules

119895larr Rules (119894 119895 nPMI(119894 119895) topic) learn as negative rule

(6) end if(7) if (nPMI(i j) gt 1205832) then(8) Rules

119895larr Rule (119894 119895 nPMI(119894 119895) topic) learn as positive rule

(9) end if(10) end for(11) end for(12) end procedure

Algorithm 2 Mine rules

Knowledge-base

Background

Orientation

Freshness

Utility

Confidence

KRule(w1 w2)

Figure 2 Representation of a knowledge rule as word pair inknowledge-base Utility freshness and background are the newlyintroduced knowledge features to monitor their quality

current task Therefore the model does not have enoughinstances to verify the quality of knowledge and is expectedto learn low quality knowledge as well A knowledge rulecan be of low quality if it has enough support in data fromwhere it was learnt but is not supported by future tasks Itmayhappen due to noise or high bias in a domainThe refinementtechnique deals with it until it is removed In order to extractnew knowledge rules kRule(119908

1 1199082) normalized PMI [23] as

nPMI(1199081 1199082) is used as given in

kRule (1199081 1199082) = nPMI (119908

1 1199082) =

PMI (1199081 1199082)

log119901 (1199081 1199082) (1)

where nPMI gives a value [minus1 1] range All candidate wordpairs are evaluated through nPMI For a word pair if the scoreis close to 1 the knowledge rule (must-link) is stored withits nPMI score as confidence Word pair having it close to minus1is stored as a negative rule (cannot-link) with nPMI score asconfidence PMI in (1) can be evaluated as

PMI (1199081 1199082) = log

119901 (1199081 1199082)

119901 (1199081) 119901 (119908

2) (2)

where 119901(1199081 1199082) is the coexistence probability of 119908

1and 119908

2

while 119901(1199081) and 119901(119908

2) are probabilities of their existence in

the current domain Coexistence is the presence of two wordsin a conceptual document irrespective of their frequencyCalculating coexistence probability and individual probabil-ities is shown in (3) where119863119905 show all documents while 119863119905is total number of documents

119901 (1199081 1199082) =

119863119905 (1199081 1199082)

119863119905

119901 (119908) =119863119905 (119908)119863119905

(3)

Freshness and utility are initialized while topic of the wordpair is added to the background of the rule The twothresholds 120583

1and 120583

2used are set to harvest word pairs with

nPMI scores at either of the two extremes given in

nPMI (1199081 1199082) =

⟨1205831or⟩ 1205832

learn

else ignore(4)

When nPMI gives a value that is too high that is closeto 1 or too low that is close to minus1 then it is consideredto be of significance This untypical behavior is learnt asknowledge while the values close to zero are ignored asthey have not shown any considerable corelation Word pairshaving nPMI below 120583

1are learnt as knowledge (cannot-links)

while the values above 1205832are also learnt (must-links) The

threshold values are mentioned in Setup Our contributionin knowledge extraction is to use nPMI(119908

1 1199082) as a very

lightweight extractionmechanism Secondly bothmust-linksand cannot-links are extracted together in a single stepMS-FIM was previously used to harvest knowledge whichwas causing a performance bottleneck The rule miningmechanism is mentioned in Algorithm 2 It generates wordpairs within each topic and evaluates them for possibleknowledge from step (2) to step (11) Satisfying either of thethresholds in step (4) and step (7) the word pair is mined aspositive or negative knowledge rule

43 Knowledge Transfer The knowledge transfer mechanismselects relevant knowledge rules each time a word is sampled

6 Computational Intelligence and Neuroscience

to have its probability increased under a topic The effectof relevant knowledge rules is added as bias into the topicmodel to push up positively corelated words under the sametopic and vice versa Since the current task has limited dataknowledge from previous tasks is incorporated to guide theinference For a sampled word 119908

1in topic 119905 if a rule exists

for it then the associated word 1199082has probability updated in

current topic 119905 as shown in

119901 (rule119894= rule | 119905) prop 119901 (119908

1| 119905) times 119901 (119908

2| 119905) (5)

where 1199081and 119908

2are the two words of a rule rule

119894(must-

link or cannot-link)When the model inference increases theprobability of a word 119908

1for a topic 119905 the probability of 119908

2

is increased for topic 119905 in case of must-link and decreasedfor cannot-link To address word sense disambiguation thefollowing equation is proposed

(119881 cup ruleBackground)ruleBackground

gt 120585 (6)

Here 120585 is the threshold that has to be satisfied for overlapbetween the vocabulary of the current task and the rulebackgroundThe rule background represents the set of wordsin topic from which the rule was learnt Comparing therule background with the current vocabulary is to ensurethat the rule is used in the same context in which it wasmined For example the word Bank has different contextswhen used with Accounts and Sea respectively Topics of allprocessed tasks were previously used for the same purposeThe net confidence and utility of a knowledge rule are usedto transfer the impact of knowledge rule in topic samplingMultigeneralized Polya Urn (M-GPU) model [20] is used totransfer the effect of knowledge into the current status ofMarkov chain in Gibbs sampler in

119901 (119908 | 119905) propsum119881

1199081015840=1(1205921199081199081015840120588) times 119899

1199051199081015840 + 120573

sum119881

V=1 (sum119881

1199081015840=1(120592V1199081015840120588) times 1198991199051199081015840 + 120573)

(7)

Equation (7) shows the amount of update in the probability bya factor 120592

1199081199081015840120588 for the other word in knowledge rule When

the probability of a sampled word is increased for a topicusing the rules the probability of all thewords associatedwiththe sampled word is also increasedThe following equation isupdated to incorporate both positive and negatively corelatedrules in a single step as

119901 (119911119894= 119905 | 119911

minus119894 119908 120572 120573 120582)

prop119899119889119905

minus119894+ 120572

sum119879

119905=1(119899119889119905

minus119894+ 120572)

times

sum1199081015840119908119894isinrule (1205921199081015840 119908119894120588) times 1198991199051199081015840

minus119894+ 120573

sum119881

V=1 (sum1199081015840 Visinrule (1205921199081015840 V120588) times 1198991198961199081015840minus119894+ 120573)

(8)

For word sampled 119908 the other word 1199081015840 sharing the same

rule with it gets a push as strong as the net confidence in thedirection of its orientation The objective is to increase theprobability of positively corelated words to appear togetheras top words under the same topic and vice versa

Table 1 OAMC positive and negative knowledge rules with highutility

Type Knowledge

+ive

(tech support) (cable usb) (pro con) (customersupport) (batter charge) (video card) (high long)(connection vga) (operating system) (monitormacbook) (mac support) (cd dvd) (windows xp)(money worth) (inch picture) (input output)(friend mine) (big deal)

minusive

(sound money) (feature battery) (battery price)(feature battery) (price device) (screen sound)(review screen) (worse digital) (price easy) (gasscreen) (signal easy) (home gps) (hotel traffic)(driver cooler) (monitor processor) (monitorspeed)

44 KnowledgeMaintenance andRetention Supportedwith aknowledgemaintenance and retentionmodule useful knowl-edge rules are stored until they justify their existence Theknowledge-base is refined after each task to keep it consistentThey all get older by a taskThe relevant knowledge rules growin confidence supported by current domain and have higherutility After updating the state of each knowledge rule theyare passed through a threshold again to justify their existenceas proposed in

nPMI (1199081 1199082)

=

120592 times 120588

120582gt 1205831or

120592 times 120588

120582lt 1205832

remove

otherwise retain

(9)

where 120592 is net confidence 120588 is utility and 1120582 is freshness fora knowledge rule If the confidence of a knowledge rule incurrent task is above its net confidence 120592 the current domainis allowed to update the knowledge background to betterrepresent its context as proposed in

kRule (1199081 1199082) =

conf gt 120592 update background

otherwise unchanged(10)

where background is updated by adding words from sup-porting topic in the current task New rules are minedwhile existing rules are evolved as a continuous processThe newly added knowledge features are used to rank andprioritize knowledge rules The difference in rank of goodand bad knowledge rules widens as the model matures withexperience Algorithm 3 is used to refine existing knowledgewith step (2) having relevant knowledge as Rule1015840 In steps (4)to (7) all relevant rules get their net confidence updated andutility incremented Step (8) has all rules aged by a task Someof the popular rules from knowledge-base of OAMC modelare shown in Table 1

5 Experimental Results

The proposed model (OAMC) contributes to the applicationof lifelong learning forNLP tasks specifically used to improve

Computational Intelligence and Neuroscience 7

RefineRules (Dt N topics V Rules)(1) procedure RefineRules (119863119905119873 topics 119881 Rules)(2) Rules1015840 larr relevantRules (119881 knowledge-base) set of relevant rules(3) for (r all Rules) do(4) if (Rules1015840 contains r) do(5) 120592 larr nPMI(119903) add to cumulative confidence(6) 120588 larr +1 Rules utility increased by 1(7) end if(8) end for(12) end procedure

Algorithm 3 Refine rules

topic coherence for big streaming data Various experimentsare performed to highlight the improvement in accuracyand performance of lifelong topic mining as product aspectsUnlike traditional ML models it operates on big data with avariety of domains The standard dataset used by previousLML models is used to evaluate the proposed model incomparison to the other state-of-the-art models The dataconsists of 50 domains each of electronic products To high-light the improvement in the OAMC model the dataset wasprovided with decreasing number of domains The modelsare evaluated in topic coherence for accuracy and time inminutes for performance The number of knowledge rules isused as quality of knowledge where fewer rules producingbetter results are considered to be of better quality

51 Setup To set up the test environment an HP PavilionDM4 machine is used with Intel 24GHz Core i5 having8GB RAM and 500GB hard drive Following the evaluationapproach in existing models 15 topics are extracted perdomain for all models while as in previous models top 30words per topic are considered to calculate topic coherenceof the proposedmodel alsoThe hyperparameters are kept thesame for all models The number of Gibbs sampler iterationsis kept the same (2000) as used in previous models Themodels are configured and initialized as used in their originalwork The proposed OAMC has its thresholds 120583

1and 120583

2

set to 08 and minus09 respectively to learn knowledge andselect relevant knowledge for a given task Since negativecorelations are too many as compared to positive rules thethreshold is set to focus more on positive rules In orderto choose relevant rule the overlap threshold is set to 03between the vocabulary and knowledge background

52 Results The accuracy of lifelong models and topicmodels in general is evaluated through topic coherence It hashigh corelation with human evaluation All of the models aregiven the same number of documents in a domain for bothlearning and evaluation The existing models drop accuracywhen it has to learn from a smaller dataset with documentsin hundreds per domain while compromising accuracy whenevaluated on larger dataset with thousands of documentsTherefore the previous models were given large datasets tolearn from and were evaluated on smaller datasets However

50 40 30 20 10 8 6 4 2 1Number of domains available

Topic coherence with different number of domains in dataset

LDALTMAMC-M

AMCOAMC

minus700

minus750

minus800

minus850

minus900

minus950

minus1000

Topi

c coh

eren

ce

Figure 3The proposed model (OAMC) produces highest accuracy(as topic coherence) among lifelong learning models for streamingbig data

for fairness to all models they are provided with the test setonly to learn from and to test on

Topic coherence of the models compared has interestingoutcome LTM surprisingly could not maintain a better topiccoherence when it had to learn from a smaller dataset It evendropped below the baseline AMC and AMC-M performedwell however the proposed OAMC beat them when anumber of domains in the dataset were dropped below 10 asshown in Figure 3 AMC and AMC-M hardly learn anythingwith fewer domains which makes them ineffective towardsstreaming data Therefore their accuracy continuously wentdown towards baseline On the other hand LTM has itslearning pattern badly affected by limited variety of domainsand learnt wrong knowledge which dropped its accuracyThe knowledge rules learnt by AMC and AMC-M were100 and below per domain for less than 10 domains in thedataset which grow up to more than 2000 for 50 domainsIt is because of the overreliance of the existing models onlarge volumes of data to be available at once which makesthem least effective towards streaming data It shows thatthey only learn better knowledge and produce better resultswhen there are many domains in the dataset and they areall available prior to analysis With a robust knowledge

8 Computational Intelligence and Neuroscience

LTM AMC-M AMC OAMCLML models

Performance

020406080

100120140

Tim

e (m

in)

Figure 4 The proposed model (OAMC) has the highest perfor-mance efficiency among LML topic models

storing mechanism OAMC learnt more knowledge refinedit and used it to improve topic coherence with few domainsavailable The OAMC model shows on average 7 with amaximum of 10 improvement in topic coherence whenthere is one domain at a time

OAMC learns wrong knowledge too as it considersonly the current domain for harvesting it However with arobust knowledge retention module after performing fewtasks only good quality knowledge survives through thefiltering mechanism Knowledge rules with less confidencefrom future tasks or limited utility are pushed down andfinally filtered out Knowledge that is getting confidencereenforced from future tasks and has growing utility stayslonger To evaluate relevance of a knowledge rule with thecurrent task the domain of the current task must have therule words confidence above threshold and overlap with therule background higher than threshold The online OAMCmodels truly learn like humans having no idea of future tasksunlike existing models Its learning is dependent on the pastexperience as the domains processed only and uses it for afuture task It is less dependent on data parses it once andconsiders the knowledge module with abstract rules only tosample relevant rules and therefore shows improvement of upto 50 in performance as shown in Figure 4

For the discussed intrinsic evaluation the models arealso compared through PMI mean and median values thathad relevance with topic coherence values The extrinsicWordNet based techniques used in the literature (WuPalmerResnik Path Lin LeacockChodorow Lesk JiangConrath andHirstStOnge) were inconclusive to differentiate among thequalities of topics extracted due to the specialized nature ofelectronic products domains

53 Evaluation OAMC improving performance of AMCby 50 as shown in Figure 4 is attributed to the followingreasons OAMC does not extract relevant knowledge for eachtask and rather chooses from a well maintained knowledgemodule Each domain is parsed only once With a knowledgefiltering mechanism only good quality knowledge rules aremaintainedThere are generally two performance bottlenecksin existing LML models that is FIM for knowledge extrac-tion and higher iterations of Gibbs sampling The OAMCmodel uses nPMI [23] which is highly efficient to extract

knowledge Secondly it learns both must-links and cannot-links together with a single mechanism as rules Even with50 domains in the dataset OAMC performed better thanAMC for certain individual domains by producing highertopic coherence while using fewer knowledge rules as shownin Figure 5 Table 1 shows some of the most frequently usedknowledge rules of OAMC Through manual evaluation oftopics from all the domains it was observed that the knowl-edge learnt by LMLmodels is more effective towards randomand intruded types of topic incoherence but less effectivetowards chained and unbalanced type of incoherence In factthe knowledge-based models can at times enhance chainedtopic incoherence if there are many knowledge rule pairssharing common words

54 Knowledge Analysis The accuracy of LML modelsdepends upon the quality of its knowledge The previousmodels did not provide a discussion on the quality ofknowledge In fact improved topic coherence was consideredas reason for good quality knowledge The knowledge rulesshown in Table 1 have utility above 10 as a token of theirquality The ratio of use of knowledge rules per domain byAMC and OAMC is 5 1 signifying the quality of knowledgeproduced by OAMC Net confidence of rules also showsthat most of the relevant domains enforced the rule andincreased its confidence (above 08with 1 asmaximum) Sincethis information is not available for the existing models thevalues cannot be matched It is still very difficult to explorethe impact of each knowledge rule separately however thediscussed features in general keep track of good qualityknowledge The knowledge learnt is of two types only cap-turing positive and negative word corelations However morevarieties of knowledge can be introduced for further insightsBy introducing ldquothe type of relationshiprdquo as knowledge theunbalanced topic incoherence problem can be addressedAlong with correlation between words the nature of theircorrelationmay also be explored for better analysisThis typeof knowledge is also useful for finding hierarchical topicsThe knowledge rules extracted are isolated while knowledgeas complex network can be more effective for a variety ofanalyses AMC uses a small graph only to associate must-links sharing common word in the same context Knowledgeextraction transfer representation and retention are onlyintroduced to NLP tasks in 2010 and require mature ideasfrom other fields of research

6 Conclusion

LML models are structured to process big data having manydomains and therefore it is the only suitable solution despiteits shortcomings LML models are recently introduced to bigdata topic extraction and are far from mature The existingLML topic models had limited application due to followingbatch processing approach andoverreliance ondataOur pro-posed model named as OAMC follows online approach forLML topic modeling that can support streaming data Withan efficient learning mechanism our OAMCmodel has lowerdependency on data and therefore improved performance by

Computational Intelligence and Neuroscience 9

AlarmClock Battery CDPlayer Fan LampIndividual domains

Topic coherence for each domain

OAMCAMC

minus750

minus770

minus790

minus810

minus830

minus850To

pic c

oher

ence

minus778 minus785

minus812

minus782

minus809

minus790

minus817

minus761

minus788

(a)

OAMCAMC

AlarmClock Battery CDPlayer Fan LampIndividual domains

Knowledge rules used for each domain

284

1423

345

863

226518

1520

446

0

500

1000

1500

2000

Num

ber o

f kno

wle

dge

rule

s use

d

(b)

Figure 5 The proposed model (OAMC) in comparison to the state-of-the-art LML model (a) produces better accuracy as topic coherencefor individual domains while (b) using fewer knowledge rules

50 as compared to state-of-the-art LML models Velocity isan important feature of big data as it is believed to be receivedcontinuously at high speed The OAMC model is designedto process the data in hand only and therefore outperformsexisting models for streaming data As the data is expectedto have noise OAMC is supported with a strict monitoringand filteringmechanism tomaintain a consistent knowledge-base The rules are made to compete for survival based onutility freshness and net confidence OAMC being blind tothe future gets higher bumps in accuracy when current taskhas low relevance to past experience This is very commonto a human-like learning approach However the tendencyof such tasks for which limited relevant knowledge availabledrops with experience considerably OAMCmodel shows 7improvement in accuracy on average for streaming big databut up to 10 for certain individual domains as compared tostate of the art

In the future we will be working on introducing morevarieties of knowledge Through complex network analysisthe knowledge rules can be stored in association with eachother Cooccurrence can mean many things and thereforethe relationship type should also be considered in future Itwill help to associate aspects with products sentiments toaspects and reasons to sentiments To improve the perfor-mance and attempt for near real-time analysis Gibbs sam-pling should be replaced with a lighter inference technique

Disclosure

This research is the authorsrsquo work not published or underreview elsewhere

Competing Interests

The authors declare that they have no competing interests

Authorsrsquo Contributions

In this research work Muhammad Taimoor Khan(corresponding author) developed the model while Mehr

Durrani helped with adjusting the knowledge learningand transfer mechanism to better adapt to the currenttask Furqan Aziz contributed through incorporating theknowledge maintenance module for high efficiency andreduced data dependency Shehzad Khalid supervised theactivity and contributed through running experiments andcompiling results Shehzad Khalid helped with introducingthe new knowledge features and with prioritizing them

Acknowledgments

The authors are thankful to Mr Yousaf Marwat Lecturerat COMSATS IIT Attock for facilitating with editing andproofreading services

References

[1] S Branavan H Chen J Eisenstein and R Barzilay ldquoLearningdocument-level semantic properties from free-text annota-tionsrdquo Journal of Artificial Intelligence Research vol 34 pp 569ndash603 2009

[2] S Brody and N Elhadad ldquoAn unsupervised aspect-sentimentmodel for online reviewsrdquo in Proceedings of the Human Lan-guage Technologies Conference of the North American Chapterof the Association for Computational Linguistics (NAACL HLTrsquo10) pp 804ndash812 Association for Computational LinguisticsLos Angeles Calif USA June 2010

[3] F Li M Huang and X Zhu ldquoSentiment analysis with globaltopics and local dependencyrdquo in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI rsquo10)Atlanta Ga USA July 2010

[4] Q Mei X Ling M Wondra H Su and C Zhai ldquoTopicsentiment mixture modeling facets and opinions in weblogsrdquoin Proceedings of the 16th International World Wide Web Con-ference (WWW rsquo07) pp 171ndash180 ACM Alberta Canada May2007

[5] I Titov and R McDonald ldquoModeling online reviews withmulti-grain topic modelsrdquo in Proceedings of the 17th Interna-tional Conference on World Wide Web (WWW rsquo08) pp 111ndash120ACM Beijing China April 2008

10 Computational Intelligence and Neuroscience

[6] Y Lu C X Zhai andN Sundaresan ldquoRated aspect summariza-tion of short commentsrdquo in Proceedings of the 18th InternationalWorld Wide Web Conference (WWW rsquo09) pp 131ndash140 ACMApril 2009

[7] A Mukherjee and B Liu ldquoAspect extraction through semi-supervisedmodelingrdquo inProceedings of the 50thAnnualMeetingof the Association for Computational Linguistics (ACL rsquo12) vol1 pp 339ndash348 ssociation for Computational Linguistics July2012

[8] I Titov and R T McDonald ldquoA joint model of text and aspectratings for sentiment summarizationrdquo in Proceedings of theAnnualMeeting of the Association for Computational Linguisticsvol 8 pp 308ndash316 Association for Computational LinguisticsColumbus Ohio USA 2008

[9] H Wang Y Lu and C Zhai ldquoLatent aspect rating analysis onreview text data a rating regression approachrdquo in Proceedings ofthe 16th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo10) pp 783ndash792 ACM July2010

[10] T L Griffiths M Steyvers D M Blei and J B TenenbaumldquoIntegrating topics and syntaxrdquo in Advances in Neural Informa-tion Processing Systems pp 537ndash544 MIT Press CambridgeMass USA 2004

[11] Y Jo and A H Oh ldquoAspect and sentiment unification modelfor online review analysisrdquo in Proceedings of the 4th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo11) pp 815ndash824 ACM Hong Kong February 2011

[12] J Liu Y Cao C Y Lin Y Huang and M Zhou ldquoLow-quality product review detection in opinion summarizationrdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 334ndash342 PragueCzech Republic June 2007

[13] Y Lu and C Zhai ldquoOpinion integration through semi-supervised topic modelingrdquo in Proceedings of the 17th Interna-tional Conference onWorld Wide Web (WWW rsquo08) pp 121ndash130ACM Beijing China April 2008

[14] C Sauper A Haghighi and R Barzilay ldquoContent models withattituderdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics vol 1 ofHuman Lan-guage Technologies pp 350ndash358 Association for ComputationalLinguistics 2011

[15] W X Zhao J Jiang H Yan and X Li ldquoJointly modeling aspectsand opinions with a MaxEnt-LDA hybridrdquo in Proceedings ofthe Conference on Empirical Methods in Natural Language Pro-cessing pp 56ndash65 Association for Computational LinguisticsOctober 2010

[16] D Andrzejewski X Zhu and M Craven ldquoIncorporatingdomain knowledge into topicmodeling via Dirichlet Forest pri-orsrdquo in Proceedings of the 26th Annual International Conferenceon Machine Learning pp 25ndash32 ACM June 2009

[17] D Andrzejewski X Zhu M Craven and B Recht ldquoA frame-work for incorporating general domain knowledge into latentdirichlet allocation using first-order logicrdquo in Proceedings ofthe 22nd International Joint Conference on Artificial Intelligence(IJCAI rsquo11) pp 1171ndash1177 Barcelona Spain July 2011

[18] Z Chen A Mukherjee B Liu M Hsu M Castellanos and RGhosh ldquoDiscovering coherent topics using general knowledgerdquoin Proceedings of the 22nd ACM International Conference onInformation and Knowledge Management (CIKM rsquo13) pp 209ndash218 ACM November 2013

[19] A Carlson J Betteridge B Kisiel B Settles E R Hruschka Jrand T M Mitchell ldquoToward an architecture for never-endinglanguage learningrdquo in Proceedings of the 24th AAAI Conferenceon Artificial Intelligence vol 5 p 3 Atlanta Ga USA July 2010

[20] Z Chen and B Liu ldquoMining topics in documents standingon the shoulders of big datardquo in Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo14) pp 1116ndash1125 ACM New York NYUSA August 2014

[21] D Mimno H M Wallach E Talley M Leenders and AMcCallum ldquoOptimizing semantic coherence in topic modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 262ndash272 Association for Computa-tional Linguistics July 2011

[22] Z Chen A Mukherjee and B Liu ldquoAspect extraction withautomated prior knowledge learningrdquo in Proceedings of the 52ndAnnual Meeting of the Association for Computational Linguistics(ACL rsquo14) pp 347ndash358 Baltimore Md USA June 2014

[23] G Bouma ldquoNormalized (pointwise) mutual information incollocation extractionrdquo in Proceedings of the Conference of theGerman Society for Computational Linguistics (GSCL rsquo09) pp31ndash40 Potsdam Germany 2009

[24] D M Blei A Y Ng and M I Jordan ldquoLatent dirichletallocationrdquoThe Journal ofMachine Learning Research vol 3 no4-5 pp 993ndash1022 2003

[25] T Hofmann ldquoProbabilistic latent semantic indexingrdquo in Pro-ceedings of the 22nd Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval(SIGIR rsquo99) pp 50ndash57 ACM Berkeley Calif USA August 1999

[26] E Eaton and P L Ruvolo ldquoElla an efficient lifelong learningalgorithmrdquo in Proceedings of the 30th International Conferenceon Machine Learning (ICML rsquo13) pp 507ndash515 2013

[27] E Kamar A Kapoor and E Horvitz ldquoLifelong learning foracquiring thewisdom of the crowdrdquo in Proceedings of the 23rdInternational Joint Conference on Artificial Intelligence (IJCAIrsquo13) pp 2313ndash2320 AAAI Press Beijing China August 2013

[28] R Raina A Battle H Lee B Packer and A Y Ng ldquoSelf-taughtlearning transfer learning from unlabeled datardquo in Proceedingsof the 24th International Conference onMachine Learning (ICMLrsquo07) pp 759ndash766 ACM June 2007

[29] D L Silver Q Yang and L Li ldquoLifelong machine learningsystems Be-yond learning algorithmsrdquo in Proceedings of theAAAI Spring Symposium Lifelong Machine Learning StanfordUniversity March 2013

[30] Z Chen and B Liu ldquoTopic modeling using topics from manydomains lifelong learning and big datardquo in Proceedings of the31st International Conference on Machine Learning (ICML rsquo14)pp 703ndash711 Beijing China June 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article Online Knowledge-Based Model for Big Data ...downloads.hindawi.com/journals/cin/2016/6081804.pdf · Research Article Online Knowledge-Based Model for Big Data Topic

Computational Intelligence and Neuroscience 3

adjustable priors to closely refer to the nature of specificdocuments and topics LDA holds the advantage of identify-ing and aggregating topics together where they are separatesteps in other approaches for example dictionary based andrelation based The extensions of unsupervised LDA used foraspect and sentiment terms extraction as topics are [1ndash5]Thehybrid topic models used for topic extraction and sentimentanalysis are [10ndash15]The semisupervised LDAmodels guidedwith initial seed aspect terms for topic extraction as productaspects are [6ndash9]

To improve the accuracy of topic models while keepingthe user involvement tominimum knowledge-based semisu-pervised models are proposed The purpose of these modelsis to provide domain specific knowledge rules that guidethe extraction process to reduce the number of incoherenttopics extracted Dirichlet Forest priors based LDA (DF-LDA) was introduced as a knowledge-based topic model towhich knowledge rules are provided in the form of must-link and cannot-link [16] It changed the perspective towardstopic extraction as the accuracy was dependent on the qualityof knowledge The knowledge rules guided the model toput terms under the same or different topics based on theintuition A must-link increases the probability of word pairsto coexist under the same topic while a cannot-link worksopposite to it DF-LDA being transitive could not handlemultiple senses of a word which was not addressed in theirlater model [17] as well The knowledge rules are assumed tobe accurate with no refinement mechanism and are appliedwith equal priority The rules were relaxed to insets in [18]having topic labels for words and thus avoiding one-to-one relationships for word coexistenceThese semisupervisedknowledge-based topic models are manually provided withdomain specific knowledge and cannot scale up to big dataThe knowledge provided is assumed to be accurate havingno mechanism for verifying and rectifying knowledge ifprovided by mediocre domain experts

Lifelong machine learning is used for various tasks inthe works of [26ndash29] however none of them has used LMLfor topic modeling Automatic knowledge-based models areintroduced to incorporate a knowledge module with topicmodels to have more semantically corelated topicsThe auto-matic LML models learn and apply knowledge automaticallywithout external support and can be directly applied tobig data Overlap among different domains is exploited forlearning and transferring knowledge AKL [22] was the firstattempt to introduce lifelong machine learning with topicmodelingThemodel processes all the domains in the datasetwith baseline LDA to extract initial topics The extractedtopics are clustered to learn and utilize knowledge LifelongTopic Model (LTM) [30] improved the performance andaccuracy of LML topic mining by using Frequent ItemsetMining (FIM) for knowledge extraction and GeneralizedPolya Urn (GPU) model for transferring it to future taskManual support was provided to set threshold for FrequentItemset Mining (FIM) however it could not address theproblem of mining local knowledge A large threshold valuemeans mining only global knowledge while a low valuethat focuses on local knowledge brings many irrelevant

knowledge rules as well They both used positively corelatedwords only as knowledge that is must-links

In [20] automatic must-links cannot-links (AMC) modelis proposed to address the problems of earlier automatic LMLmodels Unlike other LML models it used both must-linkand cannot-link knowledge types The knowledge rules areextracted with FIM with multiple support (MS-FIM) to con-sider both global and local knowledge while avoiding irrele-vant knowledge rules Since the cannot-links are believed tobe very high due to the large size of vocabulary the modelonly extracts cannot-links for the current domain In orderto ensure their quality the must-links and cannot-links areverified from topics of all domains For transferring the learntknowledge into the inference technique MultigeneralizedPolya Urn (M-GPU) model is used For a word drawn from atopic if there are relevantmust-links all the pairedwords get apushup in the topic whereas a penalty is applied to the pairedwords in cannot-links M-GPU provides the bias supportedby knowledge in order to help distributions with topicshaving semantically corelatedwordsThe transitivity problemis addressed with a graph connecting knowledge rules thatshare a common word which is used in the same sense

The existing state-of-the-art LML models follow batchprocessing approach They expect all the domains to beavailable prior to analysis The big data is processed withbaseline LDA to extract topics in first parse In the secondparse each domain is processed again with knowledge Topicsfrom all domains are loaded for each task to learn relevantknowledge that is verified frommany domains Itmakes thesemodels highly dependent on data as they require topics fromall domains to be available at all times All the knowledgerules extracted are treated equally whereas some knowledgerules have high confidence while others marginally crossthreshold There is no mechanism to prioritize knowledgerules These models lack a knowledge maintenance moduleand have to load topics from all domains for each task andthus consume more resources

4 Proposed Model (OAMC)

An online LML model (OAMC) is proposed to address theissues discussed It is more closely related to the essence ofbig data analysis by expecting data to arrive continuously inchunks It is less dependent on data therefore consuminglimited storage and processing resources Efficiency of themodel is associated with the knowledge-base instead oftopics from all domainsWith a newly introduced knowledgemaintenance and retentionmodule the harvested knowledgerules are stored in an abstract form The knowledge-baseis maintained after each task to keep it consistent Newknowledge rules are mined while existing rules are updatedor even filtered as a continuous process Additional featuresof knowledge are introduced to help maintain and prioritizethem The working of the model is shown in Figure 1 Withthe online sources producing textual content continuouslythe proposedmodel can keep up with processing needs of thedata in handThismakes itmore practical for developing real-world monitoring and analysis applications

4 Computational Intelligence and Neuroscience

OAMCModel (Dt V N knowledge-base)(1) procedure onlineAMCModel (119863119905 119881119873 knowledge-base)(2) relRuleslarr sampleRelevantKnowledge (119881 Knowledge-base)(3) if (relRules = 0) then without knowledge(4) topicslarr LDA (119863119905 119881119873 0)(5) end if(6) if (relRules gt 0) then with relevant knowledge(7) topicslarr GibbsSampler (119863119905 119881119873 relRules)(8) end if(9) rulesrefined larr refineRules (119863119905 119881 topics relRules knowledge-base)(10) rulesnew larrmineRules (119863119905 topics)(11) knowledge-baselarr rulesrefined cup rulesnew(12) end procedure

Algorithm 1 OAMCmodel

M DocsDomain 1

M DocsDomain 2

M Docs

LDA OAMCOAMCTopics Topics Topics

Knowledge-base

Learning module Learning module Learning module

Domain N

Figure 1 Proposedmodel (OAMC) using online knowledge extrac-tion and transfer mechanism

The OAMC model can operate both with and withoutknowledge In case there is no relevant knowledge availablethe model operates as baseline LDA However as the modelmatures its knowledge-base grows to support a variety offuture tasks With relevant knowledge the model showsimproved performance by grouping semantically relatedwords under topics with the help of knowledge learnt fromprevious tasks The working of OAMC model is shownin Algorithm 1 The set of documents in current domainare 119863

119905= 119863

1 1198632 119863

119899 119881 is the vocabulary of the

current task while 119873 is the number of iterations for Gibbssampler [13] used as inference technique Knowledge-basehas the knowledge rules as must-links and cannot-links Theknowledge-base has a variety of must-links and cannot-linkslearnt through experience In step (2) only those must-linksand cannot-links are selected that are relevant to the currenttask that is both the words in must-link and cannot-linkexist in the vocabulary of the current domain The rulesrelevant to the current domain are stored as relRules If norelevant rules are found the current task is processed withbaseline LDA as shown in step (4) In step (4) 0means that norelevant knowledge is found for the current task and therefore

the baseline LDA model is to be used for the current taskIt may happen at low experience when the knowledge-basehas limited rules and the current task is about a completelydifferent subject When relevant rules are found step (7)has Gibbs sampler [20] used for incorporating knowledge toguide the inferenceTheknowledge-basemaintains itself aftereach task to be consistent Step (9) refines relRules to growtheir confidence and utility as a token of their services forthe current domain All rules in the knowledge-base get atask older As part of the continuous learning process themodel harvests new knowledge rules after processing eachtask in step (10) The new rules are merged with the existingrules and are added to the knowledge-base which is ready toprocess a new task

41 Knowledge Representation Additional features are addedto knowledge rules to help maintain the knowledge-baseand keep it consistent by continuously weeding out wrongor irrelevant knowledge In previous models the knowledgerules are maintained as separate lists of must-links andcannot-links each with a word pair and a confidence valueOur OAMC model stores them together as knowledge rulesirrespective of their orientation However the orientationis communicated through a positive or negative sign usedwith the confidence value New rule features are added asutility freshness and background to strictly monitor thecontribution of each rule To survive longer a knowledgerule kRule(119908

1 1199082) needs to maintain high utility in the tasks

that follows The confidence of a rule is also expected to beassured by future domains which adds to the net confidenceThe utility and freshness of a rule maintain a check on itscontribution and relevance A rule with high confidence butlow usage and freshness is also expected to be filtered Thesenewly added features keep the knowledge-base consistentwith quality knowledge The background feature helps toresolve word sense disambiguation by providing the contextof the knowledge rule A knowledge rule in the knowledge-base is stored as shown in Figure 2

42 Knowledge Extraction The proposed model has accessto only limited big data for a task that is domain of the

Computational Intelligence and Neuroscience 5

MineRules (Dt N topics)(1) procedureMineRules (119863119905119873 topics)(2) for (t all topics) do(3) for (i j all word pairs in a topic) do(4) if (nPMI(i j) lt 1205831) then(5) Rules

119895larr Rules (119894 119895 nPMI(119894 119895) topic) learn as negative rule

(6) end if(7) if (nPMI(i j) gt 1205832) then(8) Rules

119895larr Rule (119894 119895 nPMI(119894 119895) topic) learn as positive rule

(9) end if(10) end for(11) end for(12) end procedure

Algorithm 2 Mine rules

Knowledge-base

Background

Orientation

Freshness

Utility

Confidence

KRule(w1 w2)

Figure 2 Representation of a knowledge rule as word pair inknowledge-base Utility freshness and background are the newlyintroduced knowledge features to monitor their quality

current task Therefore the model does not have enoughinstances to verify the quality of knowledge and is expectedto learn low quality knowledge as well A knowledge rulecan be of low quality if it has enough support in data fromwhere it was learnt but is not supported by future tasks Itmayhappen due to noise or high bias in a domainThe refinementtechnique deals with it until it is removed In order to extractnew knowledge rules kRule(119908

1 1199082) normalized PMI [23] as

nPMI(1199081 1199082) is used as given in

kRule (1199081 1199082) = nPMI (119908

1 1199082) =

PMI (1199081 1199082)

log119901 (1199081 1199082) (1)

where nPMI gives a value [minus1 1] range All candidate wordpairs are evaluated through nPMI For a word pair if the scoreis close to 1 the knowledge rule (must-link) is stored withits nPMI score as confidence Word pair having it close to minus1is stored as a negative rule (cannot-link) with nPMI score asconfidence PMI in (1) can be evaluated as

PMI (1199081 1199082) = log

119901 (1199081 1199082)

119901 (1199081) 119901 (119908

2) (2)

where 119901(1199081 1199082) is the coexistence probability of 119908

1and 119908

2

while 119901(1199081) and 119901(119908

2) are probabilities of their existence in

the current domain Coexistence is the presence of two wordsin a conceptual document irrespective of their frequencyCalculating coexistence probability and individual probabil-ities is shown in (3) where119863119905 show all documents while 119863119905is total number of documents

119901 (1199081 1199082) =

119863119905 (1199081 1199082)

119863119905

119901 (119908) =119863119905 (119908)119863119905

(3)

Freshness and utility are initialized while topic of the wordpair is added to the background of the rule The twothresholds 120583

1and 120583

2used are set to harvest word pairs with

nPMI scores at either of the two extremes given in

nPMI (1199081 1199082) =

⟨1205831or⟩ 1205832

learn

else ignore(4)

When nPMI gives a value that is too high that is closeto 1 or too low that is close to minus1 then it is consideredto be of significance This untypical behavior is learnt asknowledge while the values close to zero are ignored asthey have not shown any considerable corelation Word pairshaving nPMI below 120583

1are learnt as knowledge (cannot-links)

while the values above 1205832are also learnt (must-links) The

threshold values are mentioned in Setup Our contributionin knowledge extraction is to use nPMI(119908

1 1199082) as a very

lightweight extractionmechanism Secondly bothmust-linksand cannot-links are extracted together in a single stepMS-FIM was previously used to harvest knowledge whichwas causing a performance bottleneck The rule miningmechanism is mentioned in Algorithm 2 It generates wordpairs within each topic and evaluates them for possibleknowledge from step (2) to step (11) Satisfying either of thethresholds in step (4) and step (7) the word pair is mined aspositive or negative knowledge rule

43 Knowledge Transfer The knowledge transfer mechanismselects relevant knowledge rules each time a word is sampled

6 Computational Intelligence and Neuroscience

to have its probability increased under a topic The effectof relevant knowledge rules is added as bias into the topicmodel to push up positively corelated words under the sametopic and vice versa Since the current task has limited dataknowledge from previous tasks is incorporated to guide theinference For a sampled word 119908

1in topic 119905 if a rule exists

for it then the associated word 1199082has probability updated in

current topic 119905 as shown in

119901 (rule119894= rule | 119905) prop 119901 (119908

1| 119905) times 119901 (119908

2| 119905) (5)

where 1199081and 119908

2are the two words of a rule rule

119894(must-

link or cannot-link)When the model inference increases theprobability of a word 119908

1for a topic 119905 the probability of 119908

2

is increased for topic 119905 in case of must-link and decreasedfor cannot-link To address word sense disambiguation thefollowing equation is proposed

(119881 cup ruleBackground)ruleBackground

gt 120585 (6)

Here 120585 is the threshold that has to be satisfied for overlapbetween the vocabulary of the current task and the rulebackgroundThe rule background represents the set of wordsin topic from which the rule was learnt Comparing therule background with the current vocabulary is to ensurethat the rule is used in the same context in which it wasmined For example the word Bank has different contextswhen used with Accounts and Sea respectively Topics of allprocessed tasks were previously used for the same purposeThe net confidence and utility of a knowledge rule are usedto transfer the impact of knowledge rule in topic samplingMultigeneralized Polya Urn (M-GPU) model [20] is used totransfer the effect of knowledge into the current status ofMarkov chain in Gibbs sampler in

119901 (119908 | 119905) propsum119881

1199081015840=1(1205921199081199081015840120588) times 119899

1199051199081015840 + 120573

sum119881

V=1 (sum119881

1199081015840=1(120592V1199081015840120588) times 1198991199051199081015840 + 120573)

(7)

Equation (7) shows the amount of update in the probability bya factor 120592

1199081199081015840120588 for the other word in knowledge rule When

the probability of a sampled word is increased for a topicusing the rules the probability of all thewords associatedwiththe sampled word is also increasedThe following equation isupdated to incorporate both positive and negatively corelatedrules in a single step as

119901 (119911119894= 119905 | 119911

minus119894 119908 120572 120573 120582)

prop119899119889119905

minus119894+ 120572

sum119879

119905=1(119899119889119905

minus119894+ 120572)

times

sum1199081015840119908119894isinrule (1205921199081015840 119908119894120588) times 1198991199051199081015840

minus119894+ 120573

sum119881

V=1 (sum1199081015840 Visinrule (1205921199081015840 V120588) times 1198991198961199081015840minus119894+ 120573)

(8)

For word sampled 119908 the other word 1199081015840 sharing the same

rule with it gets a push as strong as the net confidence in thedirection of its orientation The objective is to increase theprobability of positively corelated words to appear togetheras top words under the same topic and vice versa

Table 1 OAMC positive and negative knowledge rules with highutility

Type Knowledge

+ive

(tech support) (cable usb) (pro con) (customersupport) (batter charge) (video card) (high long)(connection vga) (operating system) (monitormacbook) (mac support) (cd dvd) (windows xp)(money worth) (inch picture) (input output)(friend mine) (big deal)

minusive

(sound money) (feature battery) (battery price)(feature battery) (price device) (screen sound)(review screen) (worse digital) (price easy) (gasscreen) (signal easy) (home gps) (hotel traffic)(driver cooler) (monitor processor) (monitorspeed)

44 KnowledgeMaintenance andRetention Supportedwith aknowledgemaintenance and retentionmodule useful knowl-edge rules are stored until they justify their existence Theknowledge-base is refined after each task to keep it consistentThey all get older by a taskThe relevant knowledge rules growin confidence supported by current domain and have higherutility After updating the state of each knowledge rule theyare passed through a threshold again to justify their existenceas proposed in

nPMI (1199081 1199082)

=

120592 times 120588

120582gt 1205831or

120592 times 120588

120582lt 1205832

remove

otherwise retain

(9)

where 120592 is net confidence 120588 is utility and 1120582 is freshness fora knowledge rule If the confidence of a knowledge rule incurrent task is above its net confidence 120592 the current domainis allowed to update the knowledge background to betterrepresent its context as proposed in

kRule (1199081 1199082) =

conf gt 120592 update background

otherwise unchanged(10)

where background is updated by adding words from sup-porting topic in the current task New rules are minedwhile existing rules are evolved as a continuous processThe newly added knowledge features are used to rank andprioritize knowledge rules The difference in rank of goodand bad knowledge rules widens as the model matures withexperience Algorithm 3 is used to refine existing knowledgewith step (2) having relevant knowledge as Rule1015840 In steps (4)to (7) all relevant rules get their net confidence updated andutility incremented Step (8) has all rules aged by a task Someof the popular rules from knowledge-base of OAMC modelare shown in Table 1

5 Experimental Results

The proposed model (OAMC) contributes to the applicationof lifelong learning forNLP tasks specifically used to improve

Computational Intelligence and Neuroscience 7

RefineRules (Dt N topics V Rules)(1) procedure RefineRules (119863119905119873 topics 119881 Rules)(2) Rules1015840 larr relevantRules (119881 knowledge-base) set of relevant rules(3) for (r all Rules) do(4) if (Rules1015840 contains r) do(5) 120592 larr nPMI(119903) add to cumulative confidence(6) 120588 larr +1 Rules utility increased by 1(7) end if(8) end for(12) end procedure

Algorithm 3 Refine rules

topic coherence for big streaming data Various experimentsare performed to highlight the improvement in accuracyand performance of lifelong topic mining as product aspectsUnlike traditional ML models it operates on big data with avariety of domains The standard dataset used by previousLML models is used to evaluate the proposed model incomparison to the other state-of-the-art models The dataconsists of 50 domains each of electronic products To high-light the improvement in the OAMC model the dataset wasprovided with decreasing number of domains The modelsare evaluated in topic coherence for accuracy and time inminutes for performance The number of knowledge rules isused as quality of knowledge where fewer rules producingbetter results are considered to be of better quality

51 Setup To set up the test environment an HP PavilionDM4 machine is used with Intel 24GHz Core i5 having8GB RAM and 500GB hard drive Following the evaluationapproach in existing models 15 topics are extracted perdomain for all models while as in previous models top 30words per topic are considered to calculate topic coherenceof the proposedmodel alsoThe hyperparameters are kept thesame for all models The number of Gibbs sampler iterationsis kept the same (2000) as used in previous models Themodels are configured and initialized as used in their originalwork The proposed OAMC has its thresholds 120583

1and 120583

2

set to 08 and minus09 respectively to learn knowledge andselect relevant knowledge for a given task Since negativecorelations are too many as compared to positive rules thethreshold is set to focus more on positive rules In orderto choose relevant rule the overlap threshold is set to 03between the vocabulary and knowledge background

52 Results The accuracy of lifelong models and topicmodels in general is evaluated through topic coherence It hashigh corelation with human evaluation All of the models aregiven the same number of documents in a domain for bothlearning and evaluation The existing models drop accuracywhen it has to learn from a smaller dataset with documentsin hundreds per domain while compromising accuracy whenevaluated on larger dataset with thousands of documentsTherefore the previous models were given large datasets tolearn from and were evaluated on smaller datasets However

50 40 30 20 10 8 6 4 2 1Number of domains available

Topic coherence with different number of domains in dataset

LDALTMAMC-M

AMCOAMC

minus700

minus750

minus800

minus850

minus900

minus950

minus1000

Topi

c coh

eren

ce

Figure 3The proposed model (OAMC) produces highest accuracy(as topic coherence) among lifelong learning models for streamingbig data

for fairness to all models they are provided with the test setonly to learn from and to test on

Topic coherence of the models compared has interestingoutcome LTM surprisingly could not maintain a better topiccoherence when it had to learn from a smaller dataset It evendropped below the baseline AMC and AMC-M performedwell however the proposed OAMC beat them when anumber of domains in the dataset were dropped below 10 asshown in Figure 3 AMC and AMC-M hardly learn anythingwith fewer domains which makes them ineffective towardsstreaming data Therefore their accuracy continuously wentdown towards baseline On the other hand LTM has itslearning pattern badly affected by limited variety of domainsand learnt wrong knowledge which dropped its accuracyThe knowledge rules learnt by AMC and AMC-M were100 and below per domain for less than 10 domains in thedataset which grow up to more than 2000 for 50 domainsIt is because of the overreliance of the existing models onlarge volumes of data to be available at once which makesthem least effective towards streaming data It shows thatthey only learn better knowledge and produce better resultswhen there are many domains in the dataset and they areall available prior to analysis With a robust knowledge

8 Computational Intelligence and Neuroscience

LTM AMC-M AMC OAMCLML models

Performance

020406080

100120140

Tim

e (m

in)

Figure 4 The proposed model (OAMC) has the highest perfor-mance efficiency among LML topic models

storing mechanism OAMC learnt more knowledge refinedit and used it to improve topic coherence with few domainsavailable The OAMC model shows on average 7 with amaximum of 10 improvement in topic coherence whenthere is one domain at a time

OAMC learns wrong knowledge too as it considersonly the current domain for harvesting it However with arobust knowledge retention module after performing fewtasks only good quality knowledge survives through thefiltering mechanism Knowledge rules with less confidencefrom future tasks or limited utility are pushed down andfinally filtered out Knowledge that is getting confidencereenforced from future tasks and has growing utility stayslonger To evaluate relevance of a knowledge rule with thecurrent task the domain of the current task must have therule words confidence above threshold and overlap with therule background higher than threshold The online OAMCmodels truly learn like humans having no idea of future tasksunlike existing models Its learning is dependent on the pastexperience as the domains processed only and uses it for afuture task It is less dependent on data parses it once andconsiders the knowledge module with abstract rules only tosample relevant rules and therefore shows improvement of upto 50 in performance as shown in Figure 4

For the discussed intrinsic evaluation the models arealso compared through PMI mean and median values thathad relevance with topic coherence values The extrinsicWordNet based techniques used in the literature (WuPalmerResnik Path Lin LeacockChodorow Lesk JiangConrath andHirstStOnge) were inconclusive to differentiate among thequalities of topics extracted due to the specialized nature ofelectronic products domains

53 Evaluation OAMC improving performance of AMCby 50 as shown in Figure 4 is attributed to the followingreasons OAMC does not extract relevant knowledge for eachtask and rather chooses from a well maintained knowledgemodule Each domain is parsed only once With a knowledgefiltering mechanism only good quality knowledge rules aremaintainedThere are generally two performance bottlenecksin existing LML models that is FIM for knowledge extrac-tion and higher iterations of Gibbs sampling The OAMCmodel uses nPMI [23] which is highly efficient to extract

knowledge Secondly it learns both must-links and cannot-links together with a single mechanism as rules Even with50 domains in the dataset OAMC performed better thanAMC for certain individual domains by producing highertopic coherence while using fewer knowledge rules as shownin Figure 5 Table 1 shows some of the most frequently usedknowledge rules of OAMC Through manual evaluation oftopics from all the domains it was observed that the knowl-edge learnt by LMLmodels is more effective towards randomand intruded types of topic incoherence but less effectivetowards chained and unbalanced type of incoherence In factthe knowledge-based models can at times enhance chainedtopic incoherence if there are many knowledge rule pairssharing common words

54 Knowledge Analysis The accuracy of LML modelsdepends upon the quality of its knowledge The previousmodels did not provide a discussion on the quality ofknowledge In fact improved topic coherence was consideredas reason for good quality knowledge The knowledge rulesshown in Table 1 have utility above 10 as a token of theirquality The ratio of use of knowledge rules per domain byAMC and OAMC is 5 1 signifying the quality of knowledgeproduced by OAMC Net confidence of rules also showsthat most of the relevant domains enforced the rule andincreased its confidence (above 08with 1 asmaximum) Sincethis information is not available for the existing models thevalues cannot be matched It is still very difficult to explorethe impact of each knowledge rule separately however thediscussed features in general keep track of good qualityknowledge The knowledge learnt is of two types only cap-turing positive and negative word corelations However morevarieties of knowledge can be introduced for further insightsBy introducing ldquothe type of relationshiprdquo as knowledge theunbalanced topic incoherence problem can be addressedAlong with correlation between words the nature of theircorrelationmay also be explored for better analysisThis typeof knowledge is also useful for finding hierarchical topicsThe knowledge rules extracted are isolated while knowledgeas complex network can be more effective for a variety ofanalyses AMC uses a small graph only to associate must-links sharing common word in the same context Knowledgeextraction transfer representation and retention are onlyintroduced to NLP tasks in 2010 and require mature ideasfrom other fields of research

6 Conclusion

LML models are structured to process big data having manydomains and therefore it is the only suitable solution despiteits shortcomings LML models are recently introduced to bigdata topic extraction and are far from mature The existingLML topic models had limited application due to followingbatch processing approach andoverreliance ondataOur pro-posed model named as OAMC follows online approach forLML topic modeling that can support streaming data Withan efficient learning mechanism our OAMCmodel has lowerdependency on data and therefore improved performance by

Computational Intelligence and Neuroscience 9

AlarmClock Battery CDPlayer Fan LampIndividual domains

Topic coherence for each domain

OAMCAMC

minus750

minus770

minus790

minus810

minus830

minus850To

pic c

oher

ence

minus778 minus785

minus812

minus782

minus809

minus790

minus817

minus761

minus788

(a)

OAMCAMC

AlarmClock Battery CDPlayer Fan LampIndividual domains

Knowledge rules used for each domain

284

1423

345

863

226518

1520

446

0

500

1000

1500

2000

Num

ber o

f kno

wle

dge

rule

s use

d

(b)

Figure 5 The proposed model (OAMC) in comparison to the state-of-the-art LML model (a) produces better accuracy as topic coherencefor individual domains while (b) using fewer knowledge rules

50 as compared to state-of-the-art LML models Velocity isan important feature of big data as it is believed to be receivedcontinuously at high speed The OAMC model is designedto process the data in hand only and therefore outperformsexisting models for streaming data As the data is expectedto have noise OAMC is supported with a strict monitoringand filteringmechanism tomaintain a consistent knowledge-base The rules are made to compete for survival based onutility freshness and net confidence OAMC being blind tothe future gets higher bumps in accuracy when current taskhas low relevance to past experience This is very commonto a human-like learning approach However the tendencyof such tasks for which limited relevant knowledge availabledrops with experience considerably OAMCmodel shows 7improvement in accuracy on average for streaming big databut up to 10 for certain individual domains as compared tostate of the art

In the future we will be working on introducing morevarieties of knowledge Through complex network analysisthe knowledge rules can be stored in association with eachother Cooccurrence can mean many things and thereforethe relationship type should also be considered in future Itwill help to associate aspects with products sentiments toaspects and reasons to sentiments To improve the perfor-mance and attempt for near real-time analysis Gibbs sam-pling should be replaced with a lighter inference technique

Disclosure

This research is the authorsrsquo work not published or underreview elsewhere

Competing Interests

The authors declare that they have no competing interests

Authorsrsquo Contributions

In this research work Muhammad Taimoor Khan(corresponding author) developed the model while Mehr

Durrani helped with adjusting the knowledge learningand transfer mechanism to better adapt to the currenttask Furqan Aziz contributed through incorporating theknowledge maintenance module for high efficiency andreduced data dependency Shehzad Khalid supervised theactivity and contributed through running experiments andcompiling results Shehzad Khalid helped with introducingthe new knowledge features and with prioritizing them

Acknowledgments

The authors are thankful to Mr Yousaf Marwat Lecturerat COMSATS IIT Attock for facilitating with editing andproofreading services

References

[1] S Branavan H Chen J Eisenstein and R Barzilay ldquoLearningdocument-level semantic properties from free-text annota-tionsrdquo Journal of Artificial Intelligence Research vol 34 pp 569ndash603 2009

[2] S Brody and N Elhadad ldquoAn unsupervised aspect-sentimentmodel for online reviewsrdquo in Proceedings of the Human Lan-guage Technologies Conference of the North American Chapterof the Association for Computational Linguistics (NAACL HLTrsquo10) pp 804ndash812 Association for Computational LinguisticsLos Angeles Calif USA June 2010

[3] F Li M Huang and X Zhu ldquoSentiment analysis with globaltopics and local dependencyrdquo in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI rsquo10)Atlanta Ga USA July 2010

[4] Q Mei X Ling M Wondra H Su and C Zhai ldquoTopicsentiment mixture modeling facets and opinions in weblogsrdquoin Proceedings of the 16th International World Wide Web Con-ference (WWW rsquo07) pp 171ndash180 ACM Alberta Canada May2007

[5] I Titov and R McDonald ldquoModeling online reviews withmulti-grain topic modelsrdquo in Proceedings of the 17th Interna-tional Conference on World Wide Web (WWW rsquo08) pp 111ndash120ACM Beijing China April 2008

10 Computational Intelligence and Neuroscience

[6] Y Lu C X Zhai andN Sundaresan ldquoRated aspect summariza-tion of short commentsrdquo in Proceedings of the 18th InternationalWorld Wide Web Conference (WWW rsquo09) pp 131ndash140 ACMApril 2009

[7] A Mukherjee and B Liu ldquoAspect extraction through semi-supervisedmodelingrdquo inProceedings of the 50thAnnualMeetingof the Association for Computational Linguistics (ACL rsquo12) vol1 pp 339ndash348 ssociation for Computational Linguistics July2012

[8] I Titov and R T McDonald ldquoA joint model of text and aspectratings for sentiment summarizationrdquo in Proceedings of theAnnualMeeting of the Association for Computational Linguisticsvol 8 pp 308ndash316 Association for Computational LinguisticsColumbus Ohio USA 2008

[9] H Wang Y Lu and C Zhai ldquoLatent aspect rating analysis onreview text data a rating regression approachrdquo in Proceedings ofthe 16th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo10) pp 783ndash792 ACM July2010

[10] T L Griffiths M Steyvers D M Blei and J B TenenbaumldquoIntegrating topics and syntaxrdquo in Advances in Neural Informa-tion Processing Systems pp 537ndash544 MIT Press CambridgeMass USA 2004

[11] Y Jo and A H Oh ldquoAspect and sentiment unification modelfor online review analysisrdquo in Proceedings of the 4th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo11) pp 815ndash824 ACM Hong Kong February 2011

[12] J Liu Y Cao C Y Lin Y Huang and M Zhou ldquoLow-quality product review detection in opinion summarizationrdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 334ndash342 PragueCzech Republic June 2007

[13] Y Lu and C Zhai ldquoOpinion integration through semi-supervised topic modelingrdquo in Proceedings of the 17th Interna-tional Conference onWorld Wide Web (WWW rsquo08) pp 121ndash130ACM Beijing China April 2008

[14] C Sauper A Haghighi and R Barzilay ldquoContent models withattituderdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics vol 1 ofHuman Lan-guage Technologies pp 350ndash358 Association for ComputationalLinguistics 2011

[15] W X Zhao J Jiang H Yan and X Li ldquoJointly modeling aspectsand opinions with a MaxEnt-LDA hybridrdquo in Proceedings ofthe Conference on Empirical Methods in Natural Language Pro-cessing pp 56ndash65 Association for Computational LinguisticsOctober 2010

[16] D Andrzejewski X Zhu and M Craven ldquoIncorporatingdomain knowledge into topicmodeling via Dirichlet Forest pri-orsrdquo in Proceedings of the 26th Annual International Conferenceon Machine Learning pp 25ndash32 ACM June 2009

[17] D Andrzejewski X Zhu M Craven and B Recht ldquoA frame-work for incorporating general domain knowledge into latentdirichlet allocation using first-order logicrdquo in Proceedings ofthe 22nd International Joint Conference on Artificial Intelligence(IJCAI rsquo11) pp 1171ndash1177 Barcelona Spain July 2011

[18] Z Chen A Mukherjee B Liu M Hsu M Castellanos and RGhosh ldquoDiscovering coherent topics using general knowledgerdquoin Proceedings of the 22nd ACM International Conference onInformation and Knowledge Management (CIKM rsquo13) pp 209ndash218 ACM November 2013

[19] A Carlson J Betteridge B Kisiel B Settles E R Hruschka Jrand T M Mitchell ldquoToward an architecture for never-endinglanguage learningrdquo in Proceedings of the 24th AAAI Conferenceon Artificial Intelligence vol 5 p 3 Atlanta Ga USA July 2010

[20] Z Chen and B Liu ldquoMining topics in documents standingon the shoulders of big datardquo in Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo14) pp 1116ndash1125 ACM New York NYUSA August 2014

[21] D Mimno H M Wallach E Talley M Leenders and AMcCallum ldquoOptimizing semantic coherence in topic modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 262ndash272 Association for Computa-tional Linguistics July 2011

[22] Z Chen A Mukherjee and B Liu ldquoAspect extraction withautomated prior knowledge learningrdquo in Proceedings of the 52ndAnnual Meeting of the Association for Computational Linguistics(ACL rsquo14) pp 347ndash358 Baltimore Md USA June 2014

[23] G Bouma ldquoNormalized (pointwise) mutual information incollocation extractionrdquo in Proceedings of the Conference of theGerman Society for Computational Linguistics (GSCL rsquo09) pp31ndash40 Potsdam Germany 2009

[24] D M Blei A Y Ng and M I Jordan ldquoLatent dirichletallocationrdquoThe Journal ofMachine Learning Research vol 3 no4-5 pp 993ndash1022 2003

[25] T Hofmann ldquoProbabilistic latent semantic indexingrdquo in Pro-ceedings of the 22nd Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval(SIGIR rsquo99) pp 50ndash57 ACM Berkeley Calif USA August 1999

[26] E Eaton and P L Ruvolo ldquoElla an efficient lifelong learningalgorithmrdquo in Proceedings of the 30th International Conferenceon Machine Learning (ICML rsquo13) pp 507ndash515 2013

[27] E Kamar A Kapoor and E Horvitz ldquoLifelong learning foracquiring thewisdom of the crowdrdquo in Proceedings of the 23rdInternational Joint Conference on Artificial Intelligence (IJCAIrsquo13) pp 2313ndash2320 AAAI Press Beijing China August 2013

[28] R Raina A Battle H Lee B Packer and A Y Ng ldquoSelf-taughtlearning transfer learning from unlabeled datardquo in Proceedingsof the 24th International Conference onMachine Learning (ICMLrsquo07) pp 759ndash766 ACM June 2007

[29] D L Silver Q Yang and L Li ldquoLifelong machine learningsystems Be-yond learning algorithmsrdquo in Proceedings of theAAAI Spring Symposium Lifelong Machine Learning StanfordUniversity March 2013

[30] Z Chen and B Liu ldquoTopic modeling using topics from manydomains lifelong learning and big datardquo in Proceedings of the31st International Conference on Machine Learning (ICML rsquo14)pp 703ndash711 Beijing China June 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article Online Knowledge-Based Model for Big Data ...downloads.hindawi.com/journals/cin/2016/6081804.pdf · Research Article Online Knowledge-Based Model for Big Data Topic

4 Computational Intelligence and Neuroscience

OAMCModel (Dt V N knowledge-base)(1) procedure onlineAMCModel (119863119905 119881119873 knowledge-base)(2) relRuleslarr sampleRelevantKnowledge (119881 Knowledge-base)(3) if (relRules = 0) then without knowledge(4) topicslarr LDA (119863119905 119881119873 0)(5) end if(6) if (relRules gt 0) then with relevant knowledge(7) topicslarr GibbsSampler (119863119905 119881119873 relRules)(8) end if(9) rulesrefined larr refineRules (119863119905 119881 topics relRules knowledge-base)(10) rulesnew larrmineRules (119863119905 topics)(11) knowledge-baselarr rulesrefined cup rulesnew(12) end procedure

Algorithm 1 OAMCmodel

M DocsDomain 1

M DocsDomain 2

M Docs

LDA OAMCOAMCTopics Topics Topics

Knowledge-base

Learning module Learning module Learning module

Domain N

Figure 1 Proposedmodel (OAMC) using online knowledge extrac-tion and transfer mechanism

The OAMC model can operate both with and withoutknowledge In case there is no relevant knowledge availablethe model operates as baseline LDA However as the modelmatures its knowledge-base grows to support a variety offuture tasks With relevant knowledge the model showsimproved performance by grouping semantically relatedwords under topics with the help of knowledge learnt fromprevious tasks The working of OAMC model is shownin Algorithm 1 The set of documents in current domainare 119863

119905= 119863

1 1198632 119863

119899 119881 is the vocabulary of the

current task while 119873 is the number of iterations for Gibbssampler [13] used as inference technique Knowledge-basehas the knowledge rules as must-links and cannot-links Theknowledge-base has a variety of must-links and cannot-linkslearnt through experience In step (2) only those must-linksand cannot-links are selected that are relevant to the currenttask that is both the words in must-link and cannot-linkexist in the vocabulary of the current domain The rulesrelevant to the current domain are stored as relRules If norelevant rules are found the current task is processed withbaseline LDA as shown in step (4) In step (4) 0means that norelevant knowledge is found for the current task and therefore

the baseline LDA model is to be used for the current taskIt may happen at low experience when the knowledge-basehas limited rules and the current task is about a completelydifferent subject When relevant rules are found step (7)has Gibbs sampler [20] used for incorporating knowledge toguide the inferenceTheknowledge-basemaintains itself aftereach task to be consistent Step (9) refines relRules to growtheir confidence and utility as a token of their services forthe current domain All rules in the knowledge-base get atask older As part of the continuous learning process themodel harvests new knowledge rules after processing eachtask in step (10) The new rules are merged with the existingrules and are added to the knowledge-base which is ready toprocess a new task

41 Knowledge Representation Additional features are addedto knowledge rules to help maintain the knowledge-baseand keep it consistent by continuously weeding out wrongor irrelevant knowledge In previous models the knowledgerules are maintained as separate lists of must-links andcannot-links each with a word pair and a confidence valueOur OAMC model stores them together as knowledge rulesirrespective of their orientation However the orientationis communicated through a positive or negative sign usedwith the confidence value New rule features are added asutility freshness and background to strictly monitor thecontribution of each rule To survive longer a knowledgerule kRule(119908

1 1199082) needs to maintain high utility in the tasks

that follows The confidence of a rule is also expected to beassured by future domains which adds to the net confidenceThe utility and freshness of a rule maintain a check on itscontribution and relevance A rule with high confidence butlow usage and freshness is also expected to be filtered Thesenewly added features keep the knowledge-base consistentwith quality knowledge The background feature helps toresolve word sense disambiguation by providing the contextof the knowledge rule A knowledge rule in the knowledge-base is stored as shown in Figure 2

42 Knowledge Extraction The proposed model has accessto only limited big data for a task that is domain of the

Computational Intelligence and Neuroscience 5

MineRules (Dt N topics)(1) procedureMineRules (119863119905119873 topics)(2) for (t all topics) do(3) for (i j all word pairs in a topic) do(4) if (nPMI(i j) lt 1205831) then(5) Rules

119895larr Rules (119894 119895 nPMI(119894 119895) topic) learn as negative rule

(6) end if(7) if (nPMI(i j) gt 1205832) then(8) Rules

119895larr Rule (119894 119895 nPMI(119894 119895) topic) learn as positive rule

(9) end if(10) end for(11) end for(12) end procedure

Algorithm 2 Mine rules

Knowledge-base

Background

Orientation

Freshness

Utility

Confidence

KRule(w1 w2)

Figure 2 Representation of a knowledge rule as word pair inknowledge-base Utility freshness and background are the newlyintroduced knowledge features to monitor their quality

current task Therefore the model does not have enoughinstances to verify the quality of knowledge and is expectedto learn low quality knowledge as well A knowledge rulecan be of low quality if it has enough support in data fromwhere it was learnt but is not supported by future tasks Itmayhappen due to noise or high bias in a domainThe refinementtechnique deals with it until it is removed In order to extractnew knowledge rules kRule(119908

1 1199082) normalized PMI [23] as

nPMI(1199081 1199082) is used as given in

kRule (1199081 1199082) = nPMI (119908

1 1199082) =

PMI (1199081 1199082)

log119901 (1199081 1199082) (1)

where nPMI gives a value [minus1 1] range All candidate wordpairs are evaluated through nPMI For a word pair if the scoreis close to 1 the knowledge rule (must-link) is stored withits nPMI score as confidence Word pair having it close to minus1is stored as a negative rule (cannot-link) with nPMI score asconfidence PMI in (1) can be evaluated as

PMI (1199081 1199082) = log

119901 (1199081 1199082)

119901 (1199081) 119901 (119908

2) (2)

where 119901(1199081 1199082) is the coexistence probability of 119908

1and 119908

2

while 119901(1199081) and 119901(119908

2) are probabilities of their existence in

the current domain Coexistence is the presence of two wordsin a conceptual document irrespective of their frequencyCalculating coexistence probability and individual probabil-ities is shown in (3) where119863119905 show all documents while 119863119905is total number of documents

119901 (1199081 1199082) =

119863119905 (1199081 1199082)

119863119905

119901 (119908) =119863119905 (119908)119863119905

(3)

Freshness and utility are initialized while topic of the wordpair is added to the background of the rule The twothresholds 120583

1and 120583

2used are set to harvest word pairs with

nPMI scores at either of the two extremes given in

nPMI (1199081 1199082) =

⟨1205831or⟩ 1205832

learn

else ignore(4)

When nPMI gives a value that is too high that is closeto 1 or too low that is close to minus1 then it is consideredto be of significance This untypical behavior is learnt asknowledge while the values close to zero are ignored asthey have not shown any considerable corelation Word pairshaving nPMI below 120583

1are learnt as knowledge (cannot-links)

while the values above 1205832are also learnt (must-links) The

threshold values are mentioned in Setup Our contributionin knowledge extraction is to use nPMI(119908

1 1199082) as a very

lightweight extractionmechanism Secondly bothmust-linksand cannot-links are extracted together in a single stepMS-FIM was previously used to harvest knowledge whichwas causing a performance bottleneck The rule miningmechanism is mentioned in Algorithm 2 It generates wordpairs within each topic and evaluates them for possibleknowledge from step (2) to step (11) Satisfying either of thethresholds in step (4) and step (7) the word pair is mined aspositive or negative knowledge rule

43 Knowledge Transfer The knowledge transfer mechanismselects relevant knowledge rules each time a word is sampled

6 Computational Intelligence and Neuroscience

to have its probability increased under a topic The effectof relevant knowledge rules is added as bias into the topicmodel to push up positively corelated words under the sametopic and vice versa Since the current task has limited dataknowledge from previous tasks is incorporated to guide theinference For a sampled word 119908

1in topic 119905 if a rule exists

for it then the associated word 1199082has probability updated in

current topic 119905 as shown in

119901 (rule119894= rule | 119905) prop 119901 (119908

1| 119905) times 119901 (119908

2| 119905) (5)

where 1199081and 119908

2are the two words of a rule rule

119894(must-

link or cannot-link)When the model inference increases theprobability of a word 119908

1for a topic 119905 the probability of 119908

2

is increased for topic 119905 in case of must-link and decreasedfor cannot-link To address word sense disambiguation thefollowing equation is proposed

(119881 cup ruleBackground)ruleBackground

gt 120585 (6)

Here 120585 is the threshold that has to be satisfied for overlapbetween the vocabulary of the current task and the rulebackgroundThe rule background represents the set of wordsin topic from which the rule was learnt Comparing therule background with the current vocabulary is to ensurethat the rule is used in the same context in which it wasmined For example the word Bank has different contextswhen used with Accounts and Sea respectively Topics of allprocessed tasks were previously used for the same purposeThe net confidence and utility of a knowledge rule are usedto transfer the impact of knowledge rule in topic samplingMultigeneralized Polya Urn (M-GPU) model [20] is used totransfer the effect of knowledge into the current status ofMarkov chain in Gibbs sampler in

119901 (119908 | 119905) propsum119881

1199081015840=1(1205921199081199081015840120588) times 119899

1199051199081015840 + 120573

sum119881

V=1 (sum119881

1199081015840=1(120592V1199081015840120588) times 1198991199051199081015840 + 120573)

(7)

Equation (7) shows the amount of update in the probability bya factor 120592

1199081199081015840120588 for the other word in knowledge rule When

the probability of a sampled word is increased for a topicusing the rules the probability of all thewords associatedwiththe sampled word is also increasedThe following equation isupdated to incorporate both positive and negatively corelatedrules in a single step as

119901 (119911119894= 119905 | 119911

minus119894 119908 120572 120573 120582)

prop119899119889119905

minus119894+ 120572

sum119879

119905=1(119899119889119905

minus119894+ 120572)

times

sum1199081015840119908119894isinrule (1205921199081015840 119908119894120588) times 1198991199051199081015840

minus119894+ 120573

sum119881

V=1 (sum1199081015840 Visinrule (1205921199081015840 V120588) times 1198991198961199081015840minus119894+ 120573)

(8)

For word sampled 119908 the other word 1199081015840 sharing the same

rule with it gets a push as strong as the net confidence in thedirection of its orientation The objective is to increase theprobability of positively corelated words to appear togetheras top words under the same topic and vice versa

Table 1 OAMC positive and negative knowledge rules with highutility

Type Knowledge

+ive

(tech support) (cable usb) (pro con) (customersupport) (batter charge) (video card) (high long)(connection vga) (operating system) (monitormacbook) (mac support) (cd dvd) (windows xp)(money worth) (inch picture) (input output)(friend mine) (big deal)

minusive

(sound money) (feature battery) (battery price)(feature battery) (price device) (screen sound)(review screen) (worse digital) (price easy) (gasscreen) (signal easy) (home gps) (hotel traffic)(driver cooler) (monitor processor) (monitorspeed)

44 KnowledgeMaintenance andRetention Supportedwith aknowledgemaintenance and retentionmodule useful knowl-edge rules are stored until they justify their existence Theknowledge-base is refined after each task to keep it consistentThey all get older by a taskThe relevant knowledge rules growin confidence supported by current domain and have higherutility After updating the state of each knowledge rule theyare passed through a threshold again to justify their existenceas proposed in

nPMI (1199081 1199082)

=

120592 times 120588

120582gt 1205831or

120592 times 120588

120582lt 1205832

remove

otherwise retain

(9)

where 120592 is net confidence 120588 is utility and 1120582 is freshness fora knowledge rule If the confidence of a knowledge rule incurrent task is above its net confidence 120592 the current domainis allowed to update the knowledge background to betterrepresent its context as proposed in

kRule (1199081 1199082) =

conf gt 120592 update background

otherwise unchanged(10)

where background is updated by adding words from sup-porting topic in the current task New rules are minedwhile existing rules are evolved as a continuous processThe newly added knowledge features are used to rank andprioritize knowledge rules The difference in rank of goodand bad knowledge rules widens as the model matures withexperience Algorithm 3 is used to refine existing knowledgewith step (2) having relevant knowledge as Rule1015840 In steps (4)to (7) all relevant rules get their net confidence updated andutility incremented Step (8) has all rules aged by a task Someof the popular rules from knowledge-base of OAMC modelare shown in Table 1

5 Experimental Results

The proposed model (OAMC) contributes to the applicationof lifelong learning forNLP tasks specifically used to improve

Computational Intelligence and Neuroscience 7

RefineRules (Dt N topics V Rules)(1) procedure RefineRules (119863119905119873 topics 119881 Rules)(2) Rules1015840 larr relevantRules (119881 knowledge-base) set of relevant rules(3) for (r all Rules) do(4) if (Rules1015840 contains r) do(5) 120592 larr nPMI(119903) add to cumulative confidence(6) 120588 larr +1 Rules utility increased by 1(7) end if(8) end for(12) end procedure

Algorithm 3 Refine rules

topic coherence for big streaming data Various experimentsare performed to highlight the improvement in accuracyand performance of lifelong topic mining as product aspectsUnlike traditional ML models it operates on big data with avariety of domains The standard dataset used by previousLML models is used to evaluate the proposed model incomparison to the other state-of-the-art models The dataconsists of 50 domains each of electronic products To high-light the improvement in the OAMC model the dataset wasprovided with decreasing number of domains The modelsare evaluated in topic coherence for accuracy and time inminutes for performance The number of knowledge rules isused as quality of knowledge where fewer rules producingbetter results are considered to be of better quality

51 Setup To set up the test environment an HP PavilionDM4 machine is used with Intel 24GHz Core i5 having8GB RAM and 500GB hard drive Following the evaluationapproach in existing models 15 topics are extracted perdomain for all models while as in previous models top 30words per topic are considered to calculate topic coherenceof the proposedmodel alsoThe hyperparameters are kept thesame for all models The number of Gibbs sampler iterationsis kept the same (2000) as used in previous models Themodels are configured and initialized as used in their originalwork The proposed OAMC has its thresholds 120583

1and 120583

2

set to 08 and minus09 respectively to learn knowledge andselect relevant knowledge for a given task Since negativecorelations are too many as compared to positive rules thethreshold is set to focus more on positive rules In orderto choose relevant rule the overlap threshold is set to 03between the vocabulary and knowledge background

52 Results The accuracy of lifelong models and topicmodels in general is evaluated through topic coherence It hashigh corelation with human evaluation All of the models aregiven the same number of documents in a domain for bothlearning and evaluation The existing models drop accuracywhen it has to learn from a smaller dataset with documentsin hundreds per domain while compromising accuracy whenevaluated on larger dataset with thousands of documentsTherefore the previous models were given large datasets tolearn from and were evaluated on smaller datasets However

50 40 30 20 10 8 6 4 2 1Number of domains available

Topic coherence with different number of domains in dataset

LDALTMAMC-M

AMCOAMC

minus700

minus750

minus800

minus850

minus900

minus950

minus1000

Topi

c coh

eren

ce

Figure 3The proposed model (OAMC) produces highest accuracy(as topic coherence) among lifelong learning models for streamingbig data

for fairness to all models they are provided with the test setonly to learn from and to test on

Topic coherence of the models compared has interestingoutcome LTM surprisingly could not maintain a better topiccoherence when it had to learn from a smaller dataset It evendropped below the baseline AMC and AMC-M performedwell however the proposed OAMC beat them when anumber of domains in the dataset were dropped below 10 asshown in Figure 3 AMC and AMC-M hardly learn anythingwith fewer domains which makes them ineffective towardsstreaming data Therefore their accuracy continuously wentdown towards baseline On the other hand LTM has itslearning pattern badly affected by limited variety of domainsand learnt wrong knowledge which dropped its accuracyThe knowledge rules learnt by AMC and AMC-M were100 and below per domain for less than 10 domains in thedataset which grow up to more than 2000 for 50 domainsIt is because of the overreliance of the existing models onlarge volumes of data to be available at once which makesthem least effective towards streaming data It shows thatthey only learn better knowledge and produce better resultswhen there are many domains in the dataset and they areall available prior to analysis With a robust knowledge

8 Computational Intelligence and Neuroscience

LTM AMC-M AMC OAMCLML models

Performance

020406080

100120140

Tim

e (m

in)

Figure 4 The proposed model (OAMC) has the highest perfor-mance efficiency among LML topic models

storing mechanism OAMC learnt more knowledge refinedit and used it to improve topic coherence with few domainsavailable The OAMC model shows on average 7 with amaximum of 10 improvement in topic coherence whenthere is one domain at a time

OAMC learns wrong knowledge too as it considersonly the current domain for harvesting it However with arobust knowledge retention module after performing fewtasks only good quality knowledge survives through thefiltering mechanism Knowledge rules with less confidencefrom future tasks or limited utility are pushed down andfinally filtered out Knowledge that is getting confidencereenforced from future tasks and has growing utility stayslonger To evaluate relevance of a knowledge rule with thecurrent task the domain of the current task must have therule words confidence above threshold and overlap with therule background higher than threshold The online OAMCmodels truly learn like humans having no idea of future tasksunlike existing models Its learning is dependent on the pastexperience as the domains processed only and uses it for afuture task It is less dependent on data parses it once andconsiders the knowledge module with abstract rules only tosample relevant rules and therefore shows improvement of upto 50 in performance as shown in Figure 4

For the discussed intrinsic evaluation the models arealso compared through PMI mean and median values thathad relevance with topic coherence values The extrinsicWordNet based techniques used in the literature (WuPalmerResnik Path Lin LeacockChodorow Lesk JiangConrath andHirstStOnge) were inconclusive to differentiate among thequalities of topics extracted due to the specialized nature ofelectronic products domains

53 Evaluation OAMC improving performance of AMCby 50 as shown in Figure 4 is attributed to the followingreasons OAMC does not extract relevant knowledge for eachtask and rather chooses from a well maintained knowledgemodule Each domain is parsed only once With a knowledgefiltering mechanism only good quality knowledge rules aremaintainedThere are generally two performance bottlenecksin existing LML models that is FIM for knowledge extrac-tion and higher iterations of Gibbs sampling The OAMCmodel uses nPMI [23] which is highly efficient to extract

knowledge Secondly it learns both must-links and cannot-links together with a single mechanism as rules Even with50 domains in the dataset OAMC performed better thanAMC for certain individual domains by producing highertopic coherence while using fewer knowledge rules as shownin Figure 5 Table 1 shows some of the most frequently usedknowledge rules of OAMC Through manual evaluation oftopics from all the domains it was observed that the knowl-edge learnt by LMLmodels is more effective towards randomand intruded types of topic incoherence but less effectivetowards chained and unbalanced type of incoherence In factthe knowledge-based models can at times enhance chainedtopic incoherence if there are many knowledge rule pairssharing common words

54 Knowledge Analysis The accuracy of LML modelsdepends upon the quality of its knowledge The previousmodels did not provide a discussion on the quality ofknowledge In fact improved topic coherence was consideredas reason for good quality knowledge The knowledge rulesshown in Table 1 have utility above 10 as a token of theirquality The ratio of use of knowledge rules per domain byAMC and OAMC is 5 1 signifying the quality of knowledgeproduced by OAMC Net confidence of rules also showsthat most of the relevant domains enforced the rule andincreased its confidence (above 08with 1 asmaximum) Sincethis information is not available for the existing models thevalues cannot be matched It is still very difficult to explorethe impact of each knowledge rule separately however thediscussed features in general keep track of good qualityknowledge The knowledge learnt is of two types only cap-turing positive and negative word corelations However morevarieties of knowledge can be introduced for further insightsBy introducing ldquothe type of relationshiprdquo as knowledge theunbalanced topic incoherence problem can be addressedAlong with correlation between words the nature of theircorrelationmay also be explored for better analysisThis typeof knowledge is also useful for finding hierarchical topicsThe knowledge rules extracted are isolated while knowledgeas complex network can be more effective for a variety ofanalyses AMC uses a small graph only to associate must-links sharing common word in the same context Knowledgeextraction transfer representation and retention are onlyintroduced to NLP tasks in 2010 and require mature ideasfrom other fields of research

6 Conclusion

LML models are structured to process big data having manydomains and therefore it is the only suitable solution despiteits shortcomings LML models are recently introduced to bigdata topic extraction and are far from mature The existingLML topic models had limited application due to followingbatch processing approach andoverreliance ondataOur pro-posed model named as OAMC follows online approach forLML topic modeling that can support streaming data Withan efficient learning mechanism our OAMCmodel has lowerdependency on data and therefore improved performance by

Computational Intelligence and Neuroscience 9

AlarmClock Battery CDPlayer Fan LampIndividual domains

Topic coherence for each domain

OAMCAMC

minus750

minus770

minus790

minus810

minus830

minus850To

pic c

oher

ence

minus778 minus785

minus812

minus782

minus809

minus790

minus817

minus761

minus788

(a)

OAMCAMC

AlarmClock Battery CDPlayer Fan LampIndividual domains

Knowledge rules used for each domain

284

1423

345

863

226518

1520

446

0

500

1000

1500

2000

Num

ber o

f kno

wle

dge

rule

s use

d

(b)

Figure 5 The proposed model (OAMC) in comparison to the state-of-the-art LML model (a) produces better accuracy as topic coherencefor individual domains while (b) using fewer knowledge rules

50 as compared to state-of-the-art LML models Velocity isan important feature of big data as it is believed to be receivedcontinuously at high speed The OAMC model is designedto process the data in hand only and therefore outperformsexisting models for streaming data As the data is expectedto have noise OAMC is supported with a strict monitoringand filteringmechanism tomaintain a consistent knowledge-base The rules are made to compete for survival based onutility freshness and net confidence OAMC being blind tothe future gets higher bumps in accuracy when current taskhas low relevance to past experience This is very commonto a human-like learning approach However the tendencyof such tasks for which limited relevant knowledge availabledrops with experience considerably OAMCmodel shows 7improvement in accuracy on average for streaming big databut up to 10 for certain individual domains as compared tostate of the art

In the future we will be working on introducing morevarieties of knowledge Through complex network analysisthe knowledge rules can be stored in association with eachother Cooccurrence can mean many things and thereforethe relationship type should also be considered in future Itwill help to associate aspects with products sentiments toaspects and reasons to sentiments To improve the perfor-mance and attempt for near real-time analysis Gibbs sam-pling should be replaced with a lighter inference technique

Disclosure

This research is the authorsrsquo work not published or underreview elsewhere

Competing Interests

The authors declare that they have no competing interests

Authorsrsquo Contributions

In this research work Muhammad Taimoor Khan(corresponding author) developed the model while Mehr

Durrani helped with adjusting the knowledge learningand transfer mechanism to better adapt to the currenttask Furqan Aziz contributed through incorporating theknowledge maintenance module for high efficiency andreduced data dependency Shehzad Khalid supervised theactivity and contributed through running experiments andcompiling results Shehzad Khalid helped with introducingthe new knowledge features and with prioritizing them

Acknowledgments

The authors are thankful to Mr Yousaf Marwat Lecturerat COMSATS IIT Attock for facilitating with editing andproofreading services

References

[1] S Branavan H Chen J Eisenstein and R Barzilay ldquoLearningdocument-level semantic properties from free-text annota-tionsrdquo Journal of Artificial Intelligence Research vol 34 pp 569ndash603 2009

[2] S Brody and N Elhadad ldquoAn unsupervised aspect-sentimentmodel for online reviewsrdquo in Proceedings of the Human Lan-guage Technologies Conference of the North American Chapterof the Association for Computational Linguistics (NAACL HLTrsquo10) pp 804ndash812 Association for Computational LinguisticsLos Angeles Calif USA June 2010

[3] F Li M Huang and X Zhu ldquoSentiment analysis with globaltopics and local dependencyrdquo in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI rsquo10)Atlanta Ga USA July 2010

[4] Q Mei X Ling M Wondra H Su and C Zhai ldquoTopicsentiment mixture modeling facets and opinions in weblogsrdquoin Proceedings of the 16th International World Wide Web Con-ference (WWW rsquo07) pp 171ndash180 ACM Alberta Canada May2007

[5] I Titov and R McDonald ldquoModeling online reviews withmulti-grain topic modelsrdquo in Proceedings of the 17th Interna-tional Conference on World Wide Web (WWW rsquo08) pp 111ndash120ACM Beijing China April 2008

10 Computational Intelligence and Neuroscience

[6] Y Lu C X Zhai andN Sundaresan ldquoRated aspect summariza-tion of short commentsrdquo in Proceedings of the 18th InternationalWorld Wide Web Conference (WWW rsquo09) pp 131ndash140 ACMApril 2009

[7] A Mukherjee and B Liu ldquoAspect extraction through semi-supervisedmodelingrdquo inProceedings of the 50thAnnualMeetingof the Association for Computational Linguistics (ACL rsquo12) vol1 pp 339ndash348 ssociation for Computational Linguistics July2012

[8] I Titov and R T McDonald ldquoA joint model of text and aspectratings for sentiment summarizationrdquo in Proceedings of theAnnualMeeting of the Association for Computational Linguisticsvol 8 pp 308ndash316 Association for Computational LinguisticsColumbus Ohio USA 2008

[9] H Wang Y Lu and C Zhai ldquoLatent aspect rating analysis onreview text data a rating regression approachrdquo in Proceedings ofthe 16th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo10) pp 783ndash792 ACM July2010

[10] T L Griffiths M Steyvers D M Blei and J B TenenbaumldquoIntegrating topics and syntaxrdquo in Advances in Neural Informa-tion Processing Systems pp 537ndash544 MIT Press CambridgeMass USA 2004

[11] Y Jo and A H Oh ldquoAspect and sentiment unification modelfor online review analysisrdquo in Proceedings of the 4th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo11) pp 815ndash824 ACM Hong Kong February 2011

[12] J Liu Y Cao C Y Lin Y Huang and M Zhou ldquoLow-quality product review detection in opinion summarizationrdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 334ndash342 PragueCzech Republic June 2007

[13] Y Lu and C Zhai ldquoOpinion integration through semi-supervised topic modelingrdquo in Proceedings of the 17th Interna-tional Conference onWorld Wide Web (WWW rsquo08) pp 121ndash130ACM Beijing China April 2008

[14] C Sauper A Haghighi and R Barzilay ldquoContent models withattituderdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics vol 1 ofHuman Lan-guage Technologies pp 350ndash358 Association for ComputationalLinguistics 2011

[15] W X Zhao J Jiang H Yan and X Li ldquoJointly modeling aspectsand opinions with a MaxEnt-LDA hybridrdquo in Proceedings ofthe Conference on Empirical Methods in Natural Language Pro-cessing pp 56ndash65 Association for Computational LinguisticsOctober 2010

[16] D Andrzejewski X Zhu and M Craven ldquoIncorporatingdomain knowledge into topicmodeling via Dirichlet Forest pri-orsrdquo in Proceedings of the 26th Annual International Conferenceon Machine Learning pp 25ndash32 ACM June 2009

[17] D Andrzejewski X Zhu M Craven and B Recht ldquoA frame-work for incorporating general domain knowledge into latentdirichlet allocation using first-order logicrdquo in Proceedings ofthe 22nd International Joint Conference on Artificial Intelligence(IJCAI rsquo11) pp 1171ndash1177 Barcelona Spain July 2011

[18] Z Chen A Mukherjee B Liu M Hsu M Castellanos and RGhosh ldquoDiscovering coherent topics using general knowledgerdquoin Proceedings of the 22nd ACM International Conference onInformation and Knowledge Management (CIKM rsquo13) pp 209ndash218 ACM November 2013

[19] A Carlson J Betteridge B Kisiel B Settles E R Hruschka Jrand T M Mitchell ldquoToward an architecture for never-endinglanguage learningrdquo in Proceedings of the 24th AAAI Conferenceon Artificial Intelligence vol 5 p 3 Atlanta Ga USA July 2010

[20] Z Chen and B Liu ldquoMining topics in documents standingon the shoulders of big datardquo in Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo14) pp 1116ndash1125 ACM New York NYUSA August 2014

[21] D Mimno H M Wallach E Talley M Leenders and AMcCallum ldquoOptimizing semantic coherence in topic modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 262ndash272 Association for Computa-tional Linguistics July 2011

[22] Z Chen A Mukherjee and B Liu ldquoAspect extraction withautomated prior knowledge learningrdquo in Proceedings of the 52ndAnnual Meeting of the Association for Computational Linguistics(ACL rsquo14) pp 347ndash358 Baltimore Md USA June 2014

[23] G Bouma ldquoNormalized (pointwise) mutual information incollocation extractionrdquo in Proceedings of the Conference of theGerman Society for Computational Linguistics (GSCL rsquo09) pp31ndash40 Potsdam Germany 2009

[24] D M Blei A Y Ng and M I Jordan ldquoLatent dirichletallocationrdquoThe Journal ofMachine Learning Research vol 3 no4-5 pp 993ndash1022 2003

[25] T Hofmann ldquoProbabilistic latent semantic indexingrdquo in Pro-ceedings of the 22nd Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval(SIGIR rsquo99) pp 50ndash57 ACM Berkeley Calif USA August 1999

[26] E Eaton and P L Ruvolo ldquoElla an efficient lifelong learningalgorithmrdquo in Proceedings of the 30th International Conferenceon Machine Learning (ICML rsquo13) pp 507ndash515 2013

[27] E Kamar A Kapoor and E Horvitz ldquoLifelong learning foracquiring thewisdom of the crowdrdquo in Proceedings of the 23rdInternational Joint Conference on Artificial Intelligence (IJCAIrsquo13) pp 2313ndash2320 AAAI Press Beijing China August 2013

[28] R Raina A Battle H Lee B Packer and A Y Ng ldquoSelf-taughtlearning transfer learning from unlabeled datardquo in Proceedingsof the 24th International Conference onMachine Learning (ICMLrsquo07) pp 759ndash766 ACM June 2007

[29] D L Silver Q Yang and L Li ldquoLifelong machine learningsystems Be-yond learning algorithmsrdquo in Proceedings of theAAAI Spring Symposium Lifelong Machine Learning StanfordUniversity March 2013

[30] Z Chen and B Liu ldquoTopic modeling using topics from manydomains lifelong learning and big datardquo in Proceedings of the31st International Conference on Machine Learning (ICML rsquo14)pp 703ndash711 Beijing China June 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article Online Knowledge-Based Model for Big Data ...downloads.hindawi.com/journals/cin/2016/6081804.pdf · Research Article Online Knowledge-Based Model for Big Data Topic

Computational Intelligence and Neuroscience 5

MineRules (Dt N topics)(1) procedureMineRules (119863119905119873 topics)(2) for (t all topics) do(3) for (i j all word pairs in a topic) do(4) if (nPMI(i j) lt 1205831) then(5) Rules

119895larr Rules (119894 119895 nPMI(119894 119895) topic) learn as negative rule

(6) end if(7) if (nPMI(i j) gt 1205832) then(8) Rules

119895larr Rule (119894 119895 nPMI(119894 119895) topic) learn as positive rule

(9) end if(10) end for(11) end for(12) end procedure

Algorithm 2 Mine rules

Knowledge-base

Background

Orientation

Freshness

Utility

Confidence

KRule(w1 w2)

Figure 2 Representation of a knowledge rule as word pair inknowledge-base Utility freshness and background are the newlyintroduced knowledge features to monitor their quality

current task Therefore the model does not have enoughinstances to verify the quality of knowledge and is expectedto learn low quality knowledge as well A knowledge rulecan be of low quality if it has enough support in data fromwhere it was learnt but is not supported by future tasks Itmayhappen due to noise or high bias in a domainThe refinementtechnique deals with it until it is removed In order to extractnew knowledge rules kRule(119908

1 1199082) normalized PMI [23] as

nPMI(1199081 1199082) is used as given in

kRule (1199081 1199082) = nPMI (119908

1 1199082) =

PMI (1199081 1199082)

log119901 (1199081 1199082) (1)

where nPMI gives a value [minus1 1] range All candidate wordpairs are evaluated through nPMI For a word pair if the scoreis close to 1 the knowledge rule (must-link) is stored withits nPMI score as confidence Word pair having it close to minus1is stored as a negative rule (cannot-link) with nPMI score asconfidence PMI in (1) can be evaluated as

PMI (1199081 1199082) = log

119901 (1199081 1199082)

119901 (1199081) 119901 (119908

2) (2)

where 119901(1199081 1199082) is the coexistence probability of 119908

1and 119908

2

while 119901(1199081) and 119901(119908

2) are probabilities of their existence in

the current domain Coexistence is the presence of two wordsin a conceptual document irrespective of their frequencyCalculating coexistence probability and individual probabil-ities is shown in (3) where119863119905 show all documents while 119863119905is total number of documents

119901 (1199081 1199082) =

119863119905 (1199081 1199082)

119863119905

119901 (119908) =119863119905 (119908)119863119905

(3)

Freshness and utility are initialized while topic of the wordpair is added to the background of the rule The twothresholds 120583

1and 120583

2used are set to harvest word pairs with

nPMI scores at either of the two extremes given in

nPMI (1199081 1199082) =

⟨1205831or⟩ 1205832

learn

else ignore(4)

When nPMI gives a value that is too high that is closeto 1 or too low that is close to minus1 then it is consideredto be of significance This untypical behavior is learnt asknowledge while the values close to zero are ignored asthey have not shown any considerable corelation Word pairshaving nPMI below 120583

1are learnt as knowledge (cannot-links)

while the values above 1205832are also learnt (must-links) The

threshold values are mentioned in Setup Our contributionin knowledge extraction is to use nPMI(119908

1 1199082) as a very

lightweight extractionmechanism Secondly bothmust-linksand cannot-links are extracted together in a single stepMS-FIM was previously used to harvest knowledge whichwas causing a performance bottleneck The rule miningmechanism is mentioned in Algorithm 2 It generates wordpairs within each topic and evaluates them for possibleknowledge from step (2) to step (11) Satisfying either of thethresholds in step (4) and step (7) the word pair is mined aspositive or negative knowledge rule

43 Knowledge Transfer The knowledge transfer mechanismselects relevant knowledge rules each time a word is sampled

6 Computational Intelligence and Neuroscience

to have its probability increased under a topic The effectof relevant knowledge rules is added as bias into the topicmodel to push up positively corelated words under the sametopic and vice versa Since the current task has limited dataknowledge from previous tasks is incorporated to guide theinference For a sampled word 119908

1in topic 119905 if a rule exists

for it then the associated word 1199082has probability updated in

current topic 119905 as shown in

119901 (rule119894= rule | 119905) prop 119901 (119908

1| 119905) times 119901 (119908

2| 119905) (5)

where 1199081and 119908

2are the two words of a rule rule

119894(must-

link or cannot-link)When the model inference increases theprobability of a word 119908

1for a topic 119905 the probability of 119908

2

is increased for topic 119905 in case of must-link and decreasedfor cannot-link To address word sense disambiguation thefollowing equation is proposed

(119881 cup ruleBackground)ruleBackground

gt 120585 (6)

Here 120585 is the threshold that has to be satisfied for overlapbetween the vocabulary of the current task and the rulebackgroundThe rule background represents the set of wordsin topic from which the rule was learnt Comparing therule background with the current vocabulary is to ensurethat the rule is used in the same context in which it wasmined For example the word Bank has different contextswhen used with Accounts and Sea respectively Topics of allprocessed tasks were previously used for the same purposeThe net confidence and utility of a knowledge rule are usedto transfer the impact of knowledge rule in topic samplingMultigeneralized Polya Urn (M-GPU) model [20] is used totransfer the effect of knowledge into the current status ofMarkov chain in Gibbs sampler in

119901 (119908 | 119905) propsum119881

1199081015840=1(1205921199081199081015840120588) times 119899

1199051199081015840 + 120573

sum119881

V=1 (sum119881

1199081015840=1(120592V1199081015840120588) times 1198991199051199081015840 + 120573)

(7)

Equation (7) shows the amount of update in the probability bya factor 120592

1199081199081015840120588 for the other word in knowledge rule When

the probability of a sampled word is increased for a topicusing the rules the probability of all thewords associatedwiththe sampled word is also increasedThe following equation isupdated to incorporate both positive and negatively corelatedrules in a single step as

119901 (119911119894= 119905 | 119911

minus119894 119908 120572 120573 120582)

prop119899119889119905

minus119894+ 120572

sum119879

119905=1(119899119889119905

minus119894+ 120572)

times

sum1199081015840119908119894isinrule (1205921199081015840 119908119894120588) times 1198991199051199081015840

minus119894+ 120573

sum119881

V=1 (sum1199081015840 Visinrule (1205921199081015840 V120588) times 1198991198961199081015840minus119894+ 120573)

(8)

For word sampled 119908 the other word 1199081015840 sharing the same

rule with it gets a push as strong as the net confidence in thedirection of its orientation The objective is to increase theprobability of positively corelated words to appear togetheras top words under the same topic and vice versa

Table 1 OAMC positive and negative knowledge rules with highutility

Type Knowledge

+ive

(tech support) (cable usb) (pro con) (customersupport) (batter charge) (video card) (high long)(connection vga) (operating system) (monitormacbook) (mac support) (cd dvd) (windows xp)(money worth) (inch picture) (input output)(friend mine) (big deal)

minusive

(sound money) (feature battery) (battery price)(feature battery) (price device) (screen sound)(review screen) (worse digital) (price easy) (gasscreen) (signal easy) (home gps) (hotel traffic)(driver cooler) (monitor processor) (monitorspeed)

44 KnowledgeMaintenance andRetention Supportedwith aknowledgemaintenance and retentionmodule useful knowl-edge rules are stored until they justify their existence Theknowledge-base is refined after each task to keep it consistentThey all get older by a taskThe relevant knowledge rules growin confidence supported by current domain and have higherutility After updating the state of each knowledge rule theyare passed through a threshold again to justify their existenceas proposed in

nPMI (1199081 1199082)

=

120592 times 120588

120582gt 1205831or

120592 times 120588

120582lt 1205832

remove

otherwise retain

(9)

where 120592 is net confidence 120588 is utility and 1120582 is freshness fora knowledge rule If the confidence of a knowledge rule incurrent task is above its net confidence 120592 the current domainis allowed to update the knowledge background to betterrepresent its context as proposed in

kRule (1199081 1199082) =

conf gt 120592 update background

otherwise unchanged(10)

where background is updated by adding words from sup-porting topic in the current task New rules are minedwhile existing rules are evolved as a continuous processThe newly added knowledge features are used to rank andprioritize knowledge rules The difference in rank of goodand bad knowledge rules widens as the model matures withexperience Algorithm 3 is used to refine existing knowledgewith step (2) having relevant knowledge as Rule1015840 In steps (4)to (7) all relevant rules get their net confidence updated andutility incremented Step (8) has all rules aged by a task Someof the popular rules from knowledge-base of OAMC modelare shown in Table 1

5 Experimental Results

The proposed model (OAMC) contributes to the applicationof lifelong learning forNLP tasks specifically used to improve

Computational Intelligence and Neuroscience 7

RefineRules (Dt N topics V Rules)(1) procedure RefineRules (119863119905119873 topics 119881 Rules)(2) Rules1015840 larr relevantRules (119881 knowledge-base) set of relevant rules(3) for (r all Rules) do(4) if (Rules1015840 contains r) do(5) 120592 larr nPMI(119903) add to cumulative confidence(6) 120588 larr +1 Rules utility increased by 1(7) end if(8) end for(12) end procedure

Algorithm 3 Refine rules

topic coherence for big streaming data Various experimentsare performed to highlight the improvement in accuracyand performance of lifelong topic mining as product aspectsUnlike traditional ML models it operates on big data with avariety of domains The standard dataset used by previousLML models is used to evaluate the proposed model incomparison to the other state-of-the-art models The dataconsists of 50 domains each of electronic products To high-light the improvement in the OAMC model the dataset wasprovided with decreasing number of domains The modelsare evaluated in topic coherence for accuracy and time inminutes for performance The number of knowledge rules isused as quality of knowledge where fewer rules producingbetter results are considered to be of better quality

51 Setup To set up the test environment an HP PavilionDM4 machine is used with Intel 24GHz Core i5 having8GB RAM and 500GB hard drive Following the evaluationapproach in existing models 15 topics are extracted perdomain for all models while as in previous models top 30words per topic are considered to calculate topic coherenceof the proposedmodel alsoThe hyperparameters are kept thesame for all models The number of Gibbs sampler iterationsis kept the same (2000) as used in previous models Themodels are configured and initialized as used in their originalwork The proposed OAMC has its thresholds 120583

1and 120583

2

set to 08 and minus09 respectively to learn knowledge andselect relevant knowledge for a given task Since negativecorelations are too many as compared to positive rules thethreshold is set to focus more on positive rules In orderto choose relevant rule the overlap threshold is set to 03between the vocabulary and knowledge background

52 Results The accuracy of lifelong models and topicmodels in general is evaluated through topic coherence It hashigh corelation with human evaluation All of the models aregiven the same number of documents in a domain for bothlearning and evaluation The existing models drop accuracywhen it has to learn from a smaller dataset with documentsin hundreds per domain while compromising accuracy whenevaluated on larger dataset with thousands of documentsTherefore the previous models were given large datasets tolearn from and were evaluated on smaller datasets However

50 40 30 20 10 8 6 4 2 1Number of domains available

Topic coherence with different number of domains in dataset

LDALTMAMC-M

AMCOAMC

minus700

minus750

minus800

minus850

minus900

minus950

minus1000

Topi

c coh

eren

ce

Figure 3The proposed model (OAMC) produces highest accuracy(as topic coherence) among lifelong learning models for streamingbig data

for fairness to all models they are provided with the test setonly to learn from and to test on

Topic coherence of the models compared has interestingoutcome LTM surprisingly could not maintain a better topiccoherence when it had to learn from a smaller dataset It evendropped below the baseline AMC and AMC-M performedwell however the proposed OAMC beat them when anumber of domains in the dataset were dropped below 10 asshown in Figure 3 AMC and AMC-M hardly learn anythingwith fewer domains which makes them ineffective towardsstreaming data Therefore their accuracy continuously wentdown towards baseline On the other hand LTM has itslearning pattern badly affected by limited variety of domainsand learnt wrong knowledge which dropped its accuracyThe knowledge rules learnt by AMC and AMC-M were100 and below per domain for less than 10 domains in thedataset which grow up to more than 2000 for 50 domainsIt is because of the overreliance of the existing models onlarge volumes of data to be available at once which makesthem least effective towards streaming data It shows thatthey only learn better knowledge and produce better resultswhen there are many domains in the dataset and they areall available prior to analysis With a robust knowledge

8 Computational Intelligence and Neuroscience

LTM AMC-M AMC OAMCLML models

Performance

020406080

100120140

Tim

e (m

in)

Figure 4 The proposed model (OAMC) has the highest perfor-mance efficiency among LML topic models

storing mechanism OAMC learnt more knowledge refinedit and used it to improve topic coherence with few domainsavailable The OAMC model shows on average 7 with amaximum of 10 improvement in topic coherence whenthere is one domain at a time

OAMC learns wrong knowledge too as it considersonly the current domain for harvesting it However with arobust knowledge retention module after performing fewtasks only good quality knowledge survives through thefiltering mechanism Knowledge rules with less confidencefrom future tasks or limited utility are pushed down andfinally filtered out Knowledge that is getting confidencereenforced from future tasks and has growing utility stayslonger To evaluate relevance of a knowledge rule with thecurrent task the domain of the current task must have therule words confidence above threshold and overlap with therule background higher than threshold The online OAMCmodels truly learn like humans having no idea of future tasksunlike existing models Its learning is dependent on the pastexperience as the domains processed only and uses it for afuture task It is less dependent on data parses it once andconsiders the knowledge module with abstract rules only tosample relevant rules and therefore shows improvement of upto 50 in performance as shown in Figure 4

For the discussed intrinsic evaluation the models arealso compared through PMI mean and median values thathad relevance with topic coherence values The extrinsicWordNet based techniques used in the literature (WuPalmerResnik Path Lin LeacockChodorow Lesk JiangConrath andHirstStOnge) were inconclusive to differentiate among thequalities of topics extracted due to the specialized nature ofelectronic products domains

53 Evaluation OAMC improving performance of AMCby 50 as shown in Figure 4 is attributed to the followingreasons OAMC does not extract relevant knowledge for eachtask and rather chooses from a well maintained knowledgemodule Each domain is parsed only once With a knowledgefiltering mechanism only good quality knowledge rules aremaintainedThere are generally two performance bottlenecksin existing LML models that is FIM for knowledge extrac-tion and higher iterations of Gibbs sampling The OAMCmodel uses nPMI [23] which is highly efficient to extract

knowledge Secondly it learns both must-links and cannot-links together with a single mechanism as rules Even with50 domains in the dataset OAMC performed better thanAMC for certain individual domains by producing highertopic coherence while using fewer knowledge rules as shownin Figure 5 Table 1 shows some of the most frequently usedknowledge rules of OAMC Through manual evaluation oftopics from all the domains it was observed that the knowl-edge learnt by LMLmodels is more effective towards randomand intruded types of topic incoherence but less effectivetowards chained and unbalanced type of incoherence In factthe knowledge-based models can at times enhance chainedtopic incoherence if there are many knowledge rule pairssharing common words

54 Knowledge Analysis The accuracy of LML modelsdepends upon the quality of its knowledge The previousmodels did not provide a discussion on the quality ofknowledge In fact improved topic coherence was consideredas reason for good quality knowledge The knowledge rulesshown in Table 1 have utility above 10 as a token of theirquality The ratio of use of knowledge rules per domain byAMC and OAMC is 5 1 signifying the quality of knowledgeproduced by OAMC Net confidence of rules also showsthat most of the relevant domains enforced the rule andincreased its confidence (above 08with 1 asmaximum) Sincethis information is not available for the existing models thevalues cannot be matched It is still very difficult to explorethe impact of each knowledge rule separately however thediscussed features in general keep track of good qualityknowledge The knowledge learnt is of two types only cap-turing positive and negative word corelations However morevarieties of knowledge can be introduced for further insightsBy introducing ldquothe type of relationshiprdquo as knowledge theunbalanced topic incoherence problem can be addressedAlong with correlation between words the nature of theircorrelationmay also be explored for better analysisThis typeof knowledge is also useful for finding hierarchical topicsThe knowledge rules extracted are isolated while knowledgeas complex network can be more effective for a variety ofanalyses AMC uses a small graph only to associate must-links sharing common word in the same context Knowledgeextraction transfer representation and retention are onlyintroduced to NLP tasks in 2010 and require mature ideasfrom other fields of research

6 Conclusion

LML models are structured to process big data having manydomains and therefore it is the only suitable solution despiteits shortcomings LML models are recently introduced to bigdata topic extraction and are far from mature The existingLML topic models had limited application due to followingbatch processing approach andoverreliance ondataOur pro-posed model named as OAMC follows online approach forLML topic modeling that can support streaming data Withan efficient learning mechanism our OAMCmodel has lowerdependency on data and therefore improved performance by

Computational Intelligence and Neuroscience 9

AlarmClock Battery CDPlayer Fan LampIndividual domains

Topic coherence for each domain

OAMCAMC

minus750

minus770

minus790

minus810

minus830

minus850To

pic c

oher

ence

minus778 minus785

minus812

minus782

minus809

minus790

minus817

minus761

minus788

(a)

OAMCAMC

AlarmClock Battery CDPlayer Fan LampIndividual domains

Knowledge rules used for each domain

284

1423

345

863

226518

1520

446

0

500

1000

1500

2000

Num

ber o

f kno

wle

dge

rule

s use

d

(b)

Figure 5 The proposed model (OAMC) in comparison to the state-of-the-art LML model (a) produces better accuracy as topic coherencefor individual domains while (b) using fewer knowledge rules

50 as compared to state-of-the-art LML models Velocity isan important feature of big data as it is believed to be receivedcontinuously at high speed The OAMC model is designedto process the data in hand only and therefore outperformsexisting models for streaming data As the data is expectedto have noise OAMC is supported with a strict monitoringand filteringmechanism tomaintain a consistent knowledge-base The rules are made to compete for survival based onutility freshness and net confidence OAMC being blind tothe future gets higher bumps in accuracy when current taskhas low relevance to past experience This is very commonto a human-like learning approach However the tendencyof such tasks for which limited relevant knowledge availabledrops with experience considerably OAMCmodel shows 7improvement in accuracy on average for streaming big databut up to 10 for certain individual domains as compared tostate of the art

In the future we will be working on introducing morevarieties of knowledge Through complex network analysisthe knowledge rules can be stored in association with eachother Cooccurrence can mean many things and thereforethe relationship type should also be considered in future Itwill help to associate aspects with products sentiments toaspects and reasons to sentiments To improve the perfor-mance and attempt for near real-time analysis Gibbs sam-pling should be replaced with a lighter inference technique

Disclosure

This research is the authorsrsquo work not published or underreview elsewhere

Competing Interests

The authors declare that they have no competing interests

Authorsrsquo Contributions

In this research work Muhammad Taimoor Khan(corresponding author) developed the model while Mehr

Durrani helped with adjusting the knowledge learningand transfer mechanism to better adapt to the currenttask Furqan Aziz contributed through incorporating theknowledge maintenance module for high efficiency andreduced data dependency Shehzad Khalid supervised theactivity and contributed through running experiments andcompiling results Shehzad Khalid helped with introducingthe new knowledge features and with prioritizing them

Acknowledgments

The authors are thankful to Mr Yousaf Marwat Lecturerat COMSATS IIT Attock for facilitating with editing andproofreading services

References

[1] S Branavan H Chen J Eisenstein and R Barzilay ldquoLearningdocument-level semantic properties from free-text annota-tionsrdquo Journal of Artificial Intelligence Research vol 34 pp 569ndash603 2009

[2] S Brody and N Elhadad ldquoAn unsupervised aspect-sentimentmodel for online reviewsrdquo in Proceedings of the Human Lan-guage Technologies Conference of the North American Chapterof the Association for Computational Linguistics (NAACL HLTrsquo10) pp 804ndash812 Association for Computational LinguisticsLos Angeles Calif USA June 2010

[3] F Li M Huang and X Zhu ldquoSentiment analysis with globaltopics and local dependencyrdquo in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI rsquo10)Atlanta Ga USA July 2010

[4] Q Mei X Ling M Wondra H Su and C Zhai ldquoTopicsentiment mixture modeling facets and opinions in weblogsrdquoin Proceedings of the 16th International World Wide Web Con-ference (WWW rsquo07) pp 171ndash180 ACM Alberta Canada May2007

[5] I Titov and R McDonald ldquoModeling online reviews withmulti-grain topic modelsrdquo in Proceedings of the 17th Interna-tional Conference on World Wide Web (WWW rsquo08) pp 111ndash120ACM Beijing China April 2008

10 Computational Intelligence and Neuroscience

[6] Y Lu C X Zhai andN Sundaresan ldquoRated aspect summariza-tion of short commentsrdquo in Proceedings of the 18th InternationalWorld Wide Web Conference (WWW rsquo09) pp 131ndash140 ACMApril 2009

[7] A Mukherjee and B Liu ldquoAspect extraction through semi-supervisedmodelingrdquo inProceedings of the 50thAnnualMeetingof the Association for Computational Linguistics (ACL rsquo12) vol1 pp 339ndash348 ssociation for Computational Linguistics July2012

[8] I Titov and R T McDonald ldquoA joint model of text and aspectratings for sentiment summarizationrdquo in Proceedings of theAnnualMeeting of the Association for Computational Linguisticsvol 8 pp 308ndash316 Association for Computational LinguisticsColumbus Ohio USA 2008

[9] H Wang Y Lu and C Zhai ldquoLatent aspect rating analysis onreview text data a rating regression approachrdquo in Proceedings ofthe 16th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo10) pp 783ndash792 ACM July2010

[10] T L Griffiths M Steyvers D M Blei and J B TenenbaumldquoIntegrating topics and syntaxrdquo in Advances in Neural Informa-tion Processing Systems pp 537ndash544 MIT Press CambridgeMass USA 2004

[11] Y Jo and A H Oh ldquoAspect and sentiment unification modelfor online review analysisrdquo in Proceedings of the 4th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo11) pp 815ndash824 ACM Hong Kong February 2011

[12] J Liu Y Cao C Y Lin Y Huang and M Zhou ldquoLow-quality product review detection in opinion summarizationrdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 334ndash342 PragueCzech Republic June 2007

[13] Y Lu and C Zhai ldquoOpinion integration through semi-supervised topic modelingrdquo in Proceedings of the 17th Interna-tional Conference onWorld Wide Web (WWW rsquo08) pp 121ndash130ACM Beijing China April 2008

[14] C Sauper A Haghighi and R Barzilay ldquoContent models withattituderdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics vol 1 ofHuman Lan-guage Technologies pp 350ndash358 Association for ComputationalLinguistics 2011

[15] W X Zhao J Jiang H Yan and X Li ldquoJointly modeling aspectsand opinions with a MaxEnt-LDA hybridrdquo in Proceedings ofthe Conference on Empirical Methods in Natural Language Pro-cessing pp 56ndash65 Association for Computational LinguisticsOctober 2010

[16] D Andrzejewski X Zhu and M Craven ldquoIncorporatingdomain knowledge into topicmodeling via Dirichlet Forest pri-orsrdquo in Proceedings of the 26th Annual International Conferenceon Machine Learning pp 25ndash32 ACM June 2009

[17] D Andrzejewski X Zhu M Craven and B Recht ldquoA frame-work for incorporating general domain knowledge into latentdirichlet allocation using first-order logicrdquo in Proceedings ofthe 22nd International Joint Conference on Artificial Intelligence(IJCAI rsquo11) pp 1171ndash1177 Barcelona Spain July 2011

[18] Z Chen A Mukherjee B Liu M Hsu M Castellanos and RGhosh ldquoDiscovering coherent topics using general knowledgerdquoin Proceedings of the 22nd ACM International Conference onInformation and Knowledge Management (CIKM rsquo13) pp 209ndash218 ACM November 2013

[19] A Carlson J Betteridge B Kisiel B Settles E R Hruschka Jrand T M Mitchell ldquoToward an architecture for never-endinglanguage learningrdquo in Proceedings of the 24th AAAI Conferenceon Artificial Intelligence vol 5 p 3 Atlanta Ga USA July 2010

[20] Z Chen and B Liu ldquoMining topics in documents standingon the shoulders of big datardquo in Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo14) pp 1116ndash1125 ACM New York NYUSA August 2014

[21] D Mimno H M Wallach E Talley M Leenders and AMcCallum ldquoOptimizing semantic coherence in topic modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 262ndash272 Association for Computa-tional Linguistics July 2011

[22] Z Chen A Mukherjee and B Liu ldquoAspect extraction withautomated prior knowledge learningrdquo in Proceedings of the 52ndAnnual Meeting of the Association for Computational Linguistics(ACL rsquo14) pp 347ndash358 Baltimore Md USA June 2014

[23] G Bouma ldquoNormalized (pointwise) mutual information incollocation extractionrdquo in Proceedings of the Conference of theGerman Society for Computational Linguistics (GSCL rsquo09) pp31ndash40 Potsdam Germany 2009

[24] D M Blei A Y Ng and M I Jordan ldquoLatent dirichletallocationrdquoThe Journal ofMachine Learning Research vol 3 no4-5 pp 993ndash1022 2003

[25] T Hofmann ldquoProbabilistic latent semantic indexingrdquo in Pro-ceedings of the 22nd Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval(SIGIR rsquo99) pp 50ndash57 ACM Berkeley Calif USA August 1999

[26] E Eaton and P L Ruvolo ldquoElla an efficient lifelong learningalgorithmrdquo in Proceedings of the 30th International Conferenceon Machine Learning (ICML rsquo13) pp 507ndash515 2013

[27] E Kamar A Kapoor and E Horvitz ldquoLifelong learning foracquiring thewisdom of the crowdrdquo in Proceedings of the 23rdInternational Joint Conference on Artificial Intelligence (IJCAIrsquo13) pp 2313ndash2320 AAAI Press Beijing China August 2013

[28] R Raina A Battle H Lee B Packer and A Y Ng ldquoSelf-taughtlearning transfer learning from unlabeled datardquo in Proceedingsof the 24th International Conference onMachine Learning (ICMLrsquo07) pp 759ndash766 ACM June 2007

[29] D L Silver Q Yang and L Li ldquoLifelong machine learningsystems Be-yond learning algorithmsrdquo in Proceedings of theAAAI Spring Symposium Lifelong Machine Learning StanfordUniversity March 2013

[30] Z Chen and B Liu ldquoTopic modeling using topics from manydomains lifelong learning and big datardquo in Proceedings of the31st International Conference on Machine Learning (ICML rsquo14)pp 703ndash711 Beijing China June 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article Online Knowledge-Based Model for Big Data ...downloads.hindawi.com/journals/cin/2016/6081804.pdf · Research Article Online Knowledge-Based Model for Big Data Topic

6 Computational Intelligence and Neuroscience

to have its probability increased under a topic The effectof relevant knowledge rules is added as bias into the topicmodel to push up positively corelated words under the sametopic and vice versa Since the current task has limited dataknowledge from previous tasks is incorporated to guide theinference For a sampled word 119908

1in topic 119905 if a rule exists

for it then the associated word 1199082has probability updated in

current topic 119905 as shown in

119901 (rule119894= rule | 119905) prop 119901 (119908

1| 119905) times 119901 (119908

2| 119905) (5)

where 1199081and 119908

2are the two words of a rule rule

119894(must-

link or cannot-link)When the model inference increases theprobability of a word 119908

1for a topic 119905 the probability of 119908

2

is increased for topic 119905 in case of must-link and decreasedfor cannot-link To address word sense disambiguation thefollowing equation is proposed

(119881 cup ruleBackground)ruleBackground

gt 120585 (6)

Here 120585 is the threshold that has to be satisfied for overlapbetween the vocabulary of the current task and the rulebackgroundThe rule background represents the set of wordsin topic from which the rule was learnt Comparing therule background with the current vocabulary is to ensurethat the rule is used in the same context in which it wasmined For example the word Bank has different contextswhen used with Accounts and Sea respectively Topics of allprocessed tasks were previously used for the same purposeThe net confidence and utility of a knowledge rule are usedto transfer the impact of knowledge rule in topic samplingMultigeneralized Polya Urn (M-GPU) model [20] is used totransfer the effect of knowledge into the current status ofMarkov chain in Gibbs sampler in

119901 (119908 | 119905) propsum119881

1199081015840=1(1205921199081199081015840120588) times 119899

1199051199081015840 + 120573

sum119881

V=1 (sum119881

1199081015840=1(120592V1199081015840120588) times 1198991199051199081015840 + 120573)

(7)

Equation (7) shows the amount of update in the probability bya factor 120592

1199081199081015840120588 for the other word in knowledge rule When

the probability of a sampled word is increased for a topicusing the rules the probability of all thewords associatedwiththe sampled word is also increasedThe following equation isupdated to incorporate both positive and negatively corelatedrules in a single step as

119901 (119911119894= 119905 | 119911

minus119894 119908 120572 120573 120582)

prop119899119889119905

minus119894+ 120572

sum119879

119905=1(119899119889119905

minus119894+ 120572)

times

sum1199081015840119908119894isinrule (1205921199081015840 119908119894120588) times 1198991199051199081015840

minus119894+ 120573

sum119881

V=1 (sum1199081015840 Visinrule (1205921199081015840 V120588) times 1198991198961199081015840minus119894+ 120573)

(8)

For word sampled 119908 the other word 1199081015840 sharing the same

rule with it gets a push as strong as the net confidence in thedirection of its orientation The objective is to increase theprobability of positively corelated words to appear togetheras top words under the same topic and vice versa

Table 1 OAMC positive and negative knowledge rules with highutility

Type Knowledge

+ive

(tech support) (cable usb) (pro con) (customersupport) (batter charge) (video card) (high long)(connection vga) (operating system) (monitormacbook) (mac support) (cd dvd) (windows xp)(money worth) (inch picture) (input output)(friend mine) (big deal)

minusive

(sound money) (feature battery) (battery price)(feature battery) (price device) (screen sound)(review screen) (worse digital) (price easy) (gasscreen) (signal easy) (home gps) (hotel traffic)(driver cooler) (monitor processor) (monitorspeed)

44 KnowledgeMaintenance andRetention Supportedwith aknowledgemaintenance and retentionmodule useful knowl-edge rules are stored until they justify their existence Theknowledge-base is refined after each task to keep it consistentThey all get older by a taskThe relevant knowledge rules growin confidence supported by current domain and have higherutility After updating the state of each knowledge rule theyare passed through a threshold again to justify their existenceas proposed in

nPMI (1199081 1199082)

=

120592 times 120588

120582gt 1205831or

120592 times 120588

120582lt 1205832

remove

otherwise retain

(9)

where 120592 is net confidence 120588 is utility and 1120582 is freshness fora knowledge rule If the confidence of a knowledge rule incurrent task is above its net confidence 120592 the current domainis allowed to update the knowledge background to betterrepresent its context as proposed in

kRule (1199081 1199082) =

conf gt 120592 update background

otherwise unchanged(10)

where background is updated by adding words from sup-porting topic in the current task New rules are minedwhile existing rules are evolved as a continuous processThe newly added knowledge features are used to rank andprioritize knowledge rules The difference in rank of goodand bad knowledge rules widens as the model matures withexperience Algorithm 3 is used to refine existing knowledgewith step (2) having relevant knowledge as Rule1015840 In steps (4)to (7) all relevant rules get their net confidence updated andutility incremented Step (8) has all rules aged by a task Someof the popular rules from knowledge-base of OAMC modelare shown in Table 1

5 Experimental Results

The proposed model (OAMC) contributes to the applicationof lifelong learning forNLP tasks specifically used to improve

Computational Intelligence and Neuroscience 7

RefineRules (Dt N topics V Rules)(1) procedure RefineRules (119863119905119873 topics 119881 Rules)(2) Rules1015840 larr relevantRules (119881 knowledge-base) set of relevant rules(3) for (r all Rules) do(4) if (Rules1015840 contains r) do(5) 120592 larr nPMI(119903) add to cumulative confidence(6) 120588 larr +1 Rules utility increased by 1(7) end if(8) end for(12) end procedure

Algorithm 3 Refine rules

topic coherence for big streaming data Various experimentsare performed to highlight the improvement in accuracyand performance of lifelong topic mining as product aspectsUnlike traditional ML models it operates on big data with avariety of domains The standard dataset used by previousLML models is used to evaluate the proposed model incomparison to the other state-of-the-art models The dataconsists of 50 domains each of electronic products To high-light the improvement in the OAMC model the dataset wasprovided with decreasing number of domains The modelsare evaluated in topic coherence for accuracy and time inminutes for performance The number of knowledge rules isused as quality of knowledge where fewer rules producingbetter results are considered to be of better quality

51 Setup To set up the test environment an HP PavilionDM4 machine is used with Intel 24GHz Core i5 having8GB RAM and 500GB hard drive Following the evaluationapproach in existing models 15 topics are extracted perdomain for all models while as in previous models top 30words per topic are considered to calculate topic coherenceof the proposedmodel alsoThe hyperparameters are kept thesame for all models The number of Gibbs sampler iterationsis kept the same (2000) as used in previous models Themodels are configured and initialized as used in their originalwork The proposed OAMC has its thresholds 120583

1and 120583

2

set to 08 and minus09 respectively to learn knowledge andselect relevant knowledge for a given task Since negativecorelations are too many as compared to positive rules thethreshold is set to focus more on positive rules In orderto choose relevant rule the overlap threshold is set to 03between the vocabulary and knowledge background

52 Results The accuracy of lifelong models and topicmodels in general is evaluated through topic coherence It hashigh corelation with human evaluation All of the models aregiven the same number of documents in a domain for bothlearning and evaluation The existing models drop accuracywhen it has to learn from a smaller dataset with documentsin hundreds per domain while compromising accuracy whenevaluated on larger dataset with thousands of documentsTherefore the previous models were given large datasets tolearn from and were evaluated on smaller datasets However

50 40 30 20 10 8 6 4 2 1Number of domains available

Topic coherence with different number of domains in dataset

LDALTMAMC-M

AMCOAMC

minus700

minus750

minus800

minus850

minus900

minus950

minus1000

Topi

c coh

eren

ce

Figure 3The proposed model (OAMC) produces highest accuracy(as topic coherence) among lifelong learning models for streamingbig data

for fairness to all models they are provided with the test setonly to learn from and to test on

Topic coherence of the models compared has interestingoutcome LTM surprisingly could not maintain a better topiccoherence when it had to learn from a smaller dataset It evendropped below the baseline AMC and AMC-M performedwell however the proposed OAMC beat them when anumber of domains in the dataset were dropped below 10 asshown in Figure 3 AMC and AMC-M hardly learn anythingwith fewer domains which makes them ineffective towardsstreaming data Therefore their accuracy continuously wentdown towards baseline On the other hand LTM has itslearning pattern badly affected by limited variety of domainsand learnt wrong knowledge which dropped its accuracyThe knowledge rules learnt by AMC and AMC-M were100 and below per domain for less than 10 domains in thedataset which grow up to more than 2000 for 50 domainsIt is because of the overreliance of the existing models onlarge volumes of data to be available at once which makesthem least effective towards streaming data It shows thatthey only learn better knowledge and produce better resultswhen there are many domains in the dataset and they areall available prior to analysis With a robust knowledge

8 Computational Intelligence and Neuroscience

LTM AMC-M AMC OAMCLML models

Performance

020406080

100120140

Tim

e (m

in)

Figure 4 The proposed model (OAMC) has the highest perfor-mance efficiency among LML topic models

storing mechanism OAMC learnt more knowledge refinedit and used it to improve topic coherence with few domainsavailable The OAMC model shows on average 7 with amaximum of 10 improvement in topic coherence whenthere is one domain at a time

OAMC learns wrong knowledge too as it considersonly the current domain for harvesting it However with arobust knowledge retention module after performing fewtasks only good quality knowledge survives through thefiltering mechanism Knowledge rules with less confidencefrom future tasks or limited utility are pushed down andfinally filtered out Knowledge that is getting confidencereenforced from future tasks and has growing utility stayslonger To evaluate relevance of a knowledge rule with thecurrent task the domain of the current task must have therule words confidence above threshold and overlap with therule background higher than threshold The online OAMCmodels truly learn like humans having no idea of future tasksunlike existing models Its learning is dependent on the pastexperience as the domains processed only and uses it for afuture task It is less dependent on data parses it once andconsiders the knowledge module with abstract rules only tosample relevant rules and therefore shows improvement of upto 50 in performance as shown in Figure 4

For the discussed intrinsic evaluation the models arealso compared through PMI mean and median values thathad relevance with topic coherence values The extrinsicWordNet based techniques used in the literature (WuPalmerResnik Path Lin LeacockChodorow Lesk JiangConrath andHirstStOnge) were inconclusive to differentiate among thequalities of topics extracted due to the specialized nature ofelectronic products domains

53 Evaluation OAMC improving performance of AMCby 50 as shown in Figure 4 is attributed to the followingreasons OAMC does not extract relevant knowledge for eachtask and rather chooses from a well maintained knowledgemodule Each domain is parsed only once With a knowledgefiltering mechanism only good quality knowledge rules aremaintainedThere are generally two performance bottlenecksin existing LML models that is FIM for knowledge extrac-tion and higher iterations of Gibbs sampling The OAMCmodel uses nPMI [23] which is highly efficient to extract

knowledge Secondly it learns both must-links and cannot-links together with a single mechanism as rules Even with50 domains in the dataset OAMC performed better thanAMC for certain individual domains by producing highertopic coherence while using fewer knowledge rules as shownin Figure 5 Table 1 shows some of the most frequently usedknowledge rules of OAMC Through manual evaluation oftopics from all the domains it was observed that the knowl-edge learnt by LMLmodels is more effective towards randomand intruded types of topic incoherence but less effectivetowards chained and unbalanced type of incoherence In factthe knowledge-based models can at times enhance chainedtopic incoherence if there are many knowledge rule pairssharing common words

54 Knowledge Analysis The accuracy of LML modelsdepends upon the quality of its knowledge The previousmodels did not provide a discussion on the quality ofknowledge In fact improved topic coherence was consideredas reason for good quality knowledge The knowledge rulesshown in Table 1 have utility above 10 as a token of theirquality The ratio of use of knowledge rules per domain byAMC and OAMC is 5 1 signifying the quality of knowledgeproduced by OAMC Net confidence of rules also showsthat most of the relevant domains enforced the rule andincreased its confidence (above 08with 1 asmaximum) Sincethis information is not available for the existing models thevalues cannot be matched It is still very difficult to explorethe impact of each knowledge rule separately however thediscussed features in general keep track of good qualityknowledge The knowledge learnt is of two types only cap-turing positive and negative word corelations However morevarieties of knowledge can be introduced for further insightsBy introducing ldquothe type of relationshiprdquo as knowledge theunbalanced topic incoherence problem can be addressedAlong with correlation between words the nature of theircorrelationmay also be explored for better analysisThis typeof knowledge is also useful for finding hierarchical topicsThe knowledge rules extracted are isolated while knowledgeas complex network can be more effective for a variety ofanalyses AMC uses a small graph only to associate must-links sharing common word in the same context Knowledgeextraction transfer representation and retention are onlyintroduced to NLP tasks in 2010 and require mature ideasfrom other fields of research

6 Conclusion

LML models are structured to process big data having manydomains and therefore it is the only suitable solution despiteits shortcomings LML models are recently introduced to bigdata topic extraction and are far from mature The existingLML topic models had limited application due to followingbatch processing approach andoverreliance ondataOur pro-posed model named as OAMC follows online approach forLML topic modeling that can support streaming data Withan efficient learning mechanism our OAMCmodel has lowerdependency on data and therefore improved performance by

Computational Intelligence and Neuroscience 9

AlarmClock Battery CDPlayer Fan LampIndividual domains

Topic coherence for each domain

OAMCAMC

minus750

minus770

minus790

minus810

minus830

minus850To

pic c

oher

ence

minus778 minus785

minus812

minus782

minus809

minus790

minus817

minus761

minus788

(a)

OAMCAMC

AlarmClock Battery CDPlayer Fan LampIndividual domains

Knowledge rules used for each domain

284

1423

345

863

226518

1520

446

0

500

1000

1500

2000

Num

ber o

f kno

wle

dge

rule

s use

d

(b)

Figure 5 The proposed model (OAMC) in comparison to the state-of-the-art LML model (a) produces better accuracy as topic coherencefor individual domains while (b) using fewer knowledge rules

50 as compared to state-of-the-art LML models Velocity isan important feature of big data as it is believed to be receivedcontinuously at high speed The OAMC model is designedto process the data in hand only and therefore outperformsexisting models for streaming data As the data is expectedto have noise OAMC is supported with a strict monitoringand filteringmechanism tomaintain a consistent knowledge-base The rules are made to compete for survival based onutility freshness and net confidence OAMC being blind tothe future gets higher bumps in accuracy when current taskhas low relevance to past experience This is very commonto a human-like learning approach However the tendencyof such tasks for which limited relevant knowledge availabledrops with experience considerably OAMCmodel shows 7improvement in accuracy on average for streaming big databut up to 10 for certain individual domains as compared tostate of the art

In the future we will be working on introducing morevarieties of knowledge Through complex network analysisthe knowledge rules can be stored in association with eachother Cooccurrence can mean many things and thereforethe relationship type should also be considered in future Itwill help to associate aspects with products sentiments toaspects and reasons to sentiments To improve the perfor-mance and attempt for near real-time analysis Gibbs sam-pling should be replaced with a lighter inference technique

Disclosure

This research is the authorsrsquo work not published or underreview elsewhere

Competing Interests

The authors declare that they have no competing interests

Authorsrsquo Contributions

In this research work Muhammad Taimoor Khan(corresponding author) developed the model while Mehr

Durrani helped with adjusting the knowledge learningand transfer mechanism to better adapt to the currenttask Furqan Aziz contributed through incorporating theknowledge maintenance module for high efficiency andreduced data dependency Shehzad Khalid supervised theactivity and contributed through running experiments andcompiling results Shehzad Khalid helped with introducingthe new knowledge features and with prioritizing them

Acknowledgments

The authors are thankful to Mr Yousaf Marwat Lecturerat COMSATS IIT Attock for facilitating with editing andproofreading services

References

[1] S Branavan H Chen J Eisenstein and R Barzilay ldquoLearningdocument-level semantic properties from free-text annota-tionsrdquo Journal of Artificial Intelligence Research vol 34 pp 569ndash603 2009

[2] S Brody and N Elhadad ldquoAn unsupervised aspect-sentimentmodel for online reviewsrdquo in Proceedings of the Human Lan-guage Technologies Conference of the North American Chapterof the Association for Computational Linguistics (NAACL HLTrsquo10) pp 804ndash812 Association for Computational LinguisticsLos Angeles Calif USA June 2010

[3] F Li M Huang and X Zhu ldquoSentiment analysis with globaltopics and local dependencyrdquo in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI rsquo10)Atlanta Ga USA July 2010

[4] Q Mei X Ling M Wondra H Su and C Zhai ldquoTopicsentiment mixture modeling facets and opinions in weblogsrdquoin Proceedings of the 16th International World Wide Web Con-ference (WWW rsquo07) pp 171ndash180 ACM Alberta Canada May2007

[5] I Titov and R McDonald ldquoModeling online reviews withmulti-grain topic modelsrdquo in Proceedings of the 17th Interna-tional Conference on World Wide Web (WWW rsquo08) pp 111ndash120ACM Beijing China April 2008

10 Computational Intelligence and Neuroscience

[6] Y Lu C X Zhai andN Sundaresan ldquoRated aspect summariza-tion of short commentsrdquo in Proceedings of the 18th InternationalWorld Wide Web Conference (WWW rsquo09) pp 131ndash140 ACMApril 2009

[7] A Mukherjee and B Liu ldquoAspect extraction through semi-supervisedmodelingrdquo inProceedings of the 50thAnnualMeetingof the Association for Computational Linguistics (ACL rsquo12) vol1 pp 339ndash348 ssociation for Computational Linguistics July2012

[8] I Titov and R T McDonald ldquoA joint model of text and aspectratings for sentiment summarizationrdquo in Proceedings of theAnnualMeeting of the Association for Computational Linguisticsvol 8 pp 308ndash316 Association for Computational LinguisticsColumbus Ohio USA 2008

[9] H Wang Y Lu and C Zhai ldquoLatent aspect rating analysis onreview text data a rating regression approachrdquo in Proceedings ofthe 16th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo10) pp 783ndash792 ACM July2010

[10] T L Griffiths M Steyvers D M Blei and J B TenenbaumldquoIntegrating topics and syntaxrdquo in Advances in Neural Informa-tion Processing Systems pp 537ndash544 MIT Press CambridgeMass USA 2004

[11] Y Jo and A H Oh ldquoAspect and sentiment unification modelfor online review analysisrdquo in Proceedings of the 4th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo11) pp 815ndash824 ACM Hong Kong February 2011

[12] J Liu Y Cao C Y Lin Y Huang and M Zhou ldquoLow-quality product review detection in opinion summarizationrdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 334ndash342 PragueCzech Republic June 2007

[13] Y Lu and C Zhai ldquoOpinion integration through semi-supervised topic modelingrdquo in Proceedings of the 17th Interna-tional Conference onWorld Wide Web (WWW rsquo08) pp 121ndash130ACM Beijing China April 2008

[14] C Sauper A Haghighi and R Barzilay ldquoContent models withattituderdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics vol 1 ofHuman Lan-guage Technologies pp 350ndash358 Association for ComputationalLinguistics 2011

[15] W X Zhao J Jiang H Yan and X Li ldquoJointly modeling aspectsand opinions with a MaxEnt-LDA hybridrdquo in Proceedings ofthe Conference on Empirical Methods in Natural Language Pro-cessing pp 56ndash65 Association for Computational LinguisticsOctober 2010

[16] D Andrzejewski X Zhu and M Craven ldquoIncorporatingdomain knowledge into topicmodeling via Dirichlet Forest pri-orsrdquo in Proceedings of the 26th Annual International Conferenceon Machine Learning pp 25ndash32 ACM June 2009

[17] D Andrzejewski X Zhu M Craven and B Recht ldquoA frame-work for incorporating general domain knowledge into latentdirichlet allocation using first-order logicrdquo in Proceedings ofthe 22nd International Joint Conference on Artificial Intelligence(IJCAI rsquo11) pp 1171ndash1177 Barcelona Spain July 2011

[18] Z Chen A Mukherjee B Liu M Hsu M Castellanos and RGhosh ldquoDiscovering coherent topics using general knowledgerdquoin Proceedings of the 22nd ACM International Conference onInformation and Knowledge Management (CIKM rsquo13) pp 209ndash218 ACM November 2013

[19] A Carlson J Betteridge B Kisiel B Settles E R Hruschka Jrand T M Mitchell ldquoToward an architecture for never-endinglanguage learningrdquo in Proceedings of the 24th AAAI Conferenceon Artificial Intelligence vol 5 p 3 Atlanta Ga USA July 2010

[20] Z Chen and B Liu ldquoMining topics in documents standingon the shoulders of big datardquo in Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo14) pp 1116ndash1125 ACM New York NYUSA August 2014

[21] D Mimno H M Wallach E Talley M Leenders and AMcCallum ldquoOptimizing semantic coherence in topic modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 262ndash272 Association for Computa-tional Linguistics July 2011

[22] Z Chen A Mukherjee and B Liu ldquoAspect extraction withautomated prior knowledge learningrdquo in Proceedings of the 52ndAnnual Meeting of the Association for Computational Linguistics(ACL rsquo14) pp 347ndash358 Baltimore Md USA June 2014

[23] G Bouma ldquoNormalized (pointwise) mutual information incollocation extractionrdquo in Proceedings of the Conference of theGerman Society for Computational Linguistics (GSCL rsquo09) pp31ndash40 Potsdam Germany 2009

[24] D M Blei A Y Ng and M I Jordan ldquoLatent dirichletallocationrdquoThe Journal ofMachine Learning Research vol 3 no4-5 pp 993ndash1022 2003

[25] T Hofmann ldquoProbabilistic latent semantic indexingrdquo in Pro-ceedings of the 22nd Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval(SIGIR rsquo99) pp 50ndash57 ACM Berkeley Calif USA August 1999

[26] E Eaton and P L Ruvolo ldquoElla an efficient lifelong learningalgorithmrdquo in Proceedings of the 30th International Conferenceon Machine Learning (ICML rsquo13) pp 507ndash515 2013

[27] E Kamar A Kapoor and E Horvitz ldquoLifelong learning foracquiring thewisdom of the crowdrdquo in Proceedings of the 23rdInternational Joint Conference on Artificial Intelligence (IJCAIrsquo13) pp 2313ndash2320 AAAI Press Beijing China August 2013

[28] R Raina A Battle H Lee B Packer and A Y Ng ldquoSelf-taughtlearning transfer learning from unlabeled datardquo in Proceedingsof the 24th International Conference onMachine Learning (ICMLrsquo07) pp 759ndash766 ACM June 2007

[29] D L Silver Q Yang and L Li ldquoLifelong machine learningsystems Be-yond learning algorithmsrdquo in Proceedings of theAAAI Spring Symposium Lifelong Machine Learning StanfordUniversity March 2013

[30] Z Chen and B Liu ldquoTopic modeling using topics from manydomains lifelong learning and big datardquo in Proceedings of the31st International Conference on Machine Learning (ICML rsquo14)pp 703ndash711 Beijing China June 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article Online Knowledge-Based Model for Big Data ...downloads.hindawi.com/journals/cin/2016/6081804.pdf · Research Article Online Knowledge-Based Model for Big Data Topic

Computational Intelligence and Neuroscience 7

RefineRules (Dt N topics V Rules)(1) procedure RefineRules (119863119905119873 topics 119881 Rules)(2) Rules1015840 larr relevantRules (119881 knowledge-base) set of relevant rules(3) for (r all Rules) do(4) if (Rules1015840 contains r) do(5) 120592 larr nPMI(119903) add to cumulative confidence(6) 120588 larr +1 Rules utility increased by 1(7) end if(8) end for(12) end procedure

Algorithm 3 Refine rules

topic coherence for big streaming data Various experimentsare performed to highlight the improvement in accuracyand performance of lifelong topic mining as product aspectsUnlike traditional ML models it operates on big data with avariety of domains The standard dataset used by previousLML models is used to evaluate the proposed model incomparison to the other state-of-the-art models The dataconsists of 50 domains each of electronic products To high-light the improvement in the OAMC model the dataset wasprovided with decreasing number of domains The modelsare evaluated in topic coherence for accuracy and time inminutes for performance The number of knowledge rules isused as quality of knowledge where fewer rules producingbetter results are considered to be of better quality

51 Setup To set up the test environment an HP PavilionDM4 machine is used with Intel 24GHz Core i5 having8GB RAM and 500GB hard drive Following the evaluationapproach in existing models 15 topics are extracted perdomain for all models while as in previous models top 30words per topic are considered to calculate topic coherenceof the proposedmodel alsoThe hyperparameters are kept thesame for all models The number of Gibbs sampler iterationsis kept the same (2000) as used in previous models Themodels are configured and initialized as used in their originalwork The proposed OAMC has its thresholds 120583

1and 120583

2

set to 08 and minus09 respectively to learn knowledge andselect relevant knowledge for a given task Since negativecorelations are too many as compared to positive rules thethreshold is set to focus more on positive rules In orderto choose relevant rule the overlap threshold is set to 03between the vocabulary and knowledge background

52 Results The accuracy of lifelong models and topicmodels in general is evaluated through topic coherence It hashigh corelation with human evaluation All of the models aregiven the same number of documents in a domain for bothlearning and evaluation The existing models drop accuracywhen it has to learn from a smaller dataset with documentsin hundreds per domain while compromising accuracy whenevaluated on larger dataset with thousands of documentsTherefore the previous models were given large datasets tolearn from and were evaluated on smaller datasets However

50 40 30 20 10 8 6 4 2 1Number of domains available

Topic coherence with different number of domains in dataset

LDALTMAMC-M

AMCOAMC

minus700

minus750

minus800

minus850

minus900

minus950

minus1000

Topi

c coh

eren

ce

Figure 3The proposed model (OAMC) produces highest accuracy(as topic coherence) among lifelong learning models for streamingbig data

for fairness to all models they are provided with the test setonly to learn from and to test on

Topic coherence of the models compared has interestingoutcome LTM surprisingly could not maintain a better topiccoherence when it had to learn from a smaller dataset It evendropped below the baseline AMC and AMC-M performedwell however the proposed OAMC beat them when anumber of domains in the dataset were dropped below 10 asshown in Figure 3 AMC and AMC-M hardly learn anythingwith fewer domains which makes them ineffective towardsstreaming data Therefore their accuracy continuously wentdown towards baseline On the other hand LTM has itslearning pattern badly affected by limited variety of domainsand learnt wrong knowledge which dropped its accuracyThe knowledge rules learnt by AMC and AMC-M were100 and below per domain for less than 10 domains in thedataset which grow up to more than 2000 for 50 domainsIt is because of the overreliance of the existing models onlarge volumes of data to be available at once which makesthem least effective towards streaming data It shows thatthey only learn better knowledge and produce better resultswhen there are many domains in the dataset and they areall available prior to analysis With a robust knowledge

8 Computational Intelligence and Neuroscience

LTM AMC-M AMC OAMCLML models

Performance

020406080

100120140

Tim

e (m

in)

Figure 4 The proposed model (OAMC) has the highest perfor-mance efficiency among LML topic models

storing mechanism OAMC learnt more knowledge refinedit and used it to improve topic coherence with few domainsavailable The OAMC model shows on average 7 with amaximum of 10 improvement in topic coherence whenthere is one domain at a time

OAMC learns wrong knowledge too as it considersonly the current domain for harvesting it However with arobust knowledge retention module after performing fewtasks only good quality knowledge survives through thefiltering mechanism Knowledge rules with less confidencefrom future tasks or limited utility are pushed down andfinally filtered out Knowledge that is getting confidencereenforced from future tasks and has growing utility stayslonger To evaluate relevance of a knowledge rule with thecurrent task the domain of the current task must have therule words confidence above threshold and overlap with therule background higher than threshold The online OAMCmodels truly learn like humans having no idea of future tasksunlike existing models Its learning is dependent on the pastexperience as the domains processed only and uses it for afuture task It is less dependent on data parses it once andconsiders the knowledge module with abstract rules only tosample relevant rules and therefore shows improvement of upto 50 in performance as shown in Figure 4

For the discussed intrinsic evaluation the models arealso compared through PMI mean and median values thathad relevance with topic coherence values The extrinsicWordNet based techniques used in the literature (WuPalmerResnik Path Lin LeacockChodorow Lesk JiangConrath andHirstStOnge) were inconclusive to differentiate among thequalities of topics extracted due to the specialized nature ofelectronic products domains

53 Evaluation OAMC improving performance of AMCby 50 as shown in Figure 4 is attributed to the followingreasons OAMC does not extract relevant knowledge for eachtask and rather chooses from a well maintained knowledgemodule Each domain is parsed only once With a knowledgefiltering mechanism only good quality knowledge rules aremaintainedThere are generally two performance bottlenecksin existing LML models that is FIM for knowledge extrac-tion and higher iterations of Gibbs sampling The OAMCmodel uses nPMI [23] which is highly efficient to extract

knowledge Secondly it learns both must-links and cannot-links together with a single mechanism as rules Even with50 domains in the dataset OAMC performed better thanAMC for certain individual domains by producing highertopic coherence while using fewer knowledge rules as shownin Figure 5 Table 1 shows some of the most frequently usedknowledge rules of OAMC Through manual evaluation oftopics from all the domains it was observed that the knowl-edge learnt by LMLmodels is more effective towards randomand intruded types of topic incoherence but less effectivetowards chained and unbalanced type of incoherence In factthe knowledge-based models can at times enhance chainedtopic incoherence if there are many knowledge rule pairssharing common words

54 Knowledge Analysis The accuracy of LML modelsdepends upon the quality of its knowledge The previousmodels did not provide a discussion on the quality ofknowledge In fact improved topic coherence was consideredas reason for good quality knowledge The knowledge rulesshown in Table 1 have utility above 10 as a token of theirquality The ratio of use of knowledge rules per domain byAMC and OAMC is 5 1 signifying the quality of knowledgeproduced by OAMC Net confidence of rules also showsthat most of the relevant domains enforced the rule andincreased its confidence (above 08with 1 asmaximum) Sincethis information is not available for the existing models thevalues cannot be matched It is still very difficult to explorethe impact of each knowledge rule separately however thediscussed features in general keep track of good qualityknowledge The knowledge learnt is of two types only cap-turing positive and negative word corelations However morevarieties of knowledge can be introduced for further insightsBy introducing ldquothe type of relationshiprdquo as knowledge theunbalanced topic incoherence problem can be addressedAlong with correlation between words the nature of theircorrelationmay also be explored for better analysisThis typeof knowledge is also useful for finding hierarchical topicsThe knowledge rules extracted are isolated while knowledgeas complex network can be more effective for a variety ofanalyses AMC uses a small graph only to associate must-links sharing common word in the same context Knowledgeextraction transfer representation and retention are onlyintroduced to NLP tasks in 2010 and require mature ideasfrom other fields of research

6 Conclusion

LML models are structured to process big data having manydomains and therefore it is the only suitable solution despiteits shortcomings LML models are recently introduced to bigdata topic extraction and are far from mature The existingLML topic models had limited application due to followingbatch processing approach andoverreliance ondataOur pro-posed model named as OAMC follows online approach forLML topic modeling that can support streaming data Withan efficient learning mechanism our OAMCmodel has lowerdependency on data and therefore improved performance by

Computational Intelligence and Neuroscience 9

AlarmClock Battery CDPlayer Fan LampIndividual domains

Topic coherence for each domain

OAMCAMC

minus750

minus770

minus790

minus810

minus830

minus850To

pic c

oher

ence

minus778 minus785

minus812

minus782

minus809

minus790

minus817

minus761

minus788

(a)

OAMCAMC

AlarmClock Battery CDPlayer Fan LampIndividual domains

Knowledge rules used for each domain

284

1423

345

863

226518

1520

446

0

500

1000

1500

2000

Num

ber o

f kno

wle

dge

rule

s use

d

(b)

Figure 5 The proposed model (OAMC) in comparison to the state-of-the-art LML model (a) produces better accuracy as topic coherencefor individual domains while (b) using fewer knowledge rules

50 as compared to state-of-the-art LML models Velocity isan important feature of big data as it is believed to be receivedcontinuously at high speed The OAMC model is designedto process the data in hand only and therefore outperformsexisting models for streaming data As the data is expectedto have noise OAMC is supported with a strict monitoringand filteringmechanism tomaintain a consistent knowledge-base The rules are made to compete for survival based onutility freshness and net confidence OAMC being blind tothe future gets higher bumps in accuracy when current taskhas low relevance to past experience This is very commonto a human-like learning approach However the tendencyof such tasks for which limited relevant knowledge availabledrops with experience considerably OAMCmodel shows 7improvement in accuracy on average for streaming big databut up to 10 for certain individual domains as compared tostate of the art

In the future we will be working on introducing morevarieties of knowledge Through complex network analysisthe knowledge rules can be stored in association with eachother Cooccurrence can mean many things and thereforethe relationship type should also be considered in future Itwill help to associate aspects with products sentiments toaspects and reasons to sentiments To improve the perfor-mance and attempt for near real-time analysis Gibbs sam-pling should be replaced with a lighter inference technique

Disclosure

This research is the authorsrsquo work not published or underreview elsewhere

Competing Interests

The authors declare that they have no competing interests

Authorsrsquo Contributions

In this research work Muhammad Taimoor Khan(corresponding author) developed the model while Mehr

Durrani helped with adjusting the knowledge learningand transfer mechanism to better adapt to the currenttask Furqan Aziz contributed through incorporating theknowledge maintenance module for high efficiency andreduced data dependency Shehzad Khalid supervised theactivity and contributed through running experiments andcompiling results Shehzad Khalid helped with introducingthe new knowledge features and with prioritizing them

Acknowledgments

The authors are thankful to Mr Yousaf Marwat Lecturerat COMSATS IIT Attock for facilitating with editing andproofreading services

References

[1] S Branavan H Chen J Eisenstein and R Barzilay ldquoLearningdocument-level semantic properties from free-text annota-tionsrdquo Journal of Artificial Intelligence Research vol 34 pp 569ndash603 2009

[2] S Brody and N Elhadad ldquoAn unsupervised aspect-sentimentmodel for online reviewsrdquo in Proceedings of the Human Lan-guage Technologies Conference of the North American Chapterof the Association for Computational Linguistics (NAACL HLTrsquo10) pp 804ndash812 Association for Computational LinguisticsLos Angeles Calif USA June 2010

[3] F Li M Huang and X Zhu ldquoSentiment analysis with globaltopics and local dependencyrdquo in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI rsquo10)Atlanta Ga USA July 2010

[4] Q Mei X Ling M Wondra H Su and C Zhai ldquoTopicsentiment mixture modeling facets and opinions in weblogsrdquoin Proceedings of the 16th International World Wide Web Con-ference (WWW rsquo07) pp 171ndash180 ACM Alberta Canada May2007

[5] I Titov and R McDonald ldquoModeling online reviews withmulti-grain topic modelsrdquo in Proceedings of the 17th Interna-tional Conference on World Wide Web (WWW rsquo08) pp 111ndash120ACM Beijing China April 2008

10 Computational Intelligence and Neuroscience

[6] Y Lu C X Zhai andN Sundaresan ldquoRated aspect summariza-tion of short commentsrdquo in Proceedings of the 18th InternationalWorld Wide Web Conference (WWW rsquo09) pp 131ndash140 ACMApril 2009

[7] A Mukherjee and B Liu ldquoAspect extraction through semi-supervisedmodelingrdquo inProceedings of the 50thAnnualMeetingof the Association for Computational Linguistics (ACL rsquo12) vol1 pp 339ndash348 ssociation for Computational Linguistics July2012

[8] I Titov and R T McDonald ldquoA joint model of text and aspectratings for sentiment summarizationrdquo in Proceedings of theAnnualMeeting of the Association for Computational Linguisticsvol 8 pp 308ndash316 Association for Computational LinguisticsColumbus Ohio USA 2008

[9] H Wang Y Lu and C Zhai ldquoLatent aspect rating analysis onreview text data a rating regression approachrdquo in Proceedings ofthe 16th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo10) pp 783ndash792 ACM July2010

[10] T L Griffiths M Steyvers D M Blei and J B TenenbaumldquoIntegrating topics and syntaxrdquo in Advances in Neural Informa-tion Processing Systems pp 537ndash544 MIT Press CambridgeMass USA 2004

[11] Y Jo and A H Oh ldquoAspect and sentiment unification modelfor online review analysisrdquo in Proceedings of the 4th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo11) pp 815ndash824 ACM Hong Kong February 2011

[12] J Liu Y Cao C Y Lin Y Huang and M Zhou ldquoLow-quality product review detection in opinion summarizationrdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 334ndash342 PragueCzech Republic June 2007

[13] Y Lu and C Zhai ldquoOpinion integration through semi-supervised topic modelingrdquo in Proceedings of the 17th Interna-tional Conference onWorld Wide Web (WWW rsquo08) pp 121ndash130ACM Beijing China April 2008

[14] C Sauper A Haghighi and R Barzilay ldquoContent models withattituderdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics vol 1 ofHuman Lan-guage Technologies pp 350ndash358 Association for ComputationalLinguistics 2011

[15] W X Zhao J Jiang H Yan and X Li ldquoJointly modeling aspectsand opinions with a MaxEnt-LDA hybridrdquo in Proceedings ofthe Conference on Empirical Methods in Natural Language Pro-cessing pp 56ndash65 Association for Computational LinguisticsOctober 2010

[16] D Andrzejewski X Zhu and M Craven ldquoIncorporatingdomain knowledge into topicmodeling via Dirichlet Forest pri-orsrdquo in Proceedings of the 26th Annual International Conferenceon Machine Learning pp 25ndash32 ACM June 2009

[17] D Andrzejewski X Zhu M Craven and B Recht ldquoA frame-work for incorporating general domain knowledge into latentdirichlet allocation using first-order logicrdquo in Proceedings ofthe 22nd International Joint Conference on Artificial Intelligence(IJCAI rsquo11) pp 1171ndash1177 Barcelona Spain July 2011

[18] Z Chen A Mukherjee B Liu M Hsu M Castellanos and RGhosh ldquoDiscovering coherent topics using general knowledgerdquoin Proceedings of the 22nd ACM International Conference onInformation and Knowledge Management (CIKM rsquo13) pp 209ndash218 ACM November 2013

[19] A Carlson J Betteridge B Kisiel B Settles E R Hruschka Jrand T M Mitchell ldquoToward an architecture for never-endinglanguage learningrdquo in Proceedings of the 24th AAAI Conferenceon Artificial Intelligence vol 5 p 3 Atlanta Ga USA July 2010

[20] Z Chen and B Liu ldquoMining topics in documents standingon the shoulders of big datardquo in Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo14) pp 1116ndash1125 ACM New York NYUSA August 2014

[21] D Mimno H M Wallach E Talley M Leenders and AMcCallum ldquoOptimizing semantic coherence in topic modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 262ndash272 Association for Computa-tional Linguistics July 2011

[22] Z Chen A Mukherjee and B Liu ldquoAspect extraction withautomated prior knowledge learningrdquo in Proceedings of the 52ndAnnual Meeting of the Association for Computational Linguistics(ACL rsquo14) pp 347ndash358 Baltimore Md USA June 2014

[23] G Bouma ldquoNormalized (pointwise) mutual information incollocation extractionrdquo in Proceedings of the Conference of theGerman Society for Computational Linguistics (GSCL rsquo09) pp31ndash40 Potsdam Germany 2009

[24] D M Blei A Y Ng and M I Jordan ldquoLatent dirichletallocationrdquoThe Journal ofMachine Learning Research vol 3 no4-5 pp 993ndash1022 2003

[25] T Hofmann ldquoProbabilistic latent semantic indexingrdquo in Pro-ceedings of the 22nd Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval(SIGIR rsquo99) pp 50ndash57 ACM Berkeley Calif USA August 1999

[26] E Eaton and P L Ruvolo ldquoElla an efficient lifelong learningalgorithmrdquo in Proceedings of the 30th International Conferenceon Machine Learning (ICML rsquo13) pp 507ndash515 2013

[27] E Kamar A Kapoor and E Horvitz ldquoLifelong learning foracquiring thewisdom of the crowdrdquo in Proceedings of the 23rdInternational Joint Conference on Artificial Intelligence (IJCAIrsquo13) pp 2313ndash2320 AAAI Press Beijing China August 2013

[28] R Raina A Battle H Lee B Packer and A Y Ng ldquoSelf-taughtlearning transfer learning from unlabeled datardquo in Proceedingsof the 24th International Conference onMachine Learning (ICMLrsquo07) pp 759ndash766 ACM June 2007

[29] D L Silver Q Yang and L Li ldquoLifelong machine learningsystems Be-yond learning algorithmsrdquo in Proceedings of theAAAI Spring Symposium Lifelong Machine Learning StanfordUniversity March 2013

[30] Z Chen and B Liu ldquoTopic modeling using topics from manydomains lifelong learning and big datardquo in Proceedings of the31st International Conference on Machine Learning (ICML rsquo14)pp 703ndash711 Beijing China June 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article Online Knowledge-Based Model for Big Data ...downloads.hindawi.com/journals/cin/2016/6081804.pdf · Research Article Online Knowledge-Based Model for Big Data Topic

8 Computational Intelligence and Neuroscience

LTM AMC-M AMC OAMCLML models

Performance

020406080

100120140

Tim

e (m

in)

Figure 4 The proposed model (OAMC) has the highest perfor-mance efficiency among LML topic models

storing mechanism OAMC learnt more knowledge refinedit and used it to improve topic coherence with few domainsavailable The OAMC model shows on average 7 with amaximum of 10 improvement in topic coherence whenthere is one domain at a time

OAMC learns wrong knowledge too as it considersonly the current domain for harvesting it However with arobust knowledge retention module after performing fewtasks only good quality knowledge survives through thefiltering mechanism Knowledge rules with less confidencefrom future tasks or limited utility are pushed down andfinally filtered out Knowledge that is getting confidencereenforced from future tasks and has growing utility stayslonger To evaluate relevance of a knowledge rule with thecurrent task the domain of the current task must have therule words confidence above threshold and overlap with therule background higher than threshold The online OAMCmodels truly learn like humans having no idea of future tasksunlike existing models Its learning is dependent on the pastexperience as the domains processed only and uses it for afuture task It is less dependent on data parses it once andconsiders the knowledge module with abstract rules only tosample relevant rules and therefore shows improvement of upto 50 in performance as shown in Figure 4

For the discussed intrinsic evaluation the models arealso compared through PMI mean and median values thathad relevance with topic coherence values The extrinsicWordNet based techniques used in the literature (WuPalmerResnik Path Lin LeacockChodorow Lesk JiangConrath andHirstStOnge) were inconclusive to differentiate among thequalities of topics extracted due to the specialized nature ofelectronic products domains

53 Evaluation OAMC improving performance of AMCby 50 as shown in Figure 4 is attributed to the followingreasons OAMC does not extract relevant knowledge for eachtask and rather chooses from a well maintained knowledgemodule Each domain is parsed only once With a knowledgefiltering mechanism only good quality knowledge rules aremaintainedThere are generally two performance bottlenecksin existing LML models that is FIM for knowledge extrac-tion and higher iterations of Gibbs sampling The OAMCmodel uses nPMI [23] which is highly efficient to extract

knowledge Secondly it learns both must-links and cannot-links together with a single mechanism as rules Even with50 domains in the dataset OAMC performed better thanAMC for certain individual domains by producing highertopic coherence while using fewer knowledge rules as shownin Figure 5 Table 1 shows some of the most frequently usedknowledge rules of OAMC Through manual evaluation oftopics from all the domains it was observed that the knowl-edge learnt by LMLmodels is more effective towards randomand intruded types of topic incoherence but less effectivetowards chained and unbalanced type of incoherence In factthe knowledge-based models can at times enhance chainedtopic incoherence if there are many knowledge rule pairssharing common words

54 Knowledge Analysis The accuracy of LML modelsdepends upon the quality of its knowledge The previousmodels did not provide a discussion on the quality ofknowledge In fact improved topic coherence was consideredas reason for good quality knowledge The knowledge rulesshown in Table 1 have utility above 10 as a token of theirquality The ratio of use of knowledge rules per domain byAMC and OAMC is 5 1 signifying the quality of knowledgeproduced by OAMC Net confidence of rules also showsthat most of the relevant domains enforced the rule andincreased its confidence (above 08with 1 asmaximum) Sincethis information is not available for the existing models thevalues cannot be matched It is still very difficult to explorethe impact of each knowledge rule separately however thediscussed features in general keep track of good qualityknowledge The knowledge learnt is of two types only cap-turing positive and negative word corelations However morevarieties of knowledge can be introduced for further insightsBy introducing ldquothe type of relationshiprdquo as knowledge theunbalanced topic incoherence problem can be addressedAlong with correlation between words the nature of theircorrelationmay also be explored for better analysisThis typeof knowledge is also useful for finding hierarchical topicsThe knowledge rules extracted are isolated while knowledgeas complex network can be more effective for a variety ofanalyses AMC uses a small graph only to associate must-links sharing common word in the same context Knowledgeextraction transfer representation and retention are onlyintroduced to NLP tasks in 2010 and require mature ideasfrom other fields of research

6 Conclusion

LML models are structured to process big data having manydomains and therefore it is the only suitable solution despiteits shortcomings LML models are recently introduced to bigdata topic extraction and are far from mature The existingLML topic models had limited application due to followingbatch processing approach andoverreliance ondataOur pro-posed model named as OAMC follows online approach forLML topic modeling that can support streaming data Withan efficient learning mechanism our OAMCmodel has lowerdependency on data and therefore improved performance by

Computational Intelligence and Neuroscience 9

AlarmClock Battery CDPlayer Fan LampIndividual domains

Topic coherence for each domain

OAMCAMC

minus750

minus770

minus790

minus810

minus830

minus850To

pic c

oher

ence

minus778 minus785

minus812

minus782

minus809

minus790

minus817

minus761

minus788

(a)

OAMCAMC

AlarmClock Battery CDPlayer Fan LampIndividual domains

Knowledge rules used for each domain

284

1423

345

863

226518

1520

446

0

500

1000

1500

2000

Num

ber o

f kno

wle

dge

rule

s use

d

(b)

Figure 5 The proposed model (OAMC) in comparison to the state-of-the-art LML model (a) produces better accuracy as topic coherencefor individual domains while (b) using fewer knowledge rules

50 as compared to state-of-the-art LML models Velocity isan important feature of big data as it is believed to be receivedcontinuously at high speed The OAMC model is designedto process the data in hand only and therefore outperformsexisting models for streaming data As the data is expectedto have noise OAMC is supported with a strict monitoringand filteringmechanism tomaintain a consistent knowledge-base The rules are made to compete for survival based onutility freshness and net confidence OAMC being blind tothe future gets higher bumps in accuracy when current taskhas low relevance to past experience This is very commonto a human-like learning approach However the tendencyof such tasks for which limited relevant knowledge availabledrops with experience considerably OAMCmodel shows 7improvement in accuracy on average for streaming big databut up to 10 for certain individual domains as compared tostate of the art

In the future we will be working on introducing morevarieties of knowledge Through complex network analysisthe knowledge rules can be stored in association with eachother Cooccurrence can mean many things and thereforethe relationship type should also be considered in future Itwill help to associate aspects with products sentiments toaspects and reasons to sentiments To improve the perfor-mance and attempt for near real-time analysis Gibbs sam-pling should be replaced with a lighter inference technique

Disclosure

This research is the authorsrsquo work not published or underreview elsewhere

Competing Interests

The authors declare that they have no competing interests

Authorsrsquo Contributions

In this research work Muhammad Taimoor Khan(corresponding author) developed the model while Mehr

Durrani helped with adjusting the knowledge learningand transfer mechanism to better adapt to the currenttask Furqan Aziz contributed through incorporating theknowledge maintenance module for high efficiency andreduced data dependency Shehzad Khalid supervised theactivity and contributed through running experiments andcompiling results Shehzad Khalid helped with introducingthe new knowledge features and with prioritizing them

Acknowledgments

The authors are thankful to Mr Yousaf Marwat Lecturerat COMSATS IIT Attock for facilitating with editing andproofreading services

References

[1] S Branavan H Chen J Eisenstein and R Barzilay ldquoLearningdocument-level semantic properties from free-text annota-tionsrdquo Journal of Artificial Intelligence Research vol 34 pp 569ndash603 2009

[2] S Brody and N Elhadad ldquoAn unsupervised aspect-sentimentmodel for online reviewsrdquo in Proceedings of the Human Lan-guage Technologies Conference of the North American Chapterof the Association for Computational Linguistics (NAACL HLTrsquo10) pp 804ndash812 Association for Computational LinguisticsLos Angeles Calif USA June 2010

[3] F Li M Huang and X Zhu ldquoSentiment analysis with globaltopics and local dependencyrdquo in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI rsquo10)Atlanta Ga USA July 2010

[4] Q Mei X Ling M Wondra H Su and C Zhai ldquoTopicsentiment mixture modeling facets and opinions in weblogsrdquoin Proceedings of the 16th International World Wide Web Con-ference (WWW rsquo07) pp 171ndash180 ACM Alberta Canada May2007

[5] I Titov and R McDonald ldquoModeling online reviews withmulti-grain topic modelsrdquo in Proceedings of the 17th Interna-tional Conference on World Wide Web (WWW rsquo08) pp 111ndash120ACM Beijing China April 2008

10 Computational Intelligence and Neuroscience

[6] Y Lu C X Zhai andN Sundaresan ldquoRated aspect summariza-tion of short commentsrdquo in Proceedings of the 18th InternationalWorld Wide Web Conference (WWW rsquo09) pp 131ndash140 ACMApril 2009

[7] A Mukherjee and B Liu ldquoAspect extraction through semi-supervisedmodelingrdquo inProceedings of the 50thAnnualMeetingof the Association for Computational Linguistics (ACL rsquo12) vol1 pp 339ndash348 ssociation for Computational Linguistics July2012

[8] I Titov and R T McDonald ldquoA joint model of text and aspectratings for sentiment summarizationrdquo in Proceedings of theAnnualMeeting of the Association for Computational Linguisticsvol 8 pp 308ndash316 Association for Computational LinguisticsColumbus Ohio USA 2008

[9] H Wang Y Lu and C Zhai ldquoLatent aspect rating analysis onreview text data a rating regression approachrdquo in Proceedings ofthe 16th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo10) pp 783ndash792 ACM July2010

[10] T L Griffiths M Steyvers D M Blei and J B TenenbaumldquoIntegrating topics and syntaxrdquo in Advances in Neural Informa-tion Processing Systems pp 537ndash544 MIT Press CambridgeMass USA 2004

[11] Y Jo and A H Oh ldquoAspect and sentiment unification modelfor online review analysisrdquo in Proceedings of the 4th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo11) pp 815ndash824 ACM Hong Kong February 2011

[12] J Liu Y Cao C Y Lin Y Huang and M Zhou ldquoLow-quality product review detection in opinion summarizationrdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 334ndash342 PragueCzech Republic June 2007

[13] Y Lu and C Zhai ldquoOpinion integration through semi-supervised topic modelingrdquo in Proceedings of the 17th Interna-tional Conference onWorld Wide Web (WWW rsquo08) pp 121ndash130ACM Beijing China April 2008

[14] C Sauper A Haghighi and R Barzilay ldquoContent models withattituderdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics vol 1 ofHuman Lan-guage Technologies pp 350ndash358 Association for ComputationalLinguistics 2011

[15] W X Zhao J Jiang H Yan and X Li ldquoJointly modeling aspectsand opinions with a MaxEnt-LDA hybridrdquo in Proceedings ofthe Conference on Empirical Methods in Natural Language Pro-cessing pp 56ndash65 Association for Computational LinguisticsOctober 2010

[16] D Andrzejewski X Zhu and M Craven ldquoIncorporatingdomain knowledge into topicmodeling via Dirichlet Forest pri-orsrdquo in Proceedings of the 26th Annual International Conferenceon Machine Learning pp 25ndash32 ACM June 2009

[17] D Andrzejewski X Zhu M Craven and B Recht ldquoA frame-work for incorporating general domain knowledge into latentdirichlet allocation using first-order logicrdquo in Proceedings ofthe 22nd International Joint Conference on Artificial Intelligence(IJCAI rsquo11) pp 1171ndash1177 Barcelona Spain July 2011

[18] Z Chen A Mukherjee B Liu M Hsu M Castellanos and RGhosh ldquoDiscovering coherent topics using general knowledgerdquoin Proceedings of the 22nd ACM International Conference onInformation and Knowledge Management (CIKM rsquo13) pp 209ndash218 ACM November 2013

[19] A Carlson J Betteridge B Kisiel B Settles E R Hruschka Jrand T M Mitchell ldquoToward an architecture for never-endinglanguage learningrdquo in Proceedings of the 24th AAAI Conferenceon Artificial Intelligence vol 5 p 3 Atlanta Ga USA July 2010

[20] Z Chen and B Liu ldquoMining topics in documents standingon the shoulders of big datardquo in Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo14) pp 1116ndash1125 ACM New York NYUSA August 2014

[21] D Mimno H M Wallach E Talley M Leenders and AMcCallum ldquoOptimizing semantic coherence in topic modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 262ndash272 Association for Computa-tional Linguistics July 2011

[22] Z Chen A Mukherjee and B Liu ldquoAspect extraction withautomated prior knowledge learningrdquo in Proceedings of the 52ndAnnual Meeting of the Association for Computational Linguistics(ACL rsquo14) pp 347ndash358 Baltimore Md USA June 2014

[23] G Bouma ldquoNormalized (pointwise) mutual information incollocation extractionrdquo in Proceedings of the Conference of theGerman Society for Computational Linguistics (GSCL rsquo09) pp31ndash40 Potsdam Germany 2009

[24] D M Blei A Y Ng and M I Jordan ldquoLatent dirichletallocationrdquoThe Journal ofMachine Learning Research vol 3 no4-5 pp 993ndash1022 2003

[25] T Hofmann ldquoProbabilistic latent semantic indexingrdquo in Pro-ceedings of the 22nd Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval(SIGIR rsquo99) pp 50ndash57 ACM Berkeley Calif USA August 1999

[26] E Eaton and P L Ruvolo ldquoElla an efficient lifelong learningalgorithmrdquo in Proceedings of the 30th International Conferenceon Machine Learning (ICML rsquo13) pp 507ndash515 2013

[27] E Kamar A Kapoor and E Horvitz ldquoLifelong learning foracquiring thewisdom of the crowdrdquo in Proceedings of the 23rdInternational Joint Conference on Artificial Intelligence (IJCAIrsquo13) pp 2313ndash2320 AAAI Press Beijing China August 2013

[28] R Raina A Battle H Lee B Packer and A Y Ng ldquoSelf-taughtlearning transfer learning from unlabeled datardquo in Proceedingsof the 24th International Conference onMachine Learning (ICMLrsquo07) pp 759ndash766 ACM June 2007

[29] D L Silver Q Yang and L Li ldquoLifelong machine learningsystems Be-yond learning algorithmsrdquo in Proceedings of theAAAI Spring Symposium Lifelong Machine Learning StanfordUniversity March 2013

[30] Z Chen and B Liu ldquoTopic modeling using topics from manydomains lifelong learning and big datardquo in Proceedings of the31st International Conference on Machine Learning (ICML rsquo14)pp 703ndash711 Beijing China June 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article Online Knowledge-Based Model for Big Data ...downloads.hindawi.com/journals/cin/2016/6081804.pdf · Research Article Online Knowledge-Based Model for Big Data Topic

Computational Intelligence and Neuroscience 9

AlarmClock Battery CDPlayer Fan LampIndividual domains

Topic coherence for each domain

OAMCAMC

minus750

minus770

minus790

minus810

minus830

minus850To

pic c

oher

ence

minus778 minus785

minus812

minus782

minus809

minus790

minus817

minus761

minus788

(a)

OAMCAMC

AlarmClock Battery CDPlayer Fan LampIndividual domains

Knowledge rules used for each domain

284

1423

345

863

226518

1520

446

0

500

1000

1500

2000

Num

ber o

f kno

wle

dge

rule

s use

d

(b)

Figure 5 The proposed model (OAMC) in comparison to the state-of-the-art LML model (a) produces better accuracy as topic coherencefor individual domains while (b) using fewer knowledge rules

50 as compared to state-of-the-art LML models Velocity isan important feature of big data as it is believed to be receivedcontinuously at high speed The OAMC model is designedto process the data in hand only and therefore outperformsexisting models for streaming data As the data is expectedto have noise OAMC is supported with a strict monitoringand filteringmechanism tomaintain a consistent knowledge-base The rules are made to compete for survival based onutility freshness and net confidence OAMC being blind tothe future gets higher bumps in accuracy when current taskhas low relevance to past experience This is very commonto a human-like learning approach However the tendencyof such tasks for which limited relevant knowledge availabledrops with experience considerably OAMCmodel shows 7improvement in accuracy on average for streaming big databut up to 10 for certain individual domains as compared tostate of the art

In the future we will be working on introducing morevarieties of knowledge Through complex network analysisthe knowledge rules can be stored in association with eachother Cooccurrence can mean many things and thereforethe relationship type should also be considered in future Itwill help to associate aspects with products sentiments toaspects and reasons to sentiments To improve the perfor-mance and attempt for near real-time analysis Gibbs sam-pling should be replaced with a lighter inference technique

Disclosure

This research is the authorsrsquo work not published or underreview elsewhere

Competing Interests

The authors declare that they have no competing interests

Authorsrsquo Contributions

In this research work Muhammad Taimoor Khan(corresponding author) developed the model while Mehr

Durrani helped with adjusting the knowledge learningand transfer mechanism to better adapt to the currenttask Furqan Aziz contributed through incorporating theknowledge maintenance module for high efficiency andreduced data dependency Shehzad Khalid supervised theactivity and contributed through running experiments andcompiling results Shehzad Khalid helped with introducingthe new knowledge features and with prioritizing them

Acknowledgments

The authors are thankful to Mr Yousaf Marwat Lecturerat COMSATS IIT Attock for facilitating with editing andproofreading services

References

[1] S Branavan H Chen J Eisenstein and R Barzilay ldquoLearningdocument-level semantic properties from free-text annota-tionsrdquo Journal of Artificial Intelligence Research vol 34 pp 569ndash603 2009

[2] S Brody and N Elhadad ldquoAn unsupervised aspect-sentimentmodel for online reviewsrdquo in Proceedings of the Human Lan-guage Technologies Conference of the North American Chapterof the Association for Computational Linguistics (NAACL HLTrsquo10) pp 804ndash812 Association for Computational LinguisticsLos Angeles Calif USA June 2010

[3] F Li M Huang and X Zhu ldquoSentiment analysis with globaltopics and local dependencyrdquo in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI rsquo10)Atlanta Ga USA July 2010

[4] Q Mei X Ling M Wondra H Su and C Zhai ldquoTopicsentiment mixture modeling facets and opinions in weblogsrdquoin Proceedings of the 16th International World Wide Web Con-ference (WWW rsquo07) pp 171ndash180 ACM Alberta Canada May2007

[5] I Titov and R McDonald ldquoModeling online reviews withmulti-grain topic modelsrdquo in Proceedings of the 17th Interna-tional Conference on World Wide Web (WWW rsquo08) pp 111ndash120ACM Beijing China April 2008

10 Computational Intelligence and Neuroscience

[6] Y Lu C X Zhai andN Sundaresan ldquoRated aspect summariza-tion of short commentsrdquo in Proceedings of the 18th InternationalWorld Wide Web Conference (WWW rsquo09) pp 131ndash140 ACMApril 2009

[7] A Mukherjee and B Liu ldquoAspect extraction through semi-supervisedmodelingrdquo inProceedings of the 50thAnnualMeetingof the Association for Computational Linguistics (ACL rsquo12) vol1 pp 339ndash348 ssociation for Computational Linguistics July2012

[8] I Titov and R T McDonald ldquoA joint model of text and aspectratings for sentiment summarizationrdquo in Proceedings of theAnnualMeeting of the Association for Computational Linguisticsvol 8 pp 308ndash316 Association for Computational LinguisticsColumbus Ohio USA 2008

[9] H Wang Y Lu and C Zhai ldquoLatent aspect rating analysis onreview text data a rating regression approachrdquo in Proceedings ofthe 16th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo10) pp 783ndash792 ACM July2010

[10] T L Griffiths M Steyvers D M Blei and J B TenenbaumldquoIntegrating topics and syntaxrdquo in Advances in Neural Informa-tion Processing Systems pp 537ndash544 MIT Press CambridgeMass USA 2004

[11] Y Jo and A H Oh ldquoAspect and sentiment unification modelfor online review analysisrdquo in Proceedings of the 4th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo11) pp 815ndash824 ACM Hong Kong February 2011

[12] J Liu Y Cao C Y Lin Y Huang and M Zhou ldquoLow-quality product review detection in opinion summarizationrdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 334ndash342 PragueCzech Republic June 2007

[13] Y Lu and C Zhai ldquoOpinion integration through semi-supervised topic modelingrdquo in Proceedings of the 17th Interna-tional Conference onWorld Wide Web (WWW rsquo08) pp 121ndash130ACM Beijing China April 2008

[14] C Sauper A Haghighi and R Barzilay ldquoContent models withattituderdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics vol 1 ofHuman Lan-guage Technologies pp 350ndash358 Association for ComputationalLinguistics 2011

[15] W X Zhao J Jiang H Yan and X Li ldquoJointly modeling aspectsand opinions with a MaxEnt-LDA hybridrdquo in Proceedings ofthe Conference on Empirical Methods in Natural Language Pro-cessing pp 56ndash65 Association for Computational LinguisticsOctober 2010

[16] D Andrzejewski X Zhu and M Craven ldquoIncorporatingdomain knowledge into topicmodeling via Dirichlet Forest pri-orsrdquo in Proceedings of the 26th Annual International Conferenceon Machine Learning pp 25ndash32 ACM June 2009

[17] D Andrzejewski X Zhu M Craven and B Recht ldquoA frame-work for incorporating general domain knowledge into latentdirichlet allocation using first-order logicrdquo in Proceedings ofthe 22nd International Joint Conference on Artificial Intelligence(IJCAI rsquo11) pp 1171ndash1177 Barcelona Spain July 2011

[18] Z Chen A Mukherjee B Liu M Hsu M Castellanos and RGhosh ldquoDiscovering coherent topics using general knowledgerdquoin Proceedings of the 22nd ACM International Conference onInformation and Knowledge Management (CIKM rsquo13) pp 209ndash218 ACM November 2013

[19] A Carlson J Betteridge B Kisiel B Settles E R Hruschka Jrand T M Mitchell ldquoToward an architecture for never-endinglanguage learningrdquo in Proceedings of the 24th AAAI Conferenceon Artificial Intelligence vol 5 p 3 Atlanta Ga USA July 2010

[20] Z Chen and B Liu ldquoMining topics in documents standingon the shoulders of big datardquo in Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo14) pp 1116ndash1125 ACM New York NYUSA August 2014

[21] D Mimno H M Wallach E Talley M Leenders and AMcCallum ldquoOptimizing semantic coherence in topic modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 262ndash272 Association for Computa-tional Linguistics July 2011

[22] Z Chen A Mukherjee and B Liu ldquoAspect extraction withautomated prior knowledge learningrdquo in Proceedings of the 52ndAnnual Meeting of the Association for Computational Linguistics(ACL rsquo14) pp 347ndash358 Baltimore Md USA June 2014

[23] G Bouma ldquoNormalized (pointwise) mutual information incollocation extractionrdquo in Proceedings of the Conference of theGerman Society for Computational Linguistics (GSCL rsquo09) pp31ndash40 Potsdam Germany 2009

[24] D M Blei A Y Ng and M I Jordan ldquoLatent dirichletallocationrdquoThe Journal ofMachine Learning Research vol 3 no4-5 pp 993ndash1022 2003

[25] T Hofmann ldquoProbabilistic latent semantic indexingrdquo in Pro-ceedings of the 22nd Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval(SIGIR rsquo99) pp 50ndash57 ACM Berkeley Calif USA August 1999

[26] E Eaton and P L Ruvolo ldquoElla an efficient lifelong learningalgorithmrdquo in Proceedings of the 30th International Conferenceon Machine Learning (ICML rsquo13) pp 507ndash515 2013

[27] E Kamar A Kapoor and E Horvitz ldquoLifelong learning foracquiring thewisdom of the crowdrdquo in Proceedings of the 23rdInternational Joint Conference on Artificial Intelligence (IJCAIrsquo13) pp 2313ndash2320 AAAI Press Beijing China August 2013

[28] R Raina A Battle H Lee B Packer and A Y Ng ldquoSelf-taughtlearning transfer learning from unlabeled datardquo in Proceedingsof the 24th International Conference onMachine Learning (ICMLrsquo07) pp 759ndash766 ACM June 2007

[29] D L Silver Q Yang and L Li ldquoLifelong machine learningsystems Be-yond learning algorithmsrdquo in Proceedings of theAAAI Spring Symposium Lifelong Machine Learning StanfordUniversity March 2013

[30] Z Chen and B Liu ldquoTopic modeling using topics from manydomains lifelong learning and big datardquo in Proceedings of the31st International Conference on Machine Learning (ICML rsquo14)pp 703ndash711 Beijing China June 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article Online Knowledge-Based Model for Big Data ...downloads.hindawi.com/journals/cin/2016/6081804.pdf · Research Article Online Knowledge-Based Model for Big Data Topic

10 Computational Intelligence and Neuroscience

[6] Y Lu C X Zhai andN Sundaresan ldquoRated aspect summariza-tion of short commentsrdquo in Proceedings of the 18th InternationalWorld Wide Web Conference (WWW rsquo09) pp 131ndash140 ACMApril 2009

[7] A Mukherjee and B Liu ldquoAspect extraction through semi-supervisedmodelingrdquo inProceedings of the 50thAnnualMeetingof the Association for Computational Linguistics (ACL rsquo12) vol1 pp 339ndash348 ssociation for Computational Linguistics July2012

[8] I Titov and R T McDonald ldquoA joint model of text and aspectratings for sentiment summarizationrdquo in Proceedings of theAnnualMeeting of the Association for Computational Linguisticsvol 8 pp 308ndash316 Association for Computational LinguisticsColumbus Ohio USA 2008

[9] H Wang Y Lu and C Zhai ldquoLatent aspect rating analysis onreview text data a rating regression approachrdquo in Proceedings ofthe 16th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo10) pp 783ndash792 ACM July2010

[10] T L Griffiths M Steyvers D M Blei and J B TenenbaumldquoIntegrating topics and syntaxrdquo in Advances in Neural Informa-tion Processing Systems pp 537ndash544 MIT Press CambridgeMass USA 2004

[11] Y Jo and A H Oh ldquoAspect and sentiment unification modelfor online review analysisrdquo in Proceedings of the 4th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo11) pp 815ndash824 ACM Hong Kong February 2011

[12] J Liu Y Cao C Y Lin Y Huang and M Zhou ldquoLow-quality product review detection in opinion summarizationrdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 334ndash342 PragueCzech Republic June 2007

[13] Y Lu and C Zhai ldquoOpinion integration through semi-supervised topic modelingrdquo in Proceedings of the 17th Interna-tional Conference onWorld Wide Web (WWW rsquo08) pp 121ndash130ACM Beijing China April 2008

[14] C Sauper A Haghighi and R Barzilay ldquoContent models withattituderdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics vol 1 ofHuman Lan-guage Technologies pp 350ndash358 Association for ComputationalLinguistics 2011

[15] W X Zhao J Jiang H Yan and X Li ldquoJointly modeling aspectsand opinions with a MaxEnt-LDA hybridrdquo in Proceedings ofthe Conference on Empirical Methods in Natural Language Pro-cessing pp 56ndash65 Association for Computational LinguisticsOctober 2010

[16] D Andrzejewski X Zhu and M Craven ldquoIncorporatingdomain knowledge into topicmodeling via Dirichlet Forest pri-orsrdquo in Proceedings of the 26th Annual International Conferenceon Machine Learning pp 25ndash32 ACM June 2009

[17] D Andrzejewski X Zhu M Craven and B Recht ldquoA frame-work for incorporating general domain knowledge into latentdirichlet allocation using first-order logicrdquo in Proceedings ofthe 22nd International Joint Conference on Artificial Intelligence(IJCAI rsquo11) pp 1171ndash1177 Barcelona Spain July 2011

[18] Z Chen A Mukherjee B Liu M Hsu M Castellanos and RGhosh ldquoDiscovering coherent topics using general knowledgerdquoin Proceedings of the 22nd ACM International Conference onInformation and Knowledge Management (CIKM rsquo13) pp 209ndash218 ACM November 2013

[19] A Carlson J Betteridge B Kisiel B Settles E R Hruschka Jrand T M Mitchell ldquoToward an architecture for never-endinglanguage learningrdquo in Proceedings of the 24th AAAI Conferenceon Artificial Intelligence vol 5 p 3 Atlanta Ga USA July 2010

[20] Z Chen and B Liu ldquoMining topics in documents standingon the shoulders of big datardquo in Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo14) pp 1116ndash1125 ACM New York NYUSA August 2014

[21] D Mimno H M Wallach E Talley M Leenders and AMcCallum ldquoOptimizing semantic coherence in topic modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 262ndash272 Association for Computa-tional Linguistics July 2011

[22] Z Chen A Mukherjee and B Liu ldquoAspect extraction withautomated prior knowledge learningrdquo in Proceedings of the 52ndAnnual Meeting of the Association for Computational Linguistics(ACL rsquo14) pp 347ndash358 Baltimore Md USA June 2014

[23] G Bouma ldquoNormalized (pointwise) mutual information incollocation extractionrdquo in Proceedings of the Conference of theGerman Society for Computational Linguistics (GSCL rsquo09) pp31ndash40 Potsdam Germany 2009

[24] D M Blei A Y Ng and M I Jordan ldquoLatent dirichletallocationrdquoThe Journal ofMachine Learning Research vol 3 no4-5 pp 993ndash1022 2003

[25] T Hofmann ldquoProbabilistic latent semantic indexingrdquo in Pro-ceedings of the 22nd Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval(SIGIR rsquo99) pp 50ndash57 ACM Berkeley Calif USA August 1999

[26] E Eaton and P L Ruvolo ldquoElla an efficient lifelong learningalgorithmrdquo in Proceedings of the 30th International Conferenceon Machine Learning (ICML rsquo13) pp 507ndash515 2013

[27] E Kamar A Kapoor and E Horvitz ldquoLifelong learning foracquiring thewisdom of the crowdrdquo in Proceedings of the 23rdInternational Joint Conference on Artificial Intelligence (IJCAIrsquo13) pp 2313ndash2320 AAAI Press Beijing China August 2013

[28] R Raina A Battle H Lee B Packer and A Y Ng ldquoSelf-taughtlearning transfer learning from unlabeled datardquo in Proceedingsof the 24th International Conference onMachine Learning (ICMLrsquo07) pp 759ndash766 ACM June 2007

[29] D L Silver Q Yang and L Li ldquoLifelong machine learningsystems Be-yond learning algorithmsrdquo in Proceedings of theAAAI Spring Symposium Lifelong Machine Learning StanfordUniversity March 2013

[30] Z Chen and B Liu ldquoTopic modeling using topics from manydomains lifelong learning and big datardquo in Proceedings of the31st International Conference on Machine Learning (ICML rsquo14)pp 703ndash711 Beijing China June 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article Online Knowledge-Based Model for Big Data ...downloads.hindawi.com/journals/cin/2016/6081804.pdf · Research Article Online Knowledge-Based Model for Big Data Topic

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014