sobigdata – 654024 - cnr

23
SoBigData – 654024 www.sobigdata.eu SoBigData receives funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 654024 Project Acronym SoBigData Project Title SoBigData Research Infrastructure Social Mining & Big Data Ecosystem Project Number 654024 Deliverable Title Resource adaptation to register to the e-infrastructure 2 Deliverable No. D10.9 Delivery Date October 2018 Authors Massimiliano Assante (CNR), Leonardo Candela (CNR), Paolo Manghi (CNR), Pasquale Pagano (CNR) Ref. Ares(2018)5066838 - 03/10/2018

Upload: others

Post on 16-Jan-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

SoBigData receives funding from the EuropeanUnion’sHorizon 2020 research and innovation programmeundergrantagreementNo.654024

ProjectAcronym SoBigData

ProjectTitle SoBigDataResearchInfrastructureSocialMining&BigDataEcosystem

ProjectNumber 654024

DeliverableTitle Resourceadaptationtoregistertothee-infrastructure2

DeliverableNo. D10.9

DeliveryDate October2018

Authors MassimilianoAssante(CNR),LeonardoCandela(CNR),PaoloManghi(CNR),PasqualePagano(CNR)

Ref. Ares(2018)5066838 - 03/10/2018

Page 2: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page2of23

DOCUMENTINFORMATIONPROJECT

ProjectAcronym SoBigDataProjectTitle SoBigDataResearchInfrastructure

SocialMining&BigDataEcosystemProjectStart 1stSeptember2015ProjectDuration 48monthsFunding H2020-INFRAIA-2014-2015GrantAgreementNo. 654024

DOCUMENTDeliverableNo. D10.9DeliverableTitle Resourceadaptationtoregistertothee-infrastructure2ContractualDeliveryDate 30September2018ActualDeliveryDate 03October2018Author(s) MassimilianoAssante(CNR),LeonardoCandela(CNR),PaoloManghi, (CNR)

PasqualePagano(CNR)Editor(s) MassimilianoAssante(CNR),BeatriceRapisarda(CNR)Reviewer(s) RobertoTrasarti(CNR)Contributor(s) Kalina Bontcheva (USFD), Marco Cornolti (UNIPI), Stefano Cresci (CNR),

ThorstenMay(FRH),CristinaMuntean(CNR),MarcoPonza(UNIPI),DanieleRegoli(SNS),RobertoTrasarti(CNR)

WorkPackageNo. WP10WorkPackageTitle JRA3_SoBigDatae-InfrastructureWorkPackageLeader CNRWorkPackageParticipants CNR,USFD,UNIPI,FRH,UT,IMT,LUH,KCL,SNS,AALTO,ETHZDissemination PublicNature ReportVersion/Revision 1.0Draft/Final FinalTotalNo.Pages(includingcover) 23

Keywords e-infrastructure,services,resources,integration

Page 3: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page3of23

DISCLAIMERSoBigData(654024) isaResearchandInnovationAction(RIA)fundedbytheEuropeanCommissionundertheHorizon2020researchandinnovationprogramme.

SoBigData proposes to create the Social Mining & Big Data Ecosystem: a research infrastructure (RI)providing an integrated ecosystem for ethic-sensitive scientific discoveries and advanced applications ofsocialdataminingon thevariousdimensionsof social life,as recordedby“bigdata”.Buildingonseveralestablished national infrastructures, SoBigData will open up new research avenues in multiple researchfields,includingmathematics,ICT,andhuman,socialandeconomicsciences,byenablingeasycomparison,re-useandintegrationofstate-of-the-artbigsocialdata,methods,andservices,intonewresearch.

This document contains informationon SoBigData core activities, findings andoutcomes and itmay alsocontain contributions from distinguished experts who contribute as SoBigData Board members. Anyreference to content in this document should clearly indicate the authors, source, organisation andpublicationdate.

The document has been produced with the funding of the European Commission. The content of thispublication is the sole responsibility of the SoBigData Consortium and its experts, and it cannot beconsideredtoreflecttheviewsoftheEuropeanCommission.Theauthorsofthisdocumenthavetakenanyavailable measure in order for its content to be accurate, consistent and lawful. However, neither theproject consortium as a whole nor the individual partners that implicitly or explicitly participated thecreation and publication of this document hold any sort of responsibility thatmight occur as a result ofusingitscontent.

The European Union (EU) was established in accordance with the Treaty on the European Union(Maastricht). There are currently 27member states of the EuropeanUnion. It is basedon the EuropeanCommunitiesandthememberstates’cooperationinthefieldsofCommonForeignandSecurityPolicyandJusticeandHomeAffairs.The fivemain institutionsof theEuropeanUnionare theEuropeanParliament,the Council of Ministers, the European Commission, the Court of Justice, and the Court of Auditors(http://europa.eu.int/).

Copyright©TheSoBigDataConsortium2015.Seehttp://project.sobigdata.eu/fordetailsonthecopyrightholders.

Formore informationon theproject, its partners and contributorsplease seehttp://project.sobigdata.eu/. Youarepermittedtocopyanddistributeverbatimcopiesofthisdocumentcontainingthiscopyrightnotice,butmodifyingthisdocument isnotallowed.Youarepermittedtocopythisdocument inwholeor inpart intootherdocuments ifyouattachthefollowingreferencetothecopiedelements:“Copyright©TheSoBigDataConsortium2015.”

TheinformationcontainedinthisdocumentrepresentstheviewsoftheSoBigDataConsortiumasofthedatetheyarepublished.TheSoBigDataConsortiumdoesnotguaranteethatanyinformationcontainedhereiniserror-free,oruptodate.THESoBigDataCONSORTIUMMAKESNOWARRANTIES,EXPRESS,IMPLIED,ORSTATUTORY,BYPUBLISHINGTHISDOCUMENT.

Page 4: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page4of23

GLOSSARYABBREVIATION DEFINITION

VA VirtualAccess

VRE VirtualResearchEnvironment

WPS WebProcessingService

Page 5: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page5of23

TABLEOFCONTENT

DOCUMENTINFORMATION.........................................................................................................2

DISCLAIMER................................................................................................................................3

GLOSSARY...................................................................................................................................4

TABLEOFCONTENT.....................................................................................................................5

DELIVERABLESUMMARY.............................................................................................................6

EXECUTIVESUMMARY.................................................................................................................7

1 RelevancetoSoBigData.........................................................................................................8

2 Introduction..........................................................................................................................9

3 Integrationpatterns.............................................................................................................103.1 Applications...............................................................................................................................103.2 Methods.....................................................................................................................................10

4 Resourceintegrationuse-cases............................................................................................134.1 Applicationsintegrationexperiences..........................................................................................13

4.1.1 TagMe(MarcoCornolti,DepartmentofComputerScience,UniversityofPisa)........................134.1.2 SMAPH(MarcoCornolti,DepartmentofComputerScience,UniversityofPisa).......................144.1.3 TwitterMonitor(StefanoCresciandSalvatoreMinutoli,IIT-CNR).............................................144.1.4 WAT(MarcoPonza,UNIPI).........................................................................................................174.1.5 SWAT(MarcoPonza,UNIPI).......................................................................................................17

4.2 Methodsintegrationexperiences...............................................................................................174.2.1 TrajectoryBuilder(RobertoTrasarti,ISTI-CNR)...........................................................................174.2.2 QuickRank(CristinaMuntean,ISTI-CNR)....................................................................................184.2.3 GateCloud(KalinaBontcheva,UniversityofSheffield)...............................................................214.2.4 StatVal(DanieleRegoli,SNS)......................................................................................................224.2.5 OPTICSCLUSTERING(RobertoTrasarti,CNR)..................................................................................23

Page 6: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page6of23

DELIVERABLESUMMARYDeliverableD10.9“Resourceadaptationtoregistertothee-infrastructure2” is therevisedversionoftheD10.8 deliverable “Resource adaptation to register to the e-infrastructure 1”, intended to report theexperiences of partners from different infrastructures at integrating their services, methods, andapplicationsasSoBigDataresources.Thefirstsectiondescribesthegeneralintegrationpatterns,whilethesecondsectionreportstheexperiencesfromtheindividualpartners,revealingtheeffortrequired,intermsof time and technical complexity, and earned benefits. This revised version of the document covers thewhole period of the project, including the up to date information of theD10.8 deliverable and the newexperiences of partners at integrating their services, methods, and applications as SoBigData resources,developedthroughtheproject’slifetime.

Page 7: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page7of23

EXECUTIVESUMMARYDeliverableD10.9“Resourceadaptationtoregistertothee-infrastructure2” is therevisedversionoftheD10.8 deliverable “Resource adaptation to register to the e-infrastructure 1”. In order to achieveVirtualAccess SoBigData partnersmust undertake a process of integration of theirmethod resources in the e-infrastructure. This deliverable reports the experiences of partners from different infrastructures atintegratingtheirservices,methods,andapplicationsasSoBigDataresources.Thefirstsectiondescribesthegeneralintegrationpatterns,whilethesecondsectionreportstheexperiencesfromtheindividualpartners,revealingtheeffortrequired,intermsoftimeandtechnicalcomplexity,andearnedbenefits.

The feedbackobtainedduring the first reportingperiodhasbeenused to further simplify theprocessofintegrationandmitigatecomplexity(whenthiscanbedone).Specifically,tofacilitatetheimportingprocessofmethods(andtheirversionupdating)intheSoBigDatainfrastructure,anewWebapplicationnamedSAI(StatisticalAlgorithmsImporter)hasbeenmadeavailable.

Atthetimeofitsrelease,thedeliverablecoverstheactivitiesintheaddressedareasperformedinthefirst37months of the project. The project activities cover all effort spent from partners at integrating theirservices,methods, and applications as SoBigData resources, developed through the project’s lifetime. Inparticular,twoapplicationexperiencesandonemethodexperiencewereaddedinD10.9,namely,theWATandSWATapplicationsandtheOpticsClusteringmethod.

Page 8: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page8of23

1 RELEVANCETOSOBIGDATA

In order to achieve Virtual Access SoBigData partners must undertake a process of integration of theirmethod resources in the e-infrastructure. This document reports on the actual experience of scientistsinvolvedinthisprocess,withtheaimofhighlightingthechallengesandthebenefits.Thisfeedbackwillbeusedtofurthersimplifytheprocessofintegrationandmitigatecomplexitywhenthiscanbedone.

Page 9: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page9of23

2 INTRODUCTION

This deliverable reports the experiences of partners from different infrastructures at integrating theirservices, methods, and applications as SoBigData resources. It is the revised version of the documentcoveringthewholeperiodoftheproject,includingtheuptodateinformationoftheD10.8deliverableandthe new experiences of partners at integrating their services, methods, and applications as SoBigDataresources,developedthroughtheproject’slifetime.

Thefirstsectiondescribesthegeneralintegrationpatterns,whichwereimprovedwithrespecttoD10.8.Inparticular, to facilitate the importing process of methods (and their version updating) in the SoBigDatainfrastructure,anewWebapplicationnamedSAI(StatisticalAlgorithmsImporter)hasbeenmadeavailable.SAI provide scientistswith an interface that allows to easily andquickly import Java, PythonorR scriptsontothegeneral-purposegCube-baseddataanalyticsengine.

The secondsection reports theexperiences fromthe individualpartners, revealing theeffort required, intermsoftimeandtechnicalcomplexity,andearnedbenefits.

Page 10: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page10of23

3 INTEGRATIONPATTERNS

3.1 APPLICATIONS

An application in the SoBigData infrastructure is a stand-alone system running on a remote server andoffering one or more social mining methods viaWebUIs; in some cases it may also offer social miningdatasets.

A lightweight integration is achieved by explicitly registering/publishing application information on theSoBigDatacatalogue.EachapplicationisdescribedinthecataloguebyarecordenablingitsdiscoveryandincludingaURLtothewebapplication.

A mild integration consists in integrating the WebUIs as a portlet in the SoBigData VREs, leaving theapplicationrunningontheremoteserver,equippedwithSmartGears(seeWikipage)toaccountaccesstomethodsfromSoBigDatausers.Toachieveadeeper integration, theactioncan includethe integrationoftheapplicationwiththeVREworkspace,allowingscientiststoprovideinputtotheapplicationdirectlyfromthelocalworkspaceandexpecttheresultsoftheapplicationtobestoredinthelocalworkspace.

Afullintegrationconsistsinsteadin:

● Reconsideringapplication’sarchitectureinordertoextrapolatethemethodsandthedatasets;● Makemethodscompliantwiththeguidelinesofthehostingplatformaccordingtothemethodology

described inadedicatedWikipage.Thisactivitymight requiresomemodification/adaptationofthemethod implementation, e.g. for input parameters specification. The cost of this adaptationdependsonthecomplexityofthemethod;

● A SoBigData: step-by-step procedure for algorithm integration (including a couple of Java simpleclasses)resultingfromaconcreteintegrationexerciseisavailable;

● Publishthealgorithmthroughtheplatform.IncasethemethodisimplementedwithaRscript,theplatformisprovidedwithafacilitysupportingthispublishingphase;

● Transformthedatasetsinpublishableassetsandpublishthemaspreviouslydescribed;Once“fully integrated”, theapplicationactuallybecomesasetofSoBigDatasocialminingassets thatwillbenefitfrom:

● ScalabilityWillbenefitfromadistributedandscalablecomputingplatform;● Repurposing Can be exploited in the context of many virtual research environments and it is

suitableforbeingrepurposed/appliedtodatasets;● Standard accessibilityWill be automaticallymade available via aweb-basedGUI aswell aswith

web-basedprotocols(SOAPandRest);● AccountingAremonitored and assessed by SoBigData tools, e.g. detailed statistics on usage are

transparentlycollected.If“mildlyintegrated”theapplicationwillonlybenefitfromrepurposingandaccountingbenefits.

3.2 METHODS

Amethod intheSoBigData infrastructure isapieceofcode inJava,PythonorRthat implementsasocialminingalgorithm/procedure.

Page 11: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page11of23

Besidesalightweightintegration,thatcanbeachievedbyexplicitlyregistering/publishingmethodsoftwareinformationviatheSoBigDatainfrastructureresourcecatalogue(whereeachmethodsoftwareisdescribedinthecataloguebyarecordenablingitsdiscoveryandusageandincludingaURLtothesoftware),duringthelastperiodefforthasbeenspenttofacilitatetheimportingprocessofnewmethodsintheSoBigDatainfrastructure.

AsaresultofthiseffortanewWebapplicationnamedSAI(StatisticalAlgorithmsImporter)hasbeenmadeavailable in the SoBigData infrastructure. SAI complements the full integration pattern (achieved by“wrapping” (implement the relative APIs) themethod as aWPSmethod) by providing scientistswith aninterfacethatallowstoeasilyandquicklyimportJava,PythonorRscriptsontothegeneral-purposegCube-baseddataanalyticsengine.Additionally,itallowsscientiststoupdatetheirscriptswithoutfollowinglongsoftwarere-deployingprocedurespreviouslynecessary.

The SAI Web application interface (Figure 1) resembles the R Studio environment, a popular IDE for Rscripts,inordertomakeitfriendlytoscriptproviders.

Figure1.SAIInterfacetoimportnewmethodsingCube-baseddataanalyticsengine.

TheProjectbuttonallowscreating,openingandsavingaworkingsession.Ascientistuploadsasetoffilesanddataontheworkspacearea(lower-rightpanel).Asanextstep,theuserindicatesthe“mainscript”,i.e.thescriptthatwillbeexecutedongCube-baseddataanalyticsengineandthatwillusetheotherscriptsandfiles.After selecting themain script, the left-sideeditorpanel visualises itwithR syntaxhighlightingandallows modifying it. Subsequently, the user indicates the input and output of the script by highlightingvariable definitions in the script and pressing the +Input (or +Output) button: behind the scenes theapplication parses the script strings and guesses the name, description, default value and type of the

Page 12: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page12of23

variable.Thisinformationisvisualisedinthetop-rightsideInput/Outputpanel,wheretheusercanmodifytheguessedinformation.Asanotheroption,SAIcanautomaticallycompilethesameinformationbasedonWPS4R1annotations in thescript.Other tabs in this interfacearea let settingglobalvariablesandaddingmetadatatotheprocess.Specifically,theInterpretertaballowsindicatingtheRinterpreterversionandthepackages required by the script and the Info tab allows declaring the name of the algorithm and itsdescriptionwhileintheInfotabtheusercansetuptheVREinwhichthemethodshouldbeavailable.

Oncethemetadataandthevariablesinformationhavebeencompiled,theusercancreateagCube-baseddataanalyticsengineas-a-ServiceversionofthescriptbypressingtheCreatebuttonintheSoftwarepanel.The term “software”, in this casemeans a Java program that implements an as-a-Service version of theuser-providedscripts.TheJavasoftwarecontainsinstructionstoautomaticallydownloadthescriptsandtheother required resourceson theserver thatwillexecute it, configure theenvironment,execute themainscriptandreturntheresulttotheuser.ThecomputationsaremanagedbythegCube-baseddataanalyticsenginecomputingplatformthatguaranteestheprogramhasoneinstanceforeachrequestanduser.Theserverswillmanageconcurrentrequestsfromseveralusersandexecutecodeinaclosedsandboxfolder,toavoiddamagecausedbymaliciouscode.

To recapitulate, theSAIWebapplicationenables Java,PythonorR scriptswithas-a-Service featuresandreducesintegrationtimewithrespecttodirectJavacodewritingsoastoeasethefullintegrationpatternwhichwasreportedtobeabutcomplexinthepreviousexperiences.

Once fully integrated, the application actually becomes a set of SoBigData socialmining assets that willbenefitfrom:

● ScalabilityWillbenefitfromadistributedandscalablecomputingplatform;● Repurposing Can be exploited in the context of many virtual research environments and it is

suitableforbeingrepurposed/appliedtodatasets;● Standard accessibilityWill be automaticallymade available via aweb-basedGUI aswell aswith

web-basedprotocols(SOAPandRest);● AccountingAremonitored and assessed by SoBigData tools, e.g. detailed statistics on usage are

transparentlycollected.

152North.WPS4R.https://wiki.52north.org/Geostatistics/WPS4R

Page 13: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page13of23

4 RESOURCEINTEGRATIONUSE-CASES

This section reports on the experiences of scientists, as described by the scientists themselves, inintegratingtheirapplications,methods,andservicesintheinfrastructure,includingthosecaseswheretheintegration is ongoing. Experiences are organized by typology of resources to be integrated and each ofthem reports on the effort required/envisaged, in terms of time and technical skills required for theintegration, describing the high-level procedure that was followed and the highlighting the gaps forfeedbackandfurtherimprovement.

4.1 APPLICATIONSINTEGRATIONEXPERIENCES

4.1.1 TAGME (MARCO CORNOLTI, DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OFPISA)

TagMeisanentitylinkingannotator,namelyasoftwarethat,givenatextualdocument, linksmentionsofentitiesfoundinthedocumenttowardsacatalogueofentitiesdrawnfromaknowledgebase.Thishastheimportant effect of building, on top of the document, a non-ambiguous representation of the topicsmentioned by it, with impact on NLP applications such as information extraction, question answering,documenttopicalclusteringandcategorization.

TagMeisimplementedinJava(henceitsexecutionrequiresaJavaVirtualMachine)anditsfunctionalitiesareofferedasawebserviceforeasyinteroperability.

The integration in the D4Science infrastructure required the deployment of an ad-hoc virtual machine,running a SmartGearsdistribution. SmartGears camewith rangeof additional services, including accountregistration, authentication and usage accounting with minimal effort. As a middleware to deploy theTagmewebapp,weemployedtheTomcatwebserver,thathadtobeconfiguredforthespecificapplicationinordertonotletrequestsoverloadtheservertherebyrenderingserviceunavailable.WealsohadtobuildaspecificTomcatValveinordertohavetheTagMewebappacceptUTF8-encodedtextasinput,sincethedefaultcharsetdidnotcover thewholeunicodespace.Wesetupawebpagepresentingtheservice, thedocumentation, and an online demo of the service available to unauthenticated users. We took theoccasionofthedeploymentofTagMetorebuild its indexesaccordingtothe latestversionsofWikipedia,andfixafewminorbugs.

TagMe was previously available in a different deployment, and users had to be migrated to the newendpointandauthenticationmechanism.Wedecidedtomigrateusersin"waves",inordertoprogressivelysolvepotentialproblems.ThemigrationwasconcludedinSeptember2016,whentheformerdeploymentwasshutdown.

Deploying the service on a specific D4Science VRE had an important, unexpected advantage: the VREmessage board became a forum for TagMe users and amean of cooperation for research and problemsolving,andanimportantchannelofcommunicationbetweenusersandTagMeadministrators/developers.ThedeploymentofaprototypeofTagme,withtheonlyaimoftesting,tooktwoweeksandwasreadyformid-May2016.UntilJune,wefixedissues(oftenreportedbyusers)withthewebinterfaceandtheserverconfiguration. The service was launched on mid-July, when the first wave of 1% of active users was

Page 14: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page14of23

migratedtothenewplatform,andtheregistrationtotheplatformopenedtothepublic.Theservicelaunchwas finalized at the end of August, when all users weremigrated to the new platform and the formerserviceshutdown.

Intotal,thedeploymentofTagmeroughlytookfourmonths.

TheTagMeVREisaccessiblehere.

4.1.2 SMAPH (MARCO CORNOLTI, DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OFPISA)

SMAPH (paper) is anentity linking annotatorbuilt specifically for queries, a verypeculiar typeof textualdocuments,sincetheyusuallyprovidelimitedcontext,nogrammaticality,andfeaturetypos.SMAPHisthestateoftheartforentitylinkinginqueries,andobtainedthehighestscoreintheERDChallenge2014.Itisbuiltontopofsearchengines(currentlyGoogleSearch),andusestheiroutputtoannotateaquery.Italsomakesuseofstatisticalmachinelearningtomakeitsdecisions. Its impactonqueries issimilartotheoneTagMehas on longer text, in that the semantic layer it builds on top of queriesmakes it possible to godeeper(withrespecttosyntactictextanalysis)intotheuserneedbehindaquery.

From the architectural point of view, Smaph is similar to TagMe in certain aspects: it is a Java softwareavailable through aweb service. Similarly to TagMe, it has been deployed on an ad-hoc virtualmachinerunningaSmartGearsdistribution.ThemaindifferencewithrespecttoTagMeisthatitneedsaccesstotheGoogle CustomSearchAPI, henceusers have to provide aGoogle authentication token to SMAPHwhenissuingadisambiguationrequest.

Thedeployment of the SMAPHweb servicewas easier than that of TagMe, becausenousers had to bemigrated,sincethiswasthefirsttimeiswasdeployed.Thedeploymentprocessstartedinmid-Novemberandtheservicewasreleasedinmid-December,henceittookonemonth.

TheSMAPHVREisaccessiblehere.

4.1.3 TWITTERMONITOR(STEFANOCRESCIANDSALVATOREMINUTOLI,IIT-CNR)

The Twitter Monitor is an interactiveWeb application designed to access and to collect data from theTwitter stream, by exploiting the public Twitter Streaming APIs. The application is able to manageconcurrentmonitors:itispossibletolaunchparallellisteningsessions(i.e.,morethanoneTwittercrawleratthesametime)usingdifferentparametersandcollectingdifferentsetsofdata.Inadditiontoofferinganinteractive Web interface in order to ease all the operations related to Twitter crawling, the TwitterMonitor also offers a set of functionalities aimed atminimizing the loss of data due to network or localmachineproblems.TheTwitterMonitor isautomaticallycapableofdetectingandrecoveringfromsimpleerror situations, such as a closed or disconnected Twitter stream. It is also capable of detecting moreserious issues, such as Twitter refusing to open new streaming connections, and automatically sendstargetedalertstosystemadministrators.

TheintegrationactivitiesrelatedtotheTwitterMonitorhavebeentwofold.Ontheonehand,wedecidedtotakeacourseofactionsoastoprovideafirst,loosely-integrated,versionoftheTwitterMonitorasfastaspossible.Ontheotherhand,wealsoimmediatelystartedadeeper,yetlonger,integrationprocedure.

Page 15: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page15of23

The fast-lightweight integration allowed us to register the Twitter Monitor service in the SoBigDatacatalogue2andtoincludeanavigationtabinsidetheSoBigDataVREpointingtoaniframethatcontainsthefirstversionoftheSoBigDataTwitterMonitor3.TheSoBigDataTwitterMonitorrenderedinsidetheiframeisaslightlymodifiedinstanceofthestandalonemonolithicTwitterMonitorthathasbeendevelopedbyIIT-CNR. It is a single PHP application that runs on an IIT-CNR virtual machine. Given this lightweightintegration,theTwittermonitorisabletoprovideitsservicestotheSoBigDatausers,butitcannotscaletoserve a large number of requests. The implementation effort required to modify the original TwitterMonitorsothatitcouldbeusedfrominsidetheSoBigDataVREwasabout2weeksofwork.Theresultofthis integration is visible in Figure2andFigure3andhasbeenextensivelydescribed inDeliverableD8.2CrowdsensingPlatform.

Figure2.TwitterMonitorapplicationrenderedinsideadedicatednavigationtabfromtheSoBigDataVRE.

2https://sobigdata.d4science.org/group/resourcecatalogue/data-catalogue

3https://sobigdata.d4science.org/group/sobigdata.eu/twitter-monitor

Page 16: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page16of23

Figure3.TwitterMonitordialogwindowtoallowparameterconfigurationofanewcrawler.

While completing the lightweight integration,we also started the design and implementation of the fullintegration.Atfirst,wehadtoreengineerandrefactortheoriginalmonolithicTwitterMonitorapplication.Specifically,we split thewhole application into 3 distinct logicalmodules, so as to allow a dynamic andscalable deployment of needed modules to the SoBigData eInfrastructure nodes. Then, we had toreimplementinJavathefunctionalitiesoftheoriginalPHPmodules.

Thenew,fully-integrated,TwitterMonitoriscomposedofthreemodules:theScheduler,theCronandtheCrawler. These modules were originally implemented in PHP. In order to integrate their functionalitiesinside the SoBigData infrastructure the Scheduler and the Cron have been replaced by equivalent Javamodules.FortheCrawler,dueto itshighercomplexity,wedecidedto implementaSmartExecutorpluginthatwrapstheoriginalPHPmodulebyrunningitasaprocess.TheSchedulerprovidesaGUItolettheuserspecifysomeinputparametersneededtofiltertheTwittereventstogather.IthasbeenimplementedasaStandardLocalExternalAlgorithmsubclass.TheCronrunsatfixedintervalswithoutuserinteractionandhasbeen implemented as a SmartExecutor plugin. The Crawler is run/stopped by the Cron and has beenimplementedasaSmartExecutorplugin.Themodulesinteractwitheachotherbymeansofadatabasethatcontainsthelistoftaskstorunandtheirstatus.Asafirststep,wedecidedtosetupaSmartGearsnodetohosttheSmartExecutorplugins:thishelpedmanagingtheplugindeployingmoreeasily,without involvingother systemadministrators. In the samevirtualmachinehosting theSmartGears nodewedeployed thepostgres database, needed by the plugins, and registered it within the SoBigData infrastructure. TheSmartGears installation simply required running a setup script. Thenafter few configuration steps itwasready to run. We also installed the SmartExecutor service to make the plugins accessible within theeInfrastructure.We implemented themodules by using some templates available in the documentation.Somesupportfromdevelopershelpedusinspeedinguptheimplementation.

Page 17: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page17of23

The implementation of modules required a review of the functionalities of the infrastructure and, inparticular,itsinteractionwiththemodules(i.e.theplugins).Themainconcernswereaboutdiscoveringthereference for the database, interacting with the user, storing data in the user workspace, finding andrunningotherplugins.Duringthisactivity, theD4Sciencedeveloperteamsupportedusbyprovidingcodesnippetsandquicklinkstotheproperdocumentation.

4.1.4 WAT(MARCOPONZA,UNIPI)

WATdeprecatesTagME:ithassimilarruntimeperformancebuttheresultismoreaccurate(Seethepaperfor details), though it can currently process English and well-formed (news, Web pages and RSS feeds)documentsonly.WATisalsosignificantlyfasterthanTagMeforlongdocument,whichmakeittheperfectchoiceforanannotationatWeb-scaleoflargedocumentcollections.

TheintegrationwiththeD4Scienceinfrastructureispartiallyachieved.Sofaritrequiredthedeploymentofanad-hocvirtualmachine,runningaSmartGearsdistribution.AsamiddlewaretodeploytheWATwebapp,we employed the Tomcat web server that, at the time being, has yet to be configured for the WATapplication inorderfullyexploittheadditionalservicesSmartGearsprovideswithminimaleffort(accountregistration,authenticationandusageaccounting).

WAT’s API is very similar to TagMe, thus ease a possible migration from the older entity linker. ThedocumentationofWAT’spublicAPIisprovidedathttps://services.d4science.org/web/tagme/wat-api

4.1.5 SWAT(MARCOPONZA,UNIPI)

SWATisanentity-saliencesystemwhichidentifieson-the-flythesemanticfocusofadocument,expressedby its Salient Wikipedia Entities. The core of this technology is based on a broad set of syntactic andsemantic features, extracted from the inputdocumentand later fed toa classifier, previously trainedonmillionsoftrainingexamplesextractedfromtheNewYorkTimesAnnotatedCorpus.

ThecoreofSWATiswritteninPython,butitfunctionalitiesareactuallybuiltonthetopofothersoftwaretoolsthatneedtoberunontheJVM,i.e.,CoreNLPandWAT.Accordingly,aprivateVRE,fromwhichonlySWAT's VRE can access, running a standalone CoreNLP web service has been also deployed. Thecommunication between SWAT andCoreNLP/WAT it is performed via RESTful API. The same interface isalsoprovidedbySWATtoexternalwebcallsinordertoeaseitsuseinothersoftwaretools.

AnexperimentalGUIofthesystemisavailableathttps://swat.d4science.org/andthedocumentationofitspublic API is available at https://services.d4science.org/web/tagme/swat-api. The integration of SWAT intheSoBigDatainfrastructureisalmostcompletewithAccountingeAuthZfunctionalitiesthatareplannedtobeintegratedintheimmediatefuturewithintheSWATsoftwarearchitecture.

4.2 METHODSINTEGRATIONEXPERIENCES

4.2.1 TRAJECTORYBUILDER(ROBERTOTRASARTI,ISTI-CNR)

TrajectoryBuilder(https://ckan-sobigdata.d4science.org/dataset/trajectory_builder)isabasicalgorithmoftheM-Atlaspackage(http://m-atlas.eu/). It is implemented in Javaand interactwithaPostgresdatabase

Page 18: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page18of23

(with Postgis extension for spatial primitives) in order to build a trajectory from raw spatio-temporalobservations,i.e.pointsinspaceandtimerepresentedbyatriple<latitude,longitude,timestamp>.Hencethe dependencies of this algorithm are: the M-Atlas cose (matlas.jar) and the connector drivers forpostgresql(postgresql-8.4-701.jdbc4.jar).

InordertobeintegratedalauncherclassisimplementedasasubclassofStandardLocalExternalAlgorithm(https://sobigdata.d4science.org/group/sobigdatalab/importer-documentation).TheMainproblemduringtheintegrationwastheconfigurationoftheprojectintoMavenforadeployinsidetheSoBigDataplatformdue thedependencies listedabove.Thanks to theplatformadministratorhelp theproblemsweresolvedand thedeploywasdone. The timepassed from thebeginningof the integrationand thepublishof thealgorithmintheVRELabis1month,buttheactualworkinimplementingthetoolwasonlyfewhours,therest of the time was spent interacting (1-2 times each week) with the platform administrators anddevelopersinordertofindtherightsolutionfortheintegrationandhowtoputthedependenciesinit.ThedifficultiesoftheprocessofintegrationwiththeSoBigDataplatformengineare:

● Thedocumentationissowidethatisdifficulttounderstandwhattodoandwhy.● ItisrequiredknowtoolssuchasMavenandhowtoconfigureittobecompliantwiththeplatform.● The interactionwith theplatformdeveloper sometimeswasdifficult trough the ticketing system,

facetofacemeetingwerefundamentaltosolvetheproblems.Anywayseveraloftheproblemssolvedduringthisfirstintegrationwillhelptobemoreindependentintheintegrationofnextalgorithm.

4.2.2 QUICKRANK(CRISTINAMUNTEAN,ISTI-CNR)

QuickRank4isanefficientLearning-to-RanktoolkitprovidingseveralC++implementationofLtRalgorithms.The LtR algorithms currently implemented are: GBRT, LamdaMART, Oblivious GBRT / LamdaMART,CoordinateAscent,LineSearch,RankBoost.

WemadeQuickRankavailable toSoBigDataby integrating it in theSoBigDataLabVREasaWPSmethodexecutable from VRE Method Engine5. The information about the integrated tool can be found in theSoBigDataCatalogueatthefollowinglink:https://ckan-sobigdata.d4science.org/dataset/quickrank.

QuickRank users can train and testmodelswith the help of a Command Line Interface (CLI). In order tointegrateour tool in theSoBigDataVREwehad tocreateaWPSwrapperaround it,allowing theuser tointeractwithitinamoreuser-friendlyway,throughthehelpofanUserInterface(UI).He/shecaninsertthefilesandoptionsasheinvokesthetrainingortestingcapabilitiesontheexistingLtRalgorithmsashewoulddowithaCLI.Figure4andFigure5showtheinterfaceforthetrainingandtesttools/interfacescreated.

4http://quickrank.isti.cnr.it/

5https://services.d4science.org/group/sobigdatalab/method-engine

Page 19: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page19of23

Figure4.QuickTesttraininguserinterface

Page 20: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page20of23

Figure5.QuickRanktestuserinterface

The QuickRank WPS interfaces were implemented in Java, by extending theStandardLocalExternalAlgorithm6class,putatourdisposalbytheteamresponsiblewiththeintegration.Inorder for the QuickRank wrapper to work, several libraries needed to be installed on the VRE servers.QuickRankneedsgcc4.9(orabove),CMake2.8(orabove)andgit.TheimplementationeffortrequiredtocreateawrapperaroundaC++executable,withthehelpoftheProcessBuilderclass ,whichallowsustorun the QuickRank executable from the Java class, while also passing all the necessary parameters asprovidedininputbytheuserintheUI.

Thewholeintegrationeffortlastedseveralmonths.Thetaskwasstartedonthe15thofFebruary2016,andwas the first tool to be integrated in the SoBigData VRE. Initially, we developed a test wrapper on thedevelopment servers, thisphaseendingon the22ndofMarch. Later,on the16thofAugust, the secondphasestarted,whentheactualwrapperswereinsertedanddeployedintotheproductionenvironmentintheSoBigDataVRE.Theintegrationeffortendedonlthe30thofSeptember.

Weintegratedthetwomainmodes(3wrappersintotal)inwhichQuickRankcanbeused:

● Trainingwithvalidationset● Trainingwithoutvalidationset● Test

Theresults/outputsoftheQRcomputationcanbeseenintheOutputDatasetssection(Figure6)

6Asdescribedin:https://wiki.gcube-system.org/gcube/How-to_Implement_Algorithms_for_the_Statistical_Manager

Page 21: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page21of23

Figure6.QuickRankOutputfile

The easiest partwaswriting thewrapper in Java, namely extending the StandardLocalExternalAlgorithmclass.Withthehelpoftheintegrationteamwewereabletoovercomesomeoftheissues(e.g.identifytheproperdatatypesinthegCubewiki)createtheproperinterfaces,fixalltheproblematicissuesanddeploytheprojectlocally.InordertobeputinproductionQuickRankneedtohavethedependenciesinstalledandthesourcecompiled.For thisweopeneda ticket,which tookaround2weeks tobe resolved.Soonafterthatwesucceededinfinalizingtheintegrationandtestthealgorithmsalsoontheproductionserver.

4.2.3 GATECLOUD(KALINABONTCHEVA,UNIVERSITYOFSHEFFIELD)

SoBigDatauserswillaccesstheGATECloudserviceindirectlythroughtheD4Scienceplatform.GATECloudis today deployed at Sheffield and exposes calls to methods as REST services. Such methods will beintegratedasWPSmethodcalls fromthee-infrastructureenvironments.Morespecifically,SoBigDatawillmediateend-user requests (invocationsofaservice) toGATECloudbyusingaspecialAPIkey,whichwillbypasstheaccountinginherentinGATECloudandusethatprovidedbytheVREuniformlyinstead.

Around23methodsavailableasGateCloudRESTserviceswillbeexposedthroughtheSoBigDataVRE,eachwith theirownendpoint. TheAPI foreachwill be identical, acceptingas inputdocuments inXML, JSON,plaintextorHTMLformat,producingoutputasJSON,HTML,XML.

Thesemethodswill be listed in VRE catalogue and bemade accessible via aMethod Engine application.Eventually, theVREcatalogueand theGATECloudcataloguewillbe synchronised, so thatVREusers candiscovermethods fromGATEClouddirectly.Thiswillbesupportedby theHTTPAPIcurrentlypartof theGATECloudservice.TheseservicesmaybegroupedintoasingleDataMinerinstanceforeaseofdiscovery.

In the long term, GATE Cloud will be adapted for deployment on the SoBigData hardware premises,allowingcost-freeaccesstoGATECloudservicesbyresearcherswithinthescopeofVREs.

Page 22: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page22of23

4.2.4 STATVAL(DANIELEREGOLI,SNS)

StatValisanalgorithmperformingastatisticalvalidationfilterforcomplexnetworks.

ThepurposeofStatValistakinga(possiblylargeanddense)complexnetworkandfindthelinksthatarenotexplainablebysimplerandomwiringofthenodesgiventhedegree/strengthofnodes(namely,givenhowmany linkseachnodehasattached).Theresult isanewnetworkwithonly the linksthatareunexpectedwithrespecttothesaidrandommodel.Thus,youcomeupwithanoutputnetworkmuchsparserthantheoriginal,thatcanbeconsideredastatisticallysoundrepresentationofthenetworkbackbonestructure.

StatVal is essentialy a collection of R scripts. The integration has beenperformed through theStatisticalAlgorithmImporter(SAI)availableintheD4Scienceplatform,whichsupportstoolsfortheeasyintegrationofRscriptmethods insideaVRE.StatVal isnowavailable foruse in theSoBigDataLabenvironmentviaaMethodEngineapplication(seeFigure7).Youcanfinditontheresourcecatalogue.

Figure7.WebinterfaceoftheStatValmethod

Theuserhassimplytoloadanetworkintheformofacsvedgelistandtunesomeparameterseitherrelatedto the input network or to the confidence level of the statistical tests and run themethod. Output is acompressedarchivecontainingedgelistsof(different)filterednetworks.

UsageofSAI forR scripts integration isquite straightforward.Firstweneeded tocreateaproject folder,thenwesimplyhadtoloadinthefolderthenecessaryRscripts.AmainRscript,whichisgoingtomanagetheworkflowofthealgorithm,istobegiventotheSAI.Finally,wedeclaredtheinputsandoutputsofthealgorithmandwhatRpackagesthealgorithmrequires.

Somemoreeffortwasneededinthetestingpart,namelywhencheckingwhetherthemethodwasworkingproperly,andinthemaintenance,namelyindoingchangesandmodificationstothealgorithms,mainlydueto somemisunderstandings and some problems in using test data stored inside the same folder of thesoftwareproject.ButwiththehelpofD4Scienceadministratorsallproblemsweresolved.

Totaleffort, countingboth the first integrationandsuccessivemodifications,was roughlyaroundaweekfulltime.

Page 23: SoBigData – 654024  - CNR

SoBigData–654024 www.sobigdata.eu

D10.9ResourceAdaptationtoregistertothee-infrastructure2 Page23of23

4.2.5 OPTICSCLUSTERING(ROBERTOTRASARTI,CNR)

Optics Clustering is one of the tools provided byM-Atlas7 and is implemented in Java to interactwith aPostgresdatabase(withPostgisextensionforspatialprimitives)togroupsimilartrajectoriesusingthewellknownOpticsalgorithm8.Theimplementationincludesasetofsimilarityfunctionsspecializedintrajectorydata and mobility applications. The dependencies are the M-Atlas code (matlas.jar) and the connectordrivers for postgresql (postgresql-8.4-701.jdbc4.jar). In order to be integrated a launcher class isimplemented as a subclass of StandardLocalExternalAlgorithm (https://sobigdata.d4science.org/group/sobigdatalab/importer-documentation). The implementation followed was very easy thanks to theexperiencewiththepreviousintegration(i.e.TrajectoryBuilder).

Figure8.WebinterfaceoftheOpticsClusteringmethod

TheconfigurationoftheprojectintoMavenforthedeploymentwasthemajorobstacleduethelackoftrueunderstandingofsomemandatorystepswhichcanbedoneonlybyfollowingtheexpertguidance.Thetimepassed from the beginning of the integration and the publishing of the algorithm in the VRE Labwas 2weeks.

7M-Atlasisamobilityqueryinganddataminingsystemcenteredontotheconceptofspatio-temporaldata.Besidesthemechanismsforstoringandqueryingtrajectorydata,M-Atlashasmechanismsforminingtrajectorypatternsandmodelsthat,inturn,canbestoredandqueried.8 Ankerst,Mihael&M. Breunig,Markus& Kriegel, Hans-Peter& Sander, Joerg. (1999).OPTICS:Ordering Points toIdentifytheClusteringStructure.SigmodRecord.28.49-60.10.1145/304182.304187.