a historical account of apache flink - tu berlin event-processing library (cep), a machine-learning...

11
A Historical Account of Apache Flink TM : Its Origins, Growing Community, and Global Impact by Juan Soto, Technische Universität Berlin May 31, 2016 About Apache Flink The official Apache Flink project [R1], describes Flink as follows: Apache Flink is an open source platform for distributed stream and batch data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.“ The platform offers software developers varying application-programming interfaces (APIs) to create new applications that are executed on the Flink engine. Examples of these APIs include: an API for unbounded (data) streams (DataStream API), a complex event-processing library (CEP), a machine-learning library, and a graph-processing library (Gelly), among others. As reported in [R2], some of Flink’s distinguishing technologi- cal capabilities include event time handling, state & fault tolerance, low latency processing, and high throughput.” Figure 1. An overview of Apache Flink components According to an Apache Flink Wikipedia article [R3], Apache Flink is a community-driven open source framework for distributed big data analytics. … [that] aims to bridge the gap between MapReduce-like systems and shared-nothing parallel database systems. … Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively. Flink programs are automatically compiled and optimized into dataflow programs that are executed in a cluster or cloud environment.

Upload: votram

Post on 18-May-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Historical Account of Apache Flink - TU Berlin event-processing library (CEP), a machine-learning library, and a graph-processing library (Gelly), among others. As reported in [R2],

AHistoricalAccountofApacheFlinkTM:ItsOrigins,GrowingCommunity,andGlobalImpact

byJuanSoto,TechnischeUniversitätBerlinMay31,2016

AboutApacheFlink

TheofficialApacheFlinkproject [R1],describesFlinkasfollows:“ApacheFlink isanopensourceplatformfordistributedstreamandbatchdataprocessing.Flink’scoreisastreamingdataflow engine that provides data distribution, communication, and fault tolerance fordistributed computations over data streams.“ The platform offers software developersvarying application-programming interfaces (APIs) to create new applications that areexecutedontheFlinkengine.Examplesof theseAPIs include:anAPI forunbounded (data) streams (DataStreamAPI),acomplexevent-processing library (CEP),amachine-learning library,andagraph-processinglibrary(Gelly),amongothers.Asreportedin[R2],someofFlink’sdistinguishingtechnologi-calcapabilitiesinclude“eventtimehandling,state&faulttolerance,lowlatencyprocessing,andhighthroughput.”

Figure1.AnoverviewofApacheFlinkcomponentsAccording to an Apache FlinkWikipedia article [R3], “Apache Flink is a community-drivenopen source framework for distributedbigdataanalytics.… [that] aims tobridge thegapbetweenMapReduce-like systems and shared-nothing parallel database systems.… Flink'spipelined runtime system enables the execution of bulk/batch and stream processingprograms. Furthermore, Flink's runtime supports the execution of iterative algorithmsnatively. Flink programs … are automatically compiled and optimized into dataflowprogramsthatareexecutedinaclusterorcloudenvironment.”

Page 2: A Historical Account of Apache Flink - TU Berlin event-processing library (CEP), a machine-learning library, and a graph-processing library (Gelly), among others. As reported in [R2],

2

TheOriginsofApacheFlinkandSupportOvertheYearsThe origins of Apache Flink can be traced back to June 2008, when Prof. Volker MarklinitiallyfoundedtheDatabaseSystemsandInformationManagement(DIMA)GroupattheTechnische Universität (TU) Berlin. Soon after his arrival, he laid out the vision for amassively parallel dataprocessing systembasedonpost-relational user-defined functions,combining database and distributed systems concepts, with the goal of enablingmoderndataanalysisandmachinelearningforbigdata.Prof.Markl’s PhD students Stephan Ewen and FabianHüske built the very first prototypeandshortly thereafter teamedupwithDanielWarneke,aPhDstudent inProf.OdejKao’sComplex andDistributed IT Systems (CIT)Group at TUBerlin. Soon after, Prof.Markl andProf. Kao sought to collaborate with additional systems researchers in the greater Berlinarea,inordertoextend,harden,andvalidatetheirinitialprototype.In2009,Prof.MarklandProf.Kao, jointlywithresearchersfromHumboldtUniversity(HU)ofBerlin&theHassoPlattnerInstitute(HPI)inPotsdam,co-wroteaDFG(GermanResearchFoundation) research unit proposal entitled “Stratosphere – InformationManagement onthe Cloud [R4],” which was funded in 2010. This initial DFG grant (spanning 2010-2012)extended the original vision to develop a novel, database-inspired approach to analyze,aggregate,andqueryvery largecollectionsofeithertextualor(semi-)structureddataonavirtualized,massivelyparallelclusterarchitecture.Thefollow-onDFGproposalentitled,“StratosphereII:AdvancedAnalyticsforBigData”wasalso jointly co-written by researchers at TU Berlin, HU Berlin, andHPI andwas funded in2012.ThissecondDFGgrant(spanning2012-2015)shiftedthefocustowardstheprocessingof complex data analysis programs with low-latency. These early initiatives coupled withgrants from the EU FP7 and Horizon 2020 Programmes, EIT Digital, German FederalMinistries (BMBFandBMWi), and industrial grants from IBM,HP, andDeutscheTelekom,amongothersprovidedthefinancialresourcesnecessarytolaytheinitialfoundation.Certainly, funding plays a critical role, however, success could only be achieved with thesupport of numerous collaborators, including members at DFKI (The German ResearchCentre for Artificial Intelligence), SICS (The Swedish Institute of Computer Science), andSZTAKI (The Hungarian Academy of Sciences), among many others who believed in ourvision,contributed,andprovidedsupportovertheyears.Inaddition,thecontributionsfromnumerous PhD and Master’s students, and Postdoctoral Researcher Dr. Kostas TzoumaspavedthewayforwhatistodayApacheFlink.AStratosphereforkthatbecameanApacheIncubatorProjectinMarch2014andthenwentontobecomeanApacheTop-LevelProjectinDecember2014.Inlate2014,KostasTzoumasandStephanEwen,alongwithalotoftheoriginalcreatorsofthe Apache Flink project founded data Artisans, a company focused onmaking Flink thenext-generation open source platform for programming data-intensive applications. dataArtisans started with a seed financing round of 1 million euros from b-to-v Partners insummer 2014, and raised a Series A round of 5.5 million euros led by Intel Capital withparticipation from b-to-v Partners and Tengelmann Ventures in April 2016. Since thecompany was founded, many team members (flink.apache.org/community.html#people)fromdataArtisansareactivecontributorstoApacheFlink.

Page 3: A Historical Account of Apache Flink - TU Berlin event-processing library (CEP), a machine-learning library, and a graph-processing library (Gelly), among others. As reported in [R2],

3

Collectively, these efforts showcase the path from a research idea to an open sourcesoftwaresystemthat is inuseacrossmanycompanies,softwareprojects,universities, andresearch institutionsworldwide.ApacheFlink is todayoneof themostactiveopensourceprojectsintheApacheSoftwareFoundationwithusersinacademiaandindustry,aswellascontributorsandcommunitiesallaroundtheworld.

TheApacheFlinkCommunitySince2014, theApacheFlinkCommunityhas steadily continued togrowworldwide.AsofMay31,2016thereare186contributors(asreflectedinGitHub,github.com/apache/flink),33Meetupsworldwide(meetup.com/topics/apache-flink/),over6300ApacheFlinkMeetupmembers, andalmost 4800Meetupmembers in big data related groups,whereApacheFlinkisalsoatopicofinterest.ThesestatisticsaredepictedinFigures2-4andListings1and2,accordingly.

Figure2.EuropeanMeetupsstrictlyfocusedonFlinkwithover2500members

Page 4: A Historical Account of Apache Flink - TU Berlin event-processing library (CEP), a machine-learning library, and a graph-processing library (Gelly), among others. As reported in [R2],

4

Figure3.MeetupsworldwidethatinvolveFlinkwithover11000Members

Figure4.IllustrationdepictingthedistributionofApacheFlinkmembersbycountry

Page 5: A Historical Account of Apache Flink - TU Berlin event-processing library (CEP), a machine-learning library, and a graph-processing library (Gelly), among others. As reported in [R2],

5

WorldwideDistributionofFlinkMeetupGroupsandAttendees(asof31.5.16)StrictlyFlinkMeetups:6326Members,21Meetups,20Cities,and12Countries

NorthAmericanMeetupGroups(3235) NumberofMembers1.DFWAreaApacheFlinkMeetup(Dallas) 2362.ApacheFlinkPhoenix 283.NewYorkCityApacheFlinkMeetup 9364.ChicagoApacheFlinkMeetup 6215.SeattleApacheFlinkMeetup 576.BayAreaApacheFlinkMeetup(MountainView,CA) 7897.WashingtonDCAreaApacheFlinkMeetup(McLean,VA) 4678.MeetupdeApacheFlinkenCiudaddeMéxico 509.NewYork(CityArea)ApacheFlinkMeetup 50

EuropeanMeetupGroups(2474)1.ApacheFlinkBerlinMeetup(Germany) 7582.ApacheFlinkenMadrid(Spain) 3843.StreamProcessing.be(Brussels,Belgium) 2794.ApacheFlinkMeetupMunich(Germany) 985.IstanbulApacheFlinkMeetup(Turkey) 566.ApacheFlinkStockholm(Sweden) 3137.ApacheFlinkLondonMeetup(UK) 1908.ParisApacheFlinkMeetup(France) 500

AsianMeetupGroups(281)1.ApacheFlink(DelhiIndia) 372.ApacheFlinkTaiwanUserGroup(Taipei,Taiwan) 1423.BengaluruApacheFlinkMeetup(Bangalore,India) 102

SouthAmericanMeetupGroups(233)1.Brazil-SaoPauloApacheFlinkMeetup 233

Listing1.ApacheFlinkmeetupgroupsworldwidewith6326memberstotalasof31.5.16

NorthAmericanMeetupGroups(3908) NumberofMembers1.SFSparkandFriends(SanFrancisco,CA) 15802.NJDataScience–ApacheSpark 4863.ApacheKafkaDC(WashingtonDC) 1854.AustinApacheKafkaMeetup–StreamDataPlatform 2755.PortlandSparkUserGroup(Oregon) 1106.BigDataandCloudMeetup(Fremont,California) 3967.NYCBIGDATAVISIONARIES(NYC,NewYork) 6008.ChicagoAreaKafkaEnthusiasts(Chicago,Illinois) 276

EuropeanMeetupGroups(272)1.DataGanesha(Madrid,Spain) 172.DataDrivenMeetup(PoweredbyDCMN)(Berlin) 1713.ParisApacheBeam(France) 84

AsianMeetupGroups(591)1.IgnitingSpark:PracticalUse-cases(Bangalore,India) 1322.ShanghaiBigDataStreamingMeetup(China) 459

Page 6: A Historical Account of Apache Flink - TU Berlin event-processing library (CEP), a machine-learning library, and a graph-processing library (Gelly), among others. As reported in [R2],

6

Listing2.GlobalMeetupsthatincludeApacheFlinkwith4771membersasof31.5.16Inaddition,dataArtisansrunsseveralconferencesfortheApacheFlinkcommunityat-large.TheFirstAnnualFlinkForwardConference (seeImage1below)tookplaceOctober12-13,2015inBerlinandthe2ndAnnualFlinkForwardConference [R5] isscheduledtotakeplaceonSeptember12-14,2016inBerlin.

Image1.The1stAnnualFlinkForwardConference(Oct.‘15)heldinBerlin.

Page 7: A Historical Account of Apache Flink - TU Berlin event-processing library (CEP), a machine-learning library, and a graph-processing library (Gelly), among others. As reported in [R2],

7

ApacheFlinkAdoptionInjustafewyears,ApacheFlinkhasburstontothesceneandnumerousentitieshaveoptedto adopt Flink, some to investigate its capabilities, while others are already using it inproduction.Acommunity-maintained listofcompanies,softwareprojects,universities,andresearch institutes is shown in Listing3, taken from theFlinkwiki [R6]. Someof themoreprominent adopters include ResearchGate, Capital One, and Zalando, among others.Collectively, theyrepresentdiversemarketsectors, including telecommunications, finance,socialgames,travelindustry,applicationsecurity,ande-commerce,andmanyothers.

Listing3.Companies,softwareprojects,universitiesandresearchinstitutesusingFlink.

Page 8: A Historical Account of Apache Flink - TU Berlin event-processing library (CEP), a machine-learning library, and a graph-processing library (Gelly), among others. As reported in [R2],

8

ApacheFlink:SuccessStoriesApacheFlinkisinuseatvaryingenterprises,includingKing(amobilegamingcompany)andZalando(alargee-commercecompany).King[R7]engineersrecentlypostedanarticle[R8]concerningtheirRBEA(RuleBasedEventAggregator)platformforscalablereal-timeanalytics.Inthearticle,theysaidthat“[Theyare]impressedthatApacheFlinkhasreachedtoapointofmaturitywhereitcanservetheneedsofsuchacomplexapplicationalmostout-of-thebox.”Zalando [R9], a large e-commerce company recently posted an article [R10] concerning,whichbigdataanalyticssolutiontoemploy for theirbusinessneeds.Theyrequiredanearreal-timebusiness intelligencesolution&neededto introduce the rightstream-processingframework.Intheend,theyselectedFlinkforthefollowingreasons:

• Flinkprocesseseventstreamsathighthroughputswithconsistentlylowlatencies.Itprovidesanefficient,easytouse,key/valuebasedstate,

• Flinkisatruestream-processingframework.Itprocesseseventsoneatatimeandeacheventhasitsowntimewindow.ComplexsemanticscanbeeasilyimplementedusingFlink’srichprogrammingmodel.Reasoningontheeventstreamiseasierthaninthecaseofmicro-batching.Streamimperfectionslikeout-of-ordereventscanbeeasilyhandledusingtheframework’seventtimeprocessingsupport.

Zalando engineers went on to state, “Our team is currently working on implementing asolution for near real time business process monitoring with Flink. We are continuouslylearningfromtheFlinkcommunityandwe’re looking forwardtobeinganactivepartof it.ThereleaseofFlink1.0.0hasonlystrengthenedoureffortsinpursuingthispath.”Furthermore, Google is collaborating with data Artisans on an Apache Incubator ProjectcalledBeam[R11].YetanothertestamenttothevalueofApacheFlink.In [R12],bigdata technologywriter,AlexWoodieexpressedthat:“…already, people arelooking beyond Spark and wondering what comes next. Apache Flink – the elegantframeworkcomingoutofBerlinanddataArtisansthatoffersasingleAPIfordealingwithdata at rest anddata inmotion – is turning heads as a better, faster Spark.” And that“Other influencers in the big data space see promise in Flink, particularly in combinationwithApacheKafkatoanalyzefast-movingdatathemomentitarrives.”

Page 9: A Historical Account of Apache Flink - TU Berlin event-processing library (CEP), a machine-learning library, and a graph-processing library (Gelly), among others. As reported in [R2],

9

GlobalMedia:NewsArticles,VideoContent,andGoogleSearchNewsArticlesConcerningApacheFlinkAccordingtoGoogleNews, therearealmost800newsarticlesthatwerepublishedonthetopic of “Apache Flink,” as reflected in Image 2. Between 1.1.2016 and 31.5.2016 alonetherewere101newsarticlesaboutApacheFlinkpublished,asdepictedinImage3.Thatis,100articlesinresultpages1-10,andoneadditionaloneonthe11thpage.

Image2.Asnapshottakenon31.5.2016thatshows790hitsfor“ApacheFlink.”

Image3.Between1.1-31.5.2016therewere101hitsforApacheFlinkrelatednewsApacheFlinkRelatedVideosAsofMay31,2016, therewereover1100videosaboutApacheFlink,which included736videos in the first five months of 2016 alone, according to YouTube (Worldwide) anddepictedinImage4.

Image4.Asof31.5.2016therewereover1100hitsfor“ApacheFlink”onYouTubeGoogleSearchforApacheFlinkAGooglesearchfor“ApacheFlink”yieldedabout788,000hits,asseeninImage5.

Image5.Asof31.5.2016therewere788,000hitsforthesearchterm“ApacheFlink.”

Page 10: A Historical Account of Apache Flink - TU Berlin event-processing library (CEP), a machine-learning library, and a graph-processing library (Gelly), among others. As reported in [R2],

10

AcknowledgmentsSincerethankstoDr.KostasTzoumasandDr. StephanEwen (fromdataArtisans)andDr.TilmannRablandProf.VolkerMarkl(fromTUBerlin),allofwhichreviewedthispieceandprovided accurate information. Additional thanks to Sally Khudairi (from the ApacheSoftwareFoundation)foridentifyingseveral(typographical)errorsandofferingcorrections.

Institutional/GroupWebsitesandContactPersonFormoreinformationaboutTechnischeUniversität (TU)BerlinandtheDatabaseSystemsandInformationManagement(DIMA)Group,feelfreetovisithttp://www.tu-berlin.deandhttp://www.dima.tu-berlin.de, respectively. For general inquiries you can reach us [email protected].

References[R1]ApacheFlink,http://flink.apache.org.[R2]UnifiedStream&BatchProcessingwithApacheFlink,youtu.be/8Uh3ycG3Wew.[R3]ApacheFlinkArticle,https://en.wikipedia.org/wiki/Apache_Flink.[R4]Stratosphere,http://stratosphere.eu.[R5]FlinkForwardConference,http://flink-forward.org.[R6]Companies,SoftwareProjects,Universities,andResearchInstitutesUsingFlink,https://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink.[R7]King,http://www.king.com.[R8]RBEA:AScalableRealTimeAnalyticsatKing,http://data-artisans.com/rbea-scalable-real-time-analytics-at-king/.[R9]Zalando,https://www.zalando.de.[R10]ApacheShowdownFlinkvs.Spark,https://tech.zalando.de/blog/apache-showdown-flink-vs.-spark/.[R11]WhyApacheBeam?,http://data-artisans.com/why-apache-beam/.[R12]CuttingonRandomDigitalMutationsandPeakHadoop,AlexWoodie,datanami,http://www.datanami.com/2016/04/01/random-digital-mutations-peak-hadoop/.

Page 11: A Historical Account of Apache Flink - TU Berlin event-processing library (CEP), a machine-learning library, and a graph-processing library (Gelly), among others. As reported in [R2],

11

GeneralNoticeConcerningApacheFlinkandtheApacheSoftwareFoundationAvailabilityandOversightApacheFlinksoftwareisreleasedundertheApacheLicensev2.0andisoverseenbyaself-selected team of active contributors to the project. A Project Management Committee(PMC) guides the Project's day-to-day operations, including community development andproduct releases. For downloads, documentation, and ways to become involved withApacheFlink,visithttp://flink.apache.org/andhttps://twitter.com/ApacheFlink.AboutTheApacheSoftwareFoundation(ASF)Established in 1999, the all-volunteer Foundation oversees more than 350 leading OpenSource projects, including Apache HTTP Server --the world's most popular Web serversoftware.ThroughtheASF'smeritocraticprocessknownas"TheApacheWay,"morethan550 individual Members and 5,300 Committers successfully collaborate to develop freelyavailable enterprise-grade software, benefiting millions of users worldwide: thousands ofsoftware solutions are distributed under the Apache License; and the community activelyparticipates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation'sofficial user conference, trainings, and expo. The ASF is a US 501(c)(3) charitableorganization,fundedbyindividualdonationsandcorporatesponsorsincludingAlibabaCloudComputing, ARM, Bloomberg, Budget Direct, Cerner, Cloudera, Comcast, Confluent,Facebook, Google, Hortonworks, HP, Huawei, IBM, InMotion Hosting, iSigma, LeaseWeb,Microsoft, PhoenixNAP, Pivotal, Private Internet Access, Produban, Red Hat, SerenataFlowers,WANdisco,&Yahoo.Formoreinformation,visithttp://www.apache.org/orfollow@TheASFonTwitter.©TheApacheSoftwareFoundation."Apache,""ApacheFlink,""Flink,"andtheirlogosareregisteredtrademarksortrademarksofTheApacheSoftwareFoundationintheU.S.and/orother countries. All other brands and trademarks are the property of their respectiveowners.