thursday, september 29, 2016 10:05 am platform...

21
Platform Recommendation hortonworks - or - cloudera Managing the Hadoop layer of a Big Data platform can prove to be a very time consuming and complex endeavor. Luckily, there are user friendly platforms available that have been market and industry proven by many organizations of both, the public and private sectors, to help ease the management of Hadoop in a Big Data environment while maintaining the ability to be flexible and scalable as your data grows and evolves. VHB facilitated a meeting between the District 5 and Hortonworks and Cloudera. During these meeting a depth of information and details on the maturity of the companies in the market, APACHE platform, ecosystem and tools that enable Hadoop management, existing customer base, industry partnerships, industry community interaction and company technical supporting structure were discussed. This document will explore in depth two of the most accepted platforms, Cloudera and Hortonworks, and will provide a recommendation on which platform is perceived to be the most suitable for the District Five Big Data needs. Hortonworks Hortonworks is a platform based on the standard Hadoop distribution. Hortonworks differs from its competition in that they deliver a complete open source approach to their platform. This means, Hortonworks has built a system that is more flexible to use and maintain than a standard Hadoop installation. Hortonworks allows the flexibility to use any of the Hadoop ready projects (libraries, network servers, xml, big data, web-framework, database and others) within an environment. Hortonworks approach is maintaining compatibility with the open source community by being an active participator and committer within the open source community. Hortonworks has adopted the Apache Nifi. Nifi is a project that allows for reliable and secure transferring of data between systems. It also makes available many tools that can enrich and prepare data to be transferred across systems. An example is enabling seamless data format conversions and data parsing. Hortonworks provides a visual command and control center that allows users to drag and drop processors into the Hortonworks platform and build custom workflows of data. This includes hooking into various data sources and setting up data prioritization. Hortonworks also provides built-in support for Apache Kafka and Flume. Kafka and Flume are well- known projects that facilitate building workflows that can handle a large number of dynamic and consumer-side datasets and data updates. Kafka is built on a framework that is low latency and has great data durability. Before providing an overview of the Hortonworks platform from the stand point of data management, client profile and market buy-in, it is important to note that Hortonworks is the only platform available in the market that can run Windows natively environment while Cloudera runs on Linux natively only. Into a virtualized environment, Cloudera can run Windows natively and it is recommended that such set up is suitable for testing purposes. This and other pros and cons considerations will also be outlined in the following section of this document. Final Platform Recommendation (Cloudera vs. Hortonworks) Thursday, September 29, 2016 10:05 AM Big Data Environment Sandbox Page 1

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Platform Recommendationhortonworks -or- cloudera

Managing the Hadoop layer of a Big Data platform can prove to be a very time consuming and complex endeavor. Luckily, there are user friendly platforms available that have been market and industry proven by many organizations of both, the public and private sectors, to help ease the management of Hadoop in a Big Data environment while maintaining the ability to be flexible and scalable as your data grows and evolves.

VHB facilitated a meeting between the District 5 and Hortonworks and Cloudera. During these meeting a depth of information and details on the maturity of the companies in the market, APACHE platform, ecosystem and tools that enable Hadoop management, existing customer base, industry partnerships, industry community interaction and company technical supporting structure were discussed.

This document will explore in depth two of the most accepted platforms, Cloudera and Hortonworks, and will provide a recommendation on which platform is perceived to be the most suitable for the District Five Big Data needs.

HortonworksHortonworks is a platform based on the standard Hadoop distribution. Hortonworks differs from its competition in that they deliver a complete open source approach to their platform. This means, Hortonworks has built a system that is more flexible to use and maintain than a standard Hadoop installation. Hortonworks allows the flexibility to use any of the Hadoop ready projects (libraries, network servers, xml, big data, web-framework, database and others) within an environment. Hortonworks approach is maintaining compatibility with the open source community by being an active participator and committer within the open source community.

Hortonworks has adopted the Apache Nifi. Nifi is a project that allows for reliable and secure transferring of data between systems. It also makes available many tools that can enrich and prepare data to be transferred across systems. An example is enabling seamless data format conversions and data parsing.

Hortonworks provides a visual command and control center that allows users to drag and drop processors into the Hortonworks platform and build custom workflows of data. This includes hooking into various data sources and setting up data prioritization.

Hortonworks also provides built-in support for Apache Kafka and Flume. Kafka and Flume are well-known projects that facilitate building workflows that can handle a large number of dynamic and consumer-side datasets and data updates. Kafka is built on a framework that is low latency and has great data durability.

Before providing an overview of the Hortonworks platform from the stand point of data management, client profile and market buy-in, it is important to note that Hortonworks is the only platform available in the market that can run Windows natively environment while Cloudera runs on Linux natively only. Into a virtualized environment, Cloudera can run Windows natively and it is recommended that such set up is suitable for testing purposes. This and other pros and cons considerations will also be outlined in the following section of this document.

Final Platform Recommendation (Cloudera vs. Hortonworks)Thursday, September 29, 2016 10:05 AM

Big Data Environment Sandbox Page 1

Page 2: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

considerations will also be outlined in the following section of this document.

Hortonworks Platform Components

Data Management:Hortonworks has the ability to utilize any of the Apache Hadoop projects via its open sourced methodology. For this reason, Hortonworks can consume any project that was designed around data management and consumption such as Apache Nifi, Apache Kafka or Apache Flume and any other data management and consumption projects. These platforms help facilitate data management needs such as security, work flows and model building.

Client Profile:Hortonworks has embraced the open source community by regularly building and submitting open source projects for the community use as well as integrating open sourced projects from the community into the Hortonworks core platform. It is important to note that the open source community that utilizes and builds around the Hortonworks platform is large and significant. This can be beneficial when identifying solutions and recommendations on issues, bugs or fixes that may arise in the use of Hadoop and the Hortonworks platform

Market Buy-In:Hortonworks is one of the fastest growing packaged Hadoop distributions today. Due to its open source approach, many developers gravitate towards its utilization. There are many large customers today that depend on Hortonworks as their platform of choice for their Big Data needs. These customers include T-Mobile, Progressive, Pandora, and the Mayo Clinic.

Hortonworks Pros:

Completely open sourced and can virtually support any other Hadoop projects

Quickly growing in support (3rd party and direct vendor support for open sourced projects).

Can take advantage of community innovations including at no cost solutions (open source).

Can build and modify the various processes and connections to build a workflow that fits the needs of the District.

Only vendor that supports the Windows platform natively.

Hortonworks Cons:

While quickly growing, it is not as popular as other Hadoop competitors.

Apache Nifi is not a part of the subscription license and has to be licensed separately.

The Ambari Management tool (Hortonworks management dashboard) is not as robust as the Cloudera tool.

Due to its open sourced structure, solutions can take longer to be established. This is due to many approaches coming from the open source community on how to solve an issue.

*Please refer to the Platform Comparison document for more in-depth details in the SharePoint site

ClouderaCloudera is a platform built upon the core Hadoop distribution. Cloudera does not fully support the open source Apache projects (libraries, network servers, xml, big data, web-framework, database and others) for Hadoop. Instead, Cloudera takes the top 20 Hadoop projects and integrates them into the core of their Big Data platform. This approach makes it easier to support the most popular and supported projects for Apache Hadoop. This makes it possible for Cloudera to build tools out of the box

Big Data Environment Sandbox Page 2

Page 3: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

supported projects for Apache Hadoop. This makes it possible for Cloudera to build tools out of the box for data management around these top 20 supported projects. Cloudera has the largest number of customer installs in the industry to date and it is the most popular of the Hadoop packaged platforms.

The company offers an easy to use interface in a form of a dashboard that allows control of many of the Cloudera components and environment including data sources, configurations and management of the nodes in a clustered environment. Cloudera is the only vendor that has regular patch releases including updates to all their platform components. Cloudera also has the largest committer base of all the Hadoop packaged platform vendors.

Before providing an overview of the Cloudera platform from the stand point of data management, client profile and market buy-in, it is important to note that Cloudera does NOT support Windows natively. This and other pros and cons considerations will also be outlined in the following section of this document.

Cloudera Platform Components

Data Management:Cloudera takes an approach that differs from many of its competitors. Cloudera supports and integrates the top 20 Apache projects into its core product. These projects are geared towards making tasks associated with data management easier to work with. Cloudera has a set of robust dashboards and navigator applications that allows for setup, managing and defining various components within your Big Data environment. Tasks such as defining data sources, expanding the data cluster and securing the environment can be developed within the Cloudera platform.

Client Profile:Cloudera offers an Enterprise Ready platform that offers many functionalities out of the box. These out of the box solutions can be very helpful to ease the management of a Big Data environment because it offers a more direct and easy path of the setup, utilization and management of the Big Data environment. Cloudera offers a traditional software company approach to its end users making available direct support through a central point of contact. Cloudera regularly releases patches that update its core platform and supported projects. Cloudera also benefits from having the most committers on staff. This creates an advantage that when an issue arises, solutions are built and tested by Cloudera staff before being release to their clients. It is important to note that this differs from Hortonworks which relies on both, their development team as well as the open source community to develop solutions for specific issues before reaching out to the client base.

Market Buy-In:Cloudera is the largest and most popular packaged Hadoop distribution today. Cloudera currently has major partnerships with companies such as Microsoft, Red Hat, ESRI, Elastic Search, SAP and Teradata. When it comes to an install base, Cloudera has a portfolio of clients that includes Cisco, Samsung, Cerner and Box.com.

Cloudera Pros:

Largest install base and most popular Hadoop packaged platform vendor.

More robust and an easy to use management tool (that can handle data sources, projects installs, configurations, and cluster management).

Package install that is “Enterprise Ready” with out of the box templates and solutions for setup.

Supports the most popular and utilized Apache Projects

Regular release policy for patches and updates. (Largest number of committers)

Cloudera Cons:

Big Data Environment Sandbox Page 3

Page 4: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Vendor lock in. (Cloudera only supports certain Apache projects. Installed packages outside the pre-approved list won’t be supported)

No native Windows support. (Cloudera can run Windows software via a virtual machine (VM)).

*Please refer to the Platform Comparison document for more in-depth details in the SharePoint site

Platform Recommendation Comparing Hortonworks and Cloudera is not an easy task. Each platform offers advantages and disadvantages that can be a decision factor for its selection as a solution for FDOT District 5.

The FDOT District Five Testing Data Fusion Center is in a very unique position to develop its big data platform completely from scratch. There are considerations that must be taken as part of the decision factor to select the most suitable big data platform and Hadoop systems. These considerations include and are not limited to the existing hardware, virtualization, proprietary systems and tools such as Sunguide as well as the current operating system platform utilized by the District. These factors are important components that have been analyzed by the District and the project team throughout the evaluation of the Hadoop platforms.

The project team evaluated each platforms characteristics, pros and cons as well as the individual vendors and community profiles. Additional considerations such as the pool of readily supported third-party platforms including ESRI and Elastic Search were also used in determining a recommendation. Based on existing conditions and the factors above mentioned it is recommended that the District consider utilizing Cloudera as its core Hadoop platform for the Testing Data Fusion Center.

Even though Cloudera does not support Windows natively, only through virtualization, the platform offers many advantages that will benefit the District positively including the best mix of ease of use and support available through an established singular point of contact. While the Hortonworks open source approach provides several opportunities to take advantage of open source resources such as a large community base and platform flexibility, the District historically has not utilized open source solutions. By having the largest community of committers, Cloudera has the ability to test solutions for platform related technical challenges and offer a consolidated solution to its clients. This ultimately can best leverage dollar investments by avoiding the test and error cycle for platform fixes and projects that can be lengthy and costly. It is also important to note that Cloudera currently supports other third-party platforms and projects that would be needed for the scope of the FDOT Data Lake. That includes ESRI and Elastic Search which are essential for the District’s Testing Data Fusion Center performance and seamless integration with the existing geospatial platform.

Building on a platform that is stable, supported and that is enterprise-ready will allow the District data lake to build and scale with an established community that is closely supported by Cloudera. Cloudera also supports Elastic Search as well as many Apache projects. Cloudera will provide the tools needed to stand up an environment that is flexible enough to build the processes and models needed to generate analytic processes and matrixes that will provide FDOT’s end users the information they seek.

In summary, Cloudera has the best mix of enterprise-ready vendor profile, features, support, expandability and ease of use that will allow the FDOT Data Lake to grow and scale as needed for years to come.

It is important to note that the recommendation above is not based on cost of the platforms and that There are many elements that will need to be in place prior the District adopting Cloudera or any Hadoop platform including transitioning the District to utilize a Linux/ Unix environment.

Big Data Environment Sandbox Page 4

Page 5: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Hadoop platform including transitioning the District to utilize a Linux/ Unix environment.

Summarized Platform Comparison Matrix (also available on SharePoint)

Hortonworks

HDP – Hortonworks Data Platform

Cloudera

CDH – Cloudera Distribution for Hadoop

Based On Core Hadoop Framework?

YES

Platform is based on the core Hadoop projects and its projects

YES

Platform is based on the core Hadoop projects and its projects

Is the platform fully Open Source

YES, Hortonworks strives to keep its platform and projects open source to the community. Their subscription model is one for support and a few add-ons for security and such

YES & NO, Cloudera is open source to a certain degree, but many of their products that are used to manage your Hadoop cluster is part of their enterprise / subscription package.

Native Windows Support

YES NO, Not natively. Cloudera can run on Windows by utilizing Virtual Machines (VM) images.

Community Support

(Do they have a thriving support community of developers, committers, and users)

YES YES

Works with Elasticsearch (ES)

YES YES

Price Approx: $4,500 per Node

Subscription Pricing

??? – Vendor wants to sit with us before they formulate a price. But pricing is based on a ‘PER NODE’ basis.

Summary Not as popular as Cloudera but growing in popularity as a fast rate. Truly have embraced the open source community and have a strong and growing community because of it.

The leader in enterprise level Hadoop distribution. Has a proven commercial product that includes various tools and manager dashboards that makes maintaining and setting up the Hadoop cluster easier.

Big Data Environment Sandbox Page 5

Page 6: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Pilot Data Fusion Environment Sandbox Internal ArchitectureMonday, December 12, 2016 4:43 PM

Big Data Environment Sandbox Page 6

Page 7: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Production Data Fusion Environment Recommended Platform ArchitectureMonday, November 28, 2016 8:23 AM

Big Data Environment Sandbox Page 7

Page 8: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Big Data server requirements

Big Data Server Software Package RequirementsWednesday, August 03, 2016 3:51 PM

Big Data Environment Sandbox Page 8

Page 9: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Big Data Environment Sandbox Page 9

Page 10: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Big Data Environment Sandbox Page 10

Page 11: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

GIS Platform needs for the big data storeThursday, August 04, 2016 9:18 AM

Big Data Environment Sandbox Page 11

Page 12: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Big Data Environment Sandbox Page 12

Page 13: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Inventory from AAM Report and UF Data Sources Spreadsheet (42 selected in green) for the Data Fusion Environment Sandbox

Posted Speed / Speed LimitFun ClassNumber of LanesAADTAADTTMedian Type (width)Intersection LocationsAccess Management / Access ControlSurface widthRailroad crossingsNHS SISBridges

Attributes needed:1- Roadway geometry and attributes (RCI or NAVTEQ) - Include both

2- Roadway construction (D5 GIS) (maintenance or planned roadway improvements? Ask Joe Duncan)

Will come out of ATMS.nowVolume will come out of

3- Signalized intersections (TACTICS TMDD C2C interface)

4.1 for Semole County (SPM)4- Volume

5.1 for Seminole )5- Occupancy (

6- Signal Phasing7- Transit Stops (SunRail Group)8- Transit Routes (SunRail Group) 9- SunRail Route and Stops (SunRail Group)10- SunRail Ridership (SunRail Group)11- Land Use (D5 GIS)12- Sidewalks13- Road Ranger service area14- Road ranger Surveys15- Crashes involving Road Ranges16- Beat Configuration and Service details17- Major investments (Local agency investment that are not in the Work Program and do not utilize federal funding) Melissa can coordinate with local government to gather this data18- Transit Ridership data19- Travel Times (through BlueMAC API) For Seminole and Orange Counties only*

EventsDriversPassengersNon-Motorist

20- Signal Four Analytics (tables)**

List of Data Sources for Big Data EnvironmentWednesday, August 24, 2016 3:56 PM

Big Data Environment Sandbox Page 13

Page 14: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Non-MotoristPropertyVehiclesViolationsMotor CarrierWitnessTrailer

20- Detectors location and configuration21- Traffic Data from traffic detectors (MVDS, ILVDS, Bluetooth, Traffic Counts, Average, 22-Speed, Travel Times23- Crash and Incident reports (from FHP / CAD)** refer to Sunguide events from events mgmt24- Devices status25- WAZE incident data26- Traffic detectors27- Travel Times*28- Dynamic Message Signs29- Connected Vehicle Data

Sunguide –

30 - Device location31- Pairing configuration32- Travel data (OD, pair, direction, last contact timestamp, travel time, speed and status) - every 2 to 3 seconds*

BlueTOAD API

33- MAC addresses and timestamps for enables devices by travel segments34- Travel time, average speed feed*35- Traffic Alerts feeds36- Calculated traffic flow feed

Velocity Bluetooth

37- Perception Twitter API38- Walk Demand (Strava)39- HERE.com

40- TATICS Performance Measures Data Log (available via Linux cintrollers only)41 – TMDD C2C (controller status information (mode, plan, status) and Detector data (aggregate volume/occupancy)

ATMS (Orlando, Orange, Seminole, Osceola, Volusia)

42- ITS Master Plan43- AAM Assets (KML)44- Sunguide basemap45- FDOT Unified basemap Repository

D5 GIS data

46- Park management system – City of Orlando47- Drawbridge Openings in Volusia County48- NMS49- Express lanes (ELS)50- ITS FM

51- MOMS

52- LYNX

53- Permits

54- Parking

55- Land Use

Discussion Point -> Is this the same thing as Response Zones for fire rescue See # 74? NO - & will never get this AVL data because of public safety/security issue

56- Emergency Responder AVL - Low Priority

Big Data Environment Sandbox Page 14

Page 15: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

NO - & will never get this AVL data because of public safety/security issue

57- Computer Aided Dispatch (CAD)

58- Special Event Information

59- Passenger Counters

60- AVL

61- Routes

62- Schedules

63- Capacity

64- Uber

65- Lyft

66- Zip Car

67- Juice

68- Rethink

69- INRIX

70 – LOS71 – Bike lanes72- Crosswalks 73- School locations 74- Response Zones for fire rescue - Low Priority

75 – Trails 76- Weigh stations / weigh in motion77- Rest areas

5 year adoptedTentativeMaintenance

78 – Work program Discussion Point -> Is this part of the D5GIS? Check with Joe Duncan

79- LCIS – Lane closures80- VOTRAN

Big Data Environment Sandbox Page 15

Page 16: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Data Dictionary Definition DocumentMonday, October 24, 2016 10:55 AM

Big Data Environment Sandbox Page 16

Page 17: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Big Data Environment Sandbox Page 17

Page 18: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Big Data Environment Sandbox Page 18

Page 19: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Big Data Environment Sandbox Page 19

Page 20: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Big Data Environment Sandbox Page 20

Page 21: Thursday, September 29, 2016 10:05 AM Platform ...cflsmartroads.com/projects/design/tsp/Regional_integrated...ecosystem and tools that enable Hadoop management, existing customer base,

Data Access Architecture refers to how the Data Fusion Environment facilitates user access to the Data Fusion Environment resources.

a. OpenID Connect - https://en.wikipedia.org/wiki/OAuthb. Lightweight Directory Access Protocol (LDAP)

Authentication - verifying who an access request is being made on behalf of1.

a. Token/Role Managerb. Open Standard for Authorization 2.0 https://en.wikipedia.org/wiki/OAuth

Authorization - verifying the user is authorized to make the access request2.

i. http://www.andrewhavens.com/posts/20/beginners-guide-to-creating-a-rest-api/ii. https://en.wikipedia.org/wiki/Representational_state_transfer

a. Representation state transfer (REST) -

Datetime range (start / end) Other temporal parameters (time of day within the range, multiple ranges, weekdays,

weekends, or days in the week, etc. Location parameters (county, roadway, direction, mile markers, etc.) Others specific to the data source, for example, detector ID

b. Method parameters for an access protocol should be consistent

Access - actually making requests and accessing the data3.

4.a. https://syncope.apache.org/ is a possible solution

Administration - configuring users and permissions to resources

There are several objectives to serving users with access to data, each with potential solutions for consideration:

Data Access ArchitectureSaturday, November 05, 2016 12:29 PM

Big Data Environment Sandbox Page 21