trifacta wrangler enterprise: evaluating the performance of … · 3. (t3) creating – create a...

This ESG Lab Review was commissioned by Trifacta and is distributed under license from ESG.

© 2017 by The Enterprise Strategy Group, Inc. All Rights Reserved.

Abstract

This report documents ESG’s performance audit of Trifacta Wrangler Enterprise, a data wrangling solution that speeds and

simplifies data wrangling to enhance analytics. Testing focused on evaluating the single-node data processing performance

of Trifacta’s Photon Compute Engine.

Background: Trifacta Wrangler Enterprise

Trifacta Wrangler Enterprise is a data wrangling solution that simplifies and accelerates data preparation for analytics. It

enables end users to explore raw, diverse data and arrange it into structured formats for analysis, without complex, error-

prone processes. Users can explore data of any shape or size and build a recipe of wrangling steps that make the data

usable downstream. Using that recipe, they can define a data processing job that can leverage various execution engines,

making the output useful in downstream visualization and analysis applications.

Trifacta’s approach to wrangling data uses data visualization, machine learning, and human-computer interaction

techniques to enhance the user experience. Trifacta delivers:

• Interactive Exploration: Wrangler Enterprise lets users see exactly what is in their data to understand its distribution

and quality; they can explore the data, manipulate it, and immediately see how various transformations will impact the

output.

• Predictive Transformation: Wrangler Enterprise offers intelligent, contextual suggestions about what transformations

to apply to optimize the potential insight from the data sets in use. Users prompt these intelligent suggestions through

simple interactions with their data such as clicking on or selecting certain data elements.

• Intelligent Execution: Wrangler Enterprise suggests the processing engine for data transformation execution; smaller

data sets can be transformed directly in the browser, while larger data sets can be processed on any desktop or server,

and data sets that require parallel processing can run on external engines such as Spark.

• Collaborative Governance: Wrangler Enterprise integrates with security, data lineage, and access frameworks, ensuring

proper access without having to implement add-on solutions.

LOL

Figure 1. Trifacta Wrangler Enterprise

ESG Lab Review

Trifacta Wrangler Enterprise: Evaluating the Performance of Trifacta’s Photon Compute Engine

Date: April 2017 Author: Mike Leone and Kerry Dolan, Senior Analysts

Enterprise Strategy Group | Getting to the bigger truth.™

Lab Review: Trifacta Wrangler Enterprise: Benchmarking the Performance of Trifacta’s Photon Compute Engine 2


Photon Compute Engine

Trifacta products feature a lightweight, in-memory data processing engine called Photon that provides the data processing

power in the browser to support the computations required for the application’s rich visualizations and transformation

suggestions. Trifacta also leverages Photon to process data outside of the browser in single-node environments such as a

desktop and single-node server. This enables business analysts, who have the best understanding of their data and the

insights they seek, to prepare data leveraging Trifacta’s intuitive process without having to resort to complex Excel, ETL

tools, or other scripting tasks, and without having to enlist the aid of IT.

Trifacta built Photon after unsuccessfully searching for a powerful engine that could 1) deliver the performance required to

support Trifacta’s fluid user experience in-browser (Spark, as an example does not support in-browser) and 2) deliver the

performance required for small and medium data volumes leveraging desktop and single-node environments. Photon

powers Trifacta’s user experience and enables the application to provide visualizations, interactions, and suggestions

directly within the application; it also lets users intelligently create a recipe using larger and more diverse representative

samples of their data, saving time while delivering better insights.

Different Processing Engines for Different Data Volumes

One of the defining aspects of Trifacta’s architecture is the concept of Any Scale Data Processing. Given that analysts are

tasked with working with data sets of varying sizes, Trifacta provides support for a growing ecosystem of data processing

engines to ensure users are leveraging the best-fit engine for their particular data set size and type. For small-scale data that

fits in-browser or on a single server, Trifacta leverages their own Photon engine, but for larger scale data Trifacta supports

parallel processing engines including Spark, MapReduce, and Google Dataflow among others. Diverse data processing

engine support also ties into how Wrangler Enterprise supports various deployment options, both on premise with

deployments of Cloudera, Hortonworks, MapR, and Infosys IIP among others, and cloud deployments including AWS,

Google Cloud, and Microsoft Azure.

Performance Analysis of Trifacta’s Photon Compute Engine

ESG Lab audited the performance of Trifacta Wrangler Enterprise with a focus on single-node data processing. Performance

testing was done on a single edge server (dual eight-core, Intel Xeon processors; 128GB RAM) within a Cloudera Hadoop

cluster.

Trifacta created Photon to specifically handle data wrangling and support the product’s unique architecture, but Trifacta

also supports other multi-purpose engines including Apache Spark and Google Dataflow for scalable processing. For Photon,

the C++ data wrangling engine can run directly in a web browser or on a single server, creating an interactive data wrangling

experience in low-resource environments. This powers a user experience in which execution of data transformations

completes immediately with minimal memory requirements or consumption. During the testing, we found that leveraging

Photon for small- to intermediate-sized data sets that only require minimal computing resources like a single server, Photon

can perform better than Spark.

With the understanding that both Photon and Spark have their place in the data center, Trifacta supports both, enabling

organizations to pick the right engine for their job, whether that be for small or large data sets, completing numerical or

textual transformations. Textual transformations occur more frequently for data wrangling, while numerical

transformations are more common in analytic environments. The obvious reason for this is the format in which data is

received. For example, little to no wrangling is required when receiving a stream of similarly formatted numbers, but with a

collection of tweets, more data wrangling is required to properly format and organize the text.

For the purposes of understanding the performance of Photon for different transformations common in a data wrangling

workflow, ESG analyzed performance results testing the speed of executing numerical and textual data transformations on

data sets of varying sizes using both Photon and Spark. Since Trifacta supports batch execution on both Photon and Spark

via a Common Data Flow (CDF) specification, the same single-server environment was used to provide a true “apples-to-

apples” performance comparison between the two processing engines. To run on a single server, Spark was configured in



local mode with all threads ready to execute. Eight total threads were used for each transformation test across the four

different data set sizes tested: 1GB, 2GB, 5GB, and 10GB. The data used in the tests was public data taken from Twitter and

included information such as username, followers, tweets, favorites, and language. Each test was run five times to ensure

consistency, and the average execution time was recorded. It should be noted that no additional performance tuning was

done in either environment.

Numerical Transformation Tests

The first series of tests focused on numerical transformations—completing simple and complex transformation tasks

including addition, division, sorting, and grouping. Five numerical transformations were tested and are summarized below:

1. (N1) Division – Create a new column from the division of two existing columns.

2. (N2) Aggregate-addition – Compute the sum of all values in a single column.

3. (N3) Sorting – Sort a column based on numerical value in descending order.

4. (N4) Aggregate-addition and Division – Compute the sum of all values in one column, and then divide a different

value from a different preexisting column with the result of the sum.

5. (N5) Aggregate-addition and Grouping – Compute the sum of all values in a column based on a grouping and create

a new column with the results.

Tests with more than one transformation are considered complex. An example of one of the complex numerical

transformations, the aggregate-addition and division transformation (N4), is shown in Figure 3 as viewed in the Trifacta web

interface. The formula highlights the specific transformation being done on the data set.

Figure 3. Numerical Transformation – Aggregate-addition and Division (N4)

Source: Enterprise Strategy Group, 2017



Numerical Transformation Results

The results of the performance comparison analysis for five numerical transformations on four differently sized data sets

are shown in Figure 4. Across all tests, Photon yielded an average performance improvement of 1.7X over Spark. When

looking just at the more complex numerical transformations (N4 and N5), that average performance improvement

increased to even higher levels, with Photon performing 2.3X faster than Spark. The largest individual improvement when

leveraging Photon was witnessed for the 1GB data set running the complex addition and division (N4) test, yielding a

performance gain of 2.6X that of Spark.

Figure 4. Numerical Transformation Performance Comparison


What the Numbers Mean

• As expected, in both test scenarios, execution times increased as the data set size increased.

• In the division test (N1), Photon outperformed Spark, with execution times completing an average of 1.6X faster.

• The aggregate-addition test (N2) was the only numerical transformation in which Spark slightly outperformed Photon. Since this test took mere minutes to complete, even at the larger data set size, the 1.2X average time advantage Spark yielded can be considered negligible.

• Impressive to ESG was the result of the sorting test (N3). Spark is known to perform well when handling sorting transformations. ESG witnessed a slight benefit when using Photon, completing the transformation on average 1.2X faster.

• The first of the more complex tasks (N4 – aggregate-addition and division) proved Photon can outperform Spark by a sizable margin. Execution times with Photon completed an average of 2.5X faster than Spark.

• The second complex task (N5 – aggregate-addition and grouping) also produced an impressive result, with Photon completing the execution on average 2.1X faster than Spark.



Textual Transformation Tests

The second series of tests focused on textual transformations—completing simple and complex transformation tasks that

included merging, extracting, creating, sorting, and joining. These types of transformations are very common in data

wrangling settings. Seven textual transformations were tested and are summarized below:

1. (T1) Merging – Merge two columns into a new column with a comma separating the inputs.

2. (T2) Extracting – Extract specific text based on a predefined search word or character pattern.

3. (T3) Creating – Create a single list of values based on an existing text string.

4. (T4) Creating – Create a new row based on an extracted text string.

5. (T5) Sorting – Sort a column in ascending alphabetical order.

6. (T6) Creating and Merging – Create a new column per primary key for every key-value pair in a column.

7. (T7) Joining – Join a 250MB subset of data into an existing data set based on a unique text identifier.

In Figure 5, we have provided a visual example of what performing an extraction in Trifacta’s web interface looks like. Using

the extraction test (T2) as an example, Figure 5 highlights an instance in which hashtags were extracted from the text of a

collection of tweets. As shown, each tweet contained at least one hashtag, while some contained as many as ten.

Figure 5. Textual Transformation – Extracting Text (T2)




Textual Transformation Results

A performance comparison chart similar to the numerical transformations chart is shown in Figure 6 for seven different

textual transformations on four differently sized data sets. It should be noted that for a handful of scenarios, the

transformation was unable to complete due to the limited resources in the single-server environment. This occurred once

with Photon and three times with Spark.

ESG witnessed Photon’s execution time to be faster than Spark by an average of nearly 3X in 23 of the 27 textual

transformation test scenarios. The largest performance improvement offered by Photon was achieved during the test to

create a single list of values based on an existing text string (T3). This test yielded an average performance gain of 6.4X

across the full data set size range of 1-10GB, with the largest improvement being 6.6X on Photon compared to completing

the same transformation being run on Spark.

Figure 6. Textual Transformation Performance Comparison


What the Numbers Mean

• In the merge test (T1), Photon outperformed Spark, with execution times completing an average of 1.5X faster.

• The text extraction transformation test (T2) completed an average of 1.8X faster on Photon than on Spark.

• The test which created a single list based on existing text values (T3) yielded the most impressive result, a 6.6X faster execution time on Photon (22 seconds) compared with Spark (2.4 minutes) for the 1GB data set. At the 10GB data set size, Photon completed the transformation in under four minutes, while Spark took more than 20 minutes.

• The new row creation based on extracted text (T4) yielded an average execution speed nearly 3X faster with Photon than with Spark.

• The test that sorted a column in alphabetical order (T5) completed an average of 2.2X faster on Photon.

• The complex test of creating a new column and merging data based on primary key and key-value pairs (T6) favored Photon by an average of 2.7X faster.

• The only test that favored Spark, joining a new data set with an existing data set (T7), yielded an average 1.4X faster execution time than Photon.



Memory Utilization

ESG also viewed the memory utilization of the Photon and Spark tests to understand Trifacta’s minimal memory

requirements while still delivering faster-than-Spark performance for data sets with small footprints. The 1GB and 2GB tests

were analyzed across all tested numerical and textual transformations.

For numerical transformations, average peak memory utilization across all tests was 22X higher when running on Spark.

Most impressive were the individual Photon results when running the division (N1) and addition (N2) tests. Photon

completed the division transformation utilizing 59X less memory than required by Spark, while the addition transformation

required 39X less memory on Photon. Further, the worst case still produced a small advantage to Photon, with the addition

and division (N4) test requiring nearly 3X less memory than Spark.

The textual transformation memory utilization favored Photon even further, with average peak memory utilization across all

tests yielding a 28X increase on Spark. Two tests in particular stood out: the merging (T1) and the creating and merging (T6)

tests. For merging alone, Photon consumed just .119 GB of RAM, while Spark consumed 85X that. The creating and merging

test produced a 33X memory utilization savings over Photon.

Why This Matters

For a majority of organizations—and for business analysts in particular— the need for an efficient, purpose-built engine to handle their data wrangling needs has never been more apparent. Organizations still need scalable general-purpose processing engines like Spark and MapReduce to handle their larger data sets, but a complementary engine that provides a faster way to merge and consume, or “wrangle,” small to intermediate sized data sets in the tens of GBs is a welcome addition to any analytics infrastructure and provides users with substantial performance benefits.

ESG confirmed that the Photon Compute Engine, Trifacta’s in-memory, interactive, data wrangling engine, delivers fast batch execution and low-latency performance for common numerical and textual transformation tasks. Testing in an identical environment focused on small and medium data volumes on a single-node, Photon was faster than Spark in 16 of 20 numerical transformations and 23 of 27 textual transformations. Between speed of execution, low resource consumption, and mobility, ESG was impressed with Photon’s ability to handle both simple and complex transformation tasks.

The Bigger Truth

ESG Lab validated that Trifacta Wrangler Enterprise provides a fast data wrangling solution that works for data sets of

various sizes. Wrangler’s in-memory engine, Photon, enables users to quickly visualize and interact with data to deliver the

optimal format for the analysis at hand. Wrangling recipe development can be done in the browser; and data processing of

recipes can be done in the browser, on a desktop or single server, or in a Hadoop cluster.

When comparing the execution times of numerical and textual transformations between Photon and Spark, Photon

outperformed Spark in most test cases. In fact, ESG witnessed faster, more efficient execution times on Photon, completing

transformations up to 6.6X faster while utilizing up to 98% less memory compared with Spark. Not only does Trifacta’s new,

lightweight platform effectively wrangle data faster than traditional processing engines like Spark, but it also does so

through an intuitive, interactive interface that requires few resources—it can often run directly within a browser using

client-side memory on a traditional laptop.

In ESG’s view, Trifacta Wrangler Enterprise fills a void in the market—it provides fast, high-quality data wrangling for diverse

data sets, for use by the data analysts closest to the outcomes. It empowers users with varied needs and skill sets, but also

meets organizations’ governance and security requirements. So if your organization wants to spend more time using data

and less time preparing it, ESG recommends you take a look at Trifacta Wrangler Enterprise.


© 2017 by The Enterprise Strategy Group, Inc. All Rights Reserved. www.esg-global.com [email protected] P.508.482.0188


All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise Strategy Group (ESG) considers to be

reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any

reproduction or redistribution of this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the express consent

of The Enterprise Strategy Group, Inc., is in violation of U.S. copyright law and will be subject to an action for civil damages and, if applicable, criminal prosecution. Should you have any questions,

please contact ESG Client Relations at 508.482.0188.

The goal of ESG Lab reports is to educate IT professionals about data center technology products for companies of all types and sizes. ESG Lab reports are not meant to replace the evaluation

process that should be conducted before making purchasing decisions, but rather to provide insight into these emerging technologies. Our objective is to go over some of the more valuable

feature/functions of products, show how they can be used to solve real customer problems and identify any areas needing improvement. ESG Lab's expert third-party perspective is based on our

own hands-on testing as well as on interviews with customers who use these products in production environments.

trifacta wrangler enterprise: evaluating the performance of … · 3. (t3) creating – create a...

Documents