trifacta wrangler enterprise: evaluating the performance of … · 3. (t3) creating – create a...
TRANSCRIPT
This ESG Lab Review was commissioned by Trifacta and is distributed under license from ESG.
© 2017 by The Enterprise Strategy Group, Inc. All Rights Reserved.
Abstract
This report documents ESG’s performance audit of Trifacta Wrangler Enterprise, a data wrangling solution that speeds and
simplifies data wrangling to enhance analytics. Testing focused on evaluating the single-node data processing performance
of Trifacta’s Photon Compute Engine.
Background: Trifacta Wrangler Enterprise
Trifacta Wrangler Enterprise is a data wrangling solution that simplifies and accelerates data preparation for analytics. It
enables end users to explore raw, diverse data and arrange it into structured formats for analysis, without complex, error-
prone processes. Users can explore data of any shape or size and build a recipe of wrangling steps that make the data
usable downstream. Using that recipe, they can define a data processing job that can leverage various execution engines,
making the output useful in downstream visualization and analysis applications.
Trifacta’s approach to wrangling data uses data visualization, machine learning, and human-computer interaction
techniques to enhance the user experience. Trifacta delivers:
• Interactive Exploration: Wrangler Enterprise lets users see exactly what is in their data to understand its distribution
and quality; they can explore the data, manipulate it, and immediately see how various transformations will impact the
output.
• Predictive Transformation: Wrangler Enterprise offers intelligent, contextual suggestions about what transformations
to apply to optimize the potential insight from the data sets in use. Users prompt these intelligent suggestions through
simple interactions with their data such as clicking on or selecting certain data elements.
• Intelligent Execution: Wrangler Enterprise suggests the processing engine for data transformation execution; smaller
data sets can be transformed directly in the browser, while larger data sets can be processed on any desktop or server,
and data sets that require parallel processing can run on external engines such as Spark.
• Collaborative Governance: Wrangler Enterprise integrates with security, data lineage, and access frameworks, ensuring
proper access without having to implement add-on solutions.
LOL
Figure 1. Trifacta Wrangler Enterprise
ESG Lab Review
Trifacta Wrangler Enterprise: Evaluating the Performance of Trifacta’s Photon Compute Engine
Date: April 2017 Author: Mike Leone and Kerry Dolan, Senior Analysts
Enterprise Strategy Group | Getting to the bigger truth.™
Lab Review: Trifacta Wrangler Enterprise: Benchmarking the Performance of Trifacta’s Photon Compute Engine 2
© 2017 by The Enterprise Strategy Group, Inc. All Rights Reserved.
Photon Compute Engine
Trifacta products feature a lightweight, in-memory data processing engine called Photon that provides the data processing
power in the browser to support the computations required for the application’s rich visualizations and transformation
suggestions. Trifacta also leverages Photon to process data outside of the browser in single-node environments such as a
desktop and single-node server. This enables business analysts, who have the best understanding of their data and the
insights they seek, to prepare data leveraging Trifacta’s intuitive process without having to resort to complex Excel, ETL
tools, or other scripting tasks, and without having to enlist the aid of IT.
Trifacta built Photon after unsuccessfully searching for a powerful engine that could 1) deliver the performance required to
support Trifacta’s fluid user experience in-browser (Spark, as an example does not support in-browser) and 2) deliver the
performance required for small and medium data volumes leveraging desktop and single-node environments. Photon
powers Trifacta’s user experience and enables the application to provide visualizations, interactions, and suggestions
directly within the application; it also lets users intelligently create a recipe using larger and more diverse representative
samples of their data, saving time while delivering better insights.
Different Processing Engines for Different Data Volumes
One of the defining aspects of Trifacta’s architecture is the concept of Any Scale Data Processing. Given that analysts are
tasked with working with data sets of varying sizes, Trifacta provides support for a growing ecosystem of data processing
engines to ensure users are leveraging the best-fit engine for their particular data set size and type. For small-scale data that
fits in-browser or on a single server, Trifacta leverages their own Photon engine, but for larger scale data Trifacta supports
parallel processing engines including Spark, MapReduce, and Google Dataflow among others. Diverse data processing
engine support also ties into how Wrangler Enterprise supports various deployment options, both on premise with
deployments of Cloudera, Hortonworks, MapR, and Infosys IIP among others, and cloud deployments including AWS,
Google Cloud, and Microsoft Azure.
Performance Analysis of Trifacta’s Photon Compute Engine
ESG Lab audited the performance of Trifacta Wrangler Enterprise with a focus on single-node data processing. Performance
testing was done on a single edge server (dual eight-core, Intel Xeon processors; 128GB RAM) within a Cloudera Hadoop
cluster.
Trifacta created Photon to specifically handle data wrangling and support the product’s unique architecture, but Trifacta
also supports other multi-purpose engines including Apache Spark and Google Dataflow for scalable processing. For Photon,
the C++ data wrangling engine can run directly in a web browser or on a single server, creating an interactive data wrangling
experience in low-resource environments. This powers a user experience in which execution of data transformations
completes immediately with minimal memory requirements or consumption. During the testing, we found that leveraging
Photon for small- to intermediate-sized data sets that only require minimal computing resources like a single server, Photon
can perform better than Spark.
With the understanding that both Photon and Spark have their place in the data center, Trifacta supports both, enabling
organizations to pick the right engine for their job, whether that be for small or large data sets, completing numerical or
textual transformations. Textual transformations occur more frequently for data wrangling, while numerical
transformations are more common in analytic environments. The obvious reason for this is the format in which data is
received. For example, little to no wrangling is required when receiving a stream of similarly formatted numbers, but with a
collection of tweets, more data wrangling is required to properly format and organize the text.
For the purposes of understanding the performance of Photon for different transformations common in a data wrangling
workflow, ESG analyzed performance results testing the speed of executing numerical and textual data transformations on
data sets of varying sizes using both Photon and Spark. Since Trifacta supports batch execution on both Photon and Spark
via a Common Data Flow (CDF) specification, the same single-server environment was used to provide a true “apples-to-
apples” performance comparison between the two processing engines. To run on a single server, Spark was configured in
Lab Review: Trifacta Wrangler Enterprise: Benchmarking the Performance of Trifacta’s Photon Compute Engine 3
© 2017 by The Enterprise Strategy Group, Inc. All Rights Reserved.
local mode with all threads ready to execute. Eight total threads were used for each transformation test across the four
different data set sizes tested: 1GB, 2GB, 5GB, and 10GB. The data used in the tests was public data taken from Twitter and
included information such as username, followers, tweets, favorites, and language. Each test was run five times to ensure
consistency, and the average execution time was recorded. It should be noted that no additional performance tuning was
done in either environment.
Numerical Transformation Tests
The first series of tests focused on numerical transformations—completing simple and complex transformation tasks
including addition, division, sorting, and grouping. Five numerical transformations were tested and are summarized below:
1. (N1) Division – Create a new column from the division of two existing columns.
2. (N2) Aggregate-addition – Compute the sum of all values in a single column.
3. (N3) Sorting – Sort a column based on numerical value in descending order.
4. (N4) Aggregate-addition and Division – Compute the sum of all values in one column, and then divide a different
value from a different preexisting column with the result of the sum.
5. (N5) Aggregate-addition and Grouping – Compute the sum of all values in a column based on a grouping and create
a new column with the results.
Tests with more than one transformation are considered complex. An example of one of the complex numerical
transformations, the aggregate-addition and division transformation (N4), is shown in Figure 3 as viewed in the Trifacta web
interface. The formula highlights the specific transformation being done on the data set.
Figure 3. Numerical Transformation – Aggregate-addition and Division (N4)
Source: Enterprise Strategy Group, 2017
Lab Review: Trifacta Wrangler Enterprise: Benchmarking the Performance of Trifacta’s Photon Compute Engine 4
© 2017 by The Enterprise Strategy Group, Inc. All Rights Reserved.
Numerical Transformation Results
The results of the performance comparison analysis for five numerical transformations on four differently sized data sets
are shown in Figure 4. Across all tests, Photon yielded an average performance improvement of 1.7X over Spark. When
looking just at the more complex numerical transformations (N4 and N5), that average performance improvement
increased to even higher levels, with Photon performing 2.3X faster than Spark. The largest individual improvement when
leveraging Photon was witnessed for the 1GB data set running the complex addition and division (N4) test, yielding a
performance gain of 2.6X that of Spark.
Figure 4. Numerical Transformation Performance Comparison
Source: Enterprise Strategy Group, 2017
What the Numbers Mean
• As expected, in both test scenarios, execution times increased as the data set size increased.
• In the division test (N1), Photon outperformed Spark, with execution times completing an average of 1.6X faster.
• The aggregate-addition test (N2) was the only numerical transformation in which Spark slightly outperformed Photon. Since this test took mere minutes to complete, even at the larger data set size, the 1.2X average time advantage Spark yielded can be considered negligible.
• Impressive to ESG was the result of the sorting test (N3). Spark is known to perform well when handling sorting transformations. ESG witnessed a slight benefit when using Photon, completing the transformation on average 1.2X faster.
• The first of the more complex tasks (N4 – aggregate-addition and division) proved Photon can outperform Spark by a sizable margin. Execution times with Photon completed an average of 2.5X faster than Spark.
• The second complex task (N5 – aggregate-addition and grouping) also produced an impressive result, with Photon completing the execution on average 2.1X faster than Spark.
Lab Review: Trifacta Wrangler Enterprise: Benchmarking the Performance of Trifacta’s Photon Compute Engine 5
© 2017 by The Enterprise Strategy Group, Inc. All Rights Reserved.
Textual Transformation Tests
The second series of tests focused on textual transformations—completing simple and complex transformation tasks that
included merging, extracting, creating, sorting, and joining. These types of transformations are very common in data
wrangling settings. Seven textual transformations were tested and are summarized below:
1. (T1) Merging – Merge two columns into a new column with a comma separating the inputs.
2. (T2) Extracting – Extract specific text based on a predefined search word or character pattern.
3. (T3) Creating – Create a single list of values based on an existing text string.
4. (T4) Creating – Create a new row based on an extracted text string.
5. (T5) Sorting – Sort a column in ascending alphabetical order.
6. (T6) Creating and Merging – Create a new column per primary key for every key-value pair in a column.
7. (T7) Joining – Join a 250MB subset of data into an existing data set based on a unique text identifier.
In Figure 5, we have provided a visual example of what performing an extraction in Trifacta’s web interface looks like. Using
the extraction test (T2) as an example, Figure 5 highlights an instance in which hashtags were extracted from the text of a
collection of tweets. As shown, each tweet contained at least one hashtag, while some contained as many as ten.
Figure 5. Textual Transformation – Extracting Text (T2)
Source: Enterprise Strategy Group, 2017
Lab Review: Trifacta Wrangler Enterprise: Benchmarking the Performance of Trifacta’s Photon Compute Engine 6
© 2017 by The Enterprise Strategy Group, Inc. All Rights Reserved.
Textual Transformation Results
A performance comparison chart similar to the numerical transformations chart is shown in Figure 6 for seven different
textual transformations on four differently sized data sets. It should be noted that for a handful of scenarios, the
transformation was unable to complete due to the limited resources in the single-server environment. This occurred once
with Photon and three times with Spark.
ESG witnessed Photon’s execution time to be faster than Spark by an average of nearly 3X in 23 of the 27 textual
transformation test scenarios. The largest performance improvement offered by Photon was achieved during the test to
create a single list of values based on an existing text string (T3). This test yielded an average performance gain of 6.4X
across the full data set size range of 1-10GB, with the largest improvement being 6.6X on Photon compared to completing
the same transformation being run on Spark.
Figure 6. Textual Transformation Performance Comparison
Source: Enterprise Strategy Group, 2017
What the Numbers Mean
• In the merge test (T1), Photon outperformed Spark, with execution times completing an average of 1.5X faster.
• The text extraction transformation test (T2) completed an average of 1.8X faster on Photon than on Spark.
• The test which created a single list based on existing text values (T3) yielded the most impressive result, a 6.6X faster execution time on Photon (22 seconds) compared with Spark (2.4 minutes) for the 1GB data set. At the 10GB data set size, Photon completed the transformation in under four minutes, while Spark took more than 20 minutes.
• The new row creation based on extracted text (T4) yielded an average execution speed nearly 3X faster with Photon than with Spark.
• The test that sorted a column in alphabetical order (T5) completed an average of 2.2X faster on Photon.
• The complex test of creating a new column and merging data based on primary key and key-value pairs (T6) favored Photon by an average of 2.7X faster.
• The only test that favored Spark, joining a new data set with an existing data set (T7), yielded an average 1.4X faster execution time than Photon.
Lab Review: Trifacta Wrangler Enterprise: Benchmarking the Performance of Trifacta’s Photon Compute Engine 7
© 2017 by The Enterprise Strategy Group, Inc. All Rights Reserved.
Memory Utilization
ESG also viewed the memory utilization of the Photon and Spark tests to understand Trifacta’s minimal memory
requirements while still delivering faster-than-Spark performance for data sets with small footprints. The 1GB and 2GB tests
were analyzed across all tested numerical and textual transformations.
For numerical transformations, average peak memory utilization across all tests was 22X higher when running on Spark.
Most impressive were the individual Photon results when running the division (N1) and addition (N2) tests. Photon
completed the division transformation utilizing 59X less memory than required by Spark, while the addition transformation
required 39X less memory on Photon. Further, the worst case still produced a small advantage to Photon, with the addition
and division (N4) test requiring nearly 3X less memory than Spark.
The textual transformation memory utilization favored Photon even further, with average peak memory utilization across all
tests yielding a 28X increase on Spark. Two tests in particular stood out: the merging (T1) and the creating and merging (T6)
tests. For merging alone, Photon consumed just .119 GB of RAM, while Spark consumed 85X that. The creating and merging
test produced a 33X memory utilization savings over Photon.
Why This Matters
For a majority of organizations—and for business analysts in particular— the need for an efficient, purpose-built engine to handle their data wrangling needs has never been more apparent. Organizations still need scalable general-purpose processing engines like Spark and MapReduce to handle their larger data sets, but a complementary engine that provides a faster way to merge and consume, or “wrangle,” small to intermediate sized data sets in the tens of GBs is a welcome addition to any analytics infrastructure and provides users with substantial performance benefits.
ESG confirmed that the Photon Compute Engine, Trifacta’s in-memory, interactive, data wrangling engine, delivers fast batch execution and low-latency performance for common numerical and textual transformation tasks. Testing in an identical environment focused on small and medium data volumes on a single-node, Photon was faster than Spark in 16 of 20 numerical transformations and 23 of 27 textual transformations. Between speed of execution, low resource consumption, and mobility, ESG was impressed with Photon’s ability to handle both simple and complex transformation tasks.
The Bigger Truth
ESG Lab validated that Trifacta Wrangler Enterprise provides a fast data wrangling solution that works for data sets of
various sizes. Wrangler’s in-memory engine, Photon, enables users to quickly visualize and interact with data to deliver the
optimal format for the analysis at hand. Wrangling recipe development can be done in the browser; and data processing of
recipes can be done in the browser, on a desktop or single server, or in a Hadoop cluster.
When comparing the execution times of numerical and textual transformations between Photon and Spark, Photon
outperformed Spark in most test cases. In fact, ESG witnessed faster, more efficient execution times on Photon, completing
transformations up to 6.6X faster while utilizing up to 98% less memory compared with Spark. Not only does Trifacta’s new,
lightweight platform effectively wrangle data faster than traditional processing engines like Spark, but it also does so
through an intuitive, interactive interface that requires few resources—it can often run directly within a browser using
client-side memory on a traditional laptop.
In ESG’s view, Trifacta Wrangler Enterprise fills a void in the market—it provides fast, high-quality data wrangling for diverse
data sets, for use by the data analysts closest to the outcomes. It empowers users with varied needs and skill sets, but also
meets organizations’ governance and security requirements. So if your organization wants to spend more time using data
and less time preparing it, ESG recommends you take a look at Trifacta Wrangler Enterprise.
Lab Review: Trifacta Wrangler Enterprise: Benchmarking the Performance of Trifacta’s Photon Compute Engine 8
© 2017 by The Enterprise Strategy Group, Inc. All Rights Reserved. www.esg-global.com [email protected] P.508.482.0188
© 2017 by The Enterprise Strategy Group, Inc. All Rights Reserved.
All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise Strategy Group (ESG) considers to be
reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any
reproduction or redistribution of this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the express consent
of The Enterprise Strategy Group, Inc., is in violation of U.S. copyright law and will be subject to an action for civil damages and, if applicable, criminal prosecution. Should you have any questions,
please contact ESG Client Relations at 508.482.0188.
The goal of ESG Lab reports is to educate IT professionals about data center technology products for companies of all types and sizes. ESG Lab reports are not meant to replace the evaluation
process that should be conducted before making purchasing decisions, but rather to provide insight into these emerging technologies. Our objective is to go over some of the more valuable
feature/functions of products, show how they can be used to solve real customer problems and identify any areas needing improvement. ESG Lab's expert third-party perspective is based on our
own hands-on testing as well as on interviews with customers who use these products in production environments.