Bruce YellinAdvisory Systems EngineerEMC [email protected]
A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL
An article by EMC Proven Professional
Knowledge Sharing Elite Author
2014 EMC Proven Professional Knowledge Sharing 2
Table of Contents
Introduction – Baseball, Big Data, and Advanced Analytics ...................................... 9
Big Data and Baseball - Players, Coaches, Trainers, and Managers ....................... 13
Sportvision ................................................................................................................ 15
PITCHf/x ................................................................................................................ 16
HITf/x ..................................................................................................................... 25
FIELDf/x ................................................................................................................ 28
Big Data and the Business of Baseball ..................................................................... 33
Player Development .................................................................................................. 36
Revenue From Fans .................................................................................................. 38
Revenue From Media ................................................................................................ 42
Big Data Helps Create Algorithmic Baseball Journalism ......................................... 43
Listen To Your Data - Grady the Goat - The Curse of the Bambino ......................... 46
Conclusion: The Future of Big Data Baseball ........................................................... 47
Appendix - PITCHf/x Metadata ................................................................................... 50
Footnotes ..................................................................................................................... 51
PLEASE NOTE: It is recommended that this paper be printed in color.
Disclaimer: The views, processes, or methodologies published in this article are those
of the authors. They do not necessarily reflect EMC Corporation’s views, processes, or
methodologies.
2014 EMC Proven Professional Knowledge Sharing 3
Big data and advanced analytics touch every facet of our lives. We see it in action with
on-line advertising, cash register printed coupons, and the marketing of airline tickets.
Big data measurements range from hundreds of terabytes to petabytes, with analytics
refining data quantity into quality actionable information. Baseball has
always been a data-rich statistical paradise and is called a data-
driven sport, yet the amount of data it used to generate pales in
comparison to today’s game. No wonder that Joe Maddon, manager
of the Tampa Bay Rays, calls the impact of big data on baseball its second great
renaissance1.
Up until the last twenty years, we judged players based on basic statistics. Now we have
much more in-depth data to make more accurate assessments. Big data can now help
hitters battle pitchers, pitchers battle hitters, and properly position defensive players for
the pitcher-hitter dynamic. Just as big data has helped companies worldwide become
more profitable and efficient, it will also help Major League Baseball (MLB) do the same.
Big data will give teams a competitive and financial advantage, and bring the fan a more
exciting experience. The data and analytics will come from action on the field and
business decisions. It will be available to customers in the ballparks and fans around the
world, giving them insights that were impossible just a few years ago.
Pretend you are at a game on a really humid, windless, hot August night at Boston’s
century-old Fenway Park. The Red Sox lead their arch rival New York Yankees 3-2 with
2 outs in the top of the 9th inning. On 2nd base is Robinson Cano and
on 3rd, Brett Gardner who as one of the fastest runners in baseball
could score the tying run in about 3 seconds. Just 20 minutes earlier,
most of the 34,000 fans stood and sang Neil Diamond’s classic
“Sweet Caroline” with gusto. Now they are standing again, this time
vocalizing their disdain for the Yankees. Their thunderous applause, foot stomping, and
120+ decibel roar can be heard blocks away as the pitch is delivered.
The 92 mph four-seam fastball crossed the 17” wide home plate
and thuds into Jarrod “Salty” Saltalamacchia’s catcher’s mitt. An
instant later, the umpire calls “Strike Two!” against former Red Sox
hero and now Yankees “Evil Empire” foot soldier, Kevin “Youk”
Youkilis. Youk doesn’t like the umpire’s call as boisterous fans, packed like sardines in
2014 EMC Proven Professional Knowledge Sharing 4
bars along Yawkey Way, Lansdowne Street, and Brookline Avenue scream “Yes!” in
near-unison, accompanied by nearly 4 million ESPN viewers across America yelling for
or against the right-handed hitter to succeed or fail. A high-definition TV camera zooms
in at the perspiration dripping from under Kevin’s batting helmet. On the mound,
Boston’s hard-throwing right-handed warrior Clay Buchholz, his double-knit polyester
uniform drenched in sweat, has delivered his 108th pitch of the
evening. Statistically, right handed pitchers would rather face right-
handed hitters, so this matchup favors Buchholz, a former teammate
of Youk.
Buchholz takes a long time between pitches. He must get Youk out because the next
batter is even more dangerous. Over the next 24 seconds, both teams reset for the next
pitch. Before the game, the Yankees’ big data scientists charted the likely type and
location of Buchholz’s next pitch. Youk steps out of the batter’s box waiting for 3rd base
coach Robby Thompson to relay Yankee manager Joe Girardi’s offensive signal.
Thompson touches his cap, chest, thigh, gestures with his hands, then double-claps
telling Youk to expect Buchholz to jam him inside based on a PITCHf/x big data analysis.
Gardner and Cano are also looking at the signals so they know what is being attempted.
PITCHf/x data shows the Sox the location and type of the pitch Youk likes to hit. Even
with the diminishing skills of a 34-year old,
Youk is still a dangerous career .281 hitter who
thrives on pitches in the middle of the plate or
low and inside2. Red Sox bench coach Torey
Lovullo signals the fielders where to play
based on pitching coach Juan Nieves’ signal to
Salty. The impassioned fans begin to holler
again. Salty signals the 6’3” Buchholz a 1-2-3-2-1 with his fingers and moves his mitt
high. The first “1” is a fastball, “2” is an even number for an inside location, and the rest
is a decoy should Cano steal the signs and relay them to Youk. Salty and Buchholz have
agreed with 2 strikes to throw a fastball high and inside to set up their possible follow-up
pitch. Both sides have big data insight into the pitch speed, trajectory, and spin.
2014 EMC Proven Professional Knowledge Sharing 5
Wielding a bat like a gladiator’s sword, Youk is about to go into combat against 9
defenders. Youk and Buchholz are ready, both team’s benches and bullpens are
focused, and the fans are making Fenway Park vibrate with excitement. Buchholz has
an incredibly difficult job - release the ball at the precise moment and with the correct
spin and angle, because if it is off by a fraction of a millisecond or degree it will either
miss the strike zone, or Youk might hit it and tie the game in a mere matter of seconds.
Prior to the game, the relevant big data was sliced and diced and presented on the
pitching coach’s iPad. Various game parameters such as his opponents, nightly rest, by-
inning pitch counts to tally endurance factors, and even the day/night impact on each
type of pitch Buchholz throws could be factored in. This helped Buchholz and Salty
develop a pitch strategy for the game based on the humidity, wind speed, and the
strengths and weaknesses of the Yankees. In
general, if Youk hits a pitch, there is a 40%
chance it will be a ground ball (GB), 40% a fly
ball (FB), and a 20% chance of a line drive
(LD)3. Youk has also done his pre-game iPad
homework and knows Buchholz favors the fastball in a 2-
strike count. He expects the pitch to cross the plate in an area
high center to lower outside, so he prepares himself mentally
to go with the pitch and try to hit it to right field. Buchholz and
Salty are going to attempt to deceive Youk. The runners
increase their lead, Salty shifts from center of home plate to
the right side, and Buchholz, throwing from the stretch position, begins his delivery.
Buchholz throws a ball with an outer shell made from two pieces of leather laced with
108 red stitches. The stitches are 7 hundredths of an inch above the smooth surface and
push against the airflow, causing it to rotate at hundreds to thousands of RPM
depending on the pitch. The red stiches on the seam’s axis can appear to the
batter as a “red dot” when the ball is released. Hall of Famer Reggie Jackson
says “If you can’t see the rotation, you have to be able to recognize if it’s a curve ball or
a slider. And if you can’t, you’re not going to be a major league player. Anyone who can
hit above .270 can see the red dot. Anyone who saw that dot on the ball, you knew it
was the slider. If it was a really big dot, you knew it was a hanger and you could hit it
out.”4
2014 EMC Proven Professional Knowledge Sharing 6
In a tenth of a second, the ball completes 2 revolutions, travels 13’, and air resistance
starts to slow it down5. The stitches break up the airflow and reduce drag, enabling the
ball to go further than if it was smooth. Nicknamed the “Greek
God of Walks”, Youk has a keen eye for the ball with 20-11 visual
acuity (he sees from 20’ what most people see from 11’)6. He
spots the rotation and by processing the pitcher’s release point,
his hands and fingers, the ball’s spin, and probability of what Buchholz likes to throw,
surmises it’s a 4-seam fastball. The ball’s trajectory is inside and high, not low and
outside as Youk had orignally guessed. He must “protect” the plate at all costs and
prevent getting a 3rd stike, so in the next .15 seconds, he “tells” his body to swing
defensively. Salty also recognizes he is out of position and immediately moves his glove
to the left. The ball is now 25’ away from the plate. Youk’s knees flex as his weight shifts
from a flat stance to his back right leg, and his torso twists clockwise. His left foot strides
12” out and he shifts his weight forward as his coiled torso generates a massive amount
of upper body torque. It is a 90 mph fastball. Drag has slowed it by 7 mph. It travels the
55’5” between his release and home in .434 seconds assuming linear decelaration.
It is high. Gravity pulled the ball down 2.92’, offset by a backspin of -.98’, it is about to
cross the plate almost 2’ lower that its release point. Just 23’ and 1/7th of a second away,
it drops at a 100 downward trajectory as Youk’s hips begin a counterclockwise turn. He
begins his swing by maneuvering the bat through its arc and shifts his weight from back
right to front left. If he blinks, he would lose sight for .05 seconds, and it will be all over. A
camera flash might cost him .15 seconds7. Youk’s brain and muscle memory instruct his
hands for a high and inside fastball. His 31.5 ounce model T141 Louisville Slugger round
maple bat is moving at 85 mph and collides with the spinning cork, yarn, and leather-
MUST DECIDE in .03 seconds to swing and
another .03 seconds to swing high,
low, inside, outside
Brain processes information, gauges ball speed and
location.08 seconds
The batter sees the ball, sends image to brain
.1 seconds
Drag decelerates Buchholz’s 90 mph fastball to 83 mph as it
approaches Youkilis and crosses home plate in .434 seconds
Brain signals
the legs to
start stride
.03 seconds
2.75" diameter bat
must hit 3" diameter
ball within 1/8" of
dead center in ±.007
seconds.01s swing starts
.05s can still check
.15s power swing
55' 50' 45' 40' 35' 30' 25' 20' 15' 10' 5' 0'
Sec Traveled:.434s .27s .24s .18s .1s
Feet Traveled:55' 32' 29' 22' 13'
2014 EMC Proven Professional Knowledge Sharing 7
covered 5.2 ounce hardball for a millisecond “…with 6,000-8,000 pounds of force”8.
“SMACK” is heard as the impact produces a Batted Ball Speed (BBS) of 120 mph.
His timing has to be within 5 milliseconds in one of
two dimensions to hit a fair ball. Fortunately for the
Red Sox, the swing’s 2nd dimension is off. The ball
misses the bat’s sweet spot by just over ½” at -220
instead of 390 and Youk harmlessly fouls it off.
The count remains at no balls, 2 strikes. The runners go back to their bases and
everyone resets, including the fans. Red Sox manager John Farrell, a former pitcher,
signals Salty for a curveball down and away based on how Youk handled the last pitch.
Yankee manager Joe Girardi, a former catcher, also believes it will be a curve and
signals that to Robby Thompson. TV viewers are told to expect a changeup.
The Red Sox bench coach gestures his outfielders Jonny Gomes, Jacoby Ellsbury, and
Shane Victorino to the right since Buchholz will toss
a curveball low and outside. They are excellent
outfielders, but the coach knows from FIELDf/x9 big
data that correct positioning could make the
difference between a hit and an out. Outfielders
generally cover 23’ in a second. Their reaction time
is 0.35 seconds forward, 0.52 seconds sideward, and 0.65 seconds backward creating a
defensive oval10.
Buchholz agrees to Salty’s
finger-signals for a curveball.
The fans are standing and
the noise is deafening.
Before the game, Youk’s iPad displayed
Buchholz’s PITCHf/x chart to the right11.
His hitting coach, Kevin Long, had said:
1) Buchholz tires after 105 pitches 2) Buchholz likes to use his curveball 3) Buchholz’s curveball slows down to 74-75 mph after 84 pitches.
2014 EMC Proven Professional Knowledge Sharing 8
Youk readies for a slow curveball. The pitch is on its way. But instead
of low and away, Buchholz’s 110th pitch is coming down the center of
the plate. With an unquenchable desire to be tonight’s hero, Youk’s bat
strikes the curveball at 390 and rockets it towards deep center field.
In the physics of baseball, the mass and velocity of the bat and ball
determine the “conservation of momentum” and predict the equal and
opposite forces at that instant. The “coefficient of restitution” or the bounce
of a ball after it is deformed by the bat, causes the ball to gain forward motion because
of kenetic energy. The bat deforms slighty before springing back to its natural shape –
notice its shape versus the yellow line. Gardner and
Cano are off with the “crack of the bat”. Ellsbury, out
of position, instinctively turns to his left, sprinting
towards the green wall. The fan’s chanting instantly turns into a prayer.
If the ball falls in, the Yanks will lead as Gardner is nearing home and Cano is sprinting
from 2nd with the go-ahead run. Some in the crowd scream to the baseball gods for
Ellsbury to make the catch. Others turn silent as all hope is seemingly gone. Based on
where he was playing, he needs 4.4 seconds to reach the hit, but only has 4.3 seconds
as the ball sails over his head. Gardner ties the score at 3-3. Shortstop Stephen Drew
and 2nd baseman Dustin Pedroia race to short center field to catch the relay throw. Cano
has crossed 3rd base on his way home. Youk’s
smash hit the base of the wall near the “Stop N’
Shop” sign and ricochets away from Ellsbury. In this
battle of wills, Cano crosses home for the Yankee
lead as Youk lumbers into 2nd base with a double.
The Sox, with a glorious victory seemingly in hand, now trail by a run in the top of the 9th
with 2 outs. Left-handed reliever Craig Breslow replaces Buchholz and strikes out left-
handed designated hitter Travis Haffner to end the inning.
The Yankees in the bottom of the 9th bring the all-time leader in
saves and future Hall of Famer, right-handed Mariano Rivera, to the
mound. His devastating cut fastball fools the best of hitters. The cutter’s two
finger fastball grip is off-centered against the ball’s seams. Delivered like his fastball, it
has dynamic late movement half-way to home – the distance for which a batter has to
2014 EMC Proven Professional Knowledge Sharing 9
decide to swing or not.12 It breaks down and in towards a lefty or away for a righty just
before it reaches home. His WHIP
is 1.00013. Legendary Pedro Martinez, by
comparison, had a career 1.054 WHIP14, while the 2013 MLB average was 1.3015.
First to bat is Salty, a switch hitter batting left-handed. The Yankees shift players to the
right based on FIELDf/x charts. PITCHf/x’s heat map shows Rivera’s distinct cutter
locations when he faces a lefty or a righty,
and how well he succeeds at avoiding the
center of the plate16. During his career,
lefties are batting just .209 and righties
.21517. Salty pops up the 1st pitch cutter to
2nd baseman Robinson Cano for the 1st out.
Right-handed hitting 3rd baseman Will Middlebrooks is ready to face Rivera. He is fed a
steady diet of outside and low cutters and hits a ground ball to 1st baseman Lyle
Overbay for the 2nd out. Boston desperately needs a run to tie. Left-handed hitting
shortstop Stephen Drew, batting .249 with 12 home runs, is their last hope. A key to
Rivera’s success is that batters who do hit his cutter do not hit it
well. Rivera jams Drew inside with a cutter on the 1st pitch that
breaks his bat – not uncommon against Rivera. The ball is tapped
to Overbay for the 3rd out as the Yankees beat their arch rivals.
The crowd is silent.
Introduction – Baseball, Big Data, and Advanced Analytics
The modern game of baseball dates back to the late 1840’s and is one of those pure
sports where statistics has thrived. Attend a game today and a giant scoreboard displays
the batter’s picture, his batting average, home runs, RBIs, and other information. Fans
on either side of you may have bought a program with the latest statistics on each
team’s players, as well as articles, paid advertisements, and the all- important blank
2014 EMC Proven Professional Knowledge Sharing 10
scorecard for each side. To the right is a fan’s
scorecard used in the late 1800’s. Serious fans
pencil in the player’s last names and uniform
numbers, then summarize each pitch, foul ball, hit,
out, walk, error, and other events in coded
shorthand eerily close to those used at the dawn
of the game. Fielders have positional numbers
such as 1 for the pitcher, 2 for the catcher, and so
on. While the concept hasn’t changed much, the scorecard is now available as an
Android and iPhone application that helps keep score and links fans into the powerful
work of baseball statistics and sabermetrics (“Society for American Baseball Research”
plus “metrics”).
Core “counting” statistics such as at bats, runs, hits, singles, doubles, triples, home runs,
and more originated with the game, and became integral to America’s pastime with the
formation of the Elias Sports Bureau in 191318. In 1947, the Brooklyn Dodgers hired a
full-time statistician. Modern-era baseball
managers have used statistics to run their
team, perhaps none more famous than the
fiery Hall of Fame Baltimore Orioles manager
Earl Weaver. Well before the birth of the
personal computer, Weaver tracked pitcher-
hitter matchups on index cards like the one on
the right to fine-tune his platooning system19
and make pitching changes in the 1960s.
Managers like Weaver kept charts like the
one on the left which shows the 1st and 5th
innings were best for scoring runs20, perhaps
because the pitcher has yet to reach his
groove, then tires, or has already shown the
batters his repertoire a few times that game.
They kept situational charts, such as the
best time to bunt21, or with a runner on 3rd base and 1 out, they could expect to score on
average .920 runs22. Bill James evolved the statistical art through his baseball books and
2014 EMC Proven Professional Knowledge Sharing 11
Runs Scored from Situations
Runners No Outs 1 Out 2 Outs
--- .537 .294 .114
x-- .907 .544 .239
-x- 1.138 .720 .347
--x 1.349 .920 .391
xx- 1.515 .968 .468
x-x 1.762 1.140 .522
-xx 1.957 1.353 .630
xxx 2.399 1.617 .830
Value ?
2013 # Cost # Cost Per
Games 160 $20,625 44 $636,364
Appearences 673 $4,903 181 $154,696
Walks 72 $45,833 23 $1,217,391
Hits 167 $19,760 38 $736,842
Runs 103 $32,039 21 $1,333,333
Home Runs 53 $62,264 7 $4,000,000
Chris Davis Alex Rodriguez
is credited with the name sabermetrics in 1980. Sabermetrics
derived new statistics from the basic ones, such as Slugging
Percentage
, On-Base Percentage
, and On-base Plus Slugging
.
Michael Lewis’s “Moneyball: The Art of Winning an Unfair Game“ documented the birth
of modern baseball analytics and the actions of Oakland Athletics general manager Billy
Beane. Beane bucked the preconceived notions of his scouts and explored analytics to
find a better, more affordable way to assemble a winning team. Rather than
concentrate on batting averages, RBIs, and stolen bases, he focused on new
variables like on-base percentage and slugging percentage to find under-
valued talent. Moneyball became legendary when Brad Pitt starred in a 2011 movie
based on the book. Moneyball has become an adjective for finding hidden value, and a
guiding mantra that benefits individuals, corporations, and nations, not just baseball
teams.
In 2002, Beane could only spend $41 million to compete in a league of expensive stars
and richer teams like the Yankees (with their $125 million budget). That year, both the
Athletics and Yankees won 103 games. One of Beane’s approaches was to find
undervalued, patient players who helped produce runs by walking23, in lieu of overvalued
players with high metrics such as batting average and stolen bases. Oakland made the
playoffs from 2000 through 2003.
Chris Davis is one of the better undervalued stars these days24. He was a 5th round 2006
draft pick of the Texas Rangers, who traded him to the Orioles in 2009. By 2013, Davis
was making $3.3 million or 3.5% of the team payroll, and led both leagues with 53 home
runs. In comparison, Yankee’s Alex Rodriguez is the highest paid player at $28 million a
year – more than the Houston Astros payroll25. How do
they stack up?26 Davis was “paid” $62,264 for each of
his 53 home runs while Rodriguez got a whopping
$4,000,000 for each of his 9 homers.
2014 EMC Proven Professional Knowledge Sharing 12
Concepts like big data have turned raw data into useful information for the fans, players,
coaches, managers, and other organizations that promote baseball. Over the last 5
years, sensors, high-speed digital photography, and Doppler radar have led to an
explosion of baseball data.
Fans especially love the new
and exciting ways to follow their
heroes, such as this real-time
big data Internet view of balls
and strikes crossing home at
the July 6, 2013 Yankee-Oriole
game.
While baseball is still just a game, batters are always looking for an advantage over the
pitcher and vice versa. If a batter knows what type of pitch they will get, they gain a
significant upper hand by being better prepared to hit it! Big data technology may get
them that leg up. At the 2012 Sloan Sports Conference, a paper “Predicting the Next
Pitch“27 discussed an advanced analytics approach whereby batters could dramatically
increase their knowledge of what the next pitch would be. Rather than guessing or
looking for a specific pitch, like a fastball, their big data analysis of 2008 data showed
they could increase the chances of getting that fastball from 18% to an amazing 311%!
(Note: MLB rules limit players and coaches to printed reports during a game – i.e. no
electronic real-time data can actively influence play on the field28.)
Moneyball’s predictive statistics created radical theories that brought significant changes
to a team’s structure. Now, big
data is adding new insights and
dimensions to the game that never
existed before. Fans expect more,
and big data is ready to deliver29.
They are exposed to the “how” and
“why” of player actions, leading
them to experience the game at a whole new level. Teams are employing big data to find
overlooked domestic and international baseball talent they never knew existed, as well
2014 EMC Proven Professional Knowledge Sharing 13
as finding relationships between people and merchandise, ticketing, and the media. That
is big data. Beane’s statistics are just a part of it.
Big Data and Baseball - Players, Coaches, Trainers, and Managers
We have seen a fraction of the statistical metrics and analytics employed by baseball.
Even before computers, baseball was about tracking and improving. With the advent of
computers, new metrics and ways of thinking
about the sport emerged. In 2007, tools like
PITCHf/x escalated the amount of baseball
data, dwarfing the data collections that went on
before it30. Pure statistics can distort the game –
a run, hit, or out. PITCHf/x and tools like it add
insight and enhance the game. In just the last 5 years, advances in computer science
have unlocked a Pandora’s Box of data and analysis.
Players always relied on their coaches, and in the early years of the game, some had
photographs of their opponents or their own swings. Then came motion pictures shot on
film. Eventually, each team developed a VCR tape library of each hitter’s at-bats in
addition to footage of opposing pitchers and their arsenals. Hall of Fame right-fielder
Tony Gywnn played 20 years for the San Diego Padres with a career .338 batting
average and a .847 OPS31. Going back to 1983, he recorded his own games on video
tape, played it back on one VCR and recorded his swings on another VCR. He would
then further edit the “swings” tape into 3 more tapes – good at-bats, at-bats with hits,
and just the swings that produced the hits32. Gwynn was a trendsetter in this regard.
These days, pitchers can graphically review the opposing lineup before the game
through a scouting big data iPad. They examine batter
strengths and weakness against the variety, location,
and speed of pitches they throw to see if it gets them a
strike out, pop up, or a (less fortunate) home run in game
situations. After the game, the data helps the pitcher see
what he did well and areas needing improvement. A
hitter can likewise study an opposing pitcher and dissect each pitch to gain an
advantage the next time he faces him33. It can improve the predictability of pitches by
2014 EMC Proven Professional Knowledge Sharing 14
2009 FB% 2011 FB%
All 74.5 72.1
1-0 69.5 67.3
2-0 82.1 79.7
2-1 69.1 66.8
3-1 85.8 83.5
Fastball decline - hitter's counts
12.5% to 50% by leveraging data relationships such as “…Pitcher/Batter prior,
Pitcher/Count prior, the previous pitch, and the score of the game.” 34 Bloomberg Sports
offers this as a software subscription to players.
Big data has already impacted the game. It used to be that when a batter was ahead in
the count, they expected a fastball. Between 2009 and 2011, the percentage of fastballs
has dropped almost 2.4%35. The pitcher and pitching coaches are
employing dynamic, less predictable mixtures than those that have
occurred over the last 150 years of the game.
Big data technology helps bring excitement to baseball. Technology researcher Gartner
Group says “Big data is high-volume, high-velocity and high-variety information
assets that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.”36 With metrics that change over time, high-
volume is measured in hundreds of terabytes or petabytes, high-velocity in milliseconds,
and high-variety involves complexity, using rows and columns to process structured,
semi structured, as well as unstructured data such as social media and tweets. You
know you have a big data problem when the dataset’s size exceeds the ability of typical
database software tools to capture,
store, manage, and analyze it. Big
data analytics company Vertica
uses the illustration seen here to
define the typical dataset37.
A fundamental part of the big data definition is management of disparate data. A typical
sabermetric database might have hundreds of thousands of records, but they run into a
problem when they couple that database to scouting and medical reports, video, radar,
camera tracking, crowd size, crowd noise, home/away, defensive shifts, hotel
accommodations, length of road travel and duration, slumps, weather, health and fitness
data, team chemistry, financial contract length and value, bonuses, trade data, and other
information. In a few years, wearable sensor player data may need to be integrated (they
are unlikely to be put in the balls themselves). And that is just the start. Let’s see how
various products use big data to help analyze, predict, and optimize player performance,
and revolutionize baseball by supplying data that never existed before.
2014 EMC Proven Professional Knowledge Sharing 15
Sportvision
Emmy Award-winner Sportvision is a sports broadcasting technology company that
introduced K-Zone in 2001, the same year the Gartner Group
published their big data definition38,39. Moneyball had not yet hit the
presses, “the” baseball tool was Microsoft Excel, and all MLB statistics
totaled less than 2% of the baseball data collected today. Sportvision’s COMMANDf/x,
FIELDf/x, PITCHf/x, HITf/x, and SCOUTrax systems thrive on big data in baseball,
capturing 2.5 terabytes of data a game and 6 petabytes for all regular season games
played by 30 teams. COMMANDf/x analyzes the position of the catcher’s mitt to
determine if the pitcher hits the agreed upon spot. FIELDf/x digitally tracks the exact
location of every player and hit, and for the first time, brings movement analysis to the
game. Sportvison general manager Ryan Zander says “It’s almost overwhelming how
much data we’re creating.”40 While some of the data is non-action, such as pitcher
warm-ups, birds flying, and players waiting between innings, overall they collect 2-2.5
million records per game.
Sportvision is on the cusp of the big data movement. Much of their revenue comes from
selling data generated from missile tracking concepts. Its 2006 PITCHf/x product used
in-house compute and storage technology. Along the way, they changed computing
models and currently employ multiple replicated cloud databases using Amazon Web
Services (AWS) and business intelligence/visualization software from Domo41.
Sportvision handles streaming data from every MLB game, sometimes concurrently, and
performs real-time analysis as well as split-off data feeds for live Internet and TV action.
Sportvision focuses on giving fans a richer experience while optimizing the management
and marketing of sports businesses. While this article focuses on MLB, their amazing
enhancement techniques are also employed in football,
hockey, golf, basketball, soccer, stock car racing, and the
Olympics. For example, their Emmy Award-winning “1st
and Ten Line”42 uses a yellow virtual marker to show
football fans where a team has to go to reach a first down.
Their MLB product suite illustrates how big data computer vision impacts the sport.
Algorithms measure 3D objects such as people, baseballs, and background shapes by
leveraging artificial intelligence, cognitive science (study of the mind and its processes),
2014 EMC Proven Professional Knowledge Sharing 16
and machine learning. PITCHf/x visualization of the ball crossing the plate and
underlying data reach fans in stadiums and at home through the Internet, mobile apps,
and televised games. Trajectory and ball speed off the bat is captured and portrayed by
HITf/x. Sportvision data allows us to objectively answer questions that burn in the hearts
of baseball aficionados everywhere, such as:
Who is the best center fielder when playing shallow?
Which double-play shortstop/2nd baseman combinations are the most efficient?
Where should the infielder position himself for a relay throw from the outfield?
Who has the weakest/strongest throwing arms from the outfield?
While the Oakland Athletics used Moneyball concepts to find undervalued players, tools
like PITCHf/x and FIELDf/x now employ advanced big data metrics unavailable in that
era. These tools precisely measure player performance and calculate if they are in the
top 10% of batted ball velocity, use optimal paths 85% of the time to make a catch, or
whether they have 98% throwing accuracy – a real-time refinement of Moneyball.
In the future, big data sports analytics has to meet new challenges, such
as the demands that 3DTV will have when depth is added to a
broadcast. While 3DTV has had fits and starts, fans thrill to the batters
view of the ball as it approaches the plate, or root for a fielder trying to catch up to a
batted ball. Higher definition video is just around the corner. 4K Ultra HDTV has 8-mega
pixels or 4X the resolution of a 1080P 2-mega pixel HDTV, and will put a strain on this
real-time big data system. Who knows – fans may want to see the stitching on the ball or
the grains of wood on bats when this staggering level of detail becomes common place.
PITCHf/x
Sportvision’s PITCHf/x system tracks every pitch
in every game using three permanently mounted
digital cameras to follow each pitch at 30 frames
per second (FPS) through a centralized tracking
system. First installed in 200743, one camera is
placed high over 1st base and the other high over
home plate. A third centerfield-mounted camera
stays focused on home plate to establish the
strike zone since it differs for each batter.
2014 EMC Proven Professional Knowledge Sharing 17
For each time t, 9 values are collected: 3 positions x, y, z 3 velocities vx, vy, vz 3 accelerations ax, ay, az
Each camera is attached to its own computer allowing each image to
be instantly scanned for a baseball. If one is found, the next frame is
checked for a ball in-flight to make sure it is a pitch. The cameras
triangulate the ball’s exact mid-flight location through real-time
digitally processed pattern-recognition of each video frame. Raw
data is transformed into (x,y,z)
coordinates of the pitch. PITCHf/x
calculates 60 data points per pitch including the release point, spin,
speed, break, arc, location, and where the ball crosses the plate to within an inch. It
algorithmically determines 8 pitch types: fastball, sinker, curve,
changeup, slider, split-finger fastball, cut fastball, and a
knuckleball44. The real-time data is used by broadcasters and
internet providers like ESPN and Bloomberg, and augmented
by staff at the game with box scores and live play-by-play.
The combined data stream is transmitted to servers and the video frames are converted
into XML data files. Fans in the ballpark can follow the PITCHf/x action on the center
field screens or almost instantly on their iPads. At home, fans see PITCHf/x enhanced
TV graphics as the ball is in flight (broadcasters may call it by their trade name). Multiply
this gigabyte-per-pitch activity by the often simultaneous 15 league games at the same
sub-second, all processed by Major League Baseball Advanced Media (MLBAM), and
you get an incredible high-volume and high-velocity big data feed. The XML data is
available online for free within seconds. This bonanza of baseball insider information
includes the details of a pitcher's repertoire and their approach to a batter.
The ball’s speed, location, and trajectory are precisely measured as soon as the pitcher
releases it until it crosses home plate. It is algorithmically possible to discern the effect of
a fast versus slow pitch, the ball’s break up or down, left or
right, and its forward or reverse spin. As a
result, its initial velocity, gravity, drag, and
the Magnus force45 can be quantified and
displayed. Positional data allows the ball to
be shown from above looking at home, a
pitcher’s view towards home, and the
batter’s view looking at the pitcher.
2014 EMC Proven Professional Knowledge Sharing 18
PITCHf/x produces statistics that sabermetricians only dreamt about. More than strike,
ball, and hit outcomes, it gives information on the ball’s path and exactly where it
crosses the strike zone. A trail can be
graphically added to the ball for instant
replays46. The data also allows for follow-
up coaching analysis or further fan
enjoyment. Cameras log 36 events per
pitch, and there are roughly 300 pitches a game not counting 8 warm up pitches every
half-inning. With 15 match ups (30 teams) and 162 games a year, the system records
over 26 million real-time XML records a season. Compared to the older radar gun
approach, PITCHf/x is analogous to visiting a doctor for a continuous electrocardiogram
as opposed to just checking your pulse. The radar gun gives a single pitch observation
while PITCHf/x gives the set of flight dynamics.
PITCHf/x allows for extensive automated and accurate data on velocity, pitch type,
location, and result. It has tracked over 4 million pitches from 2007-2011 and is used by
16 leagues across the U.S., Canada, South Korea, and the Dominican Republic47.
Unfortunately, the system cannot go back in time to analyze pitching great Sandy Koufax
or home run slugger Babe Ruth – we have incomplete views of these legends.
Coaches, managers, and scouts use various tools to eke an out
an extra win and better their playoff chances. Fans can use the
same technology to enhance the game. Every year there is
something new and exciting to learn about this game. For
example, big data tools like PITCHf/x are available on smart
phones. Other tools slice and dice the data into amazing analytics
such as pitch plots of horizontal movement versus speed.
Fans love the way PITCHf/x shows the exact spot where the ball crosses the graphical
plate48, especially when they think the umpire has blown the call, which happens about
every six called pitches that do not involve a swing according to baseball author Tobias
Moskowitz49. Truth be told, making 300 split- second correct decisions a game on a ball
darting through the air at 90 mph is hard to do. It is also possible the umpire is biased
against a batter or pitcher and calls a strike on an actual “ball” and vice-versa. PITCHf/x
is invaluable in determining the accurate and optimal strike zone location. It shows which
2014 EMC Proven Professional Knowledge Sharing 19
Year W L ERA H/9 HR/9 BB/9 SO/9 OPS
2008 0 0 1.93 5.8 0.6 2.6 7.7 .501
2009 10 7 4.42 8.3 1.2 3.8 7.2 .716
2010 19 6 2.72 7.3 0.6 3.4 8.1 .637
2011 12 13 3.49 7.7 0.9 2.5 8.7 .659
2012 20 5 2.56 7.4 0.7 2.5 8.7 .602
2013 3 5 3.94 9.5 1.0 1.6 7.5 .720
batters swing at bad pitches, and it is easy to chart strike zone consistency
by pitch type and velocity. In this PITCHf/x illustration, Ichiro Suzuki of the
Seattle Mariners felt a pitch was outside (it is) and was ejected from a
game after disagreeing with an umpire’s “strike” call.
Referencing the charts below, umpire Eric Cooper called the action for a
2012 New York Mets (squares) game against the Orioles (triangles)50. The
chart on the left was against left-handed batters, calls on right-handed batters in the
chart on the right. Red is a called strike and green is a called ball. It was remarkable that
Cooper was correct in most of his judgment calls. Big data in action!
PITCHf/x can also help analyze the ball’s in-flight behavior. In this 7 pitch sequence, the
speed (SPD), break (BRK – the arc
of the ball - similar to gravity), left-
right horizontal movement caused by
its spin (PFX), pitch type, and the
result are shown. Notice the big
difference between the speed and the break of the curveball versus the fastball.
Here is another example of PITCHf/x helping to pinpoint pitching issues. David Price is a
27-year old left-handed Tampa Bay Rays ace who won the
2012 American League Cy Young award with a 20-5 record, a
2.56 ERA, and a .602 OPS. He made $4.3 million in 2012 and
$10 million in 2013. Great things were expected of him, so
fans were concerned with his 3-5 record over the first 12 games of 2013. His ERA
jumped up 53%, hits per 9 innings were up 28%, home runs/9 were up 43%, strike
outs/9 were down 14%, and OPS was up 20%. What was wrong? If this was just a short-
2014 EMC Proven Professional Knowledge Sharing 20
term issue, fans would not have been overly concerned, but the problem occurred start
after start. BrooksBaseball.com’s PITCHf/x data showed his four-seam fastball (little
movement) was down 1.9 mph and his sinking fastball (downward and horizontal
movement) down 2.3 mph. The slower these pitches are, the less effective his changeup
is. His reliance on faster pitches dropped 5% while favoring slower pitches by 5%.
With PITCHf/x data publically available at no
charge, a fan used it to create this custom graphic
to the right showing Price’s issues51. From a
batter’s perspective, you can see the movement
and ball’s speed as it leaves Price’s hand. The fan
noted it is not good to have a fastball and sinker
(circled in red) look similar to the hitter and that
Price’s curveball was more effective with a larger
drop in 2012 than during his first 12 appearances
in 2013.
R.A. Dickey, winner of the 2012 National League Cy Young award, is one of the premier
knuckleball pitchers. A knuckleball is aerodynamically difficult to hit. It rotates once or
twice after release whereas a fastball rotates 16-17 times. Generally thrown between 60-
80 mph, a knuckleball wobbles on its way to home, and is further
influenced by the wind and drag. On June 18, 2012, Dickey faced the
Orioles and had 13 strikeouts in a 5-0 shutout. This PITCHf/x
visualization shows his 114 pitches from the catcher’s viewpoint, with
84% of them being knuckleballs, - red strikes and yellow balls52.
2014 EMC Proven Professional Knowledge Sharing 21
Verlander
2011 # %
min
Vel
max
Vel
avg
Vel
Fastball 2,139 55% 91.1 101.4 95.0
Curveball 729 19% 74.4 84.9 79.4
Changeup 719 18% 82.6 91.0 86.8
Slider 327 8% 81.9 90.6 85.9
Total 3,914
PITCHf/x produces readable XML data which can be easily manipulated a myriad of
ways and is described in upcoming pages.
TexasLeaguers.com uses the exact same PITCHf/x XML data to produce a bird’s eye
view of the pitches53. On the left is Dickey’s release point. The right-hander stands to the
3rd base side of the pitching rubber. His knuckleball “* KN” is incredibly straight, with
virtually no left-right bending after it is released. On the right, the ball dramatically drops
as it crosses home. In its last 15’ of flight, it drops almost 3’ making it extremely difficult
to hit! PITCHf/x provides 50’ of flight from the pitchers release to home, but visualization
services like TexasLeaguers calculate the data closer to the release point and show 55’
of travel. “The ball's path is calculated by plugging the acceleration and velocity data
provided by MLBAM into the standard equation for acceleration: x = ax0*t2 + vx0*t + x0”
54.
Let’s leverage PITCHf/x big data to explore the success of the 2011 American League
Cy Young winner, Tigers right-handed pitcher, Justin Verlander. That year, he had an
impressive 24-5 record, averaged over 7 innings per
game, held batters to a low .191 average, and struck out
over a quarter of his opponents. PITCHf/x shows nearly
half of his 4,000 pitches were 95 mph fastballs, mixing in
a 79 mph curveball (19%), 87 mph changeup (18%), and an 86 mph slider (8%)55.
His success depends on his devastating circle changeup. It is thrown like a
fastball but with a dramatically different grip. You can see the index finger
and thumb form a circle that doesn’t touch the seams, and the ball is in the
palm of his hand. Verlander’s changeup is 8+ mph slower than his blazing fastball, and
that is the key. His delivery fools the batter into believing the pitch is a fastball, but it
takes .037 seconds longer to arrive. That is enough to throw off the batter’s timing.
When batters expect a changeup, their reaction time to the fastball is too slow. PITCHf/x
shows that the difference in velocity makes his changeup very effective.
2014 EMC Proven Professional Knowledge Sharing 22
Spin Rate RPM AVG SLG
Swinging
Strike
Low Spin <2300 .244 .403 8%
Avg Spin 2300-2500 .185 .340 10%
Plus Spin 2500-2700 .178 .251 12%
High Spin >2700 .166 .212 15%
The more a ball spins, the harder it is to hit. This chart shows batting averages, slugging
percentages, and swinging strike effectiveness by spin rates56. Verlander attributes part
of his success to his 3,004 RPM curveball which rotates 23%
more than MLB average of 2,450 rpm. He threw it 19% of the
time in 2011. PITCHf/x “…calculates the spin rate of pitched
balls based on the fit trajectory,”57 and is an approximation. Video doesn’t directly
capture the rotation, just where the ball is in each frame. It takes more advanced camera
technology with higher FPS and data rates to count the rpm of a white ball with red
seams using video. Fortunately, PITCHf/x can be augmented with additional data such
as TrackMan’s radar system that does capture the spin of the ball58 without a camera.
On the left is Verlander’s tight release point cluster for 201159. The pitch stratification on
the right is very consistent. As a right handed pitcher, his 95 mph fastball and his 86 mph
changeup stay 5-10” to the left part of home plate – speed is the only difference.
The game footage illustration below shows Verlander with PITCHf/x insights from 4
distinct pitch types superimposed on each other with captions60. He mixes 100 mph
fastballs, 79 mph curves, 86 mph changeups, and 88 mph sliders. In this at-bat, veteran
Minnesota Twins Justin Morneau shows multiple swings and misses. All 4 pitch types
had the same motion and release point. The screen shot on the right shows the pitches
have significantly separated with different velocities. In baseball slang, that is called
nasty.
2014 EMC Proven Professional Knowledge Sharing 23
Lincecum
Pitch ID
Start
Speed
End
Speed Break Rotation
Release
Point
2459447 82.9 75.7 5.5 1,042.1 6.12
2576000 86.1 80.0 3.7 689.8 6.05
Coaches today look at release points for all their pitchers when their velocity drops. It is
hard to spot a 3-6” drop from the dugout, but real-time data shows this instantly. The
ultimate use case is in the field of kinesiology where predictive analytics could prevent a
shoulder, torso, elbow, wrist, or hand injury when a “bad” pitching motion is detected.
Suppose you are a right handed batter
facing the Giant’s 2008-09 National
League Cy Young winner, righty Tim
Lincecum, and you want get a feel for his
slider. PITCHf/x big data can display
images of batters similar to you and show
what you might see in a game. The
search finds event IDs #2459447 and
#2576000. This visualization of slider #2576000 leaves Tim’s hand at 86.1 mph and has
slowed to 85.5 mph on its way to home61.
This illustration shows slider #2459447
was just outside at 75.7 mph, spun at
1,042 rpm (8 rotations from its release
until it crossed home), broke to the right
5.46”, and is was called it a strike. The 2nd
slider #2576000 crossed the plate much
lower and at 80 mph, rotated at 689 rpm
(5 rotations), had a little less break to the right at 3.65”,
and the was also called a strike.
2014 EMC Proven Professional Knowledge Sharing 24
Detroit Tigers Miguel Cabrera is a feared hitter. As the
American League 2012 Triple Crown winner, PITCHf/x
shows his hitting strengths and weaknesses62. He crushes
low and inside pitches. Opposing managers use spray
charts to position players where hits are likely to go63. The
legend shows if the ball was an out, error, or hit. In 2012, if
an opposing manager with this small sample size had his
pitcher throw Miguel a curveball, he might use the
following spray chart and have
his 3rd baseman and shortstop
play in the blue and red oval
areas on the right. Spray
charts can be automatically
created by coupling HITf/x and
PITCHf/x data64.
PITCHf/x is also useful to teach “framing”. As the ball crosses the plate, an algorithm can
determine if the umpire will call it a ball or a strike. Good catchers “frame” or move their
glove once they catch the ball in order to influence the umpire’s decision. For example, if
borderline pitches are called strikes 50% of the time, increasing the called strike success
to 60% through framing gives a team a terrific advantage. One study showed that
framing saves an average of 20 runs a season65. Another study found that Rays’ catcher
Jose Molina saved 73 runs one year by framing the pitch66. That difference could mean
a couple of extra victories, potentially worth millions of dollars to a team. Based on
PITCHf/x data, Molina frames 13.4% of balls into called strikes67.
PITCHf/x data is free and in XML format, so let’s take a
much closer look at it. The data is found at
http://gd2.mlb.com/components/game/mlb/. Pick a year
>=2007, a month, and a day, and find the game you want
and the inning. This example is from 9th inning of game 4
of the October 28, 2012 World Series at Comerica Park,
with the Giants 1st baseman Brandon Belt facing Tigers
pitcher Phil Coke. The full URL is
http://gd2.mlb.com/components/game/mlb/year_2012/month_10/day_28/gid_2012_10_28_sfnmlb_detmlb_1/inning/inning_9.xml.
2014 EMC Proven Professional Knowledge Sharing 25
Open the inning_9.XML. The metadata definitions are in the appendix. Coke throws 6
pitches: Ball, Ball, Foul ball, Foul ball, Ball, and Swinging Strike. Scroll down to this
sequence of -<atbat num=”65” for the pitches thrown to the PlayerID #474832 Brandon
Belt (Elias Sports Bureau/MLB numbering scheme). The pitcher Phil Coke is #457435.
Let’s focus on the XML data for pitch #4 that Belt fouls off in the red oval:
<pitch des="Foul" des_es="Foul" id="507" type="S" tfs="230849"
tfs_zulu="2012-10-29T03:08:49Z" x="109.01" y="132.11" sv_id="121028_230849"
start_speed="94.4" end_speed="86.0" sz_top="3.32" sz_bot="1.53"
pfx_x="8.23" pfx_z="10.3" px="-0.315" pz="2.919" x0="2.562" y0="50.0"
z0="6.035" vx0="-10.7" vy0="-137.956" vz0="-6.156" ax="15.708"
ay="33.556" az="-12.449" break_y="23.7" break_angle="-43.3"
break_length="4.1" pitch_type="FF" type_confidence=".676" zone="1"
nasty="41" spin_dir="141.469" spin_rate="2651.720" cc="" mt=""/>
The coordinates for the x-axis are to the catcher’s right, y-axis
towards the pitcher, and the z-axis vertical upward68. XML is hard
to visualize, but it says the four-seam fastball lost 8.4 mph
[start_speed = "94.4" minus end_speed = "86.0"] and
is Fouled off [pitch des = "Foul"].
HITf/x
Whereas PITCHf/x follows the pitcher’s delivery, HITf/x follows the ball’s trajectory off
the bat. HITf/x reprocesses the same PITCHf/x images to extract ball data whenever the
bat hits it – no separate cameras are required69. It tracks the ball to the pitcher’s mound
and measures its speed off the bat as well as vertical and horizontal angles as it leaves
the bat. This helps determine if the ball is a hit but not where it lands. Unlike publically
available PITCHf/x data, HITf/x data is privately licensed, and was only available to the
2014 EMC Proven Professional Knowledge Sharing 26
Date Hitter Team Pitcher Team Ballpark
Speed
off bat Feet VLA HLA
9/1/12 Edwin Encarnacion TOR J.P. Howell TB Rogers Center 118.9 488 24.0 101.5
4/15/12 Travis Hafner CLE Luis Mendoza KC Kauffman Stadium 117.2 481 29.7 64.4
8/17/12 Giancarlo Stanton MIA Josh Roenicke COL Coors Field 116.3 494 24.9 95.5
6/3/12 Nelson Cruz TEX Bobby Cassevah LAA Angel Stadium 116.3 484 26.5 100.1
7/2/12 Cameron Maybin SD Trevor Cahill ARI Chase Field 115.7 485 24.9 98.0
public for a short while in 2009. Much of the HITf/x functionality was incorporated into
Sportvision’s latest effort, FIELDf/x, which is discussed in the next section.
Algorithms kick in at the split-second the ball hits the bat to extract three-dimensional
trajectory and movement parameters. Other data can be integrated such as pitch
tracking, atmospheric conditions, park dimensions, etc. to analyze the expected
path. The course can be affected by the ball’s spin, humidity, temperature, drag,
and wind speed70. On humid days, the baseball doesn’t travel as far because
the yarn in the ball soaks up moisture causing it to absorb more of the bat’s energy.
HITf/x uses 20 optical samples of the batted ball to produce
the bat’s contact point, ball speed off the bat, the elevation
angle, and where the ball is heading71. These angles are key
to understanding the ball’s flight and affecting the defensive
alignment designed to prevent the hitter from safely reaching
base.
The distance a ball travels is highly dependent on
three factors. One is the speed off the bat and its
backspin. Dry balls travel further during warm, humid
weather at higher altitude with the wind blowing from
home to the outfield. HITf/x correlates how hard it is
struck with the odds of it being a hit72. An 85 mph hit
could be a single, with a double, triple, and home run
requiring 95-100 mph. The 2nd factor is the vertical launch angle (VLA), such as 200 (900
would be straight up in the air and 00 would
be a straight line drive). The last one is the
horizontal launch angle (HLA) or spray of the
ball (the location it is headed with 450 being the right field foul line and 1350 is the left
field foul line)73. These angles are unaffected by conditions like the weather, and
mathematically show
whether it is a home run
or in the playing area. In
2012, the hardest hit ball belongs to Toronto’s Edwin Encarnacion (119 mph).74
2014 EMC Proven Professional Knowledge Sharing 27
Let’s use the Colorado Rockies, who play at
Denver’s Coors Field a mile above sea level, to
explain the atmospheric impact on a hit. Thinner air
allows the ball to travel further, so a 400’ home run at
sea level travels 34’ further at Coors75. On August
17, 2012, the years’ longest home run (green
highlight in the previous table and image to the right)
was hit at Coors by right-handed Giancarlo Stanton
of the Miami Marlins off of Rockies righty pitcher
Josh Roenicke on a 6th inning breaking-ball. With HITf/x data augmented by Hit
Tracker76, the ball traveled 494’. It left Stanton’s bat at 116 mph at a 24.90 VLA and a
95.50 HLA. Hit Tracker doesn’t use cameras or radar, but relies on trajectory data to
calculate the path, and HITf/x data for direction, flight time, and absolute landing spot.
Not only is the data exciting to fans, but useful to position fielders. If Yankee right-hander
Phil Hughes faced the Orioles, the Yankee manager would shift the defense based on a
spray chart lineup card of previous Hughes-Orioles duels 77. This shows where Markakis,
Machado, and Davis have hit Hughes’ pitches. The shortstop would move to the right of
2nd base when Davis came to bat. A sufficiently large sample size is clearly required.
Based on this chart where Red Sox slugger David Ortiz hits the
ball78, you might move your shortstop to the right of 2nd base,
your 2nd baseman into short right field, and let your 3rd
baseman play in the shortstop position. The Rays employ a
shift based on spray patterns, and in 2011 their league leading
defense saved 85 runs79.
2014 EMC Proven Professional Knowledge Sharing 28
Scouts can use HITf/x to objectively evaluate prospects and approximate their success
in a team’s home stadium. This helps determine if their observations truly exemplify the
player’s ability, find out which are under/over-valued, or figure out if they are just lucky.
HITf/x can establish a hitter’s true performance profile. If they are in a slump, their batted
ball speed could be compared to their historical results. If they average 80 mph but could
only muster 70 this week, a coach might want to explore the power drop-off. This helps
rule out bad luck as the culprit, such as hitting balls right at fielders, and helps determine
if they have a mechanical issue with their swing. While this is just one indicator, it is
preferable to looking at a declining batting average and explaining they are having a 0-
22 hitting “streak”. Another example is whether game conditions, or the amount of foul
territory in a series of parks, could be causing their short-term poor performance.
FIELDf/x
Most of baseball’s statistical history
focuses on hitting and pitching, with
defensive stats by and large limited to
outs, errors, and double-plays once a
fielder reaches a ball80. A shortstop who
ranges far to his right to scoop up a ball,
leaps, spins, and fires the ball to 1st base
to get the runner out, is merely credited with an assist. A runner who times the pitcher’s
delivery, takes an aggressive lead, sprints to 2nd and slides head-first for a stolen base is
simply credited with a stolen base. To value a hit, the defense and its alignment, crowd
noise, park dimensions, weather, and other factors should be taken into account.
FIELDf/x real-time data tells us how the play happened, not simply that it occurred. With
FIELDf/x, defensive luck can be factored out of their ability to play a position and a
runner is given credit for his running abilities.
Box scores report numeric conclusions – runs, hits, errors, averages, innings, pitches,
etc. Quantifying fielding, throw accuracy, and efficient base running has been elusive
and often limited to a stop watch. A fielder’s skill or lack thereof is treated as a binary
event, wrapped in adjectives or ignored. FIELDf/x revolutionizes player evaluation and
reporting with everything on the field digitally tracked. Even sabermetric enhancements
2014 EMC Proven Professional Knowledge Sharing 29
pale in comparison to the in-depth big data analysis that cutting-edge FIELDf/x provides
– the majority of what playing baseball is all about. Sportvision calls FIELDf/x the “…holy
grail of baseball metrics”81 and is a tremendous leap forward for defensive positioning
and player reaction times. We can now quantify if a fielder took off in the right direction
when the ball was hit, or if he went forward when he should have gone backward. It
leads to precise and measurable objective views of how much a player adds or subtracts
from a team’s capabilities, and how interrelated his actions are to other events on the
field.
With empirical FIELDf/x data, the notion of a stolen base becomes a
subject unto itself. Hall of Famer Rickey Henderson’s plaque calls him
the “Man of Steal” as he holds the MLB record with 1,406 stolen bases
over a 25-year career. If FIELDf/x were in use when he played, we
might learn more about how he “reads” the pitcher, how he evaluated a
catcher’s throwing arm, and how he maximized a 4-5 step lead, sliding
into 2nd base 2.9 seconds later82. Conversely, teams could use data to better defend
against the stolen base. Big data systems show the length of the initial and secondary
lead off a bag against a particular pitcher’s wind-up, catch by the catcher, the catcher’s
throw to 2nd base, and the tag on the runner. It determines how fast a runner needs to be
against specific pitcher-catcher-fielder combinations. Teaching a runner to get a slightly
longer lead could mean the difference between a win or a loss, or reduce successful
pitcher pick-offs if the runner can leave a fraction of a second later and still steal the
base. The variables include the lead length, runner’s speed, the type of pitch thrown, the
catcher’s reaction time and the strength and accuracy of their arm, and whether the
runner slides feet first or head first. The catcher must get their throw to 2nd base in 2.0-
2.2 seconds83.
Until recently, precision data required a point-in-time radar gun. FIELDf/x uses four high
definition Prosilica GX computer controlled cameras mounted at high points in the
stadium to cover the exact position of all offensive and defensive players every second
they are on the field84. Initially installed in 2010 at AT&T
Park (San Francisco Giants)85, motion capture processing
is used on the images to interpret the action. The system
precisely follows the ball as it crosses the plate and
2014 EMC Proven Professional Knowledge Sharing 30
computer vision algorithms turn the offensive and defensive players, offensive coaches,
and umpires86 into trackable graphic images. An operator adds the umpire’s ball-strike
judgment call and people’s names. As with PITCHf/x, details are recorded as the
“…pitcher releases the ball, batter hits the ball, fielder gains possession of a ball (fields
it, or catches it from a throw) and fielder throws the ball.”87 With a hit ball, FIELDf/x
displays its angle of elevation, maximum height, and how far it travels, all while showing
the fielder’s speed to the landing spot and the distance they cover. It captures action at
1/30th of a second, resulting in approximately 50,000 offensive and defensive player
object-recognition measurements88 and 600,000 – 1,000,000 data points89,90 a game. It
results in an amazing high-volume, high-velocity, and high-variety big data feed.
The system provides a bird's-eye view of all on-field action – leads, speeds, routes, and
the ball’s fair or foul trajectory. It reports on a fielder’s range, reaction time, speed to the
ball, throwing velocity, and throwing accuracy to a base or as a relay throw. For
example, 3rd basemen have about 1½ seconds to glove a ball while middle infielders
have an extra ½ second giving them greater range91. FIELDf/x shows if a fielder takes
an optimal path to the ball, or plays better in the daytime or nighttime, or on grass or
artificial turf. And as we’ve already seen, managers use this data to position fielders, and
decide on pitch selection based on the situation, the pitcher, and the opponent.
The long debate over the quickest left fielder, the shortstop with the greatest range, or
who is worthy of a Gold Glove can now be quantified. A full season of
FIELDf/x empirical data makes sabermetric statistics on errors per
chance, assists, or passed balls passé. FIELDf/x metrics can help with
contract negotiations by giving general managers a player’s “hustle
value” comparison, or help pair up shortstops and 2nd basemen based on their combined
time to complete a double play. For example, if shortstop “Rick” threw to 2nd baseman
“John” in .4 seconds and then John threw to 1st baseman “Phil” in .5 seconds, you could
then determine if other player pairings could complete the play in less than .9 seconds.
Here is an example of how FIELDf/x allows coaches to accurately evaluate a fielder’s
defensive worth to a club. The Yankees lead the Rays in the 5th inning 2-1 with 2 outs on
July 2, 2012. Rays Will Rhymes hits an 87 mph pitch from Yankee Freddy Garcia to
center field. The ball is 31’ high as Yankee center fielder Curtis Granderson runs 44’ at
2014 EMC Proven Professional Knowledge Sharing 31
12.9 mph towards it92. The YELLOW path he takes is sub-optimal. Between innings,
coaches might have discussed this with Granderson to understand what went wrong.
FIELDf/x generates an order of magnitude more real-time data than PITCHf/x. Each
camera has two link-aggregated gigabit Ethernet ports that capture action at 30 FPS and
stream it at up to 240 MB/s during the game. They optically track a ball that could travel
600’, as well as 9 fielders, 4 umpires, a batter, 3 base runners, ball boys, and whoever
enters the field. The system can tell players apart even when they cross each other. All
this comes out to 2.5 million records for a 3 hour game93. Combined with other feeds,
Sportvision and MLB have amassed over 30 petabytes of data.94 It is stored and
analyzed in near-real time and is available to a manager for his decisions or to evaluate
player performance. Teams rely on analysts (data scientists) to sift through the daily
data volumes for unknown insights that could lead to an extra run or better defense.
Solid defense is key to a winning season. Correctly positioning your players for each
opponent, park geometries, weather conditions, pitcher/pitch, and other variables can
save runs and help your team get to the playoffs. That requires big data. Sabermatrician
Bill James equated runs scored and runs allowed to a team’s winning percentage95
. For example, a team that scores 800 runs and gives up
700,
, will win 92 games that year. A 5% FIELDf/x
defensive improvement could reduce the runs allowed to 665 runs
. Four extra wins could make or break a season.
This FIELDf/x image shows the ball’s HIT PATH on the right96. A straight line connects
the left fielder’s starting point to where the ball is going. The fielder runs 85’ at 20.1 mph
to the hit at a sub-optimal 3.9 seconds after taking 1.2 seconds to react to it. The path
2014 EMC Proven Professional Knowledge Sharing 32
was 15’ longer than a straight line. His
fielding efficiency is .824
.The catch probability is
calculated in real-time from the digital
photography. FIELDf/x adds a descriptive
language to simple box scores.
These still frames (with graphics enhanced to make the action easy to follow) are from a
FIELDf/x test of an Athletics-Mariners game on September 20, 200897. There is 1 out
with Seattle’s Kenji Johjima at bat, Jeremy Reed on 3rd base, Miguel Cairo on
2nd base, and Bryan LaHair on 1st base. Oakland defensively has 3rd baseman Jack
Hannahan , shortstop Bobby Crosby , 2nd baseman Cliff Pennington , center
fielder Ryan Sweeney , and left fielder Aaron Cunningham .
The action begins with Oakland’s Kirk Saarloos’s pitch to catcher Rob Bowen .
Johjima hits it deep to left field. We know from historical FIELDf/x data that
Cunningham covers 130’ in 7 seconds, or the time a high fly ball stays in the air. His
coach positioned him based on who is at bat and perhaps the type of pitch the pitcher
was throwing. Defensive players go after the hit and position themselves
2014 EMC Proven Professional Knowledge Sharing 33
for relay throws and a possible play at the plate. Runners prepare to tag up in
the event the ball is caught. The ball hits the wall and the runners sprint home. The relay
home is late and two runs score. Runners stay on 1st and 2nd.
These tools allow managers to create optimal defensive plans and decide which
shortstop to use based on weather, altitude, day game after a night game, type of turf,
etc. Coaches learn more about the pitchers their team will face, and the pitcher can
study the batters they will oppose in more detail than a traditional scouting report would
contain. A manager can build a list of pitch sequences for the pitcher to throw to
increase the chances of getting a hitter out. Likewise, a manager can let the hitter know
what type of pitch and its location to expect in a situation – runners on, wind, day/night,
turf, etc. With this level of detail available in real-time, players in training sessions can
get instant feedback on what they did well and what they need to work on98. While the
real-time data can’t actively influence play on the field, it does help improve player
preparation.
Big Data and the Business of Baseball
On the surface, baseball is a simple game and fun to play. For those running a baseball
business, however, simplicity couldn’t be farther from the truth. The overarching
2014 EMC Proven Professional Knowledge Sharing 34
organization is Major League Baseball, and they have 30 team franchisees. MLB owns
the brand, logos, and other aspects licensed for use as defined by a constitution. The
league employs revenue sharing with 31% of each team’s net local revenue distributed
evenly to all teams99. Small market teams like the Kansas City Royals get more money
than they put in, and very rich teams like the Yankees get back less than they contribute.
At a high level, each team is governed by a MLB franchise contract and permitted to
select players, set prices, and operate like a for-profit business. Teams put a “product”
on the field to generate fan interest. With each team having its own business model that
is tailored to its geographic franchise, some teams are quite profitable while others are
lucky to break even. In 2013, the Yankees were the most valuable team at $2.3 billion
with 2012 revenue of $471 million. Somewhat surprisingly, the Rays were the lowest
valued team at $451 million, yet they had 2012 revenue of $167 million100.
To bring in sufficient revenues to keep a franchise profitable, a club must begin with the
“old school” basics of putting a winning team on the field that fans love to watch, since
the fervor of great play in front of a sellout crowd also generates concession revenue.
The club then needs to layer on big data improvements, such as assembling a better
team for less money, establishing a dynamic and profitable customer relationship, and
positioning the team as a marketing and media revenue generating engine.
Consider yourself a team’s general manager, with your goal to assemble a
championship contender. A guideline you might have to follow is based on money
buying success. These are possible strategies you might use and teams that fit that
profile:
Highest salary – expensive veterans, top-of-their-game all-stars (Yankees)
High salary – past their prime stars (Cubs)
Medium salary – primary focus on pitchers, secondary focus on hitters (Giants)
Low salary – primary focus on hitting, secondary focus on pitching (Angels)
Lowest salary – minor league players for MLB minimums (Athletics)
There are many more such variables, combinations, and strategies. You probably do not
want to group hitters with pitchers, and you probably want to add factors like park
dimensions, typical weather conditions, number of day versus night games, travel
schedules, and more. You want left-handed hitters if your ballpark has a short right field
dimension. Some combinations may not make any sense, while others will have a strong
correlation. You will likely get more walks if you have a lot of shorter players. If you want
2014 EMC Proven Professional Knowledge Sharing 35
the oldest players or the heaviest players, you are getting veterans who really know how
to play the game, but may suffer
more injuries or tire easier. Pick
the youngest players and they
might be faster, but have less
experience. Or you could use
Billy Beane-style variables to
assemble a team focusing on
Defensive Runs Saved101. A
strategy of winning at all costs
may attract fans to the stadium, in which case you might want to outspend the other
teams. Any or all or none of these can be part of big data mining in assembling the right
team for your goals. To the right is an example of a strategy that could be tried102.
Leveraging big data cannot only help a team make it to the championships, but generate
meaningful extra revenue at the same time. Tickets are sold at a premium, souvenirs
and concession prices increase, and on cold October nights, clubs sell a lot of hot
chocolate. The 2009 Yankees earned an extra $72 million by playing in the
postseason103. Broadcasters view the World Series as a revenue bonanza, especially
when the teams are in top markets and the best-of seven series goes to 7 games. For
example, the 2009 World Series featured the Philadelphia Phillies against the Yankees.
An average of 19 million viewers saw the first 6 games jumping to 22 million viewers for
Game 7104. Championship games also bring additional revenues to local businesses
such as bars, restaurants, and MLB merchandise retailers105.
MLB players earn a contractual minimum of $490,000 in 2013 and the highest paid
player makes $29 million106. In 2011, Phillies pitcher Cliff Lee signed a $120 million 5-
year contract that earned him $25 million in 2013107. That equates to $8,067 for each of
his 3,099 pitches he made that year108. Teams collectively spend over $3 billion on
player salaries109, and try to determine each player’s worth based on their contributions.
Each team typically has 225 major and minor league players, plus 100-200 other full-
time employees, 1000-2000 part-time ushers, game-time concession sales people, etc.
Hundreds of thousands of direct and indirect workers make part of their living from
baseball, including parking attendants, police/security, hotel staff, peanut farmers, gas
station attendants, garment workers, newspaper writers, TV and radio commentators
2014 EMC Proven Professional Knowledge Sharing 36
and engineers, etc. In truth, baseball is a simple game played by kids in 45 countries110
and viewed by fans of all ages worldwide, while backed by a trillion dollar global industry
that touches the lives of a significant portion of the planet in one way or another.
Businesses hate uncertainty, and baseball is anything but certain. We want to know the
future, but we don’t. In the business of baseball, you have a lot of data at your disposal,
and need to make a lot of decisions, with the goal of making the business viable despite
this uncertainty. That is where big data helps the most – discovery:
What roster of players should be assembled?
Players get injured - who should replace them and how will it affect the team?
How long should their contracts be and how much are they worth?
How to predict the future performance of a player?
What arm angles and pitcher velocity changes can predict medical issues, and how to use medical care and conditioning to get them and keep them healthy?
With big data systems like PITCHf/x and FIELDf/x, each prospect might have a hitting,
base running, fielding, and throwing score during a game. Over the course of a season,
players would have an empirical rating based on their play – a comprehensive report
card. For example, if college player “Fred” has a 92 rating and “Bill” a 96, teams might
offer Bill a higher signing bonus based on their ratings. The approach would also work to
evaluate and compensate older players who substitute experience for raw athleticism as
their career wanes. Some executives might even employ personality and/or medical
testing to find the best athletes that can earn the team tens of millions more dollars.
Player Development
Fans yearn to be part of an exciting team – great
pitching, power, speed, and defense. They come
to the park to see an exciting game and cheer
their team to victory. All clubs would like to field a
team of feared all-stars, but their collective cost is
too high. The 2013 Yankee player payroll was ten
times that of the Astros. It is true that generally
the team with the best, most expensive players
tend to win more games and make it to the playoffs more often than teams with the least
expensive players, but it is far from a certainty. Injuries to star players can quickly turn
expensive teams into mediocre ones. This wins per dollar graph from 2001 to 2010111
shows teams above the line won more often than those below the line without an
2014 EMC Proven Professional Knowledge Sharing 37
ironclad correlation to payroll. The Athletics won an average of 87 games at $804,597
per win while the Yankees spent $2,244,897 for each of their 98 wins. There is obviously
more to a sucessful team than player’s paychecks.
In 2013, teams with above average minor league affiliates were the Cardinals, Mariners,
and Rays112. Prospects come from high schools, colleges, and key locations around the
world. In 2012, 27% or 224 MLB players came from the Dominican Republic (87),
Venezuela (58), Canada (15), Japan (13), Cuba (11), Mexico (9), Panama (7), and
others113. In the pre-Moneyball era, you scouted these locals to find talented players for
bargain-basement contracts. Big data has made the task a bit easier. With empirical
data and the power of the Internet, scouts can “watch” every game from their office and
analyze potential players before they hop on a plane or drive to visit them in action. They
can get a feel for a pitcher’s habits and understand what they throw in game situations.
Are they effective when behind 3 balls no strikes? Do they hold a runner at 1st base?
Does their changeup have a big break? Do they keep their first pitch down?
In the pre-Moneyball era, players were often ranked by “tools” with “5-tools” being the
best. Teams were prone to draft players who had that “look” – handsome, athletic body,
and charisma. Today, they want players who get hits from optimal mechanics, have the
fastest reaction times, take the shortest path on the bases, and have the most accurate
and strongest throwing arms. The big data objective is to employ metrics like these114:
60th percentile of ability to go with the pitch
92th percentile of forward exit velocity
62th percentile of path to ball
72th percentile of throwing arm accuracy
40th percentile of speed
As a result, Sportvision tools are appearing at little leagues, colleges,
and baseball schools, and are used alongside radar guns and stop-watches. SmartKage
uses HITf/x and PITCHf/x in a video-sensor batting cage to provide over 40 performance
metrics to student athletes115. Their hitting, pitching, fielding, running, catching, and
agility are quantified to improve the player. Scouts also receive this data116. SmartKage
is installed in over 300 locations, 3,500 colleges117, and many minor league parks118.
2014 EMC Proven Professional Knowledge Sharing 38
Hitters' Questions Model Data
What does he throw? Top 2 Pitches
Pitch Repertoire/Variety
Horizontal Pitch Location
Vertical Pitch Location
How hard does he throw? FB Velocity
What kind of movement? Horizontal Movement
Vertical Movement
Where does he come from? Release Point
How does he like to pitch? Swinging Strike %
Zone %
Edge %
Top 2-pitch Sequence
Clustering Pitchers
With all the players in MLB, it is hard to find plentiful data on specific hitter-pitcher
matchups. This small sample size problem is evident when a manager picks a batting
order against a particular pitcher and decides to use someone who is 2 for 12 against a
pitcher, 1 for 3, or even 3 for 20 over a 3-4 year period because they lack a sufficient
statistical history to work with. What if pitchers were grouped with others who pitched the
same way and clustered based on fastball,
changeup, and curve release points, velocity,
and horizontal and vertical break
characteristics? A grouping might show how a
hitter could do against this model of a pitcher’s
ability. Vince Gennaro used a YarcData Urika
big data appliance to comb through a vast amount of PITCHf/x data to create pitcher
clusters against left-handed batters119. In red are
pitchers such as Gio Gonzalez and Ricky Romero. In
green, Jon Lester and Cole Hamels. Purple has CC
Sabathia and Boone Logan. Bruce Chen and Johan Santana are blue groups, with
David Price and Chris Sale in yellow. Gennaro’s
analysis evaluated 12 different attributes as shown in
this table120. This data could help a manager use a
stronger statistical correlation to select a lineup. Instead
of picking a batter against CC Sabathia, the manager
would use the group stats of Sabathia, Craig Breslow,
Rex Brothers, Clayton Kershaw, and Boone Logan.
To Gennaro, “…big data is about discovery. It's about asking the right questions to spot
a customer trend, to identify a new market, or, in personnel matters, to find a gem of an
employee, someone who you didn't know existed until their name started showing up
regularly on Twitter or LinkedIn. Big data helps you to know what don't know.”121
Revenue From Fans
Team owners will tell you the key to a profitable baseball franchise starts with an exciting
team, for excitement equals fans. Ticket revenue is just the tip of the iceberg. Selling
more tickets means higher concession revenue, increased merchandise sales, more
advertisers, etc. Management has many approaches to sell tickets, from giveaway
player bobble-head dolls, firework evenings, co-sponsored free caps, balls, or bats for
2014 EMC Proven Professional Knowledge Sharing 39
kids, and other items. There are also special stadium events such as “Old-Timers”
reunions, retiring a player’s number, and tributes. It’s not just in the ballpark; TV viewers
on their couches, passengers listening to car radio, Internet baseball fans, all experience
the fervor through licensed broadcasts. A happy fan base translates into higher revenue,
and higher TV/radio ratings lead to higher advertising revenue.
While a widely accepted goal is to sell all of your seats before the season begins to
season subscribers, not all teams share that view. The Nationals “…limit the number of
season tickets available for 2013 to 20,000 of our 41,000 available seats.”122 Single-
game tickets have higher club revenue than games purchased on a discounted season
plan. Clubs have also experimented with various ticket sale strategies in order to
increase direct ticket revenue. Variable and dynamic are two approaches to allow for
“market” ticket pricing against a hot club or discount tickets to a last-place opponent.
When their team wins, ticket prices can increase, or vice-versa if they lose.
With variable pricing, teams set seat prices based on the day, game time, and pre-
season popularity of the opponent. For example, a $19 Monday seat may cost $21 for a
Wednesday game. The Red Sox introduced premium prices for Opening Day and
Yankees games, and reduced prices for weeknight games in colder months of April, May
and September123. This increases attendance for games with less desirable opponents,
and raises prices for high demand tickets against top teams. Dynamic pricing uses big
data to gauge real-time interest to set prices based on supply and demand. Similar to
the airlines, prices are higher when your team is on a hot streak, a player is about to
break an important record, your team is vying for the pennant, or a great summer day.
Twists on dynamic pricing include allowing the fan to upgrade their seat when the
stadium is not full at a lower price than if they purchased the better seat weeks earlier.
Merchandise creates substantial club revenue because they incur no cost while profiting
from each item sold. Hats, jerseys, and foam fingers use licensed team logos and
names of players past and present that fans relate to. In 2012, almost 25 million fans
bought items124. Fans purchased $3.2 billion worth of goods in 2011125 and it is
estimated that MLB keeps 4% of each sale before distributing the revenue to teams126.
2014 EMC Proven Professional Knowledge Sharing 40
The Nationals are a technologically advanced club that
uses big data for fan marketing and ticket promotion. Their
“Ultimate Ballpark Access"127 card is a fan’s ticket, cash,
and rewards card that gets them personalized discounts in
the park. The card is key to their Customer Relationship
Management (CRM) system that integrates marketing, sales, and customer service to
get a better understanding of the fan. In exchange, fans get a better game experience
with points exchangeable for merchandise or food discounts, seat upgrades, promotions
during rain delays for free beer with a hot dog purchase, and other advantages128. With
cash tied to the card, fans are encouraged and incentivized to use it at the game.
Real-time data, including a fan’s location in the park, allows the Nationals to redeploy
concession staff when queuing causes long waits, offer promotions to fans that arrive
early, and reward frequent fans. Knowing more about each fan even allows vendors to
greet them by their loyalty card name. “Hi Debi, thanks for coming today! As a loyal fan, I
see you are entitled to a free hot dog. Want it with mustard”? They even increase sales
during inclement weather through parking and subway discounts if fans buy a ticket.
All Nationals business groups share a common CRM view - ticketing, merchandise
manager, concession coordinator, etc. They strive to improve the customer experience
and allow fans to dictate their level of participation in the system. The Nationals track
how often a season ticket holder attends a game and make it easy for that fan to resell
their tickets on StubHub since the goal is a full stadium129. If they see a pattern where a
fan buys tickets when certain teams visit, such as when the Mets come to play, they can
offer them a Nationals-Mets ticket package. Their loyalty card is also tied in with other
merchants such as grocery stores for extra ballpark discounts. They increase the value
of the Nationals brand by knowing more about their fans and market directly to them.
Business managers also want to know their fans better. Data was always available on
season or luxury box ticket holders, but insight into fans that came once a year was hard
to come by. Big data to the rescue, especially with a CRM card! For example, they can
deduce that a fan who buys beer and cotton candy in the 3rd inning likely has a child at
the game, and can text them a 7th inning 50% discount offer on children’s caps.
2014 EMC Proven Professional Knowledge Sharing 41
Offering fans rich interactive media is part of the big data
ticket buying experience. For example, the Yankees ticket
website uses an interactive park tour to find a view you
like. MLB runs ticketing for most of the teams, and tracks
the buying habits of its fans to improve the fan experience
and help gain insight into its customers. Getting to the park
is made easier using a smart phone or tablet for directions and parking details. MLB’s
app also shows team statistics and video highlights from games attended, player songs,
special offers, and ticket upgrades130. Fans can receive smart phone alerts about the
best parking, gates to use, and bathrooms with short lines, see where they are in the
park, bring up a food and drink menu, and have their order delivered to their seats!
More seated fans means more advertising revenue. Time slots before the game begins,
between innings, during pitching changes, and on the rain tarp during inclement weather
are ideal for advertisers looking to capture the attention of 30,000-50,000 fans. Wrigley
Field (Chicago Cubs) is expanding their advertising square footage, including a left field
Jumbotron expected to “…generate $300 million over five years to use for Wrigley
renovations.”131 Advertising revenue is closely tied to fan attendance, which reflects
whether the team is in contention and exciting to watch. For example, the 2010 Mets at
Citi Field drew 2,559,738 fans with a 79-83 record132. A year later, their record fell to 77-
85 and drew 8% fewer fans. Advertising revenue dropped $1.9 million to $44.2 million133.
MLBAM was created to develop a common look-and-feel web site for each team134.
They now generate revenue, and in 2012, brought in $650 million through 3 million
MLB.tv and MLB.com subscribers. AtBat is their top smart app with almost 7 million
downloads. It provides play-by-play action plus the PITCHf/x graphical flight of the ball
as it crosses home135. Through electronic devices, fans increase their enjoyment through
real-time statistics and individual replays within 15 seconds. “...MLB.tv subscription
allows you to stream the game through an Xbox or any Wi-Fi connection. Meanwhile,
BAM's AtBat app allows for streaming on
your iPhone or iPad. It also allows fans to
watch the action in "Game Day" mode , a
data-enriched, live-graphic presentation of the
game.”136 It takes a lot of real-time big data
expertise to unify diverse data feeds behind
2014 EMC Proven Professional Knowledge Sharing 42
the scenes, mask the complexity and make it appear simple and seamless to the fan.
MLBAM is also working on bringing PITCHf/x, HITf/x, and FIELDf/x to the home video
game console. Imagine being at home, pinch hitting for a player with real runners on
base, trying to swing a virtual bat at Mariano Rivera’s cut fastball coming at you on your
big screen TV at 90 mph. The gaming potential is tremendous and promises to offer
greater fan immersion compared to standard XBOX or PlayStation baseball games.
It should not be a surprise that running a successful ball team is a big data problem. The
data shows all sorts of patterns137. Who buys tickets? Who buys merchandise? Who
likes what? What’s the best time to make them an offer? What form should that offer
take? On the game side, what combination of players produces the best results? Where
should you position your defense? What pitches in what sequence should you look for?
It’s come down to Moneyball on steroids. Moneyball was “…based on 80,000 box scores
and play by play data (1963-2002), or about 1.5GB of just outcome data”138, versus
today where multi-terabytes of data are produced in one game – a big data explosion!
There’s lots of data and answers in that data that leads to some amazing conclusions.
Revenue From Media
Clubs also make money from TV and radio deals. Teams contractually share revenue
from national contracts with CBS, ESPN, TBS, FOX, etc., amounting to $12.4 billion over
8 years, or nearly $52 million a season per club139. Teams also have local contracts. For
example, the Yankees own the YES Network which paid them $85 million in 2013 for
broadcasting rights, eventually reaching $350 million a year by 2042140. In 2012, the
Yankees sold 49% of YES to FOX Sports for $3.4 billion141. In 2013, the Dodgers and
Time Warner Cable signed a $7-8 billion, 20-25 year deal worth $280 million a year142.
Why are sports so popular with networks and advertisers? One big reason is the DVR.
Viewers record their favorite shows and fast-forward through commercials to achieve a
near non-stop experience. That behavior hurts product advertising and the revenue it
generates. People may record their favorite TV show to skip the commercials, but
watching DVR'ed sports results in a mediocre experience. There are many ways to track
a score, so once you find out the result, you tend to either not watch your DVR’ed game,
2014 EMC Proven Professional Knowledge Sharing 43
or watch it with less enjoyment. That is why advertisers pay so much to air commercials
between innings – people want to watch sports games live and not record them.
A team can make out well when it comes to negotiating advertising contracts with
broadcasters. For a broadcaster, a single game represents four 30-second TV or radio
commercials per half inning, or 56 commercials during a 9-inning game. In an extreme
example, Fox TV sold each 2010 baseball playoff 30-second slot for $225,000, and
$450,000 for World Series slots143, with the 2012 All-Star game going for $550,000 per
commercial144. There may also be advertisements during the game, such as the “Pepsi
home run blast” or the “Kia player of the game”. There are also commercials before and
after the game, but it is harder to guarantee advertisers the same viewer demographics,
ratings points, and market share of the 18-49 year olds they are trying to reach.
Print advertisers buy space on pocket schedules, media guides, game day handouts,
yearbooks, and the back of paper tickets. Ads are also placed in concourses around the
park, electronic scoreboards, outfield fences, behind home plate and the bases, and
scrolling messages around the field. Premium giveaways with an advertiser’s message,
such as a souvenir cup or trinket, are handed to fans as they enter the park. MLB
attracts an interesting demographic for advertisers. For example, 25% of the Milwaukee
Brewers fans earn $100,000 or more compared to 17% of the Milwaukee Designated
Market Area145. The team’s demographic audience, market, and team interest dictate the
prices charged for signage. For example, a big market team like the Red Sox might
charge $300,000 for the same home plate signage that costs $50,000-$60,000 for a half-
inning at Kauffman Stadium (Royals)146. When the Red Sox play the Royals, the Boston
TV audience sees advertisements that cost 1/6 the price of Fenway ads.
Big Data Helps Create Algorithmic Baseball Journalism
Automated Insights (AI) created realtime.automatedinsights.com147 which robo-writes
content about any MLB game in real-time. It includes big data mined
observations from previous games, be they from the day before or 100
years ago. With instant updates to games in progress and recaps derived from the final
box score, facts and insights are added to yield a colorful account of the play-by-play.
Originally named StatSheet.com, AI leverages structured statistics to produce big data
automated written content. They use linguistic algorithms to automatically publish 500-
2014 EMC Proven Professional Knowledge Sharing 44
1,000 word recaps such as this from a
Red Sox – Tigers 2013 game148. They
generate recaps for 417 other websites
covering the NFL, MLB, NCAA Division
I college basketball, and others,
consisting of 400,000+ tweets on score
updates, play-by-play accounts, and
more149. Their “…network is already
comprised of more than one million
unique pages of content and 15,000-
20,000 original articles are added every month.”150
Software-generated content can turn quantitative information into English, but qualitative
writing still requires a human touch. Some benefits of “robo-writing” include:
never gets writer’s block
works morning, noon, and night, and produces content quickly
can alter its writing style through parameter changes
can analyze data without getting tired
low cost per article
Programs can take something simple like “Adam Dunn of the White Sox homered to left
center” and instantly turn it into:
“That was an MLB leading 35th home run for the year. Adam
Dunn moves into sole possession of 50th place all time for
home runs in Major League Baseball. When there are two or
more home runs in a game, the White Sox are 34-10.”151
Rather than invest in a large data center with physical servers, storage systems, backup
and disaster recover capabilities, large power and cooling bills, and more to pull off this
big data feat, they leverage the elastic cloud computing capabilities of Amazon’s AWS.
On peak Tuesdays, they spin up 300 AWS servers on demand for one hour to input
millions of Yahoo data points and author millions of customized fantasy sports reports,
each tailored to an individual fantasy team owner!152 A report could delve into a player’s
actions versus historical events, a manager’s decisions, playoff implications, and more.
They create hundreds of reports a second and AWS bills them for the actual usage.
2014 EMC Proven Professional Knowledge Sharing 45
This US patent shows the AI basic flow153. Statistics flow from an information provider
(305) to any device that can relay baseball
stats according to an agreed API, XML,
CSV, or spreadsheet format. A content
generation engine (320) issues a SQL
query against the content generation
database (330) for statistics that includes
event IDs such as the one used in the
PITCHf/x metadata, or other classification trends, streaks, new records, etc.
The actual content is based on templates designed by AI journalists and programmers
which include <phrases> and <variables>. The templates can include “…event preview,
in-progress event update, event recap/summary, statistical analysis, progress report,
etc.”154 Once a template is selected, text is varied to avoid duplication of content by
using <phrase variation> and <statistic>. The variation allows the tone of the content to
be hopeful, positive, optimistic, angry, indifferent, pessimistic, judgmental, etc. Their
algorithms also derive uniqueness from the significance of the event.
Known as computational journalism, machine written news, automated content creation,
algorithmic journalism, and robotic reporter, to name a few, companies like AI mine big
data to create humanized content around statistics by adding the who, what, where,
when, and why with nouns, verbs, and other parts of speech. To guard against
repetitiveness, they use artificial intelligence to scan big data datasets for patterns,
insights, and trends using natural language constructs. While not yet at the level of a
professional writer, they do a fine job with plain English. Their approach can also
generate content for real estate listings, financial reports, and traffic alerts.
AI employs journalists who specialize on meta-data structure and content, and work with
programmers on data filtering parameters to fit that syntax. They coach algorithms to
identify various data “angles”. Content may originate from a hit or an out, winner or loser,
record setters, and players who had a great day. The methods involve context and data
mining searches from proprietary databases, external data feeds, and external big data
sources. Automated mining can free up a real reporter from the drudgery of digging out
data clues. The big data aspect adds historical perspective for “why” an event is special.
Working with data rich subjects, turning statistics like RBIs and home runs into
2014 EMC Proven Professional Knowledge Sharing 46
sentences and paragraphs is a natural fit for robo-writing. AI’s programmers help fill the
gap created by enormous rich data streams and the demand for formulaic journalism155.
Here are two different articles written about the same game. Which was written by the
NY Times and which by an algorithm?
BOSTON — Things looked bleak for the Angels when they trailed by two runs in the ninth inning, but Los Angeles recovered thanks to a key single from Vladimir Guerrero to pull out a 7-6 victory over the Boston Red Sox at Fenway Park on Sunday. Guerrero drove in two Angels runners. He went 2-4 at the plate. “When it comes down to honoring Nick Adenhart, and what happened in April in Anaheim, yes, it probably was the biggest hit (of my career),” Guerrero said. “Because I’m dedicating that to a former teammate, a guy that passed away.” After Chone Figgins walked, Bobby Abreu doubled and Torii Hunter was intentionally walked, the Angels were leading by one when Guerrero came to the plate against Jonathan Papelbon with two outs and the bases loaded in the ninth inning. He singled scoring Abreu from second and Figgins from third, which gave Angels the lead for good.
BOSTON — In the hopes of channeling one of their most famous postseason moments, Boston reached back 23 years on Sunday and had Dave Henderson throw out the ceremonial first pitch. It was Henderson’s home run that rallied the Red Sox when they were down to their last strike against the Angels in the 1986 American League Championship Series. Now with the Red Sox facing elimination by the Angels again, here was Henderson, proof that no game, no series, is over until the final out is secure. But this time, all these years later, it was the Angels who made that point. Down to their final strike three times in the ninth inning — just as Henderson was in 1986 — Los Angeles scored three runs off Boston’s dominant closer Jonathan Papelbon, to win the game, 7-6, and eliminate the Red Sox from the postseason. Henderson’s magic, apparently, is not reusable.
Still not sure? An algorithm created the one on the left from a box score with on-line
quotes thrown in for color156. While the artificial article is not as good as the journalist’s,
the quality isn’t noticeably sub-par. It shows state of the art linguistic algorithms
combined with big data. Algorithms may help with data mining research, create content
that journalists virtually never do such as Little League game commentary, handle
simple medical checkup explanations involving cholesterol readings, or traffic reports.
They can also report on the finer points of pitch movement, trajectories off the bat, and a
shortstop’s movement to his left, while updating it every second. Algorithms could also
alert a coach in natural language to a tiring pitcher’s lower release point or velocity.
Listen To Your Data - Grady the Goat - The Curse of the Bambino
As good as baseball statistics and analytics have become, they don’t guarantee a win,
nor is everyone a true believer. An example of where data was available to help make a
better decision - a premise of big data - was during the 2003 ALCS pitting the Red Sox
against the Yankees. The Red Sox were just 5 defensive outs away from winning their
5th pennant since 1918 with a chance to win the World Series, something they had not
2014 EMC Proven Professional Knowledge Sharing 47
done in 85 years. That they failed to win the game can be traced to the decision of Red
Sox manager Grady Little to “go with his gut”, and not with the data available to him, at a
critical moment late in the game.
The celebrated Red Sox pitcher Pedro Martinez was on the mound, and
was brilliant giving up only 3 hits through 7 innings. He had now thrown
over 100 pitches, however, and his team’s data analysis showed that his effectiveness
dropped noticeably after 6 innings or the 105-pitch mark. Historically, batters had a .555
OPS through 6 innings facing Martinez, jumping to .758 after that. As Martinez passed
the 105-pitch mark, the Yankees began to tee off on the previously dominant Red Sox
“ace”, chipping away at the Sox’s 5-2 lead. Astonishingly, Little refused to pull Martinez
from the game until he had thrown 123 pitches; his last pitch, thrown to Yankees catcher
Jorge Posada, saw the Yankees tie the game, which they went on to win. The so-called
“Curse of the Bambino”, which linked the Red Sox trading the legendary Babe Ruth in
1919 to the Yankees to their World Series drought since then, was seemingly alive and
well. Little was fired 11 days later; his refusal to rely on clear-cut evidence that a new
pitcher was needed had broken the hearts of Red Sox Nation for yet another year.
Conclusion: The Future of Big Data Baseball
Baseball is a game where a “once in a lifetime” batter has a .400 average and a pitcher
that wins 15 games a year is ranked in the top 10%. The Moneyball era marked a radical
overhaul of baseball analytics and the refinement of the process of baseball. Teams
today invest millions on advanced analytics to determine the players they need, how to
use them, how to help them, maximize revenue, and transform their team’s relationship
to their fans. A symbiotic relationship between the athlete and the nerd has emerged.
Fans continue to be exposed to this data explosion, with degrees of difficulty on fielding
plays replacing adjectives like “great catch”, or base running metrics that make for hotly
debated player comparisons. At home, fans see a liberal dose of data capabilities
because of advertising demands and new metrics that grab everyone’s attention.
Management and the media continue to expand revenues and increase fan interest.
Coaches are able to keep players healthier during the grueling season, and turn a
player’s average season into great ones with a few extra hits a month. As Vince
Gennaro says “There are going to be pregame meetings on where the shortstop is going
2014 EMC Proven Professional Knowledge Sharing 48
Client
Server
File System
Append Only Log
Write
AcknowledgeMem Table – Storage row
Flush
Commit Log Data – SS Table
WRITE PATH
DATA 1 DATA 2 DATA 3 DATA 4
DATA 1 DATA 2 DATA 3 DATA 4
DATA 1 DATA 2 DATA 3 DATA 4
DATA 1 DATA 2 DATA 3 DATA 4
DATA 1 DATA 2 DATA 3 DATA 4
Client
Server
File SystemRandom Seek
Request
MissMem Table – Storage row
Read
Data – SS Table
Key Cache
Bloom Filter
Acknowledge READ PATH
DATA 1 DATA 2 DATA 3 DATA 4
DATA 1 DATA 2 DATA 3 DATA 4
DATA 1 DATA 2 DATA 3 DATA 4
DATA 1 DATA 2 DATA 3 DATA 4
to play exactly for each of the hitters”157. Players will make better catches, runners will
get more stolen bases, and the pitcher/hitter relationship will become more intense.
Big data baseball continues to grow as PITCHf/x and other systems reach minor league
affiliates. Over time, 7,500 major and minor league players could be tracked a year. Add
7,700 U.S. colleges158, 37,000 American high schools, and big data could someday track
over a million players. By following players around the world, there will be millions of
players that could be affordably tracked. Big data has come to baseball!
More players equal more games and more data as demonstrated by Gennaro’s use of
Urika. Management will gain more insight into which high school players to draft, find
undrafted “gems”, decide who to put together on teams, how to help player
development, determine negotiating strategies based on empirical player values, and
more. The data collected is complex and difficult to derive information from using
classical database tools, and some teams have begun to experiment with Hadoop.
Hadoop is a big data processing method where structured and unstructured data is
spread amongst dozens to hundreds of computers in parallel to quickly answer complex
questions, similar to the speed of a Google search request. The Rays are using Hadoop
to help derive extra revenue and create a better fan experience159. Coaches use
Cassandra, Hive, or Hbase analytics160,161 to discover better defenses, optimize running,
or give the hitter or pitcher an advantage. The following is the basic Cassandra logic.
Big data sports managers know that live baseball entertainment is not limited to the
ballpark. They would love, for example, to tie PITCHf/x into the untapped market of real-
time video games. Fans trying to bat against Mariano Rivera with real game footage
could bring in substantial revenue. Someday, the data might tie into holographic baseball
in your den, so fans could physically “stand tall” against the pitcher with bases loaded!
2014 EMC Proven Professional Knowledge Sharing 49
Teams don’t want to eliminate the human
factors that make baseball a great game,
but instead leverage big data to help them
test theories and make smarter decisions in
all facets of their business. Grady Little’s
“old school” gut feeling days and the era of
Earl Weaver note cards may be coming to
an end. The “new school” big data explosion shows the decision pool of data is growing
at a rate beyond the abilities of a simple, singular data mining approach162. Mastering
these new analytics should yield valuable competitive insights in all facets of the game.
The Yankees have realized this and employ 21 statisticians163 who apply their
knowledge of analytics, modeling, and computer science to the customer relationship,
and solve complex business problems by asking “what if” questions of the data.
Perhaps the day will come when the fan in their seat will ask their smart phone about the
likelihood of a runner on 1st base scoring if the batter gets a curveball and drives it into
right field. Or transit officials may someday reduce car traffic and improve mass transit
service through real-time insight into the number of fans attending a game, fan
sentiment, and the score. We want to have a great time at the game and owners want us
excited about their product. Trying to understand the data relationships is a big
challenge, and teams that succeed will increase their revenue. When teams succeed,
they enhance our national pastime. Play ball!
2014 EMC Proven Professional Knowledge Sharing 50
Appendix - PITCHf/x Metadata
The following definitions support the inning_9.XML file used earlier in this paper.
Field This data value Description
des Foul ball Pitch description
des es Foul ball Pitch description Spanish
ID 507 Unique pitch number of the game
type S B ball, S strike, X in play
tfs 230849 Event #
tfs_zulu 2012-10-29 to 3:08:49Z Starting date and time of event
x 109.01 x-pixel at home plate
y 132.11 y-pixel at home plate
sv_id 121028_230849 date/time stamp when PITCHf/x first detects the pitch YYMMDD_hhmmss
start_speed 94.4 release point velocity mph
end_speed 86.0 crossing front of home plate velocity mph
sz_top 3.32 distance from ground to the top of the strike zone, feet
sz_bot 1.53 distance from ground to the bottom of the strike zone, feet
pfx_x 8.23 horizontal movement of the pitch between the release point and home plate, inches
pfx_z 10.3 vertical movement of the pitch between the release point and home plate, inches
px -0.315 left/right distance of the pitch from the middle of the plate as it crossed home plate, feet
pz 2.919 height of the pitch as it crossed the front of home plate, feet
x0 2.562 left/right distance of the pitch, measured at the initial point, feet
y0 50.0 distance from home plate to measure the initial parameters, feet
z0 6.035 height of the pitch measured at the initial point, feet
vx0 -10.7 x velocity of the pitch measured at initial point, feet per second
vy0 -137.96 y velocity of the pitch measured at initial point, feet per second
vz0 -6.156 z velocity of the pitch measured at initial point, feet per second
ax 15.708 x acceleration of the pitch measured at the initial point, feet per second per second
ay 33.556 y acceleration of the pitch measured at the initial point, feet per second per second
az -12.449 z acceleration of the pitch measured at the initial point, feet per second per second
break_y 23.7
distance from home plate to the point in the pitch trajectory where it achieved its greatest
deviation from the straight line between the release point and the front of home plate, feet
break_angle -43.3
angle from vertical to the straight line path from the release point to where the pitch crossed
the front of home plate, as seen from the catcher’s/umpire’s perspective, degrees
break_length 4.1
greatest distance between the pitch's trajectory between the release point to the front of
home plate, and the straight line from the release point to the front of home plate, inches
pitch_type FF
most probable pitch type according to a neural net classification algorithm developed by Ross
Paul of MLBAM - FF=Four-seam Fastball
type_confidence 0.676
value of classification algorithm’s output node corresponding to the most probable pitch
type, multiplied by 1.5 if the pitch is known by MLBAM as part of the pitcher’s repertoire
zone 1
location of the pitch based on the boxes into which the Gameday app divides the strike zone
for its hot/cold zone graphics
nasty 41 how hard a pitch it is to hit 0-100
spin_dir 141.469 direction of spin of the baseball, degrees
spin_rate 2651.72 measured as the number of rotations the ball, rpm
cc Comments?
mt Unknown?
2014 EMC Proven Professional Knowledge Sharing 51
Footnotes
1 http://espn.go.com/mlb/story/_/id/6908844/information-age-changing-way-game-played
2 http://espn.go.com/mlb/player/hotzones/_/id/5375/kevin-youkilis
3 http://www.fangraphs.com/graphs.aspx?playerid=1935&position=1B/3B&page=0&type=mini
4 http://blog.ceciliatan.com/?p=478
5 Adapted - http://www.ericcressey.com/troubleshooting-baseball-hitting-timing-is-not-always-the-problem
6 www.providencejournal.com/sports/red-sox/content/20120316-baseball-vision-when-20-20-eyesight-just-wont-cut-it.ece
7 “The Physics of baseball”, by Robert K. Adair, ISBN 0-06-008436-7, Page 39
8 http://www.bostonbaseball.com/whitesox/baseball_extras/physics.html
9 http://www.sportvision.com/baseball/fieldfx
10 http://baseball.physics.illinois.edu/FieldFX-TDR-GregR.pdf
11
http://www.brooksbaseball.net/pfxVB/pfx.php?s_type=2&sp_type=1&year=2012&month=8&day=4&batterX=&pitchSel=453329&game=gid_2012_08_04_minmlb_bosmlb_1/&prevGame=gid_2012_08_04_minmlb_bosmlb_1/&prevDate=84&inning1=y 12
http://princeofslides.blogspot.com/2011/05/sab-r-metrics-gif-movies-and-pitch.html 13
http://www.baseball-reference.com/players/r/riverma01.shtml 14
http://www.baseball-reference.com/players/m/martipe02.shtml 15
http://www.fangraphs.com/leaders.aspx?pos=all&stats=pit&lg=all&qual=0&type=1&season=2013&month=0&season1=2013&ind=0&team=0,ss&rost=0&age=&filter=&players=0 16
http://www.fangraphs.com/heatmap.aspx?playerid=844&position=P&pitch=FC 17
http://articles.latimes.com/2013/jul/29/sports/la-sp-0730-mariano-rivera-20130730 18
http://en.wikipedia.org/wiki/Elias_Sports_Bureau 19
http://sabr.org/sabermetrics 20
http://www.predictiveanalyticsworld.com/sanfrancisco/2012/presentations/pdf/Day2_1430_Vartanian.pdf 21
http://baseballanalysts.com/archives/2006/07/empirical_analy_1.php 22
http://media.hometeamsonline.com/photos/baseball/SOBRATO/TG_Strategy_Presentation.pdf 23
http://turbotodd.wordpress.com/2011/10/26/information-on-demand-2011-a-data-driven-conversation-with-michael-lewis-billy-beane/ 24
http://www.baseball-reference.com/players/d/davisch02.shtml 25
http://deadspin.com/2013-payrolls-and-salaries-for-every-mlb-team-462765594 26
http://baseballplayersalaries.com/ 27
http://www.sloansportsconference.com/wp-content/uploads/2012/02/98-Predicting-the-Next-Pitch_updated.pdf 28
http://sports.yahoo.com/fantasy/blog/roto_arcade/post/Been-Caught-Stealing-MLB-warns-Phillies-about-s?urn=fantasy,240535 29
http://www.geekwire.com/2013/major-league-baseball-dominates-big-data-mobile-game/ 30
http://seanlahman.com/blog/wp-content/uploads/2013/08/SABR-43-presentation.pdf 31
http://espn.go.com/mlb/player/stats/_/id/1150/tony-gwynn 32
http://sportsillustrated.cnn.com/vault/article/magazine/MAG1115578/2/index.htm 33
http://espn.go.com/mlb/story/_/id/6908844/information-age-changing-way-game-played 34
http://www.sloansportsconference.com/?p=6137 35
http://espn.go.com/mlb/story/_/id/6908844/information-age-changing-way-game-played 36
http://www.gartner.com/it-glossary/big-data/ 37
http://www.sloansportsconference.com/wp-content/uploads/2013/Slides/205/HP%20Vertica.pdf 38
http://www.sportvision.com/ 39
“The Technology of Baseball” by Thomas K. Adamson. ISBN 978-1-4296-9955-6, P.15 40
http://www.bloomberg.com/news/2011-03-31/baseball-is-set-for-deluge-in-data-as-monitoring-of-players-goes-hi-tech.html 41
http://blogs.wsj.com/cio/2013/07/15/baseball-all-stars-data-gets-more-sophisticated-with-field-fx/ 42
http://en.wikipedia.org/wiki/1st_&_Ten_(graphics_system) 43
http://www.dnallen.com/stuff/AllenSkillVLuck11.pdf 44
“Broadcasting Baseball: A History of the National Pastime on Radio and Television”, ISBN 978-0-7864-4644-5, P. 204 45
“Why a Curveball Curves”, ISBN 978-1-58816-794-1. Page 63, “The Physics of a Curveball” by Dr. Peter Brancazio, my professor at Brooklyn College, 1971. The Magnus effect is the top spin causing the ball to move downward in addition to gravity, back spin causing the ball to rise, all due to the ball’s interaction with the air and the wake it creates. 46
http://sportsvideo.org/main/blog/2010/10/tbs-sportvision-offer-hd-viewers-new-angles-with-pitchfx/ 47
http://www.kbardesign.com/141266/1404601/motion-design/sportvision-baseball-presentation 48
http://www.hardballtimes.com/main/blog_article/ichiro-strike-three/ 49
“Scorecasting: The Hidden Influences Behind How Sports Are Played and Games Are Won” by Tobias Moskowitz, ISBN 978-0-307-59180-7, P.14. 50
http://www.brooksbaseball.net/pfxVB/zoneTrack.php?&game=gid_2012_06_18_balmlb_nynmlb_1/&innings=yyyyyyyyy&month=06&day=18&year=2012 51
http://rayscoloredglasses.com/2013/05/09/velocity-concerns-for-rays-david-price-extend-beyond-his-fastball/ 52
http://pitchfx.texasleaguers.com/pitcher/285079/?batters=A&count=AA&pitches=AA&from=6%2F18%2F2012&to=6%2F18%2F2012
2014 EMC Proven Professional Knowledge Sharing 52
53
http://pitchfx.texasleaguers.com/pitcher/285079/?batters=A&count=AA&pitches=AA&from=6/18/2012&to=6/18/2012 54
http://pitchfx.texasleaguers.com/readingcharts.php 55
http://www.fangraphs.com/pitchfx.aspx?playerid=8700&position=P 56
http://sportsillustrated.cnn.com/2011/writers/tom_verducci/04/12/fastballs.trackman/index.html 57
http://baseballanalysts.com/archives/fx_visualizatio_1/ 58
An Introduction to TrackMan Baseball - http://baseballanalysts.com/archives/fx_visualizations/002587-print.html 59
http://www.fangraphs.com/pitchfxg.aspx?playerid=8700&position=P&season=2011&date=0&dh=0 60
http://wapc.mlb.com/play/?content_id=26750779&topic_id=&c_id=mlb&tcid=vpp_copy_26750779&v=3 61
http://www.kbardesign.com/141266/1400894/motion-design/pitchfx-data-visualization 62
http://msn.foxsports.com/mlb/player/miguel-cabrera/hotzone/140955?q=miguel-cabrera 63
http://pitchfx.texasleaguers.com/batter/408234/?pitchers=A&count=AA&pitches=AA&from=1/1/2012&to=12/31/2012 64
http://www.sportvision.com/news/research-confirms-hard-throwers-advantage 65
http://wapc.mlb.com/play/?c_id=mlb&content_id=20171503&topic_id=7417714&v=3 66
http://www.sportvision.com/news/baseballs-strike-zone-illusionist 67
http://www.sportvision.com/news/baseballs-strike-zone-illusionist 68
http://phys.csuchico.edu/baseball/talks/AAPT(Jul-2013)/slides.pdf 69
http://baseball.physics.illinois.edu/TrackingTechnologiesBaseball.pdf 70
http://www.sportvision.com/baseball/hitfx 71
http://www.sportvision.com/baseball/hitfx 72
http://sportsvideo.org/main/blog/2013/07/sportvisions-hit-speed-data-making-broadcast-debut-on-fox-sports-regional-networks/ 73
http://seamheads.com/2010/05/09/meet-the-new-park-factors-part-ii/ 74
http://www.beyondtheboxscore.com/2009/6/6/900817/graph-of-the-day-average-batted 75
http://baseballengineer.com/2010/04/13/meet-the-new-park-factors-–-part-ii/ 76
http://www.hittrackeronline.com/detail.php?id=2012_3635&type=hitter 77
pitchfx.texasleaguers.com/pitcher/461833/?batters=A&count=AA&pitches=AA&from=4/1/2012&to=9/30/2013 78
http://espn.go.com/mlb/story/_/id/6908844/information-age-changing-way-game-played 79
http://www.billjamesonline.com/best_defensive_teams_of_the_decade/ 80
http://www.baseballnation.com/2011/4/23/2128357/will-fieldf-x-go-public 81
http://www.businessinsider.com/fieldfx-sportvision-technology-baseball-tracking-2011-03 82
http://en.wikipedia.org/wiki/Rickey_Henderson 83
http://www.fangraphs.com/community/2010-pitchfx-summit-recap/ 84
http://www.alliedvisiontec.com/apac/news/news-display/article/lets-play-ballbrmachine-vision-cameras-in-professional-baseball.html 85
http://www.baseballprospectus.com/article.php?articleid=11869 86
New Technologies in Baseball, Aug 7, 2010, Rand Pendleton, Sportvision Dir Video Broadcast Development 87
http://baseballanalysts.com/archives/fx_visualizatio_1/ 88
http://www.baseballprospectus.com/article.php?articleid=11869 89
http://www.slideshare.net/timothyf/fisher-baseball-baseball-stats-overload?from_search=13 90
http://www.seanlahman.com/2013/08/baseball-in-the-age-of-big-data/ 91
http://www.baseballprospectus.com/article.php?articleid=11869 92
http://www.kbardesign.com/141266/1429848/motion-design/baseball-reaction-time-concepts 93
www.bloomberg.com/news/2011-03-31/baseball-is-set-for-deluge-in-data-as-monitoring-of-players-goes-hi-tech.html 94
http://www.intelfreepress.com/news/moneyball-brings-big-data-to-silicon-valleys-minor-league/6239 95
http://en.wikipedia.org/wiki/Pythagorean_expectation 96
http://60ft6in.com/2012/12/09/technology-for-the-fans/ 97
http://www.nytimes.com/2009/07/10/sports/baseball/10cameras.html?_r=1& 98
http://www.youtube.com/watch?v=HIJRZW3AoOw 99
http://www.fangraphs.com/library/business/revenue-sharing/ 100
http://www.forbes.com/mlb-valuations/list/ 101
http://espn.go.com/blog/statsinfo/tag/_/name/defensive-runs-saved 102
https://infocus.emc.com/william_schmarzo/using-the-big-data-strategy-document-to-win-the-world-series/ 103
http://www.cnbc.com/id/38046122 104
http://articles.latimes.com/2011/oct/21/sports/la-sp-world-series-tv-20111022 105
http://www.tauntongazette.com/news/x1565405444/World-Series-a-boost-for-local-business 106
http://sports.newsday.com/long-island/data/baseball/mlb-salaries-2013/ 107
http://www.baseball-reference.com/players/l/leecl02.shtml 108
http://www.fangraphs.com/statss.aspx?playerid=1636&position=P#battedball 109
http://www.cbssports.com/mlb/salaries 110
http://wiki.answers.com/Q/Which_countries_play_baseball 111
http://blogs.sas.com/content/cokins/2011/08/23/moneyball-turning-the-odds-on-the-casino-with-analytics/ 112
http://www.minorleagueball.com/2013/1/28/3925786/2013-baseball-farm-system-rankings 113
http://progressive-economy.org/files/2012/05/fact512.opening.day_1.pdf 114
http://www.sloansportsconference.com/?p=7809 115
http://www.smartkage.com/#/services/professional 116
https://www.prbuzz.com/sports/114405-more-good-news-for-boston-area-baseball-players-and-fans-of-all-ages.html 117
http://sabr.org/analytics/speakers/2012 118
http://www.reuters.com/article/2011/08/25/idUS162922+25-Aug-2011+MW20110825 119
http://www.yarcdata.com/Solutions/baseball-analytics.php
2014 EMC Proven Professional Knowledge Sharing 53
120
http://vincegennaro.mlblogs.com/2013/04/22/clustering-pitchers-by-similarity-part-1/ 121
http://www.yarcdata.com/Resources/video9.php 122
http://www.forums.mlb.com/n/pfx/forum.aspx?nav=messages&webtag=ml-washington&tid=38839 123
http://www.csnne.com/blog/red-sox-talk/red-sox-introduce-new-ticket-pricing-structure?p=ya5nbcs&ocid=yahoo 124
http://www.statista.com/topics/968/major-league-baseball/#chapter3 125
http://www.ocregister.com/articles/licensed-347956-products-major.html 126
http://www.askmen.com/sports/business/14_sports_business.html 127
http://washington.nationals.mlb.com/was/ticketing/ultimate_ballpark_access.jsp 128
http://emerging.uschamber.com/events/2013/bhs-big-data-and-why-it-matters 129
http://nationals.brochure-mlb.com/2013/faqs.php 130
http://newyork.yankees.mlb.com/mobile/attheballpark/index.jsp?c_id=nyy 131
http://www.nwherald.com/2013/07/12/cubs-get-signage-approval-ending-stadium-revenue-excuses/atyifb4/ 132
http://www.baseball-reference.com/teams/NYM/attend.shtml 133
http://bats.blogs.nytimes.com/2013/03/05/attendance-and-revenue-fall-at-citi-field/?_r=0 134
http://www.marketwatch.com/story/baseball-prospectus-announces-multi-year-partnership-with-mlbam-2013-09-04 135
http://www.forbes.com/sites/mikeozanian/2013/03/27/baseball-team-valuations-2013-yankees-on-top-at-2-3-billion/ 136
http://www.fastcompany.com/1822802/mlb-advanced-medias-bob-bowman-playing-digital-hardball-and-hes-winning 137
http://www.bloomberg.com/video/applying-big-data-thinking-to-sports-LZvCgdCFTxGYMtBXCVGm6g.html 138
http://sabr.org/latest/2013-sabr-analytics-conference-research-presentations - transcription of Vince Gennaro, "The Big Data Approach to Baseball Analytics" 139
http://www.forbes.com/sites/mikeozanian/2013/03/27/baseball-team-valuations-2013-yankees-on-top-at-2-3-billion/ 140
http://www.bloomberg.com/news/2013-03-28/yankees-value-surges-to-2-3-billion-on-yes-deal-forbes-says.html 141
http://sportsbusinessnews.com/content/big-business-major-league-baseball-and-television 142
http://www.nytimes.com/2013/01/29/sports/baseball/dodgers-sweet-tv-deal-will-taste-bitter-to-fans.html 143
www.bloomberg.com/news/2010-10-03/fox-says-ad-slots-for-baseball-playoff-series-world-series-are-90-sold.html 144
http://www.adweek.com/news/television/fox-sells-out-mlb-all-star-game-141581 145
http://milwaukee.brewers.mlb.com/mil/sponsorship/demographics/index.jsp 146
http://www.boston.com/sports/baseball/redsox/articles/2007/04/25/sox_have_dice_k_but_rivals_reaping_ad_dollars/ 147
http://www.forbes.com/sites/jasonbelzer/2013/02/26/automated-insights-poised-to-revolutionize-sports-media/ 148
http://www.sportce.com/red-sox-beat-tigers-5-2-advance-to-world-series-automated-insights/ 149
http://statsheet.com/sites 150
http://statsheet.com/about 151
http://www.youtube.com/watch?feature=player_embedded&v=M9oeJDkBjqA 152
http://upstart.bizjournals.com/companies/startups/2013/05/22/automated-insights-turns-data-to-stories.html?page=all 153
U.S. Patent US 20110246182A1 154
U.S. Patent US 20110246182A1 155
http://itknowledgeexchange.techtarget.com/cio/big-data-meets-automated-content-development/ 156
http://www.globalaffairs.org/forum/threads/computerised-journalism.62888/ 157
www.bloomberg.com/news/2011-03-31/baseball-is-set-for-deluge-in-data-as-monitoring-of-players-goes-hi-tech.html 158
http://wiki.answers.com/Q/How_many_colleges_are_there_in_the_US 159
http://www.7x7.com/tech-gadgets/lightspeed-venture-s-barry-eggers-how-big-data-disrupting-baseball 160
http://hadoop.apache.org/ 161
http://pivotallabs.com/patrick-mcfadin-killer-apps-using-apache-cassandra/ 162
http://resources.idgenterprise.com/original/AST-0082578_Turn_Big_Data_Inward_With_IT_Analytics.pdf 163
http://www.ft.com/cms/s/2/3f5cc88c-0b21-11e1-ae56-00144feabdc0.html#axzz1dkG8fZcd
EMC believes the information in this publication is accurate as of its publication date.
The information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC
CORPORATION MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND
WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND
SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR
FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires
an applicable software license.