do it in production

Download Do It In  Production

If you can't read please download the document

Upload: judah

Post on 25-Feb-2016

30 views

Category:

Documents


2 download

DESCRIPTION

Do It In Production. Testing Where It Counts. Seth Eliot. Senior Knowledge Engineer, Test Excellence. About Seth. Digital Media Services. A/B Testing of Services. Petabytes Processed. Services and Cloud. The Future of Software Testing. Part 1 Nov 2011 Testing in Production. Part 2 - PowerPoint PPT Presentation

TRANSCRIPT

PowerPoint Presentation

Do It In ProductionTesting Where It CountsSeth EliotSenior Knowledge Engineer, Test Excellence

1Services and Cloud

A/B Testing of ServicesDigital Media ServicesPetabytes ProcessedAbout Seth

Part 3July 2012The CloudPart 2March 2012TestOpsPart 1Nov 2011Testing in ProductionThe Future of Software Testing

Chronologically Left to RightExperience is in software servicesTesting Planet Links

The future of software testing Part Three CloudJuly 10, 2012http://www.thetestingplanet.com/2012/07/july-2012-issue-8/The future of software testing Part two TestOpshttp://www.thetestingplanet.com/2012/03/march-2012-issue-7/The future of software testing Part one Testing in productionThe Testing Planet, November 2011

I also did a Mind MapTesting in Production MindmapAugust 6, 2012http://www.ministryoftesting.com/2012/08/mindmap-testing-in-production/For Ministry of Testing (Software Testing Club)

2

MeasurementA quantitatively expressed reduction of uncertaintybased on one or more observationsTestingabout quality of a system under testGood book I recommend itIs there something the assembled crowd here might be interested in measuring?CLICK: yes, Quality!So this is how I define Testing This INCLUDES classic pre-prod test case executionAnd his will necessarily include more than the classic test case execution

3Data Driven Decision MakingData Driven ValidationTesting in ProductionTiPReal UsersProduction Environments

Data Driven Decision Making (D3M) is about the first definition: measurementData Driven Validation is about the second definition: testingThis talk about TiP, but TiP is but one form of Data-Driven ValidationCLICK TiP is leveraging real users, because we cannot know what all users will doCLICK and actual production, because production is a dangerous and chaotic place.in a risk mitigated way to reduce uncertainty about the quality of your software4

Lets dive in with an exampleBen was not someone I had followed CLICK: show re-tweetThe TweetBeing from MSFT this caught my attentionLikely IE6. Even MSFT is running away from IE6Is it cost effective to keep that XP environment around? With IE 6?And how about every other OS and browser in the world?The matrix gets hugeWould it be great to answer the question, What are you users actually using?and understand how you product works with them5Real World Performance Monitoring

500 Million measurements per monthJSIJavaScript Instrumentation

Instead of a huge matrix, you can use production to get the data you needOf end to end performance under real operating conditionsIn this case PLT for Outlook.com (Hotmail at the time) from millions of actual usersGet data on every OS, browser, Geographic location, or data center used instead of testing a huge matrix in the labIdentified and remedied performance bottlenecks They use JSIThis is Big Data **** Whos heard of big datatransitions to definition on next slide

-------------------Not just a PLT, but a round trip for everything data you cant get in a labPublic internetLoad balancersLAN switchesPartner ServicesThis is (old) data from Hotmail (now Outlook.com). Based on this and similar measurements theyIdentified and remedied performance bottlenecks Such as upstream bandwidth constraintsBy using more caching and static images6What is Big Data?Value

VelocityVolume

MB GB TB

PB EB ZBVariety

The previous example makes use of Big DataSo while not all of our Data-Driven Validation needs to be Big Data, it is worthwhile understanding what Big Data Is3 VsVolumeVelocityVariety4th V Value whats the value? Efficient quality assessment------------------------------------------

Ultimately it is about Big Insights- Again Hubbard: When you have high uncertainty, you need very little data to make an impactful reduction in it.

3 Vs - http://radar.oreilly.com/2012/01/what-is-big-data.html

VolumeCannot be handled by conventional RDBMSSQL Server maxes out at 16TBEntire web is 0.5 ZB (2009); probably about 1-2 ZB today`Richard Wray (2009-05-18). "Internet data heads for 500bn gigabytes". The Guardian. http://www.guardian.co.uk/business/2009/may/18/digital-content-expansion.

Velocityeverythings instrumentedSpeed of feedback is importantIBM: The Road: could you cross a busy road with just a snapshot (not live data)? http://vimeo.com/20718357 Batch vs. StreamPartial Analysis: http://research.microsoft.com/apps/video/default.aspx?id=163222

VarietyStructured: DBUnstructured: TweetsHow about XML? One good rule of thumb is if the data structure (or lack thereof) is not sufficient for the processing task at hand, then it is unstructured.7

Twitter

Xbox KinectI mentioned Twitter in the previous slide, here is how Twitter data can be usedThis is an internal Microsoft tool. Public tools exist to do similar things******* Turn Tweet stream into actionable metricsSentiment is positive 2:1; there was a spike in certain topics around TechFest (MSFT R&D showcase)Can be used to find bugs too: version over version issues-----------------------------

The Ambient Data of the web / social can be usedData Sources: Twitter, Blog, News, Forum, Facebook but mostly TwitterSentiment, Timeliness (TechFest was in March), Quality signalsBugs? Timeline = new version? Certain phrases?

NoteSentiment: almost everything has a large neutral frequency. Positive > 2:1 over negative is goodSDK and Kinect for Windows had a boost in early March Microsoft TechFest (R&D showcase)Kinect Fusion creates a detailed 3-D rendering of your environment

This technology may help you find bugs,Certain phrases may indicate itA rapid change in sentiment with a new release

Other technologies that mine Microsofts customer support data can also be used to find issues with released product.8What Does Data Have To Do With TiP?Biggest and Best Data is in ProductionTesting in Production (TiP)Use Data for Quality AssessmentProduction is Truth

Data-Driven Validation is bigger than just TiPLots of good Data-Driven validation prior to production tooFor any system of sufficient scale, only Production looks like productionData center pics: Ideal (lab) versus Reality (production)9

Why Testing in Production (TiP)?Find this with a unit test.Walking DirectionsThis route may be missing sidewalksYou cannot find this bug pre-prodWould you test walking directions between A and B for every combination in the world?It is trivial to find in productionWith the Right telemetryGoogle can know when this happens in Prod and report itCLICK Google knows that this route may be missing sidewalks!

Remember: only in production do you find:The true diversity of Real users and usageThe true complexity of the production environment

-----------------------------------------Find this with a unit Test James Whittaker - http://www.youtube.com/watch?v=cqwXUTjcabs&feature=BF&list=PL1242F05D3EA83AB1&index=16

105 Million Metrics

Grid Report:CPU, NetworkOperational Data Store (ODS)

SystemApplicationBusinessMetricsHistorical trending and analysis

Hadoop

Lets look at an example from FacebookFacebook uses Open source monitoring software like GangliaUsing Hadoop, which we will talk about later, they developed.An internally produced ODS persistent and accurateSystem metrics (CPU, Memory, lO, Network)Application metrics (Web, DB, Caches)Facebook metrics (Usage, Revenue)

They claim to collect 5 million metrics theyre about dodgy on what this specifically means, but it is.Passive Validation at scale

------------------------------------Nagios: ping testing, ssh testing- Is Active Validation

Refs:Ganglia, ODS: Cook, Tom. A Day in the Life of Facebook Operations. Velocity 2010. [Online] June 2010. http://www.youtube.com/watch?v=T-Xr_PJdNmQPicture: FB Prineville Datacenter: http://www.facebook.com/prinevilleDataCenter/

115 million Metrics?Its still a lot.

But 5 million metrics is a bit ambiguous- I understand it to mean number of different metrics collected x servers they collect them on

Cook, Tom. A Day in the Life of Facebook Operations. Velocity 2010. [Online] June 2010. http://www.youtube.com/watch?v=T-Xr_PJdNmQ12

Metrics findFeature broken for significant percent of usersMetrics findProblems at ScaleConstant dogfooding with reporting toolsEngineers stay with code every step of the wayThis process works for Facebook partly because Facebook does not, by and large, need to produce particularly high-quality softwareReally?

So how does Facebook use their 5 million metrics to assess quality?Lets refer to a Quora answer and blog post from a FB engineer that discusses thisCLICK: How is FB like Gondor?Boromir: "Gondor has no king, Gondor needs no king. Facebook has no testers, Facebook needs no testersCLICK: What does FB actually do then? (refer to slide)So am I mocking this or promoting it as a valid practice?CLICK: well, both really depends on your business requirementsThe FB engineer in question said (refer to slide)****** FB uses TiP only, they just throw it in productionWere all pretty familiar with FBs quality if your quality needs to be higher, then this approach does not work-----------------------------------

A lot of cross talk between Dev and QA its pretty slow lets get rid of itOur engineers write, debug, and test their own codeWe expose real traffic to these servicesEngineers need to be there every step of the wayOn IRC channel when deployAggressively log and audit5 million metrics can findProblems at scaleBroken features for significant percent of users

Refs:Cook, Tom. A Day in the Life of Facebook Operations. Velocity 2010. [Online] June 2010. http://www.youtube.com/watch?v=T-Xr_PJdNmQhttp://www.zdnet.com/blog/facebook/why-facebook-doesnt-have-or-need-testers/7191http://www.quora.com/Is-it-true-that-Facebook-has-no-testers - Evan Priestley, - Facebook engineer from 2007-2011

13Just Throw it in Production?Integration TestingFunctional TestingUnit TestingMetrics and Opticsmaybe less of thisTesting in ProductionFunctional TestingIntegration TestingPerformance and ScalabilityTDDInstrumentEverything

Blue = DeveloperPurple = Tester

Weve seen FB just throw it in Production, and that is part of their business decision. But most teams will not choose to do this

This is a simplified model of the test life cycleI call this the BUFT model (Big Up-Front Testing)I presume this look familiar to most of youCLICK: So then maybe we add TiPWe still have BUFT, and now the Testers have that much more to do!CLICK: So we need to adjust the modelThis is just one possible way to do itDevs take on more UFT testing - focus on functional & code quality at the COMPONENT level (Test can help with strategy)Test focus on integrated service quality (Dev can help with implementation Testability in Production)

****** Rule of thumb: should not find bugs that could have found in an earlier stage

---------------------------Other notes:Instrument Everything is from FB - http://www.youtube.com/watch?v=T-Xr_PJdNmQMetrics and Optics give you access to the data streamTDD is a better way to build in quality

No, do NOT just throw it in productionShould be part of a continuous test strategyBut may want to reduce UFT (Up-Front Testing).From BUFT to UFT + TiP

14Real Users Applications InfrastructureSynthetic Transactions

Active ValidationPassive Validation

AvailabilityReliabilityPerformanceOperational Intelligence

Business IntelligenceTwo Types of Data-Driven ValidationThe examples I have shown thus far are types of Passive ValidationPassive Validation is very valuable do not be fooled by the nameAnother types of Data-driven validation is Active ValidationSynthetic Transactions will be very familiar Test Cases are Synthetic TxsLets look at some examples

-----------------------------------------------------------Passive ValidationLooks a lot like what we would call monitoringOperational intelligence, like availability and performanceBusiness Intelligence tells us where the user is going. Crucial knowledge for a quality strategyWe always have to make hard decision on what to test, this answer thatBI also can indicate bugsIf usage drops off when no user-facing change has been made

Active ValidationThis looks a lot like the testing we do todaySynthetic Transactions = Test CasesIf we do this in production, Testing Active MonitoringAvailability = Is it there? = successful TX, regardless of resultReliability = Does it work = Tx without errorPerformance = How long does it take?

15

AvailabilityPerformance

From Azure

Visuals are from Office Service Pulse dashboard many metrics from active validationA specific example is Exchange Online, a hosted service provides email, calendar, contacts managementWanted to re-use existing on-prem testsDeveloped execution framework running from Azure, Microsofts cloud platformAvailability = Is it there? = successful TX, regardless of resultRun repeatedly to turn pass/fail into availability/non-availabilityPerformance = How long does it take?Run repeatedly and time Tx to historically trend with timeEspecially useful release over release

-----------------------------------------------

we quite simply had to figure out how to simultaneously test a server and a service? How do we pull our existing rich cache of test automation into the services space?

For server (on-prem)5000 machines in test labs70,000 automated test cases run multiple times a day on these machines.

Reuse and extend our existing infrastructure.Exchange will remain one codebase.We are one team and will not have a separate service engineering team or service operations team

SolutionTiPRun tests from Azure

You getAvailabilityPerformance

Ref:Experiences of Test Automation; Dorothy Graham; Jan 2012; ISBN 0321754069; Chapter: Moving to the Cloud: The Evolution of TiP, Continuous Regression Testing in Production; Ken Johnston, Felix Deschamps

16Fault InjectionChaos Monkey

Latency Monkey

Game Day

Chaos Gorilla

Another example that does not quite look like test casesOperational Fault InjectionThis is another type of Active ValidationInjects Synthetic FaultsTo disrupt service operationTo Test System Fault Tolerance assuming system was designed to be FT!!Chaos MonkeyApril 2011 exampleSimian armyJune 2012 exampleAmazon Game Day

--------------------------------------Other Notes:Netflix is a streaming video service hosted on Amazon AWS CloudAvailable in both North and South America, the Caribbean, United Kingdom, Ireland, Sweden, Denmark, Norway, Finland

Chaos Monkey Simian ArmyIt started with their "Chaos Monkey", a script deployed to randomly kill instances and services within their production architecture. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables. Then they took the concept further with other jobs with other similar goals. Latency Monkey induces artificial delays, Conformity Monkey finds instances that dont adhere to best practices and shuts them down, Janitor Monkey searches for unused resources and disposes of them. April 2011 outage stayed upJune 2012 Outage Chaos Gorilla should have prepared them to survive an outage, but did notChaos Gorilla, the Simian Army member tasked with simulating the loss of an availability zone, was built for exactly this purpose. This outage highlighted the need for additional tools and use cases for both Chaos Gorilla and other parts of the Simian Army.

Amazon Game DayEntire DC taken downAnnounced in advance, but few services opt-outService owners are alert, but mostly not worried Amazon services designed for this

Refs for Chaos Monkeyhttp://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.htmlhttp://techblog.netflix.com/2011/07/netflix-simian-army.html June 2012 outage: http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html

Refs for Amazon Game Day- There really arent any, but this post mentions it: http://devops.com/2011/03/08/

17Synthetics Have Risks

Who's Got the Monkey?Who's Got the Monkey Now?Monkey

Continuing our theme of monkeys.Fault Injection has some obvious risks, but even less intrusive synthetic transactions carry risksMonkey story (below) illustrates some risk of synthetics onBusiness Metrics and ReportingChanging The shape od production data-----------------------------------

Other risks of syntheticsService OperationDirect User ExperiencePartner ServicesSecurityCost

http://thedailywtf.com/Articles/Ive-Got-The-Monkey-Now.aspx

1999 was a big year for Harvard Business School Publishing. In the past few years, they had seen their business model selling books, journals, articles, case studies, and so forth transform from being entirely catalogue-based to largely web-based, and it had finally come time for a major re-launch of their website. HBRPs new website was slick. On top of a fairly advanced search system, the re-designed site also featured community forums and a section called Ideas @ Work, which let users download audio broadcasts from influential business thinkers from around the world. And best of all, despite the rapid development schedule, scope creep, and all of the new bells and whistles, the new site actually worked. In the height of the dot-com era, not too many other sites could claim the same. One key contributor to the success of Harvard Business School Publishings new website was its extensive testing and QA. Analysts developed all sorts of test cases to cover virtually every aspect of the site. They worked closely with HBSPs logistics department to make sure the tests searching, fulfillment, account management, etc. were run. And not just run, but run often. This aggressive testing strategy ensured that the site would function as intended for years to come. That is, until that one day in 2002. On that day, one of the test cases failed: the Single Result Search. The Single Result Search test case was part of a trio of cases designed to test the systems search logic. Like the Zero Result Search case, which had the tester enter a term like asdfasdf to produce no results, and the Many Results Search case, which had the tester enter a term like management to produce pages of results, the Single Result Search case had the tester enter a term specifically, monkey to verify that the system would return exactly one result. And for three years, monkey returned exactly one result: Who's Got the Monkey? (full article text) by William Oncken Jr. Written in 1974 Onckens article is for managers who find themselves running out of time while their subordinates are running out of work. As for the monkeys, theyre just an analogy for work, not who managers should outsource work to. Apparently, Oncken wasnt that ahead of his time. In any case, on that day in 2002, the monkey search returned two results. The first, as expected, was Who's Got the Monkey?. The second result was something to the effect of Whos Got The Monkey Now?, which was an update to HBSPs run-away best seller, Onckens 1974 Who's Got the Monkey?. It seemed obvious: the Single Result Search test case just needed to be updated. But then they looked into the matter a bit further. As part of the aggressive testing strategy mentioned earlier, the HBSP logistics team would fill their down time by executing test cases. First theyd run through the Zero Result Search test, then the Many Result Search test, then the Single Result Search. Then theyd add that single result Whos Got the Monkey? to their shopping cart, create an new account, submit the order, and then fulfill it. Of course, they didnt actually fulfill it everyone knew that orders for Mr. Test Test and 123 Test St. were not to be filled. That is, everyone except the marketing department. When HBSPs marketing department analyzed the sales trends, they noticed a rather interesting trend. Onckens 1974 Who's Got the Monkey? was a run-away best seller! And like any marketing department would, they took the story and ran. HBSP created pamphlets and other distillations of the paper. They even repackaged those little plastic cocktail monkeys as official Whos Got the Monkey monkeys. And finally, sometime in 2002, the updated version of Whos Got the Monkey? was posted to HBSP, which was then picked up by the searching system, which, in turn, caused the Single Result Search test case to fail. Of course, by this point, there was little anyone could do. The fictional success of Whos Got the Monkey had already been widely publicized as reality. And with all the subsequent write-ups (many of which are still around to this day), it may have very well become a best-seller. Needless to say, HBSP has since changed their aggressive testing policy. Some details of the story have been redacted to protect the guilty. Thanks to the two anonymous sources working at HBSP for the inside scoop, and news archives for the rest.

18Synthetics Have Risks

More risks..Xbox storyAmazon StoryMitigationsData tagging, filtering, and clean-up

Xbox- Obviously a negative experience as the user is confused, and may think they have been charged (they were not)- This was only a handful of users. Xbox has implemented clever mitigations, such as only using UUIDs outside the range used by valid Xbox users.

AmazonThis is a negative user experience because the user it trying to find actual items to purchase, an intent not served by exposing test data as shown in this example. The exposure of such data ironically creates a sense of "immaturity" or lack of quality. The poor experience becomes worse if a user purchases such an item. It may be reasonable to have such test data on the site transiently, but it should be removed after testing is complete

Mitigations includeData TaggingData CleaningData FilteringTransaction StubbingTransaction pre-validationTransaction Throttling19Provides insight into real usageReproducible and well understood scenariosCovers a vast variety of environmentsRequires proper handling of Personally Identifiable Information (PII)May adversely alter production and production dataPPPAAYouve learned a lot alreadyA quizA = ActiveP = PassiveAnswers can be subject to argument, there are gray areas20Experimentation

Exposure ControlDogfood and BetaA/B TestingTo have a great idea, have a lot of them

-Thomas EdisonTo understand the power of TiP, it is illustrative to understand the power of.Experimentation is a passive validation methodologyTry new things in productionBuild on SuccessesCut your losses before they get expensiveA/B testing is users assigned to one of multiple experiences and comparedDF and Beta is.. A bit different, users opt-in to trying a not yet released versionBoth use Exposure Control which limits who sees the new codeMitigate risk by limited exposure of new code

Controlled ExperimentationUn-controlled experimentation

One way FB experiments:Three concentric push phasesp1 = internal releasep2 = small external releasep3 = full external releaseRef: http://framethink.blogspot.com/2011/01/how-facebook-ships-code.html

21Experimentation at Googledice and slice in any way you can possibly fathom

1% launches

Shadow launches

1/32/3

1% launches Eric SchmidtSlice and dice is about the dataDesign decisions and also service quality

Shadow LaunchesStatus packets: billions packets per dayLaunched service, but users could not see it

At Microsoft we used experimentation to assess how often decisions were goodDecision makers were expertsCLICK1/3 achieved some degree of the desires goal1/3 had not significant effect this is an important result that many do not consider1/3 had the opposite to the desired effectExperimentation lets you quanitify the good ones and weed out the bad ones

--------------------------------------

1% launches dice and slice in any way you can possibly fathom Eric SchmidtRef: How Google Fuels Its Idea Factory, BusinessWeek, April 29, 2008; http://www.businessweek.com/magazine/content/08_19/b4083054277984.htmSomewhat famously this is used for design decisionsdesign philosophy was governed by data and data exclusively Douglas Bowman, Former Visual Design Lead - http://stopdesign.com/archive/2009/03/20/goodbye-google.html)Slice and dice what? The data its a data-driven decision

Shadow LaunchesRef: Seattle Conference on Scalability: Lessons In Building Scalable Systems, Reza Behforooz; http://video.google.com/videoplay?docid=6202268628085731280 @6:55Google Talk Presence packetsConnectedUsers X BuddylistSize X OnlineStatechanges = billions packets per dayEverything was happening, but nothing was displayed to users

At Microsoft, an evaluation of decisions tested with experimentation found 1/3 2/3Ref: http://blog.clicksnconversions.com/intuition-sucks-%e2%80%93-that%e2%80%99s-why-we-test/

22Experimentation at Netflix

1B API requests per day

Canary DeploymentLets look at an example of experimentation more directly ties to traditional software quality assessment

Netflix is a streaming video service hosted on Amazon AWS CloudAvailable in both North and South America, the Caribbean, United Kingdom, Ireland, Sweden, Denmark, Norway, Finland 1B API requests = Big DataBlue is Vcurr smiley face represents customer traffic is carried on that (virtual) serverRed is Vnext, [click] Netflix spins up Vnext in the cloud carrying no user traffic[click] They then put one red/Vnext server live carrying user traffic and let it run to test code quality[click] They then switch user traffic to red/Vnext servers but keep blue/Vcurr ones around while they run overnight and check for problems[click] Finally if all is well with Vnext, the release the Vcurr resources.

Typical problem found: memory leakMove all users to Vnext and let bake that is big data

Although not truly random and un-biased, there is still value here, especially to see large changes

http://perfcap.blogspot.com/2012/03/ops-devops-and-noops-at-netflix.html

Joe Sondow, Building Cloud Tools for NetflixSlides: http://www.slideshare.net/joesondow/building-cloudtoolsfornetflix-9419504Talk: http://blip.tv/silicon-valley-cloud-computing-group/building-cloud-tools-for-netflix-5754984 23

A brief look at data scienceBeware of averagesEven small data is useful dataScience!Watch out for sample biasData Science is becoming more important for testers to know. (Tester as Data Scientist)Not going to spend a lot of time on basics like median, mean, Standard Deviation or linear regressionI assume you know those or you can look those up laterHere we will cover some of more interesting TECHNIQUES and GOTCHAS ones you wont find in a beginner stats courseWont explain them here, will illustrate them on the following slidesPlus the tools of Big Data24

How many years have you worked in software?

This is the one of the first computers I ever usedI have been working in software for 19 yearsSurvey the audienceGet 5 answers (samples)This is to illustrate the rule of 5 Median is point where equal population above and below itMedian years in softwarehas 93.75% chance of being between Min and Max samples surveyed (among the 5 taken)***** Power of small data sets****** Explain sample bias

2593.75%How many years have you worked in software?

50%25%12.5%6.25%3.125%

Sample BiasHow representative are you?Chance of neither

nor

5 x 5 x

100% - 3.125%- 3.125%Median is point where equal population above and below itMedian years in softwarehas 93.75% chance of being between Min and Max samples surveyed (among the 5 taken)Explain why the rule of 5 worksExplain sample bias

Explaining why the rule of 5 worksA value has a 50% chance of being above the median, same as chance of heads on a coin flipAll 5 values above median? 5 heads or 3.125%Neither all 5 values above median nor below it 100 (2 x 3.125) = 93.75%

Sample BiasMedian years in software among TestersMedian years in software among Testers attending Test Bash - would be same as 1 if we knew Test Bash attendees where a representative sample.Median years in software among Testers attending Test Bash who are willing to volunteer such info [self-selection bias]

ModelingThis makes no assumption about the model. By definition a single observation has a 50% chance of being over or under the median

26

Averages are Your EnemyAveraging is a form of lossy data compressionit destroys information!Take-Away: You need to understand your population

Above Probability Density function contains samples from two distinct populations. For example could be different versions of the softwareOr different user populations: testers vs. real users, different geographic regions

27

Averages are Your EnemySame as previous just more complex example 5 distinct populations

28

These data have exactly the same summary statistics!X Mean = 9.0y Mean = 7.5X SD = 3.32y SD = 2.03R2= 0.67Averaging is a form of lossy data compressionit destroys information!Other stats too can also be lossyTake-Away: You need to understand your data model

R^2 is the Coefficient of DeterminationCloser to 1 indicates that a regression line fits the data wellSD is the Standard DeviationA low standard deviation indicates that the data points tend to be very close to the mean; high standard deviation indicates that the data points are spread out over a large range of values.1 SD 68.27% of set; 2 SD 95.4%; 3 SD 99.73% 29Tools of Big Data: HadoopHadoop

HDFSADDFCAECCADDFCAECCADDADDADDFCAFCAFCAECCECCECCMap-Reduce

1xA1xC1xF1xA2xD1xE2xC2xA0xB3xC2xD1xE1xFHadoop is a tool for processing large data setsProcessing = what you might do with a SQL SELECT combine, sort, count

Imagine this data set of 9 chars is actually 10s of trillions of chars

First we need to store massive amounts of dataDistributed storage: HDFS = Hadoop Distributed File SystemBreak file into pieces. Each piece stored multiple times (3) for redundancy

Then we need to process massive amounts of dataDistributed computingMap-Reduce and similar algorithms (Cosmos uses Dryad)Bring the compute to the data in its split-up formMap-Reduce can operate on the piecesThe processing is MAPed to the smaller subsetsThe output of these many operations is then re-combined (REDUCED) into a single answer

(remembering input is 10s of trillions) Output is a much smaller file than input------------------------------------------

Hadoop is part of a rich eco-system of tools- Hive - Data warehouse for Hadoop - http://hive.apache.org/query the data using a SQL-like language called HiveQL- Pig - http://pig.apache.org/high-level language for expressing data analysis.compiler that produces sequences of Map-Reduce programsMahout - machine learning library - http://mahout.apache.org/Scribe: log aggregation

HDInsight is Hadoop running on Microsoft Azure

Ref:http://www.windowsazure.com/en-us/manage/services/hdinsight/http://hadoop.apache.org/30Tools of Big Data: Cosmos

Processing 2PB per dayOn tens of thousands of computersStores hundreds of petabyte

Cosmos is similar to HadoopIt is Microsoft-internalThe numbers are impressive-----------------------------------------------

Data drives search, advertising, and all of MicrosoftWeb pages: Links, text, titles, etcSearch logs: What people searched for, what they clicked, etcIE logs: What sites people visit, the browsing order, etcAdvertising logs: What ads do people click on, what was shown, etcSocial feeds from Twitter & Facebook Service telemetry Office 365, Hotmail (not emails), MSN

Picture is a modularized Container of servers used in Microsoft Data Centers

Refs: It stores hundreds of petabytes of data on tens of thousands of computers. Large scale batch processing using Dryad with a high-level language called SCOPE on top of it.The Bing Big Data Platform - Ken Johnston; Big Data Innovation Summit 2013, Las Vegas: Process 2PB per DayData drives search, advertising, and all of Microsofthttp://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf

31

We know we can't anticipate the 101 things that will go wrong, The only thing we can control is ensuring our team responds appropriately to those situations. Jerry Hook, Executive Producer HaloHundreds of thousands of requests per secondHack using modified XboxHalo Servers

Xbox Console

SharepointPower Pivot

ActionTarget

But even if youre following the law, you can do things where people get queasy.

HDInsight

Hadoop on Azure

HDInsight These are Spartans from HaloA headless Spartan cannot be killed. They should not exist, but they didHow did the Halo team find this bug and eliminate it?CLICK: HDInsight Hadoop running on Azure - **** Hadoop as a serviceCLICK: Halo had the data, but it was overwhelmingCLICK: using the data and HDInsight they found the bug in production, and eliminated itHeadless Spartan: unofficial mod, which can only be applied using a modified Xbox 360.Almost impossible to find pre-release. But in production they can find and eliminate itCLICK: here is brief overview of what they didFrom hundreds of low-wage low-skilled testers Millions of free, low-skilled testers, highly skilled customers CLICK: They also found less obvious bugs and cheats, it is all hidden in the dataTell target storyCLICK: reveal quote - Target statistician Andrew Pole ---------------------------------------------

Target StoryTarget, store in US like Tesco in UKhttp://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/Target sent teenage daughter baby supply couponsTarget apologized, then called weeks later to apologize again Father admitted unbeknownst to him at the time, daughter was pregnantThe following purchases may indicate woman is pregnant with a boyCocoa butter lotionA large purseZinc and magnesium supplementsA bright blue rugBut even if youre following the law, you can do things where people get queasy. - Target statistician Andrew Pole started mixing in all these ads for things we knew pregnant women would never buy, so the baby ads looked random. Wed put an ad for a lawn mower next to diapers

Ref:http://www.microsoft.com/en-us/news/features/2012/oct12/10-31halo4.aspxhttp://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=710000002102

32

Availability (y) over time (x)Predict 75% of dips 24 hours ahead of time

DatacMachine LearningcCosmoscAnother Big Data example is Microsoft Exchange Online***** They can predict 75% of availability issues ahead of timeBig Data from Over 8000 Servers instrumented to collect 1000 MetricsProcessed by COSMOSCLICK ***** Using ML they can *PREDICT* 75% of outages ahead of time---------------------------------------------------------

PBs of data collected, such asAvailabilityLatencyErrorsPerf counters: CPU, Memory, etcLots of serversInstrument them all you get lots of DataPBs, how can we process all that? Cosmos, Machine Learning Its about fitting your data to a modelThink about simple linear regression y=mx + b, it is like that but can get much more advanced

33

Passive

Active

The best observations are often in productionA set of observations to reduce uncertainty about quality of a system under testTestingTesting is.We can use Passive and/or Active techniques to get those observationsIn production is were we can find some of the best observations

Either using Passive or Active ValidationWe obtain Data, which we use to calculate metrics, which is used to drive actions

About the quality of the product

34blog: http://bit.ly/seth_qaDo It In Production

Testing Where it [email protected]

me.@setheliot

?

Thank you!Contact InfoSeth [email protected]: @sethliotBlog: http://bit.ly/seth_qa

35