handling flash sales - akamai · handling flash sales: ... lead-up to the diwali festival in india,...

11
AKAMAI WHITE PAPER Handling Flash Sales DevOps Strategies for Traffic Mitigation Holiday Readiness White Paper Author: Dominic Lovell, Senior Enterprise Architect, Global Consulting Services, Akamai

Upload: hoanghanh

Post on 28-May-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

AKAMAI WHITE PAPER

Handling Flash SalesDevOps Strategies for Traffic

Mitigation Holiday Readiness White Paper

Author: Dominic Lovell, Senior Enterprise Architect, Global Consulting Services, Akamai

1

• Insight into the preparation needed for flash sales

• Understanding what website and application limits are before your sales period

• Segmentation options toward managing a large spike of traffic

• Ensuring a consistent experience throughout a sales period

• Reacting to traffic patterns or performance degradation in real time

• Collecting real-time metrics and having a snapshot for future events

Part 1 - Understanding our TrafficThe lead-up to any good sales cycle means we must spend time to prepare for these types of events months in advance. A good starting point is a baseline from previous years, and the trends seen in current traffic patterns. As a part of this process, we must ask ourselves a series of questions on the who, what, when, and how of our sale campaign.

We can start by answering any of these questions:

• Do we have any post-mortem or success stories from last year’s sale period we can review?

• Can we analyze previous years’ traffic, through Google Analytics or other tools?

• Can we review traffic patterns from this year, and ask ourselves: What does the Year-On-Year (YOY)or Month-On-Month (MOM) traffic look like?

• What is our social media following like compared with last year — do we have a larger presence in our socialfollowing that could affect avenues of traffic to our site?

• How many email addresses do we have in our CRM, and what are our typical click-through rates for these typesof campaigns?

• How much of our in-store traffic is supporting our online channel?

• Do we understand the demographic of our holiday shoppers? Are they baby boomers who look online to find a dealbefore shopping in a store? Or are they the 22% of millennials that plan to not set foot inside a retail store?3

• What sort of geographic traffic do we have? Are we catering to an audience across different countries or time zones,which may impact the waves of traffic that occur on the site?

• Are we aware of what the split of traffic is between desktop and mobile?

Handling Flash Sales: DevOps Strategies for Traffic Mitigation Holiday Readiness Whitepaper

From managing a traffic spike during traditional holiday sales, through to the advent of daily

deals sites, e-commerce has seen an increasing trend from retailers who take advantage of

large short-lived flash sales. Some retailers now offer branded sales days, such as Amazon’s

Prime Day and Alibaba’s Singles’ Day. In fact, Alibaba garnered $17.8 billion in sales from

the 24-hour sale in 20171, and some analysts suggest that Amazon may have earned up to

$600 million in sales from its 30-hour sale in 20172. Whether it’s El Buen Fin in Mexico, the

lead-up to the Diwali festival in India, Cyber Monday in the United States, or Boxing Day in

Australia, there’s an increasing demand for online retailers around the world to provide fast

and consistent experiences across their online channels during these peak sales periods.

This article aims to provide recommendations for holiday readiness, including minimizing the impact from traffic surges by utilizing techniques through a CDN, segmenting traffic, and automating the scale and visibility of your traffic patterns to support an organization and its DevOps teams. It covers the following concepts:

Handling Flash Sales: DevOps Strategies for Traffic Mitigation Holiday Readiness Whitepaper 2

• What kind of devices usually access our site? Are they high-end devices, or low-end devices that might notbe as responsive?

• Have we ensured we are mobile friendly?

• Are we running any TV or radio commercials that may lead to a spike in traffic at certain times of the day?

• How are we planning on distributing our sales URLs to our customers? Email, social, TV, or other channels?

• Can we segment our customers into groups such as repeat purchasers, highest-basketed customers, and otherkey value segments — whether through third-party analytics, or other A/B targeting tools, that allow us to cookieor differentiate our high-value customers?

These questions should give us a high-level understanding on where are users are based, the size of our audience, how they are accessing our site, across what devices, and at what times of the day, across each geography.

Part 2 - Preparation Before the Event

Step 1 - Caching - The FundamentalsAs a part of any good sale strategy, we should start with the fundamentals. Start by asking yourself, “Are we caching appropriately, and do we have good offload?” If your CDN allows you to report on or segment traffic across static and non-static content, you have an opportunity to determine how much offload you have, and whether you need any caching improvements. Importantly, you want to ensure your generic upsells, cross sells, and calls to action are all cached to an appropriate degree. This content, which may appear non-cacheable initially, can always be cached and offloaded for some period of time. A 100-millisecond delay in website load time can hurt conversion rates by 7%4. As a general recommendation, you should follow these rules of thumb:

• Static content, such as images, CSS, JS files - High TTL values of 7 days or more

• Semi-dynamic non-personalized content, such as login pages - Medium TTL of 1 day

• Personalized content - One to many - Low TTL of 1 hr

• Dynamic content, such price and inventory - Low TTL of 10 minutes

• Personalized content – One to one (e.g., product recommendations) - Zero seconds TTL

Step 2 - Landing Page Optimization As of 2017, SOASTA found that although a promotional campaign may be a visitor’s first experience to a site, these types of pages often perform more poorly than other pages. In fact, they discovered that on average, load times were 30–60% slower than regular pages5. There were several factors uncovered, such as page bloat, too many tracking tags and third-party scripts, no performance testing beforehand, and a rush to get these landing pages out the door. Noting this, we must ensure that our landing pages and campaign content are all heavily integrated into our test and launch pipeline. We must also review the weight and load times to identify if there are any performance bottlenecks on these pages and address those concerns up front.

Step 3 - Ensuring we are Mobile Friendly Akamai has found that time spent on mobile devices dominates time spent on the web, with 60% of total time spent on mobile. By 2021 Akamai predicts that 90% of data delivered over the web will be over a smartphone. Google also reports that in 2017, we are in the age of the micro-moment, where smartphone users are 50% more likely to expect to purchase something immediately when using their smartphone, compared to a year ago6.

With this in mind, developers and designers must ensure that sites are mobile friendly and support the device types reviewed in Part 1. Google recommends sites are implemented with Responsive Web Design (RWD) techniques7 to ensure users receive the same experience across desktop and mobile. One of the biggest RWD concerns we see at Akamai is a site’s inability to scale large images down to the appropriate size and resolution for the mobile device. Often mobile devices are served desktop-sized images in the browser, which leads to an over-delivery of bytes and unnecessary processing required from the device to scale the image to fit onto the viewport.

Handling Flash Sales: DevOps Strategies for Traffic Mitigation Holiday Readiness Whitepaper 3

Part 3 - Understanding our LimitsOnce optimal caching strategies are in place, and have been optimized for the types of devices likely to access your sale site, we can now understand the limits of the existing system. While we might consider autoscaling the answer to an influx in traffic, autoscaling comes with its challenges around provisioning too many instances, choosing the right instance types, and having idle instances running. There’s a fine balance between autoscaling and overconsumption of resources. Unfortunately, just because you met last year’s demands does not necessarily mean you will be able to handle this year’s traffic. Collection of data in Part 1 supports an understanding of recent traffic, common traffic patterns, and points of entry. This data allows you to perform load and stress tests against the site, and helps demonstrate the limits of the underlying infrastructure.

The primary goal is to understand the max capacity of the system, to ensure no downtime, but also to understand at what point the application starts to slow down. In fact, website slowdowns are 10 times more frequent than an outage, according to research conducted by SOASTA. A slow-loading page could have twice the impact on revenue than a site failure alone.8

At a minimum, the following types of tests are recommended:

• Spike tests - Short bursts of high load

• Peak load tests - Simulations of the busiest periods throughout the day

• Soak tests - Tests against long periods of high load

• Max capacity tests - Understanding where the application hits its hard limits

SOASTA, using its CloudTest tool, recommends the following load levels when testing your site:

• 10 virtual users - Smoke tests to initially test performance

• 100 virtual users - Can the back end handle load with minimal traffic?

• 1,000 virtual users - Does the application perform well now, or is it waiting for connections, data, and responses?

• Max virtual users - Growing the test to a point to meet expected peaks

• 2x, 3x, or 10x max virtual users - To perform stress tests and understand where the application may break underextreme traffic scenarios

Figure 1 - Load testing strategies

4Handling Flash Sales: DevOps Strategies for Traffic Mitigation Holiday Readiness Whitepaper

If you are unaware of your max users, or max requests per second (RPS), you should start by looking at the limits of your web servers, based on the CPU and memory constraints available on your current infrastructure. For example, NGINX has released some tangible metrics behind its limits on RPS and connections per second (CPS) on live HTTP and HTTPS connections9. Keep in mind that these types of requests will only be for personalized or highly dynamic content that cannot be cached at Edge, such as cart or checkout information. Having prepared optimal cache settings previously will ensure you have offloaded as much as possible to your CDN.

Once load tested, we ask ourselves these questions:

• Does our application meet our response time goals?

• After peak traffic, does the application recover gracefully and continue to operate normally?

• At what point does the application stop meeting our response-time goals?

• Do the servers begin to generate errors or refuse connections at any point?

• Do our servers crash during any of these virtual user tests, and have we found our maximum threshold?.

Part 4 - Managing Traffic at Scale

1. Managing the Initial Spike of TrafficMany large retailers start to advertise their Cyber Monday sales campaigns before the official Monday10. Rubicon found that in a 2016 survey, 22% of shoppers started researching Cyber Monday early3. With customers eager to get in and grab a sale early, managing an uptick in traffic can be difficult to predict.

On the other hand, offering sales early also gives an opportunity for your competitors to price match or offer a more competitive offer on similar products. Having a strategy in place to allow peak traffic through the doors at scale will ensure a competitive edge over rivals, and ensure your customers receive the best possible experience when visiting the flash sale for the first time. Therefore, any bargain hunters should be redirect to a landing page with a message stating when the sale period will begin.

Figure 2 - Peak user demographics

Handling Flash Sales: DevOps Strategies for Traffic Mitigation Holiday Readiness Whitepaper 5

2. Opening the Doors at the Right Time - A Rolling Sales Window Approach

Figure 3 - Monitoring global traffic

Analyzing traffic patterns in Part 1 should provide insight into traffic sources, times of day frequented, and location of our users. This insight offers a segmentation strategy behind when traffic should allowed to access any new sales pages. In this approach, any EDM campaigns or social messaging should also follow a model where you only advertise to the relevant countries or groups of users at alternate times.

For example, countries which hit morning first, such as Australia and New Zealand, can receive targeted emails during their mornings on the day of the sale. Where appropriate, these users could be segmented further by state, as time zones differ between states, and East and West Coast cities are several hours apart. Similarly, social networks such as Facebook allow you to specify targeted ads based on time zones or the country of your users, to ensure only targeted users will see the sales posts on their wall.

Other such time-based targeting could rely on your users’ habits, such as targeting users on mobile devices early in the morning, as you know these users convert the best during this time. In fact, Google found that in 2017 smartphone users are significantly more likely to purchase from companies whose mobile sites or apps customize information to their location11. If you can understand your users’ routines, you can target your traffic around these influxes and determine who should see the sales messages at what time.

Once advertised, leveraging an audience segmentation tool will allow you to ensure only users in the relevant location or targeting segments will see the newly advertised sales. In the event people share links or forward emails, users outside these segments or country locations that have not yet hit the rolling sales window should only see a sale coming soon message, or be redirected to an appropriate landing page on site asking them to check back soon when the sale is opened for their segment. These landing pages should reside in cloud file storage, so they are decoupled from your backend infrastructure, and cached on your CDN, so any incoming traffic does not affect users who are currently transacting on site.

As the day progresses, traffic from earlier segments or countries in earlier time zones will continue to increase, until they hit their late evening, when traffic from these segments will usually slowly drop off. At this point, you can choose to target these users back to a landing page once their midnight occurs, or continue to allow these users to transact on the site (in which case you should leverage techniques discussed in the next section).

Through the use of appropriate segmentation cookies, users who are given access to the site should continue to have access even after their sales window closes, for the life of their session, to ensure users currently browsing can still check out. As other countries and geos come online, you should leverage automated processes within your CDN to push this traffic to the site and allow different segments to view sale content.

Handling Flash Sales: DevOps Strategies for Traffic Mitigation Holiday Readiness Whitepaper 6

3. Ensuring a Consistent ExperienceUsing the previously discussed rolling sales window technique, there will be a period throughout the day where two thirds of all geos have access to sales content. During this time, you can prioritize the types of users on site as well as the content available to certain visitors.

The first prioritization technique should determine a metric, which is considered a high priority across all users on site, such as users with items in their cart. This metric can then be used to leverage a visitor prioritization mechanism that ensures that users with items in cart are always given precedence when browsing on site. For example, users who have items in cart can have a cookie flagging them as purchasers that are allowed full access to the site. Other targeting options may be UTM query parameters for certain click-through URLs, or users who have been cookied previously who are returning users and will likely check out as repeat customers. Others may fall into a bucket of yes or no users based on a configurable setting. The yes users get unrestricted access to the site based on a set window of time established in a cookie, and the no users receive a holding page, which tells them they are in the queue to gain access to the site. On this holding page, you could show a video, or offer users a coupon, to ensure they are still engaged on site.

Using the load testing and process in Part 3, capacity limits, predefined autoscale parameters, and the hard limits imposed within the current infrastructure are now all known. These metrics can then be used as feedback signals to your CDN to load balance between servers under high load, or to shift traffic around the infrastructure. By prioritizing which users get access to the infrastructure, you can ensure the right type of traffic flows through to your checkout and conversion pages. Other users who may be showrooming, and do not progress through the conversion funnel, are not given priority on the site.

Figure 4 - Prioritizing which traffic gets access to the backend infrastructure

The other prioritization technique allows you to prioritize content rather than users. You should define endpoints within your system, or API data that can only be accessed under certain conditions. For example, if your site is known to have a large bot footprint, and these bots contribute to the overall load of your system, this may affect the amount of resources available to users who are looking to complete legitimate sales. You also do not want to have these bots index flash sale content, as the sale period will be finished by the time the content appears in search indexes. Therefore, you can target Autonomous System Numbers (ASNs) or values such as User-Agent strings, to serve bots cached content or redirect them elsewhere on site. Using this API prioritization technique allows all other users to see up-to-date content and access sales information in real time.

7Handling Flash Sales: DevOps Strategies for Traffic Mitigation Holiday Readiness Whitepaper

Even with the best of planning and foresight, there may be times when your infrastructure goes down or begins to behave unexpectedly, particularly under times of increased stress, such as during a flash sale. In this case, ensuring a consistent branding experience is important. Users that receive a “503 Service Unavailable” message, a database error, or worse just a white screen with no explanation, may leave and not return during the sale period. This is why a failover mechanism must be in place, to failover to a landing page or a static copy of your site. In such an event, you should either continue to serve cached content or serve a static copy of the site from a cloud file storage location. Checkout or other transactional areas of the site should be disabled, or force redirects to a landing page or other location, which will allow users to continue to browse and see product information while on site. In the interim, once the application has recovered or been brought back up, failover switches can be performed automatically by the CDN or via a manual switch configured ahead of time to allow users to access the back end and transactional pages.

Using these feedback mechanisms, coupled with Visitor and API prioritization techniques, you can ensure you give your customers a consistent experience. In this approach, users have predefined opportunities to gain access to sales content, and once in and transacting onsite, have prioritized access to checkout. Other users, or third-party access such as bots and scrapers, will only see older cached content, and not contribute to the load on the system. In cases where the system has been overloaded, a failover mechanism ensures a consistent branding experience is still available, even when your infrastructure has hit a bottleneck or is unavailable. One of the key successes behind a DevOps strategy is having systems in place that will automatically perform tasks that would otherwise take time and manual intervention to roll out. Coupling the metrics around scale and performance, with the APIs necessary to prioritize and scale traffic, will ensure that these mechanisms are operating without the need for manual intervention.

4. Reacting to Traffic Patterns and Performance IssuesMany online retailers are starting to recognize that uptime isn’t the only metric that plays into consumer satisfaction, and that performance needs to be at the helm of holiday readiness. This has been supported by research from Akamai12 that suggests that sites that experienced downtime, on average, saw a permanent abandonment rate of 9%, whereas sites that suffered from slow performance experienced a 28% permanent abandonment rate.

Load metrics and frontend performance play a vital role into increased conversions. Google has found that if a page load increases from one second to seven seconds, the probability of a mobile site visitor bouncing increases to 113%13. In fact, Google also reports that more than half of all users will abandon a mobile site if it takes more than three seconds to load, and another 20% will drop off for every second of delay14. Therefore, it is critical to have real-time measurements in place to capture load times and have automated processes in place to scale these traffic patterns.

A real-time operations dashboard is necessary to capture insight into users and how traffic is performing. This includes real-time alerts into traffic distributions, errors, and the general performance seen across all geos. If one geo or segment of traffic is underperforming, or does not meet performance goals, it may be necessary to intervene and shift traffic or disable third-party services that may be affecting user experience.

This is why Real User Monitoring (RUM) systems are necessary to capture load metrics in real time and provide insight into whether traffic segments are performing more poorly than usual. For example, SOASTA found by breaking down user data by geography, users in Australia were used to slower load times compared with visitors in the United States, and were less likely to bounce if a site was loading slowly. As such, they recommend prioritizing for audiences who are much more sensitive to page render and load times.

It is important to understand what these thresholds are ahead of time, so these rules can be integrated into the automated DevOps triggers.

For example, these metrics could include:

• Backend response times (or Time To First Byte)

• Is your caching and offload strategy working effectively?

• Frontend response times

• Are there any third-party resources or particular frontend assets that are causing a bottleneck?

Handling Flash Sales: DevOps Strategies for Traffic Mitigation Holiday Readiness Whitepaper 8

• Do you have mechanisms in place to disable such services if necessary?

• Error response codes from our application

• What is the overall percentage of 4xx or 5xx errors reported from your infrastructure?

• Can you implement transaction traces to determine which requests were problematic?

• Can you capture real-time metrics from the CDN?

• Do you have mechanisms to trace problematic transactions in the application?

• What is the general sentiment across your social channels?

• Are you seeing one or two users complaining, or do you have a larger issue at play?

Some RUM systems have the ability to provide webhooks or alerting systems so DevOps teams can be made aware when these thresholds are hit. For example, a webhook could be set up to alert developers in Slack that a geo is receiving a poor user experience based on slow load times on the front end. These webhooks could also feed back into a DevOps workflow to shift the amount of traffic to a particular cluster or availability zone. Developers be made aware of these automated changes and decide whether this is something they should continue to monitor, or whether the there is a larger issue they will need to rectify in real time.

A RUM dashboard, with triggered alerts, provides the quantitative metrics around performance and load times. Coupled with real-time monitoring of qualitative feedback via social channels, this will ensure a site is supported across a full lifecycle of automated traffic and scale, yet also allows DevOps teams to intervene when necessary.

Figure 5 - Monitoring performance in across a geo in real time

5. After the EventAfter a flash sale has completed and closed its doors, metrics for post analysis should be collected. This includes max user sessions, peak load times, geographic details of your traffic, offload metrics, aggregate response codes from origin, traffic volume reports, and reports on most frequented pages. It is important to collect these metrics during the event where possible, but also collect key metrics in the days following the event, as these metrics may not be available on your systems when you need to understand traffic patterns and key metrics for the next flash sale, as most reporting systems only retain data for a limited time.

Post-event and project notes should be summarized and collated from all project and technical teams as reference for future sales. This will provide insight into any issues or opportunities for improvement for next time, as well as the tasks that worked well. This material should be accessible to the entire team, as different individuals may work on future sales.

9Handling Flash Sales: DevOps Strategies for Traffic Mitigation Holiday Readiness Whitepaper

Where appropriate, redirects for the URLs for the landing pages and other sales categories that had been set up for this event should be created to redirect traffic to a landing page noting the sale is over. These landing pages should also link to appropriate category or product pages, or to a location on the site such as the generic sales pages or the homepage. This will ensure users who access these URLs after the sale has ended are sent to a page where they can still browse and check out on the site. To ensure maximum offload, these redirects should be cached at Edge and not served by the backend infrastructure. Landing pages can be hosted on a cloud file storage location where necessary.

Part 5 - Summary

The key to managing a flash sale successfully is preparation. Through traffic analysis techniques, DevOps teams can understand and predict where surges in traffic will arrive from. This analysis can be completed by answering a number of questions around traffic demographics and user behavior on site. This analysis will provide a baseline for any plans on scale and load testing.

Before any load testing can take place, a series of optimizations must be carried out, including a review on the existing caching strategies in place, as well as content optimization for landing pages, to ensure these pages are not overweight and have good load metrics. This content must also be optimized for mobile to leverage users who access content and purchase on the spot.

Once caching is optimal and page content has been optimized, you are now in a position to perform load tests to understand whether the current infrastructure and cache settings will support the estimated traffic. Several types of load tests will determine how your system behaves under certain conditions. The outcomes of these tests provide metrics that can be used to automate the scale of user traffic.

Understanding your traffic demographics gives you an opportunity to segment users by country, time zone, device type, and other factors. This allows you to manage initial traffic spikes by not having to pre-launch sales campaigns or perform silent launches, and allows all users into the system during the key sales period. A rolling sales window is one such technique that segments users by location and targets when and how the users can access sales content during the sale period.

Other segmentation techniques allow you to prioritize who receives ongoing access to the site, by prioritizing certain users, such as those with items in their cart. Failover planning also ensures a consistent branding experience in the event that your infrastructure goes down. These techniques, when used in conjunction with our load test metrics, ensures your application infrastructure will always receive the right level of traffic it can support at any one time.

Through the use of RUM systems, automated triggers can fire webhooks or other alerts that can then feed back into the automated traffic management workflow. Alerts and notifications of these automated processes will allow DevOps teams to manually intervene when necessary. Real-time RUM dashboards provide a global snapshot of user satisfaction and application health.

Figure 5 - Monitoring performance in across a geo in real time

10Handling Flash Sales: DevOps Strategies for Traffic Mitigation Holiday Readiness Whitepaper

Once finished, sales metrics collected for post-analysis reports will ensure this event provides insight for future teams and an opportunity to optimize again in the future.

In summary, once initial traffic data and content optimization has been completed, traffic management can be automated with segmentation techniques and real-time feedback on end-user performance to ensure application infrastructure can support a large influx of traffic. Automating these processes provides a seamless and stress-free way for DevOps teams to support large sales, huge peaks of traffic, and irregular traffic patterns across a site.

We will be glad to assist you if you have further questions or would like to discuss the best option for your use case. Please reach out to your Akamai account team or email [email protected].

Sources

1) https://www.forbes.com/sites/franklavin/2016/11/15/singles-day-scorecard-a-day-in-china-now-bigger-than-a-year-in-

brazil/#615aad921076

2) http://www.practicalecommerce.com/amazon-prime-day-2017-smashes-sales-record

3) http://go.rubiconproject.com/rs/958-XBX-033/images/Rubicon%20Holiday%20Poll%202016%20--%20FINAL.pdf

4) https://www.soasta.com/press-releases/akamai-online-retail-performance-report-milliseconds-are-critical/

5) https://www.soasta.com/blog/marketing-campaign-performance-optimization/

6) https://www.thinkwithgoogle.com/consumer-insights/micro-moments-consumer-behavior-expectations/

7) https://developers.google.com/search/mobile-sites/mobile-seo/responsive-design

8) https://www.soasta.com/blog/downtime-vs-slowtime/

9) https://www.nginx.com/blog/testing-the-performance-of-nginx-and-nginx-plus-web-servers/

10) https://www.fool.com/investing/2016/11/25/what-wal-mart-target-and-amazon-are-doing-on-cyber.aspx

11) https://www.thinkwithgoogle.com/consumer-insights/micro-moments-consumer-behavior-expectations/

12) https://www.soasta.com/blog/downtime-vs-slowtime/

13) https://www.thinkwithgoogle.com/intl/en-aunz/advertising-channels/mobile/au-mobile-page-speed-new-industry-benchmarks/

14) https://www.thinkwithgoogle.com/advertising-channels/mobile/consumer-behavior-mobile-digital-experiences/

As the world’s largest and most trusted cloud delivery platform, Akamai makes it easier for its customers to provide the best and most secure digital experiences on any device, anytime, anywhere. Akamai’s massively distributed platform is unparalleled in scale with over 200,000 servers across 130 countries, giving customers superior performance and threat protection. Akamai’s portfolio of web and mobile performance, cloud security, enterprise access, and video delivery solutions are supported by exceptional customer service and 24/7 monitoring. To learn why the top financial institutions, e-commerce leaders, media & entertainment providers, and government organizations trust Akamai please visit www.akamai.com, blogs.akamai.com, or @Akamai on Twitter. You can find our global contact information at www.akamai.com/locations. Published 10/17.