synergy 2015 session slides: syn408 xendesktop 7.6 architecture - dealing with failure

© 2015 Citrix

• Why – This information isn’t useful without explaining why –  I will spend no more than half the speaking time on this – Don’t need to write stuff down just try to grasp my message

• What – Some examples – Actual architecture and things you can do

• Also –  I will finish with at least 10 minutes for Q&A – Email and twitter I respond to

1

© 2015 Citrix

•  I was the Practice Manager for Workforce Mobility at Presidio, which is a great company and Citrix partner. One of my accomplishments there was the Atlanta Public Schools XenApp/XenDesktop 7 deployment for 50,000 students (one of the first large XenDesktop 7 deployments from a partner). I honestly wanted to do more and joined Ericsson earlier this year as a Consulting Manager – I could list buzzwords like DevOps, OpenStack, CI/CD, SDN and NFV but in reality I currently help customers align their entire deployment pipeline (including software development) with how their company produces value.

2

© 2015 Citrix

• Failures can stop business flow and cost companies money. If you’ve ever worked in Operations, you might think that their sole job is to prevent failures over anything else. To add to this thought, we consume better hardware every year and expect stable performance. Why do newer phones seem to have battery life issues and problems making calls? It’s amazing I can grab a cell phone from 10 years ago and it would last all day on a charge. I had a Volkswagen beetle that still runs, we seriously can’t make data center hardware reliable? We can shorten this philosophy to 5 9s uptime, 99.999% uptime seems to be written into every CIO’s wishlist from any architecture today.

3

© 2015 Citrix

• Failure is a tough thing to avoid or predict. We really should be looking at things a different way. I also realize that many of us have different roles and can think they don’t have a say in this. I disagree, if you can relate anything back to the business value, you will get people’s ears or at worst, a better job.

4

© 2015 Citrix

• Let’s walk through a hypothetical here. Our customer or end user can’t get to the desktop, we find out the desktop can’t pull our profile data from the storage server. In fact, our storage appears to have failed! “Nevermind the details!”, says the Director or CIO, we need this fixed now. We need to ensure storage does not fail again!

5

© 2015 Citrix

• Let’s get a storage expert in here! The solution is a new or upgraded SAN with better performance, more reliability and a promise that it will not fail, or your money back (terms and conditions apply!). The problem with this solution is that it confuses eliminating a problem with finding a solution. It does not address the underlying cause.

• Could this have been the storage driver? How does SAN uptime prevent that? What if it’s just space/performance/latency?

•  Just because the desktop failed when storage did doesn’t mean that storage is the cause

• You are now forever justifying this fix (can you honestly admit it’s wrong if you find out?) Also, how’s the SAN fabric looking?

6

© 2015 Citrix

• One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture.

•  If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.

• Rambo architecture, each component can survive failures of the other components it depends on

•  If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.

• http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html • http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html

7

© 2015 Citrix

• Automation is key. A few years ago Aaron Parker had a session and he asked how are people not automating, there is not reason. I did not automate much back then, I do now though! If you do something twice, you need to automate it. Humans are not ideal at repeat entry, but computers are. Utilizing Chef or Puppet is something you should look into if you haven’t yet. Also, our focus should ultimately be about cycle time. Finally, the concept of immutable servers is also a worthwhile solution: Treat servers like inkjet printers, its often easier to just replace them

• http://www.thoughtworks.com/insights/blog/rethinking-building-cloud-part-4-immutable-servers

10

© 2015 Citrix

• Let’s talk about architecture for XenDesktop 7.6 and how we survive failure. Think of having Desktop and Apps still run despite necessary components failing. A better way to focus this is to evaluate what end users can handle. Surprisingly, I’ve found most handle logoffs better than slow performance. People often don’t report a logoff if they can log back in, but when a print job takes 30 minutes or longer, you can be assured of a ticket.

11

© 2015 Citrix

• Another session SYN502 discussed issues with SMB, Folder redirection and newer technologies. I’m still a fan of redirection but mainly for Documents and file data, not for Desktop or AppData. This is one of the biggest areas to tackle for failure issues, in my earlier example, it was the profile that failed, causing the desktop to not load. I have seen profile replication in failover scenarios where one data center is primary for a set of users, while the other is primary for another set of users. End user feedback is important to get this issue resolved, is it worth hardware and slowness because people use the desktop for their my documents? Usually not.

• For more info see Synergy 2015 - SYN502: I’ve got 99 problems, and folder redirection is every one of them (Helge Klein, Sean Bass, Aaron Parker)

12

© 2015 Citrix

• Did I mention how easy it is to scale later using cheap hardware, storage, compute?

• Perhaps take out APS refs in the picture?

13

© 2015 Citrix

• For HA we should always add another PVS server with a SEPARATE vdisk store (you can mix SAN/local disk, etc here)

•  If we leave DHCP alone we add a point of failure where target devices may fail to boot. You can use 2008 R2 or 2012 to provide split scope or utilize a more redundant solution such as bluecat or infoblox.

• PXE and TFTP is another point of HA concern, you can only provide true HA with a hardware load balancer. I often do NOT provide HA for TFTP but if you have a hardware load balancer there is no reason not to. PXE will load the bootstrap which, if not specified with you PVS servers, won’t work (you need to add them)

• Use mirroring with SQL if you can. It’s great and clustering doesn’t really prevent you from dealing with issues such as the storage failing! If your storage will never ever fail then that’s awesome but keep in mind I can use local storage and mirroring and pretty much get the same benefits, well except for the feeling of spending tons of money. Clustering helps update SQL nodes one at a time while keeping SQL up, this generally is not something I do, but I do recommend mirroring.

• Mirroring requires a witness server, a 3rd server that doesn’t do anything other than help with the quorum (sql deciding what server is primary). If you set this up and lose a secondary and a witness, the primary will stop. I often put my witness

14

© 2015 Citrix

• Load balancers are your friend, I reference NetScaler because of obvious reasons but keep in mind there are free virtual load balancers that are linux based that can do some work. You don’t have to be a Cisco CCIE to figure this stuff out either, there are tons of blogs and walkthroughs out there to guide you through this. That being said, GSLB is a LOT harder than just load balancing internal components

15

© 2015 Citrix

• This diagram is actually for application/dev updating but the theory is the same for different scenarios. We can use blue/green for upgrades, new feature rollout, etc. Note we actually snapshot or clone the database, then flip over to the other application set (or data center, database, etc). If your backups are too long and big, this method of updating or rolling out changes is ideal.

• Limiting Downtime -Green/Blue Deployments – Create live replica of database – Duplicate all app nodes w new code/config – Adjust routing to activate new code

• When to Use – You are updating your schema – No object versioned db – No feature flags – Can test the feature outside production – Restoring from a backup is not practical (big data sets) –  Plan for the worst case scenario: Oops, my feature blew up

• http://www.slideshare.net/adrianjotto/docker-102-immutable-infrastructure

2014-08-15

16

© 2015 Citrix

• Limiting Risk – Requires Feature Flags or Sticky LB sessions – Back up your data – All nodes use production database – Route new connections to new nodes

• When to use – No contract breaking changes to schema –  You have object versioned db

– You use feature flags –  Impractical to test the feature outside prod – Have a full backup of your data & can restore

• http://www.slideshare.net/adrianjotto/docker-102-immutable-infrastructure

2014-08-15

17

© 2015 Citrix

• Note the right side with 3 SCVMM (hyper-v) clusters, we use both clusters but can survive the failure of an entire cluster. All the clusters share the same SQL mirror, storefront farm and File server for profiles.

18

© 2015 Citrix

• This is one cluster of 2 or more for Hyper-V • 2 Blades do the work (so one blade can fail and my cluster is up). If they both

fail, I have another cluster. •  I have 2 of everything • Don’t skimp on something, make it two or more of EVERYTHING you can.

19

synergy 2015 session slides: syn408 xendesktop 7.6 architecture - dealing with failure

Technology

citrix partner

citrix failures

storage server

storage expert

storage driver

better job

san uptime

better hardware