rescuing resque
Post on 14-Sep-2014
286 views
DESCRIPTION
Overview of backgrounding architecture changes made at PeopleAdmin that was a big win for our team and company.TRANSCRIPT
![Page 1: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/1.jpg)
Backgrounding OverhaulOverview and Results
![Page 2: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/2.jpg)
● This has been a cross-team effort○ Development○ QA○ Operations○ L3
● Lots of people have helped● This includes management (no suckup)
Credit where credit is due
![Page 3: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/3.jpg)
What are background jobs?
● Tasks to be performed in the background (duh)
● May be handed off by the web● May be handed off by other jobs● May be scheduled at regular intervals● Are typically expensive
![Page 4: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/4.jpg)
At PeopleAdmin backgrounding is...● Resque (Ruby API)● Redis (Middleware)● Jobs are put in queues● Workers look at queues for
work● Workers are grouped into
pools● We have 1 pool per worker
server● We have many worker servers● Resque scheduler puts jobs
into queues at their scheduled run time
![Page 5: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/5.jpg)
So what do we use backgrounding for?
![Page 6: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/6.jpg)
EVERYTHING
![Page 7: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/7.jpg)
Specifically...● Transitions of postings,
applications, hiring proposals, etc...
● Emails● Keyword indexing (search)● Import jobs● Export jobs● Report generation (EEO)● Employment task lifecycle● Onboarding task lifecycle● Marketplace integrations (job
boards, background checks)
● Chore notifications● Clearing cached data● Promoting changes between
customer environments● Employer stats collection● Et cetera● Et cetera● Et cetera
![Page 8: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/8.jpg)
So, uh... If everything relies on this, wouldn’t changes be dangerous?
![Page 9: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/9.jpg)
YESBut we are smart and daring
(sometimes)
![Page 10: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/10.jpg)
So what were/are the problems?● Visibility● Performance● Job Contention● Technology limitations● Technology reliability● Deployment interruption● Others...
![Page 11: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/11.jpg)
No Visibility
● Resque was a black box● Operations, L3 & Development had no view
into production● Ability to diagnose problems was limited● Also had no way to know if we were creating
more problems
![Page 12: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/12.jpg)
No Visibility
● Instrumented jobs with Splunk● Gave us sophisticated querying ability and
graphing of results● Gave us view into life of each job● Allowed view into usage patterns, time in
queue, time to perform and other metrics
![Page 13: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/13.jpg)
No Visibility
![Page 14: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/14.jpg)
Performance
● Perceived performance is time in queue + time to perform
● Some individual jobs were particularly slow to perform○ emails○ system events
● These affected system as a whole
![Page 15: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/15.jpg)
Performance
● Emails & system events targeted for performance improvements
● Perform time for emails down from 23 seconds to 9 seconds
● Perform time for system events down from 32 to 8 seconds
![Page 16: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/16.jpg)
Job Contention
● Non-prod jobs interfered with production jobs
![Page 17: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/17.jpg)
Job Contention
● Non-prod jobs interfered with production jobs
● So we separated prod & non-prod queues
![Page 18: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/18.jpg)
Job Contention
● Non-prod jobs interfered with production jobs
● So we separated prod & non-prod queues
● Still have a few issues...
![Page 19: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/19.jpg)
Job Contention
● Jobs of different types in the same queue would contend for workers
![Page 20: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/20.jpg)
Job Contention
● Jobs of different types in the same queue would contend for workers
● So we reallocated jobs into fine-grained queues
![Page 21: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/21.jpg)
Technology Limitations● Resque & Resque-Pool work, but are simple● We are not simple
○ Multiple customers○ Multiple groups○ User activity dynamics○ Flood possibility
● Best illustrated by example...
![Page 22: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/22.jpg)
Technology Limitations
KeywordIndexes
Emails ImportsJobs enter the queues
Workers prioritize queues from left to right
Worker proceeds down list of queues until it finds a job to be processed
If no jobs are available, workers start back at the left of the list Worker
![Page 23: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/23.jpg)
Technology Limitations
KeywordIndexes
Emails ImportsJobs enter the queues
Workers prioritize queues from left to right
Worker proceeds down list of queues until it finds a job to be processed
If no jobs are available, workers start back at the left of the list Worker
![Page 24: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/24.jpg)
Technology Limitations
KeywordIndexes
Emails ImportsJobs enter the queues
Workers prioritize queues from left to right
Worker proceeds down list of queues until it finds a job to be processed
If no jobs are available, workers start back at the left of the list Worker
![Page 25: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/25.jpg)
Technology Limitations
KeywordIndexes
Emails ImportsJobs enter the queues
Workers prioritize queues from left to right
Worker proceeds down list of queues until it finds a job to be processed
If no jobs are available, workers start back at the left of the list Worker
![Page 26: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/26.jpg)
Technology Limitations
job
KeywordIndexes
Emails ImportsJobs enter the queues
Workers prioritize queues from left to right
Worker proceeds down list of queues until it finds a job to be processed
If no jobs are available, workers start back at the left of the list Worker
![Page 27: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/27.jpg)
Technology Limitations
job
KeywordIndexes
Emails ImportsJobs enter the queues
Workers prioritize queues from left to right
Worker proceeds down list of queues until it finds a job to be processed
If no jobs are available, workers start back at the left of the list Worker
![Page 28: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/28.jpg)
Technology Limitations
KeywordIndexes
Emails ImportsJobs enter the queues
Workers prioritize queues from left to right
Worker proceeds down list of queues until it finds a job to be processed
If no jobs are available, workers start back at the left of the list Working
![Page 29: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/29.jpg)
Technology Limitations
job
job
job
job
job
job
job
job
job
job
job
KeywordIndexes
Emails ImportsSometimes we get floods of jobs
Workers are dumb, they always start at left and move right
Queues of a lower priority of the flooded queue get lonely
Net result is a customer waiting while a job sits in a queue
WorkerWorkerWorker 1
![Page 30: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/30.jpg)
Technology Limitations
job
job
job
job
job
job
job
job
KeywordIndexes
Emails ImportsSometimes we get floods of jobs
Workers are dumb, they always start at left and move right
Queues of a lower priority of the flooded queue get lonely
Net result is a customer waiting while a job sits in a queue
WorkerWorkerWorking 1
![Page 31: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/31.jpg)
Technology Limitations
job
job
job
job
job
KeywordIndexes
Emails ImportsSometimes we get floods of jobs
Workers are dumb, they always start at left and move right
Queues of a lower priority of the flooded queue get lonely
Net result is a customer waiting while a job sits in a queue
WorkerWorking 2 Working 1
![Page 32: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/32.jpg)
Technology Limitations
job job
KeywordIndexes
Emails ImportsSometimes we get floods of jobs
Workers are dumb, they always start at left and move right
Queues of a lower priority of the flooded queue get lonely
Net result is a customer waiting while a job sits in a queue
WorkerWorkerWorking 1
![Page 33: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/33.jpg)
Technology Limitations
KeywordIndexes
Emails ImportsSometimes we get floods of jobs
Workers are dumb, they always start at left and move right
Queues of a lower priority of the flooded queue get lonely
Net result is a customer waiting while a job sits in a queue
Working 2Worker 3 Working 1
![Page 34: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/34.jpg)
Technology Limitations
● There was no existing solution to this problem within the Resque ecosystem.
● Our options○ Migrate to a different technology○ Contribute enhancements to our current technology
● We opted for the latter (Qtrix)
![Page 35: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/35.jpg)
Technology Limitations
Qtrix says, “Your priority is…”Our central Qtrix orchestrator tells each worker what their queue priorities are
Workers still dumb, the lists are intelligently shuffled
Every queue is the top priority of at least one worker
Higher priority queues appear to left more often than lower priority queues
Worker 2
Worker 3
Worker 1Keyword Indexes, Emails, Imports
Emails, Imports,Keyword Indexes
Imports, Keyword Indexes, Emails
![Page 36: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/36.jpg)
Technology Limitations
job
job
job
job
job
job
job
job
job
job
job
KeywordIndexes
Emails ImportsOur central Qtrix orchestrator tells each worker what their queue priorities are
Workers still dumb, the lists are intelligently shuffled
Every queue is the top priority of at least one worker
Higher priority queues appear to left more often than lower priority queues
Worker 3Worker 2Worker 1
![Page 37: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/37.jpg)
Technology Limitations
job
job
job
job
job
job
job
job
KeywordIndexes
Emails ImportsOur central Qtrix orchestrator tells each worker what their queue priorities are
Workers still dumb, the lists are intelligently shuffled
Every queue is the top priority of at least one worker
Higher priority queues appear to left more often than lower priority queues
Working 3Working 2Working 1
![Page 38: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/38.jpg)
Technology Limitations
job
job
job
job
job
KeywordIndexes
Emails ImportsOur central Qtrix orchestrator tells each worker what their queue priorities are
Workers still dumb, the lists are intelligently shuffled
Every queue is the top priority of at least one worker
Higher priority queues appear to left more often than lower priority queues
Working 3 Working 2Working 1
![Page 39: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/39.jpg)
Technology Limitations
job
job
KeywordIndexes
Emails ImportsOur central Qtrix orchestrator tells each worker what their queue priorities are
Workers still dumb, the lists are intelligently shuffled
Every queue is the top priority of at least one worker
Higher priority queues appear to left more often than lower priority queues
Working 3 Working 2Working 1
![Page 40: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/40.jpg)
Technology Limitations
KeywordIndexes
Emails ImportsOur central Qtrix orchestrator tells each worker what their queue priorities are
Workers still dumb, the lists are intelligently shuffled
Every queue is the top priority of at least one worker
Higher priority queues appear to left more often than lower priority queues
Worker 3Working 2Working 1
![Page 41: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/41.jpg)
Technology Limitations
Qtrix also gives us...● The ability to create different priority configurations for
different scenarios● The ability to change to those configurations on the fly● The ability to script these changes in reaction to
different events● The ability to have this work elasticallyWe are not taking advantage of all of these things yet…
![Page 42: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/42.jpg)
Technology Reliability
● Redis is memory bound● Resque would leave a mess● Redis was a single point of failure
![Page 43: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/43.jpg)
Technology Reliability
● Redis is memory bound● Resque would leave a mess● Redis was a single point of failure● Solutions
○ Automated memory cleanup○ Added redis AOF backups○ Added data replication but not failover (yet)
![Page 44: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/44.jpg)
Deployment Interruption
● Jobs would be terminated● Jobs sit idle while workers restart● Scheduler would go down and execution
times missed● Ditto employer method jobs, plus hung locks
![Page 45: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/45.jpg)
Deployment Interruption
● Now…○ All jobs finish gracefully○ There is no delay time where jobs are not getting
worked (includes employer methods jobs)○ Scheduler is not brought down during deploys○ Employer method job locks are still a problem
![Page 46: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/46.jpg)
We have gained● Diagnostic ability● Performance metrics● Better performance● Less long-term &
catastrophic risk● Lowered resource needs● Lower customer pain
And here we are...
Still issues● Redis is single point of
failure● Resque scheduler
reliability● Scaling elastically● Tidying up
![Page 47: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/47.jpg)
Since June...● Total time waiting on jobs decreased 31%
○ SystemEventWorker time decreased 72%● Total time jobs enqueued decreased 68%
○ Production jobs enqueued time decreased 74%● Redis memory use decreased ~70%● “Stuck jobs” during floods decreased 100%● Eliminated 1 worker server
The numbers tell the story
![Page 48: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/48.jpg)
● For the opportunity to work on these fun, challenging problems
● For the help along the way● For the trust to be allowed to work
unrestrained● For the patience & understanding when
things didn’t go according to plan
Thanks!
![Page 49: Rescuing Resque](https://reader031.vdocument.in/reader031/viewer/2022013117/541562f38d7f72336c8b467c/html5/thumbnails/49.jpg)
Questions?