Download - Focus, Governance, and Innovation: How LinkedIn Scaled to 3M Jira Issues and 500M Members
LinkedIn scales to 3M issues and 500M members
A tale of process, people, and technology
ARNIE MATZ | DIRECTOR, SOF TWARE ENGINEERING | LINKEDIN
DAN HATA | SENIOR ENGINEERING MANAGER | LINKEDIN
500,000,000+ registered members
138+M UNITED STATES OF AMERICA
29+M Brazil
42+M India
8+M Indonesia4+M Phillippines3+M Malaysia1+M Singapore
1+M Japan
1+M Korea
32+M China
1+M New Zealand
8+M Australia
23+M UK14+M France10+M Germany10+M Italy9+M Spain7+M Netherlands3+M Belgium2+M Denmark2+M Sweden1+M Ireland
Artifact Repository Review
SCM
Development Tools
IDE CI Pipeline
SCM Review
Jira Artifact Repository
Artifact Repository
Development Operations Marketing Facilities
Jira Business Usage at LinkedIn
What did LinkedIn think of Jira in 2015?
Why does my dashboard freeze at
10am?
Jira was fast at my last company.
Why don’t we just build our own Jira?
Why does Jira always crash?
Have we looked into alternatives?
Why is Jira always slow?
Will Jira ever be stable?
One last request……
Please fix Jira as soon as possible Arnie
0
17.5
35
52.5
70
2015 2016 2017 2018 2019 2020 2021
Issues Count Growth: 2015-2022
2015 Scary Stuff
Stability and Performance No understanding of Jira stability and performance issues.
No change control process.
Unlimited admins Many people have admin access making change control and standardization impossible.
Lucene index corruption 25% of Jira restarts resulted in index corruption and recovery takes hours.
Rapid custom field growth Contributes to index growth. There was no governance. Admins said yes to all custom field requests.
No governance
No change control
No metrics
2015 ASSESSMENT SUMMARY
Unplanned outages almost every day
Sometimes all day
Thousands of custom fields
Growing by 150% year over year
....and way out of control
Process
Three investment areas
Technology People
CHECK POINT
1.2 million issues 300+ million members
6,000+ employees
2015
People: Process:
Technology:
CRITICAL
REVIEW
CRITICAL
Roles For Supporting Jira
App AdminsFocus on customer service: external consultants
OperationsFocus on deterministic change and mitigating risk
DevelopersEnsuring performance and scale are built into all solutions
ManagersFocus on governance, strategy, Atlassian partnership.
App Admins
Developers
Operations
Managers
2015 Team Staffing: Before
4 0
00
App Admins
Developers
Operations
Managers
2015 Team Staffing: During
4
00
0
App Admins
Developers
Operations
Managers
2015 Team Staffing: After
4 1
10
Monitoring and SLOs
Atlassian relationship
Governance CRITICAL
CRITICAL
Change control
2015 Process: Before
CRITICAL
WARNING
0
17.5
35
52.5
70
2015 2016 2017 2018 2019 2020 2021
Issues Count Growth: 2015-2022
0
17.5
35
52.5
70
2015 2016 2017 2018 2019 2020 2021
2015 Issue Count Projections — Original 2015 Issue Projection
— Issue Projection with Governance in Place
Under Control
• Removed unused Custom Fields for a 40% overall reduction!
• Limited Custom Field growth through Governance
2015: Custom Fields
Governance: Unbound columns
• Unbounded columns in Jira: • Comments • Versions
2015 Process Improvements
Misuse: What can I do?
• Document and communicate what is acceptable use
• Work with users to find the right solution • Through technology, make it impossible
for misuse to reoccur
2015 Process Improvements
Change Control • Configuration as code • All changes are tested,
reviewed, communicated, with rollback plans
Service Level Objectives • Tracked and investigated violations
• Example: <2 seconds issue creation time
2015 Process Improvements
Atlassian Relationship • Introduced TAM • Added Premier Support • Partnered with TAM and PS at Atlassian
to target a performance upgrade • Extended licensing for end of life plugin
2015 Process Improvements
Monitoring and SLOs
Atlassian relationship
Governance
SUCCESS
WARNING
Change control ALMOST
2015 Process: After
ALMOST
Availability
Hardware
Monitoring and Alerting CRITICAL
CRITICAL
WARNING
2015 Technology: Before
Hardware upgrade • Lucene index on SSD • Currently 75 GB • 6 hours rebuild time
2015 Technology Improvements
Adding Software Driven Governance • Python-Jira client enables innovations • Replica databases provides read access to
application needing real-time Jira data
2015 Technology Improvements
Leveraging inGraphs • All application and system resources displayed on a
single page
2015 Technology Improvements
Jira Data Center
2017 Technology Improvements
CHECK POINT
2 million issues 400+ million members
9,000+ employees
2016
1.2 million issues 300+ million members
6,000+ employees
2015
People: Process:
Technology:
CRITICAL
REVIEW
CRITICAL
ALMOSTPeople: Process:
Technology: REVIEW
REVIEW
App Admins
Developers
Operations
Managers
2016 Team Staffing: Before
4 1
10
App Admins
Developers
Operations
Managers
2016 Team Staffing: After
2 1
10.5
Operational Excellence
Atlassian Relationship
Governance
WARNING
2016 Process: Before
ALMOST
WARNING
2016 Process Improvements
Governance • Documented and communicated • All requests lead with business requirement • Scale is the most important requirement • Automated Governance
2016 Process Improvements
Operational Excellence Culture • Code and config reviews • Intelligent risk decision • Change control and communication • Monitoring and metrics • Automated remediation • Service level objectives • Awesome alerts and response • Business continuity plan • Relentless pursuit of exceptions causation • Blameless postmortems
2016 Process Improvements
Partnering with Atlassian • TAM relationship: evolved from tactical to
strategic in 2016 • Partnering with TAM for all major upgrades • Atlassian Premier Support provides critical bug
fix over the holidays to address bug in widely used gadget
Operational Excellence
Atlassian Relationship
Governance
2016 Process: After
SUCCESS
ALMOST
SUCCESS
Availability
Hardware
Monitoring and Alerting
CRITICAL
2016 Technology: Before
WARNING
ALMOST
User blacklisting and throttling • Implemented blacklisting based on username • Throttling based on requests/minute per host
2016 Technology Improvements
#Jira.conf
# Blacklist a user to by adding and entry with value of 1.
map $remote_user $user_blacklisted {
default 0;
"johnnynumberfive" 1;
}
Larger Hardware, Tuned Instance • 64GB upgraded to 256GB • JVM increased
2016 Technology Improvements
Leveraging inGraphs • Monitor and alerting on all bottlenecks
2016 Technology Improvements
Logstash Parsing logs to make useful
data
Adding in ELK
Kibana Create dashboards showing
insightful data
Elastic Search Horizontally scalable data
storage
Adding in ELK
CHECK POINT
2 million issues 400+ million members
9,000+ employees
2016
3 million issues 500+ million members
10,000+ employees
1.2 million issues 300+ million members
6,000+ employees
People: Process:
Technology:
CRITICAL
REVIEW
CRITICAL
2015
ALMOST
SUCCESS
People: Process:
Technology: ALMOST
2017
ALMOSTPeople: Process:
Technology: ALMOSTREVIEW
REVIEW
Operational Excellence
Atlassian Relationship
Product Vision
SUCCESS
2017 Process: Before
ALMOST
SUCCESS
App Admins Operations
Developers Manager
Roles For Supporting Jira
Understands customer requirements and prioritizes work.
Product Owner
App Admins Operations
Developers Manager
Roles For Supporting Jira
Understands customer requirements and prioritizes work.
Product Owner
2 1
10.5
0
2017 Process ImprovementsPartnering with Atlassian • Networking with the Jira community • Providing feedback and requesting features
Availability
Hardware
Monitoring and Alerting
2017 Technology: Before
ALMOST
ALMOST
ALMOST
Real User Monitoring • Performance regression reports emailed daily • Response times include rendering • Global statistics give us insight into latency
2017 Technology Improvements
Jira Data Center • 4 nodes improves our MTTR by avoiding lengthy
index rebuilds • Resilient from the "single click of death"
2017 Technology Improvements
Getting to Scale
Always ask why!
Invest in the team
Build vendor relationship
Lather, rinse, repeat
Thank you!
ARNIE MATZ | DIRECTOR, SOF TWARE ENGINEERING | LINKEDIN
DAN HATA | SENIOR ENGINEERING MANAGER | LINKEDIN