architecting modern data platforms - gbv
TRANSCRIPT
Architecting Modern Data Platforms
A Guide to Enterprise Hadoop at Scale
Jan Kunigk, Ian Buss, Paul Wilkinson, and Lars George
Beijing • Boston • Farnham • Sebastopol • Tokyo O'REILLY
Table of Contents
Foreword xiii
Preface xvii
1. Big Data Technology Primer 1
A Tour of the Landscape 3 Core Components 5 Computational Frameworks 10 Analytical SQL Engines 14 Storage Engines 18 Ingestion 25 Orchestration 25
Summary 26
Part I. Infrastructure
2. Clusters 31
Reasons for Multiple Clusters 31 Multiple Clusters for Resiliency 31 Multiple Clusters for Software Development 32 Multiple Clusters for Workload Isolation 33 Multiple Clusters for Legal Separation 34 Multiple Clusters and Independent Storage and Compute 35
Multitenancy 35 Requirements for Multitenancy 36
Sizing Clusters 37 Sizing by Storage 38
Sizing by Ingest Rate 40 Sizing by Workload 41
Cluster Growth 41 The Drivers of Cluster Growth 42 Implementing Cluster Growth 42
Data Replication 43 Replication for Software Development 43 Replication and Workload Isolation 43
Summary 44
3. Compute and Storage 45 Computer Architecture for Hadoop 46
Commodity Servers 46 Server CPUs and RAM 48 Nonuniform Memory Access 50 CPU Specifications 54 RAM 55
Commoditized Storage Meets the Enterprise 55 Modularity of Compute and Storage 57 Everything Is Java 57 Replication or Erasure Coding? 57 Alternatives 58
Hadoop and the Linux Storage Stack 58 User Space 58 Important System Calls 61 The Linux Page Cache 62 Short-Circuit and Zero-Copy Reads 65 Filesystems 69
Erasure Coding Versus Replication 71 Discussion 76 Guidance 79
Low-Level Storage 81 Storage Controllers 81 Disk Layer 84
Server Form Factors 91 Form Factor Comparison 94 Guidance 95
Workload Profiles 96 Cluster Configurations and Node Types 97
Master Nodes 98 Worker Nodes 99 Utility Nodes 100
iv Table of Contents
Edge Nodes 101 Small Cluster Configurations 101 Medium Cluster Configurations 102 Large Cluster Configurations 103
Summary 104
4. Networking 107 How Services Use a Network 107
Remote Procedure Calls (RPCs) 107 Data Transfers 109 Monitoring 113 Backup 113 Consensus 114
Network Architectures 114 Small Cluster Architectures 115 Medium Cluster Architectures 116 Large Cluster Architectures 124
Network Integration 128 Reusing an Existing Network 128 Creating an Additional Network 129
Network Design Considerations 131 Layer 1 Recommendations 131 Layer 2 Recommendations 133 Layer 3 Recommendations 135
Summary 138
5. Organizational Challenges 139 Who Runs It? 140 Is It Infrastructure, Middleware, or an Application? 140 Case Study: A Typical Business Intelligence Project 141
The Traditional Approach 141 Typical Team Setup 143 Compartmentalization of IT 146 Revised Team Setup for Hadoop in the Enterprise 147 Solution Overview with Hadoop 154 New Team Setup 155 Split Responsibilities 156 Do I Need DevOps? 156 Do I Need a Center of Excellence/Competence? 157
Summary 157
Table of Contents v
6. Datacenter Considerations 159 Why Does It Matter ? 159 Basic Datacenter Concepts 160
Cooling 162 Power 163 Network 164 Rack Awareness and Rack Failures 165 Failure Domain Alignment 167
Space and Racking Constraints 168 Ingest and Intercluster Connectivity 169
Software 169 Hardware 170
Replacements and Repair 171 Operational Procedures 172
Typical Pitfalls 172 Networking 172 Cluster Spanning 173
Summary 181
Part II. Platform
7. Provisioning Clusters 185 Operating Systems 185
OS Choices 187 OS Configuration for Hadoop 188 Automated Configuration Example 193
Service Databases 194 Required Databases 196 Database Integration Options 197 Database Considerations 201
Hadoop Deployment 202 Hadoop Distributions 202 Installation Choices 205 Distribution Architecture 206 Installation Process 208
Summary 210
8. Platform Validation 211 Testing Methodology 212 Useful Tools 213 Hardware Validation 213
vi | Table of Contents
CPU 213 Disks 216 Network 221
Hadoop Validation 227 HDFS Validation 228 General Validation 230
Validating Other Components 234 Operations Validation 235
Summary 236
9. Security 237 In-Flight Encryption 237
TLS Encryption 238 SASL Quality of Protection 240 Enabling in-Flight Encryption 241
Authentication 242 Kerberos 242 LDAP Authentication 247 Delegation Tokens 248 Impersonation 249
Authorization 250 Group Resolution 251 Superusers and Supergroups 253 Hadoop Service Level Authorization 257 Centralized Security Management 258 HDFS 260 YARN 261 ZooKeeper 262 Hive 263 Impala 264 HBase 264 Solr 265 Kudu 266 Oozie 266 Hue 266 Kafka 269 Sentry 270
At-Rest Encryption 270 Volume Encryption with Cloudera Navigator Encrypt and Key Trustee
Server 273 HDFS Transparent Data Encryption 274 Encrypting Temporary Files 279
Table of Contents vii
Summary 279
10. Integration with Identity Management Providers 281 Integration Areas 281 Integration Scenarios 282
Scenario 1: Writing a File to HDFS 282 Scenario 2: Submitting a Hive Query 283 Scenario 3: Running a Spark Job 284
Integration Providers 285 LDAP Integration 287
Background 287 LDAP Security 289 Load Balancing 290 Application Integration 290 Linux Integration 292
Kerberos Integration 296 Kerberos Clients 296 KDC Integration 298
Certificate Management 304 Signing Certificates 305 Converting Certificates 307 Wildcard Certificates 308 Automation 309
Summary 309
11. Accessing and Interacting with Clusters 311 Access Mechanisms 311
Programmatic Access 311 Command-Line Access 312 WebUIs 312
Access Topologies 313 Interaction Patterns 314 Proxy Access 316 Load Balancing 318 Edge Node Interactions 318
Access Security 323 Administration Gateways 324
Workbenches 324 Hue 324 Notebooks 325
Landing Zones 326 Summary 328
viii | Table of Contents
12. High Availability 329 High Availability Defined 330
Lateral/Service HA 330 Vertical/Systemic HA 330
Measuring Availability 331 Percentages 331 Percentiles 331
Operating for HA 331 Monitoring 331 Playbooks and Postmortems 332
HA Building Blocks 332 Quorums 332 Load Balancing 334 Database HA 341 Ancillary Services 343
General Considerations 345 Separation of Master and Worker Processes 345 Separation of Identical Service Roles 345 Master Servers in Separate Failure Domains 346 Balanced Master Configurations 346 Optimized Server Configurations 346
High Availability of Cluster Services 347 ZooKeeper 347 HDFS 348 YARN 353 HBase 356 KMS 358 Hive 359 Impala 362 Solr 367 Kafka 369 Oozie 371 Hue 372 Other Services 375 Autoconfiguration 375
Summary 376
13. Backup and Disaster Recovery 377 Context 377
Many Distributed Systems 377 Policies and Objectives 378 Failure Scenarios 379
Table of Contents ix
Suitable Data Sources 382 Strategies 383 Data Types 386 Consistency 386 Validation 387 Summary 388
Data Replication 388 HBase 389 Cluster Management Tools 389 Kafka 390 Summary 391
Hadoop Cluster Backups 391 Subsystems 394 Case Study: Automating Backups with Oozie 398
Restore 405 Summary 406
Part III. Taking Hadoop to the Cloud
14. Basics of Vir i l izat ion for Hadoop 411 Compute Virtualization 412
Virtual Machine Distribution 413 Anti-Affinity Groups 414
Storage Virtualization 415 Virtualizing Local Storage 416 SANs 417 Object Storage and Network-Attached Storage 421
Network Virtualization 423 Cluster Life Cycle Models 425 Summary 430
15. Solutions for Private Clouds 433 OpenStack 435
Automation and Integration 436 Life Cycle and Storage 436 Isolation 438 Summary 438
OpenShift 439 Automation 439 Life Cycle and Storage 440 Isolation 441
x Table of Contents
Summary 441 VMware and Pivotal Cloud Foundry 442 Do It Yourself? 442
Automation 445 Isolation 446 Life Cycle Model 446 Summary 447
Object Storage for Private Clouds 448 EMC Isilon 448 Ceph 450
Summary 453
16. Solutions in the Public Cloud 455 Key Things to Know 455 Cloud Providers 457
AWS 457 Microsoft Azure 464 Google Cloud Platform 470
Implementing Clusters 473 Instances 473 Storage and Life Cycle Models 478 Network Architecture 484 High Availability 488
Summary 495
17. Automated Provisioning 497 Long-Lived Clusters 497
Configuration and Templating 498 Deployment Phases 499 Vendor Solutions 502 One-Click Deployments 505 Homegrown Automation 505 Hooking Into a Provisioning Life Cycle 505 Scaling Up and Down 506 Deploying with Security 508
Transient Clusters 510 Sharing Metadata Services 511 Summary 512
18. Security in the Cloud 513 Assessing the Risk 513 Risk Model 515
Table of Contents xi
Environmental Risks 515 Deployment Risks 516
Identity Provider Options for Hadoop 517 Option A: Cloud-Only Self-Contained ID Services 519 Option B: Cloud-Only Shared ID Services 520 Option C: On-Premises ID Services 521
Object Storage Security and Hadoop 523 Identity and Access Management 523 Amazon Simple Storage Service 524 GCP Cloud Storage 527 Microsoft Azure 531
Auditing 535 Encryption for Data at Rest 535
Requirements for Key Material 536 Options for Encryption in the Cloud 537 On-Premises Key Persistence 539 Encryption via the Cloud Provider 539 Encryption Feature and Interoperability Summary 547 Recommendations and Summary for Cloud Encryption 549
Encrypting Data in Flight in the Cloud 550 Perimeter Controls and Firewalling 551
GCP 553 AWS 555 Azure 557
Summary 559
A. Backup Onboarding Checklist 561
Index 571
xii Table of Contents