architecting modern data platforms - gbv

Architecting Modern Data Platforms

A Guide to Enterprise Hadoop at Scale

Jan Kunigk, Ian Buss, Paul Wilkinson, and Lars George

Beijing • Boston • Farnham • Sebastopol • Tokyo O'REILLY

Table of Contents

Foreword xiii

Preface xvii

1. Big Data Technology Primer 1

A Tour of the Landscape 3 Core Components 5 Computational Frameworks 10 Analytical SQL Engines 14 Storage Engines 18 Ingestion 25 Orchestration 25

Summary 26

Part I. Infrastructure

2. Clusters 31

Reasons for Multiple Clusters 31 Multiple Clusters for Resiliency 31 Multiple Clusters for Software Development 32 Multiple Clusters for Workload Isolation 33 Multiple Clusters for Legal Separation 34 Multiple Clusters and Independent Storage and Compute 35

Multitenancy 35 Requirements for Multitenancy 36

Sizing Clusters 37 Sizing by Storage 38

Sizing by Ingest Rate 40 Sizing by Workload 41

Cluster Growth 41 The Drivers of Cluster Growth 42 Implementing Cluster Growth 42

Data Replication 43 Replication for Software Development 43 Replication and Workload Isolation 43

Summary 44

3. Compute and Storage 45 Computer Architecture for Hadoop 46

Commodity Servers 46 Server CPUs and RAM 48 Nonuniform Memory Access 50 CPU Specifications 54 RAM 55

Commoditized Storage Meets the Enterprise 55 Modularity of Compute and Storage 57 Everything Is Java 57 Replication or Erasure Coding? 57 Alternatives 58

Hadoop and the Linux Storage Stack 58 User Space 58 Important System Calls 61 The Linux Page Cache 62 Short-Circuit and Zero-Copy Reads 65 Filesystems 69

Erasure Coding Versus Replication 71 Discussion 76 Guidance 79

Low-Level Storage 81 Storage Controllers 81 Disk Layer 84

Server Form Factors 91 Form Factor Comparison 94 Guidance 95

Workload Profiles 96 Cluster Configurations and Node Types 97

Master Nodes 98 Worker Nodes 99 Utility Nodes 100

iv Table of Contents

Edge Nodes 101 Small Cluster Configurations 101 Medium Cluster Configurations 102 Large Cluster Configurations 103

Summary 104

4. Networking 107 How Services Use a Network 107

Remote Procedure Calls (RPCs) 107 Data Transfers 109 Monitoring 113 Backup 113 Consensus 114

Network Architectures 114 Small Cluster Architectures 115 Medium Cluster Architectures 116 Large Cluster Architectures 124

Network Integration 128 Reusing an Existing Network 128 Creating an Additional Network 129

Network Design Considerations 131 Layer 1 Recommendations 131 Layer 2 Recommendations 133 Layer 3 Recommendations 135

Summary 138

5. Organizational Challenges 139 Who Runs It? 140 Is It Infrastructure, Middleware, or an Application? 140 Case Study: A Typical Business Intelligence Project 141

The Traditional Approach 141 Typical Team Setup 143 Compartmentalization of IT 146 Revised Team Setup for Hadoop in the Enterprise 147 Solution Overview with Hadoop 154 New Team Setup 155 Split Responsibilities 156 Do I Need DevOps? 156 Do I Need a Center of Excellence/Competence? 157

Summary 157

Table of Contents v

6. Datacenter Considerations 159 Why Does It Matter ? 159 Basic Datacenter Concepts 160

Cooling 162 Power 163 Network 164 Rack Awareness and Rack Failures 165 Failure Domain Alignment 167

Space and Racking Constraints 168 Ingest and Intercluster Connectivity 169

Software 169 Hardware 170

Replacements and Repair 171 Operational Procedures 172

Typical Pitfalls 172 Networking 172 Cluster Spanning 173

Summary 181

Part II. Platform

7. Provisioning Clusters 185 Operating Systems 185

OS Choices 187 OS Configuration for Hadoop 188 Automated Configuration Example 193

Service Databases 194 Required Databases 196 Database Integration Options 197 Database Considerations 201

Hadoop Deployment 202 Hadoop Distributions 202 Installation Choices 205 Distribution Architecture 206 Installation Process 208

Summary 210

8. Platform Validation 211 Testing Methodology 212 Useful Tools 213 Hardware Validation 213

vi | Table of Contents

CPU 213 Disks 216 Network 221

Hadoop Validation 227 HDFS Validation 228 General Validation 230

Validating Other Components 234 Operations Validation 235

Summary 236

9. Security 237 In-Flight Encryption 237

TLS Encryption 238 SASL Quality of Protection 240 Enabling in-Flight Encryption 241

Authentication 242 Kerberos 242 LDAP Authentication 247 Delegation Tokens 248 Impersonation 249

Authorization 250 Group Resolution 251 Superusers and Supergroups 253 Hadoop Service Level Authorization 257 Centralized Security Management 258 HDFS 260 YARN 261 ZooKeeper 262 Hive 263 Impala 264 HBase 264 Solr 265 Kudu 266 Oozie 266 Hue 266 Kafka 269 Sentry 270

At-Rest Encryption 270 Volume Encryption with Cloudera Navigator Encrypt and Key Trustee

Server 273 HDFS Transparent Data Encryption 274 Encrypting Temporary Files 279

Table of Contents vii

Summary 279

10. Integration with Identity Management Providers 281 Integration Areas 281 Integration Scenarios 282

Scenario 1: Writing a File to HDFS 282 Scenario 2: Submitting a Hive Query 283 Scenario 3: Running a Spark Job 284

Integration Providers 285 LDAP Integration 287

Background 287 LDAP Security 289 Load Balancing 290 Application Integration 290 Linux Integration 292

Kerberos Integration 296 Kerberos Clients 296 KDC Integration 298

Certificate Management 304 Signing Certificates 305 Converting Certificates 307 Wildcard Certificates 308 Automation 309

Summary 309

11. Accessing and Interacting with Clusters 311 Access Mechanisms 311

Programmatic Access 311 Command-Line Access 312 WebUIs 312

Access Topologies 313 Interaction Patterns 314 Proxy Access 316 Load Balancing 318 Edge Node Interactions 318

Access Security 323 Administration Gateways 324

Workbenches 324 Hue 324 Notebooks 325

Landing Zones 326 Summary 328

viii | Table of Contents

12. High Availability 329 High Availability Defined 330

Lateral/Service HA 330 Vertical/Systemic HA 330

Measuring Availability 331 Percentages 331 Percentiles 331

Operating for HA 331 Monitoring 331 Playbooks and Postmortems 332

HA Building Blocks 332 Quorums 332 Load Balancing 334 Database HA 341 Ancillary Services 343

General Considerations 345 Separation of Master and Worker Processes 345 Separation of Identical Service Roles 345 Master Servers in Separate Failure Domains 346 Balanced Master Configurations 346 Optimized Server Configurations 346

High Availability of Cluster Services 347 ZooKeeper 347 HDFS 348 YARN 353 HBase 356 KMS 358 Hive 359 Impala 362 Solr 367 Kafka 369 Oozie 371 Hue 372 Other Services 375 Autoconfiguration 375

Summary 376

13. Backup and Disaster Recovery 377 Context 377

Many Distributed Systems 377 Policies and Objectives 378 Failure Scenarios 379

Table of Contents ix

Suitable Data Sources 382 Strategies 383 Data Types 386 Consistency 386 Validation 387 Summary 388

Data Replication 388 HBase 389 Cluster Management Tools 389 Kafka 390 Summary 391

Hadoop Cluster Backups 391 Subsystems 394 Case Study: Automating Backups with Oozie 398

Restore 405 Summary 406

Part III. Taking Hadoop to the Cloud

14. Basics of Vir i l izat ion for Hadoop 411 Compute Virtualization 412

Virtual Machine Distribution 413 Anti-Affinity Groups 414

Storage Virtualization 415 Virtualizing Local Storage 416 SANs 417 Object Storage and Network-Attached Storage 421

Network Virtualization 423 Cluster Life Cycle Models 425 Summary 430

15. Solutions for Private Clouds 433 OpenStack 435

Automation and Integration 436 Life Cycle and Storage 436 Isolation 438 Summary 438

OpenShift 439 Automation 439 Life Cycle and Storage 440 Isolation 441

x Table of Contents

Summary 441 VMware and Pivotal Cloud Foundry 442 Do It Yourself? 442

Automation 445 Isolation 446 Life Cycle Model 446 Summary 447

Object Storage for Private Clouds 448 EMC Isilon 448 Ceph 450

Summary 453

16. Solutions in the Public Cloud 455 Key Things to Know 455 Cloud Providers 457

AWS 457 Microsoft Azure 464 Google Cloud Platform 470

Implementing Clusters 473 Instances 473 Storage and Life Cycle Models 478 Network Architecture 484 High Availability 488

Summary 495

17. Automated Provisioning 497 Long-Lived Clusters 497

Configuration and Templating 498 Deployment Phases 499 Vendor Solutions 502 One-Click Deployments 505 Homegrown Automation 505 Hooking Into a Provisioning Life Cycle 505 Scaling Up and Down 506 Deploying with Security 508

Transient Clusters 510 Sharing Metadata Services 511 Summary 512

18. Security in the Cloud 513 Assessing the Risk 513 Risk Model 515

Table of Contents xi

Environmental Risks 515 Deployment Risks 516

Identity Provider Options for Hadoop 517 Option A: Cloud-Only Self-Contained ID Services 519 Option B: Cloud-Only Shared ID Services 520 Option C: On-Premises ID Services 521

Object Storage Security and Hadoop 523 Identity and Access Management 523 Amazon Simple Storage Service 524 GCP Cloud Storage 527 Microsoft Azure 531

Auditing 535 Encryption for Data at Rest 535

Requirements for Key Material 536 Options for Encryption in the Cloud 537 On-Premises Key Persistence 539 Encryption via the Cloud Provider 539 Encryption Feature and Interoperability Summary 547 Recommendations and Summary for Cloud Encryption 549

Encrypting Data in Flight in the Cloud 550 Perimeter Controls and Firewalling 551

GCP 553 AWS 555 Azure 557

Summary 559

A. Backup Onboarding Checklist 561

Index 571

xii Table of Contents

architecting modern data platforms - gbv

Documents