Skip to main content

Professional Hadoop Solutions



Professional Hadoop Solutions

Boris Lublinsky, Kevin T. Smith, Alexey Yakubovich

ISBN: 978-1-118-82418-4 September 2013 504 Pages

Download Product Flyer

Download Product Flyer

Download Product Flyer is to download PDF in new tab. This is a dummy description. Download Product Flyer is to download PDF in new tab. This is a dummy description. Download Product Flyer is to download PDF in new tab. This is a dummy description. Download Product Flyer is to download PDF in new tab. This is a dummy description.


The go-to guidebook for deploying Big Data solutions with Hadoop

Today's enterprise architects need to understand how the Hadoop frameworks and APIs fit together, and how they can be integrated to deliver real-world solutions. This book is a practical, detailed guide to building and implementing those solutions, with code-level instruction in the popular Wrox tradition. It covers storing data with HDFS and Hbase, processing data with MapReduce, and automating data processing with Oozie. Hadoop security, running Hadoop with Amazon Web Services, best practices, and automating Hadoop processes in real time are also covered in depth.

With in-depth code examples in Java and XML and the latest on recent additions to the Hadoop ecosystem, this complete resource also covers the use of APIs, exposing their inner workings and allowing architects and developers to better leverage and customize them.

  • The ultimate guide for developers, designers, and architects who need to build and deploy Hadoop applications
  • Covers storing and processing data with various technologies, automating data processing, Hadoop security, and delivering real-time solutions
  • Includes detailed, real-world examples and code-level guidelines
  • Explains when, why, and how to use these tools effectively
  • Written by a team of Hadoop experts in the programmer-to-programmer Wrox style

Professional Hadoop Solutions is the reference enterprise architects and developers need to maximize the power of Hadoop.

Introduction xvii

Chapter 1: Big Data and the Hadoop Ecosystem 1

Big Data Meets Hadoop 2

Hadoop: Meeting the Big Data Challenge 3

Data Science in the Business World 5

The Hadoop Ecosystem 7

Hadoop Core Components 7

Hadoop Distributions 10

Developing Enterprise Applications with Hadoop 12

Summary 16

Chapter 2: Storing Data in Hadoop 19


HDFS Architecture 20

Using HDFS Files 24

Hadoop-Specific File Types 26

HDFS Federation and High Availability 32

HBase 34

HBase Architecture 34

HBase Schema Design 40

Programming for HBase 42

New HBase Features 50

Combining HDFS and HBase for Effective Data Storage 53

Using Apache Avro 53

Managing Metadata with HCatalog 58

Choosing an Appropriate Hadoop Data Organization for Your Applications 60

Summary 62

Chapter 3: Processing Your Data with MapReduce 63

Getting to Know MapReduce 63

MapReduce Execution Pipeline 65

Runtime Coordination and Task Management in MapReduce 68

Your First MapReduce Application 70

Building and Executing MapReduce Programs 74

Designing MapReduce Implementations 78

Using MapReduce as a Framework for Parallel Processing 79

Simple Data Processing with MapReduce 81

Building Joins with MapReduce 82

Building Iterative MapReduce Applications 88

To MapReduce or Not to MapReduce? 94

Common MapReduce Design Gotchas 95

Summary 96

Chapter 4: Customizing MapReduce Execution 97

Controlling MapReduce Execution with InputFormat 98

Implementing InputFormat for Compute-Intensive Applications 100

Implementing InputFormat to Control the Number of Maps 106

Implementing InputFormat for Multiple HBase Tables 112

Reading Data Your Way with Custom RecordReaders 116

Implementing a Queue-Based RecordReader 116

Implementing RecordReader for XML Data 119

Organizing Output Data with Custom Output Formats 123

Implementing OutputFormat for Splitting MapReduce

Job’s Output into Multiple Directories 124

Writing Data Your Way with Custom RecordWriters 133

Implementing a RecordWriter to Produce Outputtar Files 133

Optimizing Your MapReduce Execution with a Combiner 135

Controlling Reducer Execution with Partitioners 139

Implementing a Custom Partitioner for One-to-Many Joins 140

Using Non-Java Code with Hadoop 143

Pipes 143

Hadoop Streaming 143

Using JNI 144

Summary 146

Chapter 5: Building Reliable MapReduce Apps 147

Unit Testing MapReduce Applications 147

Testing Mappers 150

Testing Reducers 151

Integration Testing 152

Local Application Testing with Eclipse 154

Using Logging for Hadoop Testing 156

Processing Applications Logs 160

Reporting Metrics with Job Counters 162

Defensive Programming in MapReduce 165

Summary 166

Chapter 6: Automating Data Processing with Oozie 167

Getting to Know Oozie 168

Oozie Workflow 170

Executing Asynchronous Activities in Oozie Workflow 173

Oozie Recovery Capabilities 179

Oozie Workflow Job Life Cycle 180

Oozie Coordinator 181

Oozie Bundle 187

Oozie Parameterization with Expression Language 191

Workflow Functions 192

Coordinator Functions 192

Bundle Functions 193

Other EL Functions 193

Oozie Job Execution Model 193

Accessing Oozie 197

Oozie SLA 199

Summary 203

Chapter 7: Using Oozie 205

Validating Information about Places Using Probes 206

Designing Place Validation Based on Probes 207

Designing Oozie Workflows 208

Implementing Oozie Workflow Applications 211

Implementing the Data Preparation Workflow 212

Implementing Attendance Index and Cluster Strands

Workflows 220

Implementing Workflow Activities 222

Populating the Execution Context from a java Action 223

Using MapReduce Jobs in Oozie Workflows 223

Implementing Oozie Coordinator Applications 226

Implementing Oozie Bundle Applications 231

Deploying, Testing, and Executing Oozie Applications 232

Deploying Oozie Applications 232

Using the Oozie CLI for Execution of an Oozie Application 234

Passing Arguments to Oozie Jobs 237

Using the Oozie Console to Get Information about Oozie

Applications 240

Getting to Know the Oozie Console Screens 240

Getting Information about a Coordinator Job 245

Summary 247

Chapter 8: Advanced Oozie FEATURES 249

Building Custom Oozie Workflow Actions 250

Implementing a Custom Oozie Workflow Action 251

Deploying Oozie Custom Workflow Actions 255

Adding Dynamic Execution to Oozie Workflows 257

Overall Implementation Approach 257

A Machine Learning Model, Parameters, and Algorithm 261

Defining a Workflow for an Iterative Process 262

Dynamic Workflow Generation 265

Using the Oozie Java API 268

Using Uber Jars with Oozie Applications 272

Data Ingestion Conveyer 276

Summary 283

Chapter 9: Real-Time Hadoop 285

Real-Time Applications in the Real World 286

Using HBase for Implementing Real-Time Applications 287

Using HBase as a Picture Management System 289

Using HBase as a Lucene Back End 296

Using Specialized Real-Time Hadoop Query Systems 317

Apache Drill 319

Impala 320

Comparing Real-Time Queries to MapReduce 323

Using Hadoop-Based Event-Processing Systems 323

HFlame 324

Storm 326

Comparing Event Processing to MapReduce 329

Summary 330

Chapter 10: Hadoop Security 331

A Brief History: Understanding Hadoop Security Challenges 333

Authentication 334

Kerberos Authentication 334

Delegated Security Credentials 344

Authorization 350

HDFS File Permissions 350

Service-Level Authorization 354

Job Authorization 356

Oozie Authentication and Authorization 356

Network Encryption 358

Security Enhancements with Project Rhino 360

HDFS Disk-Level Encryption 361

Token-Based Authentication and Unified Authorization Framework 361

HBase Cell-Level Security 362

Putting it All Together — Best Practices for Securing Hadoop 362

Authentication 363

Authorization 364

Network Encryption 364

Stay Tuned for Hadoop Enhancements 365

Summary 365

Chapter 11: Running Hadoop Applications on AWS 367

Getting to Know AWS 368

Options for Running Hadoop on AWS 369

Custom Installation using EC2 Instances 369

Elastic MapReduce 370

Additional Considerations before Making Your Choice 370

Understanding the EMR-Hadoop Relationship 370

EMR Architecture 372

Using S3 Storage 373

Maximizing Your Use of EMR 374

Utilizing CloudWatch and Other AWS Components 376

Accessing and Using EMR 377

Using AWS S3 383

Understanding the Use of Buckets 383

Content Browsing with the Console 386

Programmatically Accessing Files in S3 387

Using MapReduce to Upload Multiple Files to S3 397

Automating EMR Job Flow Creation and Job Execution 399

Orchestrating Job Execution in EMR 404

Using Oozie on an EMR Cluster 404

AWS Simple Workflow 407

AWS Data Pipeline 408

Summary 409

Chapter 12: Building Enterprise Security Solutions for Hadoop Implementations 411

Security Concerns for Enterprise Applications 412

Authentication 414

Authorization 414

Confidentiality 415

Integrity 415

Auditing 416

What Hadoop Security Doesn’t Natively Provide for Enterprise Applications 416

Data-Oriented Access Control 416

Differential Privacy 417

Encrypted Data at Rest 419

Enterprise Security Integration 419

Approaches for Securing Enterprise Applications Using Hadoop 419

Access Control Protection with Accumulo 420

Encryption at Rest 430

Network Isolation and Separation Approaches 430

Summary 434

Chapter 13: Hadoop’s Future 435

Simplifying MapReduce Programming with DSLs 436

What Are DSLs? 436

DSLs for Hadoop 437

Faster, More Scalable Processing 449

Apache YARN 449

Tez 452

Security Enhancements 452

Emerging Trends 453

Summary 454

APPENDIX : Useful Reading 455

Index 463
ReadMe Document
Chapter 2 Code
Chapter 4 Code
Chapter 5 Code
Chapter 7 Code
Chapter 8 Code
Chapter 9 Code
Chapter 11 Code
ChapterPageDetailsDatePrint Run
233Error in Text
Currently reads:
This is achieved by configuring Data Nodes to send block location information and heartbeat to both Data Nodes.
Should be:
This is achieved by configuring Data Nodes to send block location information and heartbeat to both Name Nodes.