Foreword by* Gareth James *xix

Foreword by* Ravi Bapna *xxi

Preface to the Python Edition xxiii

Acknowledgments xxvii

**Part I Preliminaries**

**Chapter 1 Introduction 3**

1.1 What Is Business Analytics? 3

1.2 What Is Data Mining? 5

1.3 Data Mining and Related Terms 5

1.4 Big Data 6

1.5 Data Science 7

1.6 Why Are There So Many Different Methods? 8

1.7 Terminology and Notation 9

1.8 Road Maps to This Book 11

Order of Topics 11

**Chapter 2 Overview of the Data Mining Process 15**

2.1 Introduction 15

2.2 Core Ideas in Data Mining 16

Classification 16

Prediction 16

Association Rules and Recommendation Systems 16

Predictive Analytics 17

Data Reduction and Dimension Reduction 17

Data Exploration and Visualization 17

Supervised and Unsupervised Learning 18

2.3 The Steps in Data Mining 19

2.4 Preliminary Steps 21

Organization of Datasets 21

Predicting Home Values in the West Roxbury Neighborhood 21

Loading and Looking at the Data in Python 22

Python imports 25

Sampling from a Database 26

Oversampling Rare Events in Classification Tasks 26

Preprocessing and Cleaning the Data 27

2.5 Predictive Power and Overfitting 34

Overfitting 34

Creation and Use of Data Partitions 36

2.6 Building a Predictive Model 40

Modeling Process 40

2.7 Using Python for Data Mining on a Local Machine 45

2.8 Automating Data Mining Solutions 46

2.9 Ethical Practice in Data Mining1 47

Data Mining Software: The State of the Market (by *Herb Edelstein*) 52

Problems 56

**Part II Data Exploration and Dimension Reduction**

**Chapter 3 Data Visualization 61**

3.1 Uses of Data Visualization 61

Python 63

3.2 Data Examples 64

Example 1: Boston Housing Data 64

Example 2: Ridership on Amtrak Trains 65

3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots 66

Distribution Plots: Boxplots and Histograms 68

Heatmaps: Visualizing Correlations and Missing Values 72

3.4 Multidimensional Visualization 75

Adding Variables: Color, Size, Shape, Multiple Panels, and Animation 75

Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering 77

Reference: Trend Lines and Labels 83

Scaling up to Large Datasets 84

Multivariate Plot: Parallel Coordinates Plot 84

Interactive Visualization 86

3.5 Specialized Visualizations 89

Visualizing Networked Data 90

Visualizing Hierarchical Data: Treemaps 92

Visualizing Geographical Data: Map Charts 94

3.6 Summary: Major Visualizations and Operations, by Data Mining Goal 97

Prediction 97

Classification 97

Time Series Forecasting 97

Unsupervised Learning 98

Problems 99

**Chapter 4 Dimension Reduction 101**

Python 101

4.1 Introduction 102

4.2 Curse of Dimensionality 102

4.3 Practical Considerations 103

Example 1: House Prices in Boston 104

4.4 Data Summaries 104

Summary Statistics 104

Aggregation and Pivot Tables 106

4.5 Correlation Analysis 108

4.6 Reducing the Number of Categories in Categorical Variables 110

4.7 Converting a Categorical Variable to a Numerical Variable 111

4.8 Principal Components Analysis 111

Example 2: Breakfast Cereals 111

Principal Components 116

Normalizing the Data 117

Using Principal Components for Classification and Prediction 120

4.9 Dimension Reduction Using Regression Models 122

4.10 Dimension Reduction Using Classification and Regression Trees 122

Problems 123

**Part III Performance Evaluation**

**Chapter 5 Evaluating Predictive Performance 129**

Python 129

5.1 Introduction 130

5.2 Evaluating Predictive Performance 130

Naive Benchmark: The Average 131

Prediction Accuracy Measures 131

Comparing Training and Validation Performance 132

Cumulative Gains and Lift Charts 135

5.3 Judging Classifier Performance 136

Benchmark: The Naive Rule 136

Class Separation 138

The Confusion (Classification) Matrix 139

Using the Validation Data 140

Accuracy Measures 140

Propensities and Cutoff for Classification 141

Performance in Case of Unequal Importance of Classes 143

Asymmetric Misclassification Costs 147

Generalization to More Than Two Classes 149

5.4 Judging Ranking Performance 150

Gains and Lift Charts for Binary Data 150

Decile Lift Charts 153

Beyond Two Classes 154

Gains and Lift Charts Incorporating Costs and Benefits 154

Cumulative Gains as a Function of Cutoff 154

5.5 Oversampling 155

Oversampling the Training Set 158

Evaluating Model Performance Using a Non-oversampled Validation Set 158

Evaluating Model Performance if Only Oversampled Validation Set Exists 158

Problems 161

**Part IV Prediction and Classification Methods**

**Chapter 6 Multiple Linear Regression 167**

Python 167

6.1 Introduction 168

6.2 Explanatory vs. Predictive Modeling 168

6.3 Estimating the Regression Equation and Prediction 170

Example: Predicting the Price of Used Toyota Corolla Cars 171

6.4 Variable Selection in Linear Regression 176

Reducing the Number of Predictors 176

How to Reduce the Number of Predictors 177

Regularization (Shrinkage Models) 183

Problems 187

**Chapter 7 ***k*-Nearest Neighbors (*k*NN) 191

Python 191

7.1 The *k*-NN Classifier (Categorical Outcome) 192

Determining Neighbors 192

Classification Rule 193

Example: Riding Mowers 193

Choosing *k *195

Setting the Cutoff Value 197

*k*-NN with More Than Two Classes 200

Converting Categorical Variables to Binary Dummies 200

7.2 *k*-NN for a Numerical Outcome 200

7.3 Advantages and Shortcomings of *k*-NN Algorithms 202

Problems 204

**Chapter 8 The Naive Bayes Classifier 207**

Python 207

8.1 Introduction 207

Cutoff Probability Method 208

Conditional Probability 208

Example 1: Predicting Fraudulent Financial Reporting 209

8.2 Applying the Full (Exact) Bayesian Classifier 210

Using the “Assign to the Most Probable Class” Method 210

Using the Cutoff Probability Method 210

Practical Difficulty with the Complete (Exact) Bayes Procedure 210

Solution: Naive Bayes 211

The Naive Bayes Assumption of Conditional Independence 212

Using the Cutoff Probability Method 213

Example 2: Predicting Fraudulent Financial Reports, Two Predictors 213

Example 3: Predicting Delayed Flights 214

8.3 Advantages and Shortcomings of the Naive Bayes Classifier 221

Problems 223

**Chapter 9 Classification and Regression Trees 225**

Python 225

9.1 Introduction 226

Tree Structure 226

Decision Rules 227

Classifying a New Record 228

9.2 Classification Trees 228

Recursive Partitioning 228

Example 1: Riding Mowers 229

Measures of Impurity 231

9.3 Evaluating the Performance of a Classification Tree 237

Example 2: Acceptance of Personal Loan 237

Sensitivity Analysis Using Cross Validation 239

9.4 Avoiding Overfitting 242

Stopping Tree Growth 242

Fine-tuning Tree Parameters 244

Other Methods for Limiting Tree Size 247

9.5 Classification Rules from Trees 248

9.6 Classification Trees for More Than Two Classes 249

9.7 Regression Trees 249

Prediction 252

Measuring Impurity 252

Evaluating Performance 252

9.8 Improving Prediction: Random Forests and Boosted Trees 253

Random Forests 253

Boosted Trees 255

9.9 Advantages and Weaknesses of a Tree 256

Problems 259

**Chapter 10 Logistic Regression 263**

Python 263

10.1 Introduction 264

10.2 The Logistic Regression Model 265

10.3 Example: Acceptance of Personal Loan 267

Model with a Single Predictor 267

Estimating the Logistic Model from Data: Computing Parameter Estimates 269

Interpreting Results in Terms of Odds (for a Profiling Goal) 272

10.4 Evaluating Classification Performance 273

Variable Selection 276

10.5 Logistic Regression for Multi-class Classification 276

Ordinal Classes 277

Nominal Classes 278

Comparing Ordinal and Nominal Models 279

10.6 Example of Complete Analysis: Predicting Delayed Flights 281

Data Preprocessing 284

Model Training 285

Model Interpretation 285

Model Performance 285

Variable Selection 288

Problems 294

**Chapter 11 Neural Nets 297**

Python 297

11.1 Introduction 298

11.2 Concept and Structure of a Neural Network 298

11.3 Fitting a Network to Data 299

Example 1: Tiny Dataset 299

Computing Output of Nodes 301

Preprocessing the Data 303

Training the Model 304

Example 2: Classifying Accident Severity 308

Avoiding Overfitting 311

Using the Output for Prediction and Classification 311

11.4 Required User Input 312

11.5 Exploring the Relationship Between Predictors and Outcome 313

11.6 Deep Learning 313

Convolutional neural networks (CNNs) 314

Local feature map 316

A Hierarchy of Features 316

The Learning Process 316

Unsupervised Learning 317

Conclusion 318

11.7 Advantages and Weaknesses of Neural Networks 319

Problems 321

**Chapter 12 Discriminant Analysis 323**

Python 323

12.1 Introduction 324

Example 1: Riding Mowers 324

Example 2: Personal Loan Acceptance 324

12.2 Distance of a Record from a Class 325

12.3 Fisher’s Linear Classification Functions 328

12.4 Classification Performance of Discriminant Analysis 331

12.5 Prior Probabilities 333

12.6 Unequal Misclassification Costs 333

12.7 Classifying More Than Two Classes 335

Example 3: Medical Dispatch to Accident Scenes 335

12.8 Advantages and Weaknesses 338

Problems 339

**Chapter 13 Combining Methods: Ensembles and Uplift Modeling 343**

Python 343

13.1 Ensembles 344

Why Ensembles Can Improve Predictive Power 345

Simple Averaging 346

Bagging 347

Boosting 347

Bagging and Boosting in Python 348

Advantages and Weaknesses of Ensembles 348

13.2 Uplift (Persuasion) Modeling 350

A-B Testing 350

Uplift 350

Gathering the Data 351

A Simple Model 352

Modeling Individual Uplift 353

Computing Uplift with Python 355

Using the Results of an Uplift Model 355

13.3 Summary 355

Problems 357

**Part V Mining Relationships Among Records**

**Chapter 14 Association Rules and Collaborative Filtering 361**

Python 361

14.1 Association Rules 362

Discovering Association Rules in Transaction Databases 362

Example 1: Synthetic Data on Purchases of Phone Faceplates 363

Generating Candidate Rules 363

The Apriori Algorithm 366

Selecting Strong Rules 366

Data Format 368

The Process of Rule Selection 369

Interpreting the Results 370

Rules and Chance 372

Example 2: Rules for Similar Book Purchases 374

14.2 Collaborative Filtering 376

Data Type and Format 376

Example 3: Netflix Prize Contest 377

User-Based Collaborative Filtering: “People Like You” 378

Item-Based Collaborative Filtering 381

Advantages and Weaknesses of Collaborative Filtering 381

Collaborative Filtering vs. Association Rules 384

14.3 Summary 385

Problems 387

**Chapter 15 Cluster Analysis 391**

Python 391

15.1 Introduction 392

Example: Public Utilities 393

15.2 Measuring Distance Between Two Records 395

Euclidean Distance 396

Normalizing Numerical Measurements 397

Other Distance Measures for Numerical Data 398

Distance Measures for Categorical Data 400

Distance Measures for Mixed Data 400

15.3 Measuring Distance Between Two Clusters 401

Minimum Distance 401

Maximum Distance 401

Average Distance 401

Centroid Distance 401

15.4 Hierarchical (Agglomerative) Clustering 403

Single Linkage 404

Complete Linkage 404

Average Linkage 405

Centroid Linkage 405

Ward’s Method 405

Dendrograms: Displaying Clustering Process and Results 406

Validating Clusters 408

Limitations of Hierarchical Clustering 409

15.5 Non-Hierarchical Clustering: The *k*-Means Algorithm 411

Choosing the Number of Clusters (*k*) 412

Problems 418

**Part VI Forecasting Time Series**

**Chapter 16 Handling Time Series 423**

Python 423

16.1 Introduction 424

16.2 Descriptive vs. Predictive Modeling 425

16.3 Popular Forecasting Methods in Business 425

Combining Methods 426

16.4 Time Series Components 426

Example: Ridership on Amtrak Trains 427

16.5 Data-Partitioning and Performance Evaluation 431

Benchmark Performance: Naive Forecasts 432

Generating Future Forecasts 434

Problems 436

**Chapter 17 Regression-Based Forecasting 439**

Python 439

17.1 A Model with Trend 440

Linear Trend 440

Exponential Trend 444

Polynomial Trend 444

17.2 A Model with Seasonality 447

17.3 A Model with Trend and Seasonality 449

17.4 Autocorrelation and ARIMA Models 451

Computing Autocorrelation 451

Improving Forecasts by Integrating Autocorrelation Information 454

Evaluating Predictability 456

Problems 459

**Chapter 18 Smoothing Methods 469**

Python 469

18.1 Introduction 470

18.2 Moving Average 470

Centered Moving Average for Visualization 470

Trailing Moving Average for Forecasting 471

Choosing Window Width (*w*) 475

18.3 Simple Exponential Smoothing 475

Choosing Smoothing Parameter *α *476

Relation Between Moving Average and Simple Exponential Smoothing 477

18.4 Advanced Exponential Smoothing 479

Series with a Trend 479

Series with a Trend and Seasonality 480

Series with Seasonality (No Trend) 480

Problems 483

**Part VII Data Analytics**

**Chapter 19 Social Network Analytics 493**

Python 493

19.1 Introduction 494

19.2 Directed vs. Undirected Networks 495

19.3 Visualizing and Analyzing Networks 495

Plot Layout 498

Edge List 499

Adjacency Matrix 500

Using Network Data in Classification and Prediction 500

19.4 Social Data Metrics and Taxonomy 500

Node-Level Centrality Metrics 502

Egocentric Network 503

Network Metrics 503

19.5 Using Network Metrics in Prediction and Classification 507

Link Prediction 507

Entity Resolution 507

Collaborative Filtering 510

19.6 Collecting Social Network Data with Python 513

19.7 Advantages and Disadvantages 514

Problems 516

**Chapter 20 Text Mining 517**

Python 517

20.1 Introduction 518

20.2 The Tabular Representation of Text: Term-Document Matrix and “Bag-of-Words” 519

20.3 Bag-of-Words vs. Meaning Extraction at Document Level 519

20.4 Preprocessing the Text 521

Tokenization 521

Text Reduction 523

Presence/Absence vs. Frequency 526

Term Frequency–Inverse Document Frequency (TF-IDF) 526

From Terms to Concepts: Latent Semantic Indexing 528

Extracting Meaning 528

20.5 Implementing Data Mining Methods 529

20.6 Example: Online Discussions on Autos and Electronics 529

Importing and Labeling the Records 530

Text Preprocessing in Python 530

Producing a Concept Matrix 530

Fitting a Predictive Model 532

Prediction 532

20.7 Summary 533

Problems 534

**Part VIII Cases**

**Chapter 21 Cases 539**

21.1 Charles Book Club 539

The Book Industry 539

Database Marketing at Charles 540

Data Mining Techniques 542

Assignment 544

21.2 German Credit 545

Background 545

Data 546

Assignment 546

21.3 Tayko Software Cataloger 551

Background 551

The Mailing Experiment 551

Data 551

Assignment 553

21.4 Political Persuasion 554

Background 554

Predictive Analytics Arrives in US Politics 554

Political Targeting 555

Uplift 555

Data 556

Assignment 557

21.5 Taxi Cancellations 558

Business Situation 558

Assignment 558

21.6 Segmenting Consumers of Bath Soap 559

Business Situation 559

Key Problems 560

Data 560

Measuring Brand Loyalty 560

Assignment 562

21.7 Direct-Mail Fundraising 562

Background 562

Data 563

Assignment 564

21.8 Catalog Cross-Selling 565

Background 565

Assignment 565

21.9 Time Series Case: Forecasting Public Transportation Demand 566

Background 566

Problem Description 566

Available Data 566

Assignment Goal 567

Assignment 567

Tips and Suggested Steps 567

References 569

Data Files Used in the Book 571

Python Utilities Functions 575

Index 585