Wiley.com
Print this page Share

Data Mining for Business Analytics: Concepts, Techniques, and Applications in R

ISBN: 978-1-118-87936-8
584 pages
September 2017
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R (1118879368) cover image

Description

"This book has by far the most comprehensive review of business analytics methods that I have ever seen, covering everything from classical approaches such as linear and logistic regression, through to modern methods like neural networks, bagging and boosting, and even much more business specific procedures such as social network analysis and text mining. If not the bible, it is at the least a definitive manual on the subject."

- Gareth M. James, University of Southern California and co-author (with Witten, Hastie and Tibshirani), of the best-selling book "An Introduction to Statistical Learning, with Applications in R”


Incorporating an innovative focus on data visualization and time series forecasting, Data Mining for Business Analytics supplies insightful, detailed guidance on fundamental data mining techniques. The book guides readers through the use of the freely-available R software for developing predictive models and techniques in order to describe and find patterns in data. The authors use interesting, real-world examples to build a theoretical and practical understanding of key data mining methods. The book includes discussions of R subroutines, allowing readers to work hands-on with the provided data. Throughout the book, applications of the discussed topics focus on the business problem as motivation and avoid unnecessary statistical theory. Each chapter concludes with exercises that allow readers to expand their comprehension of the presented material. Over a dozen cases that require use of the different data mining techniques are introduced, and a related Web site features over two dozen data sets, exercise solutions, PowerPoint slides, and case solutions. Modern topics include text analytics, recommender systems, social network analysis, getting data from a database into the analytics process, and scoring and employing the results of an analysis to a database.

See More

Table of Contents

Foreword 17

Preface to the R Edition 19

Acknowledgments 22

PART I PRELIMINARIES

CHAPTER 1 Introduction 3

1.1 What is Business Analytics? 3

1.2 What is Data Mining? 5

1.3 Data Mining and Related Terms 5

1.4 Big Data 7

1.5 Data Science 8

1.6 Why Are There So Many Different Methods? 8

1.7 Terminology and Notation 9

1.8 Road Maps to This Book 11

Order of Topics 11

CHAPTER 2 Overview of the Data Mining Process 17

2.1 Introduction 17

2.2 Core Ideas in Data Mining 18

Classification 18

Prediction 18

Association Rules and Recommendation Systems 18

Predictive Analytics 19

Data Reduction and Dimension Reduction 19

Data Exploration and Visualization 19

Supervised and Unsupervised Learning 20

2.3 The Steps in Data Mining 21

2.4 Preliminary Steps 23

Organization of Datasets 23

Predicting Home Values in the West Roxbury Neighborhood 23

Loading and looking at the data in R 24

Sampling from a Database 26

Oversampling Rare Events in Classification Tasks 27

Preprocessing and Cleaning the Data 28

2.5 Predictive Power and Overfitting 35

Overfitting 35

Creation and Use of Data Partitions 37

2.6 Building a Predictive Model 41

Modeling Process 41

2.7 Using R for Data Mining on a Local Machine 46

2.8 Automating Data Mining Solutions 46

Data Mining Software: The State of the Market (by Herb Edelstein) 47

Problems 51

PART II DATA EXPLORATION AND DIMENSION REDUCTION

CHAPTER 3 Data Visualization 57

3.1 Uses of Data Visualization 57

Base R or ggplot? 59

3.2 Data Examples 59

Example 1: Boston Housing Data 59

Example 2: Ridership on Amtrak Trains 61

3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots 61

Distribution Plots: Boxplots and Histograms 62

Heatmaps: Visualizing Correlations and Missing Values 66

3.4 Multi-Dimensional Visualization 67

Adding Variables: Color, Size, Shape, Multiple Panels, and Animation 70

Manipulations: Re-scaling, Aggregation and Hierarchies, Zooming, Filtering 72

Reference: Trend Lines and Labels 75

Scaling up to Large Datasets 77

Multivariate Plot: Parallel Coordinates Plot 77

Interactive Visualization 81

3.5 Specialized Visualizations 83

Visualizing Networked Data 83

Visualizing Hierarchical Data: Treemaps 86

Visualizing Geographical Data: Map Charts 86

3.6 Summary: Major Visualizations and Operations, by Data Mining Goal 90

Prediction 90

Classification 90

Time Series Forecasting 90

Unsupervised Learning 91

Problems 92

CHAPTER 4 Dimension Reduction 95

4.1 Introduction 95

4.2 Curse of Dimensionality 96

4.3 Practical Considerations 96

Example 1: House Prices in Boston 97

4.4 Data Summaries 98

Summary Statistics 98

Aggregation and Pivot Tables 100

4.5 Correlation Analysis 103

4.6 Reducing the Number of Categories in Categorical Variables 103

4.7 Converting A Categorical Variable to A Numerical Variable 104

4.8 Principal Components Analysis 105

Example 2: Breakfast Cereals 105

Principal Components 111

Normalizing the Data 111

Using Principal Components for Classification and Prediction 115

4.9 Dimension Reduction Using Regression Models 115

4.10 Dimension Reduction Using Classification and Regression Trees 117

Problems 118

PART III PERFORMANCE EVALUATION

CHAPTER 5 Evaluating Predictive Performance 123

5.1 Introduction 123

5.2 Evaluating Predictive Performance 124

Naive Benchmark: The Average 124

Prediction Accuracy Measures 125

Comparing Training and Validation Performance 127

Lift Chart 127

5.3 Judging Classifier Performance 128

Benchmark: The Naive Rule 130

Class Separation 130

The Confusion (Classification) Matrix 130

Using the Validation Data 132

Accuracy Measures 133

Propensities and Cutoff for Classification 133

Performance in Case of Unequal Importance of Classes 138

Asymmetric Misclassification Costs 140

Generalization to More Than Two Classes 142

5.4 Judging Ranking Performance 143

Lift Charts for Binary Data 143

Decile Lift Charts 146

Beyond Two Classes 146

Lift Charts Incorporating Costs and Benefits 147

Lift as a Function of Cutoff 147

5.5 Oversampling 148

Oversampling the Training Set 151

Evaluating Model Performance Using a Non-oversampled Validation Set 151

Evaluating Model Performance If Only Oversampled Validation Set Exists 151

Problems 154

PART IV PREDICTION AND CLASSIFICATION METHODS

CHAPTER 6 Multiple Linear Regression 159

6.1 Introduction 159

6.2 Explanatory vs. Predictive Modeling 160

6.3 Estimating the Regression Equation and Prediction 162

Example: Predicting the Price of Used Toyota Corolla Cars 162

6.4 Variable Selection in Linear Regression 168

Reducing the Number of Predictors 168

How to Reduce the Number of Predictors 169

Problems 176

CHAPTER 7 k-Nearest Neighbors (kNN) 181

7.1 The k-NN Classifier (categorical outcome) 181

Determining Neighbors 181

Classification Rule 182

Example: Riding Mowers 183

Choosing k 184

Setting the Cutoff Value 186

k-NN with More Than Two Classes 186

Converting Categorical Variables to Binary Dummies 188

7.2 k-NN for a Numerical Response 188

7.3 Advantages and Shortcomings of k-NN Algorithms 190

Problems 192

CHAPTER 8 The Naive Bayes Classifier 195

8.1 Introduction 195

Cutoff Probability Method 196

Conditional Probability 196

Example 1: Predicting Fraudulent Financial Reporting 196

8.2 Applying the Full (Exact) Bayesian Classifier 197

Using the “Assign to the Most Probable Class” Method 198

Using the Cutoff Probability Method 198

Practical Difficulty with the Complete (Exact) Bayes Procedure 198

Solution: Naive Bayes 199

The Naive Bayes Assumption of Conditional Independence 200

Using the Cutoff Probability Method 200

Example 2: Predicting Fraudulent Financial Reports, Two Predictors 201

Example 3: Predicting Delayed Flights 202

8.3 Advantages and Shortcomings of the Naive Bayes Classifier 207

Problems 210

CHAPTER 9 Classification and Regression Trees 213

9.1 Introduction 213

9.2 Classification Trees 215

Recursive Partitioning 215

Example 1: Riding Mowers 215

Measures of Impurity 218

Tree Structure 220

Classifying a New Record 221

9.3 Evaluating the Performance of a Classification Tree 223

Example 2: Acceptance of Personal Loan 223

9.4 Avoiding Overfitting 229

Stopping Tree Growth: Conditional Inference Trees 229

Pruning the Tree 230

Cross-Validation 230

Best Pruned Tree 234

9.5 Classification Rules from Trees 235

9.6 Classification Trees for More Than two Classes 235

9.7 Regression Trees 236

Prediction 237

Measuring Impurity 237

Evaluating Performance 237

9.8 Improving Prediction: Random Forests and Boosted Trees 238

Random Forests 238

Boosted Trees 240

9.9 Advantages and Weaknesses of a Tree 241

Problems 243

CHAPTER 10 Logistic Regression 247

10.1 Introduction 247

10.2 The Logistic Regression Model 249

10.3 Example: Acceptance of Personal Loan 250

Model with a Single Predictor 252

Estimating the Logistic Model from Data: Computing Parameter Estimates 253

Interpreting Results in Terms of Odds (for a Profiling Goal) 256

10.4 Evaluating Classification Performance 257

Variable Selection 258

10.5 Example of Complete Analysis: Predicting Delayed Flights 261

Data Preprocessing 265

Model Fitting and Estimation 265

Model Interpretation 265

Model Performance 265

Variable Selection 267

10.6 Appendix: Logistic Regression for Profiling 271

Appendix A: Why Linear Regression Is Problematic for a Categorical Outcome 271

Appendix B: Evaluating Explanatory Power 272

Appendix C: Logistic Regression for More Than Two Classes 276

Problems 280

CHAPTER 11 Neural Nets 283

11.1 Introduction 283

11.2 Concept and Structure of a Neural Network 284

11.3 Fitting a Network to Data 285

Example 1: Tiny Dataset 285

Computing Output of Nodes 286

Preprocessing the Data 289

Training the Model 290

Example 2: Classifying Accident Severity 294

Avoiding Overfitting 295

Using the Output for Prediction and Classification 295

11.4 Required User Input 297

11.5 Exploring the Relationship Between Predictors and Outcome 299

11.6 Advantages and Weaknesses of Neural Networks 301

Problems 302

CHAPTER 12 Discriminant Analysis 305

12.1 Introduction 305

Example 1: Riding Mowers 306

Example 2: Personal Loan Acceptance 306

12.2 Distance of a Record from a Class 308

12.3 Fisher’s Linear Classification Functions 309

12.4 Classification Performance of Discriminant Analysis 312

12.5 Prior Probabilities 314

12.6 Unequal Misclassification Costs 314

12.7 Classifying More Than Two Classes 315

Example 3: Medical Dispatch to Accident Scenes 315

12.8 Advantages and Weaknesses 319

Problems 320

CHAPTER 13 Combining Methods: Ensembles and Uplift Modeling 323

13.1 Ensembles 323

Why Ensembles Can Improve Predictive Power 324

Simple Averaging 326

Bagging 327

Boosting 327

Bagging and Boosting in R 327

Advantages and Weaknesses of Ensembles 327

13.2 Uplift (Persuasion) Modeling 330

12 CONTENTS

A-B Testing 330

Uplift 331

Gathering the Data 331

A Simple Model 333

Modeling Individual Uplift 333

Computing Uplift with R 334

Using the Results of an Uplift Model 336

13.3 Summary 336

Problems 337

PART V MINING RELATIONSHIPS AMONG RECORDS

CHAPTER 14 Association Rules and Collaborative Filtering 341

14.1 Association Rules 342

Discovering Association Rules in Transaction Databases 342

Example 1: Synthetic Data on Purchases of Phone Faceplates 342

Generating Candidate Rules 344

The Apriori Algorithm 345

Selecting Strong Rules 345

Data Format 347

The Process of Rule Selection 349

Interpreting the Results 349

Rules and Chance 351

Example 2: Rules for Similar Book Purchases 353

14.2 Collaborative Filtering 355

Data Type and Format 355

Example 3: Netflix Prize Contest 356

User-Based Collaborative Filtering: “People Like You” 357

Item-Based Collaborative Filtering 360

Advantages and Weaknesses of Collaborative Filtering 360

Collaborative Filtering vs. Association Rules 362

14.3 Summary 363

Problems 365

CHAPTER 15 Cluster Analysis 369

15.1 Introduction 369

Example: Public Utilities 371

15.2 Measuring Distance Between Two Records 373

Euclidean Distance 373

Normalizing Numerical Measurements 374

Other Distance Measures for Numerical Data 374

Distance Measures for Categorical Data 377

Distance Measures for Mixed Data 378

15.3 Measuring Distance Between Two Clusters 378

Minimum Distance 378

Maximum Distance 378

Average Distance 379

Centroid Distance 379

15.4 Hierarchical (Agglomerative) Clustering 381

Single Linkage 381

Complete Linkage 382

Average Linkage 382

Centroid Linkage 382

Ward’s Method 382

Dendrograms: Displaying Clustering Process and Results 383

Validating Clusters 385

Limitations of Hierarchical Clustering 388

15.5 Non-hierarchical Clustering: The k-Means Algorithm 388

Choosing The Number of Clusters (k) 390

Problems 395

PART VI FORECASTING TIME SERIES

CHAPTER 16 Handling Time Series 401

16.1 Introduction 401

16.2 Descriptive vs. Predictive Modeling 403

16.3 Popular Forecasting Methods in Business 403

Combining Methods 403

16.4 Time Series Components 404

Example: Ridership on Amtrak Trains 404

16.5 Data Partitioning and Performance Evaluation 409

Benchmark Performance: Naive Forecasts 410

Generating Future Forecasts 412

Problems 413

CHAPTER 17 Regression-Based Forecasting 417

17.1 A Model with Trend 417

Linear Trend 417

Exponential Trend 421

Polynomial Trend 423

17.2 A Model with Seasonality 423

17.3 A model with trend and seasonality 428

17.4 Autocorrelation and ARIMA Models 430

Computing Autocorrelation 430

Improving Forecasts by Integrating Autocorrelation Information 433

Evaluating Predictability 437

Problems 439

CHAPTER 18 Smoothing Methods 449

18.1 Introduction 449

18.2 Moving Average 450

Centered Moving Average for Visualization 450

Trailing Moving Average for Forecasting 451

Choosing Window Width (w) 455

18.3 Simple Exponential Smoothing 455

Choosing Smoothing Parameter _ 456

Relation Between Moving Average and Simple Exponential Smoothing 458

18.4 Advanced Exponential Smoothing 458

Series with a Trend 458

Series with a Trend and Seasonality 459

Series with Seasonality (No Trend) 460

Problems 463

PART VII DATA ANALYTICS

CHAPTER 19 Social Network Analytics 473

19.1 Introduction 473

19.2 Directed vs. Undirected Networks 475

19.3 Visualizing and analyzing networks 476

Graph Layout 476

Edge List 479

Adjacency Matrix 479

Using Network Data in Classification and Prediction 479

19.4 Social Data Metrics and Taxonomy 480

Node-Level Centrality Metrics 481

Egocentric Network 482

Network Metrics 484

19.5 Using Network Metrics in Prediction and Classification 486

Link Prediction 486

Entity Resolution 486

Collaborative Filtering 489

Collecting Social Network Data With R 491

Advantages and Disadvantages 493

Problems 495

CHAPTER 20 Text Mining 497

20.1 Introduction1 497

20.2 The Tabular Representation of Text: Term-Document Matrix and “Bag-of-Words” 498

20.3 Bag-of-Words vs. Meaning Extraction at Document Level 499

20.4 Preprocessing the Text 500

Tokenization 502

Text Reduction 503

Presence/Absence vs. Frequency 505

Term Frequency - Inverse Document Frequency (TF-IDF) 505

From Terms to Concepts: Latent Semantic Indexing 506

Extracting Meaning 507

20.5 Implementing data mining methods 507

20.6 Example: Online Discussions on Autos and Electronics 508

Importing and Labeling the Records 508

Text Preprocessing in R 510

Producing a Concept Matrix 510

Fitting a Predictive Model 510

Prediction 512

20.7 Summary 512

Problems 513

PART VIII CASES

CHAPTER 21 Cases 517

21.1 Charles Book Club2 517

The Book Industry 517

Database Marketing at Charles 518

Data Mining Techniques 520

Assignment 522

21.2 German Credit 524

Background 524

Data 524

Assignment 528

21.3 Tayko Software Cataloger3 529

Background 529

The Mailing Experiment 529

Data 529

Assignment 530

21.4 Political Persuasion4 533

Background 533

Predictive Analytics Arrives in US Politics 533

Political Targeting 533

Uplift 534

Data 535

Assignment 535

21.5 Taxi Cancellations5 537

Business Situation 537

Assignment 537

21.6 Segmenting Consumers of Bath Soap6 539

Business Situation 539

Key Problems 539

Data 540

Measuring Brand Loyalty 540

Assignment 540

21.7 Direct-Mail Fundraising 543

Background 543

Data 543

Assignment 543

21.8 Catalog Cross-Selling7 546

Background 546

Assignment 546

21.9 Predicting Bankruptcy 548

Predicting Corporate Bankruptcy 548

Assignment 549

21.10 Time Series Case: Forecasting Public Transportation Demand 551

Background 551

Problem Description 551

Available Data 551

Assignment Goal 551

Assignment 552

Tips and Suggested Steps 552

References 553

Data Files Used in the Book 555

Index

See More
Back to Top