Ebook
Data Mining and Predictive Analytics, 2nd EditionISBN: 9781118991121
824 pages
June 2015

Description
Learn methods of data analysis and their application to realworld data sets
This updated second edition serves as an introduction to data mining methods and models, including association rules, clustering, neural networks, logistic regression, and multivariate analysis. The authors apply a unified “white box” approach to data mining methods and models. This approach is designed to walk readers through the operations and nuances of the various methods, using small data sets, so readers can gain an insight into the inner workings of the method under review. Chapters provide readers with handson analysis problems, representing an opportunity for readers to apply their newlyacquired data mining expertise to solving real problems using large, realworld data sets.
Data Mining and Predictive Analytics, Second Edition:
 Offers comprehensive coverage of association rules, clustering, neural networks, logistic regression, multivariate analysis, and R statistical programming language
 Features over 750 chapter exercises, allowing readers to assess their understanding of the new material
 Provides a detailed case study that brings together the lessons learned in the book
 Includes access to the companion website, www.dataminingconsultant.com, with exclusive passwordprotected instructor content
Data Mining and Predictive Analytics, Second Edition will appeal to computer science and statistic students, as well as students in MBA programs, and chief executives.
Table of Contents
PREFACE xxi
ACKNOWLEDGMENTS xxix
PART I DATA PREPARATION 1
CHAPTER 1 AN INTRODUCTION TO DATA MINING AND PREDICTIVE ANALYTICS 3
1.1 What is Data Mining? What is Predictive Analytics? 3
1.2 Wanted: Data Miners 5
1.3 The Need for Human Direction of Data Mining 6
1.4 The CrossIndustry Standard Process for Data Mining: CRISPDM 6
1.4.1 CRISPDM: The Six Phases 7
1.5 Fallacies of Data Mining 9
1.6 What Tasks Can Data Mining Accomplish 10
CHAPTER 2 DATA PREPROCESSING 20
2.1 Why do We Need to Preprocess the Data? 20
2.2 Data Cleaning 21
2.3 Handling Missing Data 22
2.4 Identifying Misclassifications 25
2.5 Graphical Methods for Identifying Outliers 26
2.6 Measures of Center and Spread 27
2.7 Data Transformation 30
2.8 Min–Max Normalization 30
2.9 ZScore Standardization 31
2.10 Decimal Scaling 32
2.11 Transformations to Achieve Normality 32
2.12 Numerical Methods for Identifying Outliers 38
2.13 Flag Variables 39
2.14 Transforming Categorical Variables into Numerical Variables 40
2.15 Binning Numerical Variables 41
2.16 Reclassifying Categorical Variables 42
2.17 Adding an Index Field 43
2.18 Removing Variables that are not Useful 43
2.19 Variables that Should Probably not be Removed 43
2.20 Removal of Duplicate Records 44
2.21 A Word About ID Fields 45
CHAPTER 3 EXPLORATORY DATA ANALYSIS 54
3.1 Hypothesis Testing Versus Exploratory Data Analysis 54
3.2 Getting to Know the Data Set 54
3.3 Exploring Categorical Variables 56
3.4 Exploring Numeric Variables 64
3.5 Exploring Multivariate Relationships 69
3.6 Selecting Interesting Subsets of the Data for Further Investigation 70
3.7 Using EDA to Uncover Anomalous Fields 71
3.8 Binning Based on Predictive Value 72
3.9 Deriving New Variables: Flag Variables 75
3.10 Deriving New Variables: Numerical Variables 77
3.11 Using EDA to Investigate Correlated Predictor Variables 78
3.12 Summary of Our EDA 81
CHAPTER 4 DIMENSIONREDUCTION METHODS 92
4.1 Need for DimensionReduction in Data Mining 92
4.2 Principal Components Analysis 93
4.3 Applying PCA to the Houses Data Set 96
4.4 How Many Components Should We Extract? 102
4.5 Profiling the Principal Components 105
4.6 Communalities 108
4.7 Validation of the Principal Components 110
4.8 Factor Analysis 110
4.9 Applying Factor Analysis to the Adult Data Set 111
4.10 Factor Rotation 114
4.11 UserDefined Composites 117
4.12 An Example of a UserDefined Composite 118
PART II STATISTICAL ANALYSIS 129
CHAPTER 5 UNIVARIATE STATISTICAL ANALYSIS 131
5.1 Data Mining Tasks in Discovering Knowledge in Data 131
5.2 Statistical Approaches to Estimation and Prediction 131
5.3 Statistical Inference 132
5.4 How Confident are We in Our Estimates? 133
5.5 Confidence Interval Estimation of the Mean 134
5.6 How to Reduce the Margin of Error 136
5.7 Confidence Interval Estimation of the Proportion 137
5.8 Hypothesis Testing for the Mean 138
5.9 Assessing the Strength of Evidence Against the Null Hypothesis 140
5.10 Using Confidence Intervals to Perform Hypothesis Tests 141
5.11 Hypothesis Testing for the Proportion 143
CHAPTER 6 MULTIVARIATE STATISTICS 148
6.1 TwoSample tTest for Difference in Means 148
6.2 TwoSample ZTest for Difference in Proportions 149
6.3 Test for the Homogeneity of Proportions 150
6.4 ChiSquare Test for Goodness of Fit of Multinomial Data 152
6.5 Analysis of Variance 153
CHAPTER 7 PREPARING TO MODEL THE DATA 160
7.1 Supervised Versus Unsupervised Methods 160
7.2 Statistical Methodology and Data Mining Methodology 161
7.3 CrossValidation 161
7.4 Overfitting 163
7.5 Bias–Variance TradeOff 164
7.6 Balancing the Training Data Set 166
7.7 Establishing Baseline Performance 167
CHAPTER 8 SIMPLE LINEAR REGRESSION 171
8.1 An Example of Simple Linear Regression 171
8.2 Dangers of Extrapolation 177
8.3 How Useful is the Regression? The Coefficient of Determination, r2 178
8.4 Standard Error of the Estimate, s 183
8.5 Correlation Coefficient r 184
8.6 Anova Table for Simple Linear Regression 186
8.7 Outliers, High Leverage Points, and Influential Observations 186
8.8 Population Regression Equation 195
8.9 Verifying the Regression Assumptions 198
8.10 Inference in Regression 203
8.11 tTest for the Relationship Between x and y 204
8.12 Confidence Interval for the Slope of the Regression Line 206
8.13 Confidence Interval for the Correlation Coefficient p 208
8.14 Confidence Interval for the Mean Value of y Given x 210
8.15 Prediction Interval for a Randomly Chosen Value of y Given x 211
8.16 Transformations to Achieve Linearity 213
8.17 Box–Cox Transformations 220
CHAPTER 9 MULTIPLE REGRESSION AND MODEL BUILDING 236
9.1 An Example of Multiple Regression 236
9.2 The Population Multiple Regression Equation 242
9.3 Inference in Multiple Regression 243
9.4 Regression with Categorical Predictors, Using Indicator Variables 249
9.5 Adjusting R2: Penalizing Models for Including Predictors that are not Useful 256
9.6 Sequential Sums of Squares 257
9.7 Multicollinearity 258
9.8 Variable Selection Methods 266
9.9 Gas Mileage Data Set 270
9.10 An Application of Variable Selection Methods 271
9.11 Using the Principal Components as Predictors in Multiple Regression 279
PART III CLASSIFICATION 299
CHAPTER 10 kNEAREST NEIGHBOR ALGORITHM 301
10.1 Classification Task 301
10.2 kNearest Neighbor Algorithm 302
10.3 Distance Function 305
10.4 Combination Function 307
10.5 Quantifying Attribute Relevance: Stretching the Axes 309
10.6 Database Considerations 310
10.7 kNearest Neighbor Algorithm for Estimation and Prediction 310
10.8 Choosing k 311
10.9 Application of kNearest Neighbor Algorithm Using IBM/SPSS Modeler 312
CHAPTER 11 DECISION TREES 317
11.1 What is a Decision Tree? 317
11.2 Requirements for Using Decision Trees 319
11.3 Classification and Regression Trees 319
11.4 C4.5 Algorithm 326
11.5 Decision Rules 332
11.6 Comparison of the C5.0 and CART Algorithms Applied to Real Data 332
CHAPTER 12 NEURAL NETWORKS 339
12.1 Input and Output Encoding 339
12.2 Neural Networks for Estimation and Prediction 342
12.3 Simple Example of a Neural Network 342
12.4 Sigmoid Activation Function 344
12.5 BackPropagation 345
12.6 GradientDescent Method 346
12.7 BackPropagation Rules 347
12.8 Example of BackPropagation 347
12.9 Termination Criteria 349
12.10 Learning Rate 350
12.11 Momentum Term 351
12.12 Sensitivity Analysis 353
12.13 Application of Neural Network Modeling 353
CHAPTER 13 LOGISTIC REGRESSION 359
13.1 Simple Example of Logistic Regression 359
13.2 Maximum Likelihood Estimation 361
13.3 Interpreting Logistic Regression Output 362
13.4 Inference: are the Predictors Significant? 363
13.5 Odds Ratio and Relative Risk 365
13.6 Interpreting Logistic Regression for a Dichotomous Predictor 367
13.7 Interpreting Logistic Regression for a Polychotomous Predictor 370
13.8 Interpreting Logistic Regression for a Continuous Predictor 374
13.9 Assumption of Linearity 378
13.10 ZeroCell Problem 382
13.11 Multiple Logistic Regression 384
13.12 Introducing Higher Order Terms to Handle Nonlinearity 388
13.13 Validating the Logistic Regression Model 395
13.14 WEKA: HandsOn Analysis Using Logistic Regression 399
CHAPTER 14 NAÏVE BAYES AND BAYESIAN NETWORKS 414
14.1 Bayesian Approach 414
14.2 Maximum a Posteriori (Map) Classification 416
14.3 Posterior Odds Ratio 420
14.4 Balancing the Data 422
14.5 Naïve Bayes Classification 423
14.6 Interpreting the Log Posterior Odds Ratio 426
14.7 ZeroCell Problem 428
14.8 Numeric Predictors for Naïve Bayes Classification 429
14.9 WEKA: Handson Analysis Using Naïve Bayes 432
14.10 Bayesian Belief Networks 436
14.11 Clothing Purchase Example 436
14.12 Using the Bayesian Network to Find Probabilities 439
CHAPTER 15 MODEL EVALUATION TECHNIQUES 451
15.1 Model Evaluation Techniques for the Description Task 451
15.2 Model Evaluation Techniques for the Estimation and Prediction Tasks 452
15.3 Model Evaluation Measures for the Classification Task 454
15.4 Accuracy and Overall Error Rate 456
15.5 Sensitivity and Specificity 457
15.6 FalsePositive Rate and FalseNegative Rate 458
15.7 Proportions of True Positives, True Negatives, False Positives, and False Negatives 458
15.8 Misclassification Cost Adjustment to Reflect RealWorld Concerns 460
15.9 Decision Cost/Benefit Analysis 462
15.10 Lift Charts and Gains Charts 463
15.11 Interweaving Model Evaluation with Model Building 466
15.12 Confluence of Results: Applying a Suite of Models 466
CHAPTER 16 COSTBENEFIT ANALYSIS USING DATADRIVEN COSTS 471
16.1 Decision Invariance Under Row Adjustment 471
16.2 Positive Classification Criterion 473
16.3 Demonstration of the Positive Classification Criterion 474
16.4 Constructing the Cost Matrix 474
16.5 Decision Invariance Under Scaling 476
16.6 Direct Costs and Opportunity Costs 478
16.7 Case Study: CostBenefit Analysis Using DataDriven Misclassification Costs 478
16.8 Rebalancing as a Surrogate for Misclassification Costs 483
CHAPTER 17 COSTBENEFIT ANALYSIS FOR TRINARY AND kNARY CLASSIFICATION MODELS 491
17.1 Classification Evaluation Measures for a Generic Trinary Target 491
17.2 Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem 494
17.3 DataDriven CostBenefit Analysis for Trinary Loan Classification Problem 498
17.4 Comparing Cart Models with and without DataDriven Misclassification Costs 500
17.5 Classification Evaluation Measures for a Generic kNary Target 503
17.6 Example of Evaluation Measures and DataDriven Misclassification Costs for kNary Classification 504
CHAPTER 18 GRAPHICAL EVALUATION OF CLASSIFICATION MODELS 510
18.1 Review of Lift Charts and Gains Charts 510
18.2 Lift Charts and Gains Charts Using Misclassification Costs 510
18.3 Response Charts 511
18.4 Profits Charts 512
18.5 Return on Investment (ROI) Charts 514
PART IV CLUSTERING 521
CHAPTER 19 HIERARCHICAL AND kMEANS CLUSTERING 523
19.1 The Clustering Task 523
19.2 Hierarchical Clustering Methods 525
19.3 SingleLinkage Clustering 526
19.4 CompleteLinkage Clustering 527
19.5 kMeans Clustering 529
19.6 Example of kMeans Clustering at Work 530
19.7 Behavior of MSB, MSE, and PseudoF as the kMeans Algorithm Proceeds 533
19.8 Application of kMeans Clustering Using SAS Enterprise Miner 534
19.9 Using Cluster Membership to Predict Churn 537
CHAPTER 20 KOHONEN NETWORKS 542
20.1 SelfOrganizing Maps 542
20.2 Kohonen Networks 544
20.3 Example of a Kohonen Network Study 545
20.4 Cluster Validity 549
20.5 Application of Clustering Using Kohonen Networks 549
20.6 Interpreting The Clusters 551
20.7 Using Cluster Membership as Input to Downstream Data Mining Models 556
CHAPTER 21 BIRCH CLUSTERING 560
21.1 Rationale for Birch Clustering 560
21.2 Cluster Features 561
21.3 Cluster Feature Tree 562
21.4 Phase 1: Building the CF Tree 562
21.5 Phase 2: Clustering the SubClusters 564
21.6 Example of Birch Clustering, Phase 1: Building the CF Tree 565
21.7 Example of Birch Clustering, Phase 2: Clustering the SubClusters 570
21.8 Evaluating the Candidate Cluster Solutions 571
21.9 Case Study: Applying Birch Clustering to the Bank Loans Data Set 571
CHAPTER 22 MEASURING CLUSTER GOODNESS 582
22.1 Rationale for Measuring Cluster Goodness 582
22.2 The Silhouette Method 583
22.3 Silhouette Example 584
22.4 Silhouette Analysis of the IRIS Data Set 585
22.5 The PseudoF Statistic 590
22.6 Example of the PseudoF Statistic 591
22.7 PseudoF Statistic Applied to the IRIS Data Set 592
22.8 Cluster Validation 593
22.9 Cluster Validation Applied to the Loans Data Set 594
PART V ASSOCIATION RULES 601
CHAPTER 23 ASSOCIATION RULES 603
23.1 Affinity Analysis and Market Basket Analysis 603
23.2 Support, Confidence, Frequent Itemsets, and the a Priori Property 605
23.3 How Does the A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets 607
23.4 How Does the A Priori Algorithm Work (Part 2)? Generating Association Rules 608
23.5 Extension from Flag Data to General Categorical Data 611
23.6 InformationTheoretic Approach: Generalized Rule Induction Method 612
23.7 Association Rules are Easy to do Badly 614
23.8 How can we Measure the Usefulness of Association Rules? 615
23.9 Do Association Rules Represent Supervised or Unsupervised Learning? 616
23.10 Local Patterns Versus Global Models 617
PART VI ENHANCING MODEL PERFORMANCE 623
CHAPTER 24 SEGMENTATION MODELS 625
24.1 The Segmentation Modeling Process 625
24.2 Segmentation Modeling Using EDA to Identify the Segments 627
24.3 Segmentation Modeling using Clustering to Identify the Segments 629
CHAPTER 25 ENSEMBLE METHODS: BAGGING AND BOOSTING 637
25.1 Rationale for Using an Ensemble of Classification Models 637
25.2 Bias, Variance, and Noise 639
25.3 When to Apply, and not to apply, Bagging 640
25.4 Bagging 641
25.5 Boosting 643
25.6 Application of Bagging and Boosting Using IBM/SPSS Modeler 647
CHAPTER 26 MODEL VOTING AND PROPENSITY AVERAGING 653
26.1 Simple Model Voting 653
26.2 Alternative Voting Methods 654
26.3 Model Voting Process 655
26.4 An Application of Model Voting 656
26.5 What is Propensity Averaging? 660
26.6 Propensity Averaging Process 661
26.7 An Application of Propensity Averaging 661
PART VII FURTHER TOPICS 669
CHAPTER 27 GENETIC ALGORITHMS 671
27.1 Introduction To Genetic Algorithms 671
27.2 Basic Framework of a Genetic Algorithm 672
27.3 Simple Example of a Genetic Algorithm at Work 673
27.4 Modifications and Enhancements: Selection 676
27.5 Modifications and Enhancements: Crossover 678
27.6 Genetic Algorithms for RealValued Variables 679
27.7 Using Genetic Algorithms to Train a Neural Network 681
27.8 WEKA: HandsOn Analysis Using Genetic Algorithms 684
CHAPTER 28 IMPUTATION OF MISSING DATA 695
28.1 Need for Imputation of Missing Data 695
28.2 Imputation of Missing Data: Continuous Variables 696
28.3 Standard Error of the Imputation 699
28.4 Imputation of Missing Data: Categorical Variables 700
28.5 Handling Patterns in Missingness 701
PART VIII CASE STUDY: PREDICTING RESPONSE TO DIRECTMAIL MARKETING 705
CHAPTER 29 CASE STUDY, PART 1: BUSINESS UNDERSTANDING, DATA PREPARATION, AND EDA 707
29.1 CrossIndustry Standard Practice for Data Mining 707
29.2 Business Understanding Phase 709
29.3 Data Understanding Phase, Part 1: Getting a Feel for the Data Set 710
29.4 Data Preparation Phase 714
29.5 Data Understanding Phase, Part 2: Exploratory Data Analysis 721
CHAPTER 30 CASE STUDY, PART 2: CLUSTERING AND PRINCIPAL COMPONENTS ANALYSIS 732
30.1 Partitioning the Data 732
30.2 Developing the Principal Components 733
30.3 Validating the Principal Components 737
30.4 Profiling the Principal Components 737
30.5 Choosing the Optimal Number of Clusters Using Birch Clustering 742
30.6 Choosing the Optimal Number of Clusters Using kMeans Clustering 744
30.7 Application of kMeans Clustering 745
30.8 Validating the Clusters 745
30.9 Profiling the Clusters 745
CHAPTER 31 CASE STUDY, PART 3: MODELING AND EVALUATION FOR PERFORMANCE AND INTERPRETABILITY 749
31.1 Do you Prefer the Best Model Performance, or a Combination of Performance and Interpretability? 749
31.2 Modeling and Evaluation Overview 750
31.3 CostBenefit Analysis Using DataDriven Costs 751
31.4 Variables to be Input to the Models 753
31.5 Establishing the Baseline Model Performance 754
31.6 Models that use Misclassification Costs 755
31.7 Models that Need Rebalancing as a Surrogate for Misclassification Costs 756
31.8 Combining Models Using Voting and Propensity Averaging 757
31.9 Interpreting the Most Profitable Model 758
CHAPTER 32 CASE STUDY, PART 4: MODELING AND EVALUATION FOR HIGH PERFORMANCE ONLY 762
32.1 Variables to be Input to the Models 762
32.2 Models that use Misclassification Costs 762
32.3 Models that Need Rebalancing as a Surrogate for Misclassification Costs 764
32.4 Combining Models using Voting and Propensity Averaging 765
32.5 Lessons Learned 766
32.6 Conclusions 766
APPENDIX A DATA SUMMARIZATION AND VISUALIZATION 768
Part 1: Summarization 1: Building Blocks of Data Analysis 768
Part 2: Visualization: Graphs and Tables for Summarizing and Organizing Data 770
Part 3: Summarization 2: Measures of Center, Variability, and Position 774
Part 4: Summarization and Visualization of Bivariate Relationships 777
INDEX 781
Author Information
Chantal D. Larose is a Ph.D. candidate in Statistics at the University of Connecticut. Her research focuses on the imputation of missing data and modelbased clustering. She has taught undergraduate statistics since 2011, and is a statistical consultant for DataMiningConsultant.com, LLC.