Skip to main content

Data Mining for Business Analytics: Concepts, Techniques and Applications in Python

Hardcover

Pre-order

£102.00

*VAT

Data Mining for Business Analytics: Concepts, Techniques and Applications in Python

Galit Shmueli, Peter C. Bruce, Peter Gedeck, Nitin R. Patel

ISBN: 978-1-119-54984-0 November 2019 592 Pages

Hardcover
Pre-order
£102.00

Description

Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python presents an applied approach to data mining concepts and methods, using Python software for illustration

Readers will learn how to implement a variety of popular data mining algorithms in Python (a free and open-source software) to tackle business problems and opportunities.

This is the sixth version of this successful text, and the first using Python. It covers both statistical and machine learning algorithms for prediction, classification, visualization, dimension reduction, recommender systems, clustering, text mining and network analysis. It also includes:

  • A new co-author, Peter Gedeck, who brings both experience teaching business analytics courses using Python, and expertise in the application of machine learning methods to the drug-discovery process
  • A new section on ethical issues in data mining
  • Updates and new material based on feedback from instructors teaching MBA, undergraduate, diploma and executive courses, and from their students
  • More than a dozen case studies demonstrating applications for the data mining techniques described
  • End-of-chapter exercises that help readers gauge and expand their comprehension and competency of the material presented
  • A companion website with more than two dozen data sets, and instructor materials including exercise solutions, PowerPoint slides, and case solutions

Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python is an ideal textbook for graduate and upper-undergraduate level courses in data mining, predictive analytics, and business analytics. This new edition is also an excellent reference for analysts, researchers, and practitioners working with quantitative methods in the fields of business, finance, marketing, computer science, and information technology.

“This book has by far the most comprehensive review of business analytics methods that I have ever seen, covering everything from classical approaches such as linear and logistic regression, through to modern methods like neural networks, bagging and boosting, and even much more business specific procedures such as social network analysis and text mining. If not the bible, it is at the least a definitive manual on the subject.”

—Gareth M. James, University of Southern California and co-author (with Witten, Hastie and Tibshirani) of the best-selling book An Introduction to Statistical Learning, with Applications in R 

Foreword by Gareth James xix

Foreword by Ravi Bapna xxi

Preface to the Python Edition xxiii

Acknowledgments xxvii

Part I Preliminaries

Chapter 1 Introduction 3

1.1 What Is Business Analytics? 3

1.2 What Is Data Mining? 5

1.3 Data Mining and Related Terms 5

1.4 Big Data 6

1.5 Data Science 7

1.6 Why Are There So Many Different Methods? 8

1.7 Terminology and Notation 9

1.8 Road Maps to This Book 11

Order of Topics 11

Chapter 2 Overview of the Data Mining Process 15

2.1 Introduction 15

2.2 Core Ideas in Data Mining 16

Classification 16

Prediction 16

Association Rules and Recommendation Systems 16

Predictive Analytics 17

Data Reduction and Dimension Reduction 17

Data Exploration and Visualization 17

Supervised and Unsupervised Learning 18

2.3 The Steps in Data Mining 19

2.4 Preliminary Steps 21

Organization of Datasets 21

Predicting Home Values in the West Roxbury Neighborhood 21

Loading and Looking at the Data in Python 22

Python imports 25

Sampling from a Database 26

Oversampling Rare Events in Classification Tasks 26

Preprocessing and Cleaning the Data 27

2.5 Predictive Power and Overfitting 34

Overfitting 34

Creation and Use of Data Partitions 36

2.6 Building a Predictive Model 40

Modeling Process 40

2.7 Using Python for Data Mining on a Local Machine 45

2.8 Automating Data Mining Solutions 46

2.9 Ethical Practice in Data Mining1 47

Data Mining Software: The State of the Market (by Herb Edelstein) 52

Problems 56

Part II Data Exploration and Dimension Reduction

Chapter 3 Data Visualization 61

3.1 Uses of Data Visualization 61

Python 63

3.2 Data Examples 64

Example 1: Boston Housing Data 64

Example 2: Ridership on Amtrak Trains 65

3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots 66

Distribution Plots: Boxplots and Histograms 68

Heatmaps: Visualizing Correlations and Missing Values 72

3.4 Multidimensional Visualization 75

Adding Variables: Color, Size, Shape, Multiple Panels, and Animation 75

Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering 77

Reference: Trend Lines and Labels 83

Scaling up to Large Datasets 84

Multivariate Plot: Parallel Coordinates Plot 84

Interactive Visualization 86

3.5 Specialized Visualizations 89

Visualizing Networked Data 90

Visualizing Hierarchical Data: Treemaps 92

Visualizing Geographical Data: Map Charts 94

3.6 Summary: Major Visualizations and Operations, by Data Mining Goal 97

Prediction 97

Classification 97

Time Series Forecasting 97

Unsupervised Learning 98

Problems 99

Chapter 4 Dimension Reduction 101

Python 101

4.1 Introduction 102

4.2 Curse of Dimensionality 102

4.3 Practical Considerations 103

Example 1: House Prices in Boston 104

4.4 Data Summaries 104

Summary Statistics 104

Aggregation and Pivot Tables 106

4.5 Correlation Analysis 108

4.6 Reducing the Number of Categories in Categorical Variables 110

4.7 Converting a Categorical Variable to a Numerical Variable 111

4.8 Principal Components Analysis 111

Example 2: Breakfast Cereals 111

Principal Components 116

Normalizing the Data 117

Using Principal Components for Classification and Prediction 120

4.9 Dimension Reduction Using Regression Models 122

4.10 Dimension Reduction Using Classification and Regression Trees 122

Problems 123

Part III Performance Evaluation

Chapter 5 Evaluating Predictive Performance 129

Python 129

5.1 Introduction 130

5.2 Evaluating Predictive Performance 130

Naive Benchmark: The Average 131

Prediction Accuracy Measures 131

Comparing Training and Validation Performance 132

Cumulative Gains and Lift Charts 135

5.3 Judging Classifier Performance 136

Benchmark: The Naive Rule 136

Class Separation 138

The Confusion (Classification) Matrix 139

Using the Validation Data 140

Accuracy Measures 140

Propensities and Cutoff for Classification 141

Performance in Case of Unequal Importance of Classes 143

Asymmetric Misclassification Costs 147

Generalization to More Than Two Classes 149

5.4 Judging Ranking Performance 150

Gains and Lift Charts for Binary Data 150

Decile Lift Charts 153

Beyond Two Classes 154

Gains and Lift Charts Incorporating Costs and Benefits 154

Cumulative Gains as a Function of Cutoff 154

5.5 Oversampling 155

Oversampling the Training Set 158

Evaluating Model Performance Using a Non-oversampled Validation Set 158

Evaluating Model Performance if Only Oversampled Validation Set Exists 158

Problems 161

Part IV Prediction and Classification Methods

Chapter 6 Multiple Linear Regression 167

Python 167

6.1 Introduction 168

6.2 Explanatory vs. Predictive Modeling 168

6.3 Estimating the Regression Equation and Prediction 170

Example: Predicting the Price of Used Toyota Corolla Cars 171

6.4 Variable Selection in Linear Regression 176

Reducing the Number of Predictors 176

How to Reduce the Number of Predictors 177

Regularization (Shrinkage Models) 183

Problems 187

Chapter 7 k-Nearest Neighbors (kNN) 191

Python 191

7.1 The k-NN Classifier (Categorical Outcome) 192

Determining Neighbors 192

Classification Rule 193

Example: Riding Mowers 193

Choosing k 195

Setting the Cutoff Value 197

k-NN with More Than Two Classes 200

Converting Categorical Variables to Binary Dummies 200

7.2 k-NN for a Numerical Outcome 200

7.3 Advantages and Shortcomings of k-NN Algorithms 202

Problems 204

Chapter 8 The Naive Bayes Classifier 207

Python 207

8.1 Introduction 207

Cutoff Probability Method 208

Conditional Probability 208

Example 1: Predicting Fraudulent Financial Reporting 209

8.2 Applying the Full (Exact) Bayesian Classifier 210

Using the “Assign to the Most Probable Class” Method 210

Using the Cutoff Probability Method 210

Practical Difficulty with the Complete (Exact) Bayes Procedure 210

Solution: Naive Bayes 211

The Naive Bayes Assumption of Conditional Independence 212

Using the Cutoff Probability Method 213

Example 2: Predicting Fraudulent Financial Reports, Two Predictors 213

Example 3: Predicting Delayed Flights 214

8.3 Advantages and Shortcomings of the Naive Bayes Classifier 221

Problems 223

Chapter 9 Classification and Regression Trees 225

Python 225

9.1 Introduction 226

Tree Structure 226

Decision Rules 227

Classifying a New Record 228

9.2 Classification Trees 228

Recursive Partitioning 228

Example 1: Riding Mowers 229

Measures of Impurity 231

9.3 Evaluating the Performance of a Classification Tree 237

Example 2: Acceptance of Personal Loan 237

Sensitivity Analysis Using Cross Validation 239

9.4 Avoiding Overfitting 242

Stopping Tree Growth 242

Fine-tuning Tree Parameters 244

Other Methods for Limiting Tree Size 247

9.5 Classification Rules from Trees 248

9.6 Classification Trees for More Than Two Classes 249

9.7 Regression Trees 249

Prediction 252

Measuring Impurity 252

Evaluating Performance 252

9.8 Improving Prediction: Random Forests and Boosted Trees 253

Random Forests 253

Boosted Trees 255

9.9 Advantages and Weaknesses of a Tree 256

Problems 259

Chapter 10 Logistic Regression 263

Python 263

10.1 Introduction 264

10.2 The Logistic Regression Model 265

10.3 Example: Acceptance of Personal Loan 267

Model with a Single Predictor 267

Estimating the Logistic Model from Data: Computing Parameter Estimates 269

Interpreting Results in Terms of Odds (for a Profiling Goal) 272

10.4 Evaluating Classification Performance 273

Variable Selection 276

10.5 Logistic Regression for Multi-class Classification 276

Ordinal Classes 277

Nominal Classes 278

Comparing Ordinal and Nominal Models 279

10.6 Example of Complete Analysis: Predicting Delayed Flights 281

Data Preprocessing 284

Model Training 285

Model Interpretation 285

Model Performance 285

Variable Selection 288

Problems 294

Chapter 11 Neural Nets 297

Python 297

11.1 Introduction 298

11.2 Concept and Structure of a Neural Network 298

11.3 Fitting a Network to Data 299

Example 1: Tiny Dataset 299

Computing Output of Nodes 301

Preprocessing the Data 303

Training the Model 304

Example 2: Classifying Accident Severity 308

Avoiding Overfitting 311

Using the Output for Prediction and Classification 311

11.4 Required User Input 312

11.5 Exploring the Relationship Between Predictors and Outcome 313

11.6 Deep Learning 313

Convolutional neural networks (CNNs) 314

Local feature map 316

A Hierarchy of Features 316

The Learning Process 316

Unsupervised Learning 317

Conclusion 318

11.7 Advantages and Weaknesses of Neural Networks 319

Problems 321

Chapter 12 Discriminant Analysis 323

Python 323

12.1 Introduction 324

Example 1: Riding Mowers 324

Example 2: Personal Loan Acceptance 324

12.2 Distance of a Record from a Class 325

12.3 Fisher’s Linear Classification Functions 328

12.4 Classification Performance of Discriminant Analysis 331

12.5 Prior Probabilities 333

12.6 Unequal Misclassification Costs 333

12.7 Classifying More Than Two Classes 335

Example 3: Medical Dispatch to Accident Scenes 335

12.8 Advantages and Weaknesses 338

Problems 339

Chapter 13 Combining Methods: Ensembles and Uplift Modeling 343

Python 343

13.1 Ensembles 344

Why Ensembles Can Improve Predictive Power 345

Simple Averaging 346

Bagging 347

Boosting 347

Bagging and Boosting in Python 348

Advantages and Weaknesses of Ensembles 348

13.2 Uplift (Persuasion) Modeling 350

A-B Testing 350

Uplift 350

Gathering the Data 351

A Simple Model 352

Modeling Individual Uplift 353

Computing Uplift with Python 355

Using the Results of an Uplift Model 355

13.3 Summary 355

Problems 357

Part V Mining Relationships Among Records

Chapter 14 Association Rules and Collaborative Filtering 361

Python 361

14.1 Association Rules 362

Discovering Association Rules in Transaction Databases 362

Example 1: Synthetic Data on Purchases of Phone Faceplates 363

Generating Candidate Rules 363

The Apriori Algorithm 366

Selecting Strong Rules 366

Data Format 368

The Process of Rule Selection 369

Interpreting the Results 370

Rules and Chance 372

Example 2: Rules for Similar Book Purchases 374

14.2 Collaborative Filtering 376

Data Type and Format 376

Example 3: Netflix Prize Contest 377

User-Based Collaborative Filtering: “People Like You” 378

Item-Based Collaborative Filtering 381

Advantages and Weaknesses of Collaborative Filtering 381

Collaborative Filtering vs. Association Rules 384

14.3 Summary 385

Problems 387

Chapter 15 Cluster Analysis 391

Python 391

15.1 Introduction 392

Example: Public Utilities 393

15.2 Measuring Distance Between Two Records 395

Euclidean Distance 396

Normalizing Numerical Measurements 397

Other Distance Measures for Numerical Data 398

Distance Measures for Categorical Data 400

Distance Measures for Mixed Data 400

15.3 Measuring Distance Between Two Clusters 401

Minimum Distance 401

Maximum Distance 401

Average Distance 401

Centroid Distance 401

15.4 Hierarchical (Agglomerative) Clustering 403

Single Linkage 404

Complete Linkage 404

Average Linkage 405

Centroid Linkage 405

Ward’s Method 405

Dendrograms: Displaying Clustering Process and Results 406

Validating Clusters 408

Limitations of Hierarchical Clustering 409

15.5 Non-Hierarchical Clustering: The k-Means Algorithm 411

Choosing the Number of Clusters (k) 412

Problems 418

Part VI Forecasting Time Series

Chapter 16 Handling Time Series 423

Python 423

16.1 Introduction 424

16.2 Descriptive vs. Predictive Modeling 425

16.3 Popular Forecasting Methods in Business 425

Combining Methods 426

16.4 Time Series Components 426

Example: Ridership on Amtrak Trains 427

16.5 Data-Partitioning and Performance Evaluation 431

Benchmark Performance: Naive Forecasts 432

Generating Future Forecasts 434

Problems 436

Chapter 17 Regression-Based Forecasting 439

Python 439

17.1 A Model with Trend 440

Linear Trend 440

Exponential Trend 444

Polynomial Trend 444

17.2 A Model with Seasonality 447

17.3 A Model with Trend and Seasonality 449

17.4 Autocorrelation and ARIMA Models 451

Computing Autocorrelation 451

Improving Forecasts by Integrating Autocorrelation Information 454

Evaluating Predictability 456

Problems 459

Chapter 18 Smoothing Methods 469

Python 469

18.1 Introduction 470

18.2 Moving Average 470

Centered Moving Average for Visualization 470

Trailing Moving Average for Forecasting 471

Choosing Window Width (w) 475

18.3 Simple Exponential Smoothing 475

Choosing Smoothing Parameter α 476

Relation Between Moving Average and Simple Exponential Smoothing 477

18.4 Advanced Exponential Smoothing 479

Series with a Trend 479

Series with a Trend and Seasonality 480

Series with Seasonality (No Trend) 480

Problems 483

Part VII Data Analytics

Chapter 19 Social Network Analytics 493

Python 493

19.1 Introduction 494

19.2 Directed vs. Undirected Networks 495

19.3 Visualizing and Analyzing Networks 495

Plot Layout 498

Edge List 499

Adjacency Matrix 500

Using Network Data in Classification and Prediction 500

19.4 Social Data Metrics and Taxonomy 500

Node-Level Centrality Metrics 502

Egocentric Network 503

Network Metrics 503

19.5 Using Network Metrics in Prediction and Classification 507

Link Prediction 507

Entity Resolution 507

Collaborative Filtering 510

19.6 Collecting Social Network Data with Python 513

19.7 Advantages and Disadvantages 514

Problems 516

Chapter 20 Text Mining 517

Python 517

20.1 Introduction 518

20.2 The Tabular Representation of Text: Term-Document Matrix and “Bag-of-Words” 519

20.3 Bag-of-Words vs. Meaning Extraction at Document Level 519

20.4 Preprocessing the Text 521

Tokenization 521

Text Reduction 523

Presence/Absence vs. Frequency 526

Term Frequency–Inverse Document Frequency (TF-IDF) 526

From Terms to Concepts: Latent Semantic Indexing 528

Extracting Meaning 528

20.5 Implementing Data Mining Methods 529

20.6 Example: Online Discussions on Autos and Electronics 529

Importing and Labeling the Records 530

Text Preprocessing in Python 530

Producing a Concept Matrix 530

Fitting a Predictive Model 532

Prediction 532

20.7 Summary 533

Problems 534

Part VIII Cases

Chapter 21 Cases 539

21.1 Charles Book Club 539

The Book Industry 539

Database Marketing at Charles 540

Data Mining Techniques 542

Assignment 544

21.2 German Credit 545

Background 545

Data 546

Assignment 546

21.3 Tayko Software Cataloger 551

Background 551

The Mailing Experiment 551

Data 551

Assignment 553

21.4 Political Persuasion 554

Background 554

Predictive Analytics Arrives in US Politics 554

Political Targeting 555

Uplift 555

Data 556

Assignment 557

21.5 Taxi Cancellations 558

Business Situation 558

Assignment 558

21.6 Segmenting Consumers of Bath Soap 559

Business Situation 559

Key Problems 560

Data 560

Measuring Brand Loyalty 560

Assignment 562

21.7 Direct-Mail Fundraising 562

Background 562

Data 563

Assignment 564

21.8 Catalog Cross-Selling 565

Background 565

Assignment 565

21.9 Time Series Case: Forecasting Public Transportation Demand 566

Background 566

Problem Description 566

Available Data 566

Assignment Goal 567

Assignment 567

Tips and Suggested Steps 567

References 569

Data Files Used in the Book 571

Python Utilities Functions 575

Index 585