Machine Learning in Python: Essential Techniques for Predictive Analysis
Machine Learning in Python: Essential Techniques for Predictive Analysis
ISBN: 9781118961742 May 2015 360 Pages
Description
Machine Learning in Python shows you how to successfully analyze data using only two core machine learning algorithms, and how to apply them using Python. By focusing on two algorithm families that effectively predict outcomes, this book is able to provide full descriptions of the mechanisms at work, and the examples that illustrate the machinery with specific, hackable code. The algorithms are explained in simple terms with no complex math and applied using Python, with guidance on algorithm selection, data preparation, and using the trained models in practice. You will learn a core set of Python programming techniques, various methods of building predictive models, and how to measure the performance of each model to ensure that the right one is used. The chapters on penalized linear regression and ensemble methods dive deep into each of the algorithms, and you can use the sample code in the book to develop your own data analysis solutions.
Machine learning algorithms are at the core of data analytics and visualization. In the past, these methods required a deep background in math and statistics, often in combination with the specialized R programming language. This book demonstrates how machine learning can be implemented using the more widely used and accessible Python programming language.
 Predict outcomes using linear and ensemble algorithm families
 Build predictive models that solve a range of simple and complex problems
 Apply core machine learning algorithms using Python
 Use sample code directly to build custom solutions
Machine learning doesn't have to be complex and highly specialized. Python makes this technology more accessible to a much wider audience, using methods that are simpler, effective, and well tested. Machine Learning in Python shows you how to do this, without requiring an extensive background in math or statistics.
Table of contents
Introduction xxiii
Chapter 1 The Two Essential Algorithms for Making Predictions 1
Why Are These Two Algorithms So Useful? 2
What Are Penalized Regression Methods? 7
What Are Ensemble Methods? 9
How to Decide Which Algorithm to Use 11
The Process Steps for Building a Predictive Model 13
Framing a Machine Learning Problem 15
Feature Extraction and Feature Engineering 17
Determining Performance of a Trained Model 18
Chapter Contents and Dependencies 18
Summary 20
Chapter 2 Understand the Problem by Understanding the Data 23
The Anatomy of a New Problem 24
Different Types of Attributes and Labels Drive Modeling Choices 26
Things to Notice about Your New Data Set 27
Classification Problems: Detecting Unexploded Mines Using Sonar 28
Physical Characteristics of the Rocks Versus Mines Data Set 29
Statistical Summaries of the Rocks versus Mines Data Set 32
Visualization of Outliers Using Quantile]Quantile Plot 35
Statistical Characterization of Categorical Attributes 37
How to Use Python Pandas to Summarize the
Rocks Versus Mines Data Set 37
Visualizing Properties of the Rocks versus Mines Data Set 40
Visualizing with Parallel Coordinates Plots 40
Visualizing Interrelationships between Attributes and Labels 42
Visualizing Attribute and Label Correlations Using a Heat Map 49
Summarizing the Process for Understanding Rocks versus Mines Data Set 50
Real]Valued Predictions with Factor Variables: How Old Is Your Abalone? 50
Parallel Coordinates for Regression Problems—Visualize Variable Relationships for Abalone Problem 56
How to Use Correlation Heat Map for Regression—Visualize Pair]Wise Correlations for the Abalone Problem 60
Real]Valued Predictions Using Real]Valued Attributes: Calculate How Your Wine Tastes 62
Multiclass Classification Problem: What Type of Glass Is That? 68
Summary 73
Chapter 3 Predictive Model Building: Balancing Performance, Complexity, and Big Data 75
The Basic Problem: Understanding Function Approximation 76
Working with Training Data 76
Assessing Performance of Predictive Models 78
Factors Driving Algorithm Choices and Performance—Complexity and Data 79
Contrast Between a Simple Problem and a Complex Problem 80
Contrast Between a Simple Model and a Complex Model 82
Factors Driving Predictive Algorithm Performance 86
Choosing an Algorithm: Linear or Nonlinear? 87
Measuring the Performance of Predictive Models 88
Performance Measures for Different Types of Problems 88
Simulating Performance of Deployed Models 99
Achieving Harmony Between Model and Data 101
Choosing a Model to Balance Problem Complexity, Model Complexity, and Data Set Size 102
Using Forward Stepwise Regression to Control Overfitting 103
Evaluating and Understanding Your Predictive Model 108
Control Overfitting by Penalizing Regression
Coefficients—Ridge Regression 110
Summary 119
Chapter 4 Penalized Linear Regression 121
Why Penalized Linear Regression Methods Are So Useful 122
Extremely Fast Coefficient Estimation 122
Variable Importance Information 122
Extremely Fast Evaluation When Deployed 123
Reliable Performance 123
Sparse Solutions 123
Problem May Require Linear Model 124
When to Use Ensemble Methods 124
Penalized Linear Regression: Regulating Linear Regression for Optimum Performance 124
Training Linear Models: Minimizing Errors and More 126
Adding a Coefficient Penalty to the OLS Formulation 127
Other Useful Coefficient Penalties—Manhattan and ElasticNet 128
Why Lasso Penalty Leads to Sparse Coefficient Vectors 129
ElasticNet Penalty Includes Both Lasso and Ridge 131
Solving the Penalized Linear Regression Problem 132
Understanding Least Angle Regression and Its Relationship to Forward Stepwise Regression 132
How LARS Generates Hundreds of Models of Varying Complexity 136
Choosing the Best Model from The Hundreds LARS Generates 139
Using Glmnet: Very Fast and Very General 144
Comparison of the Mechanics of Glmnet and LARS Algorithms 145
Initializing and Iterating the Glmnet Algorithm 146
Extensions to Linear Regression with Numeric Input 151
Solving Classification Problems with Penalized Regression 151
Working with Classification Problems Having More Than Two Outcomes 155
Understanding Basis Expansion: Using Linear Methods on Nonlinear Problems 156
Incorporating NonNumeric Attributes into Linear Methods 158
Summary 163
Chapter 5 Building Predictive Models Using Penalized Linear Methods 165
Python Packages for Penalized Linear Regression 166
Multivariable Regression: Predicting Wine Taste 167
Building and Testing a Model to Predict Wine Taste 168
Training on the Whole Data Set before Deployment 172
Basis Expansion: Improving Performance by Creating New Variables from Old Ones 178
Binary Classification: Using Penalized Linear Regression to Detect Unexploded Mines 181
Build a Rocks versus Mines Classifier for Deployment 191
Multiclass Classification: Classifying Crime Scene
Glass Samples 204
Summary 209
Chapter 6 Ensemble Methods 211
Binary Decision Trees 212
How a Binary Decision Tree Generates Predictions 213
How to Train a Binary Decision Tree 214
Tree Training Equals Split Point Selection 218
How Split Point Selection Affects Predictions 218
Algorithm for Selecting Split Points 219
Multivariable Tree Training—Which Attribute to Split? 219
Recursive Splitting for More Tree Depth 220
Overfitting Binary Trees 221
Measuring Overfit with Binary Trees 221
Balancing Binary Tree Complexity for Best Performance 222
Modifications for Classification and Categorical Features 225
Bootstrap Aggregation: “Bagging” 226
How Does the Bagging Algorithm Work? 226
Bagging Performance—Bias versus Variance 229
How Bagging Behaves on Multivariable Problem 231
Bagging Needs Tree Depth for Performance 235
Summary of Bagging 236
Gradient Boosting 236
Basic Principle of Gradient Boosting Algorithm 237
Parameter Settings for Gradient Boosting 239
How Gradient Boosting Iterates Toward a Predictive Model 240
Getting the Best Performance from Gradient Boosting 240
Gradient Boosting on a Multivariable Problem 244
Summary for Gradient Boosting 247
Random Forest 247
Random Forests: Bagging Plus Random Attribute Subsets 250
Random Forests Performance Drivers 251
Random Forests Summary 252
Summary 252
Chapter 7 Building Ensemble Models with Python 255
Solving Regression Problems with Python Ensemble Packages 255
Building a Random Forest Model to Predict Wine Taste 256
Constructing a Random Forest Regressor Object 256
Modeling Wine Taste with Random Forest Regressor 259
Visualizing the Performance of a Random
Forests Regression Model 262
Using Gradient Boosting to Predict Wine Taste 263
Using the Class Constructor for Gradient Boosting Regressor 263
Using Gradient Boosting Regressor to
Implement a Regression Model 267
Assessing the Performance of a Gradient Boosting Model 269
Coding Bagging to Predict Wine Taste 270
Incorporating NonNumeric Attributes in Python Ensemble Models 275
Coding the Sex of Abalone for Input to Random Forest Regression in Python 275
Assessing Performance and the Importance of Coded Variables 278
Coding the Sex of Abalone for Gradient Boosting Regression in Python 278
Assessing Performance and the Importance of Coded Variables with Gradient Boosting 282
Solving Binary Classification Problems with Python Ensemble Methods 284
Detecting Unexploded Mines with Python Random Forest 285
Constructing a Random Forests Model to Detect Unexploded Mines 287
Determining the Performance of a Random Forests Classifier 291
Detecting Unexploded Mines with Python Gradient Boosting 291
Determining the Performance of a Gradient Boosting Classifier 298
Solving Multiclass Classification Problems with Python Ensemble Methods 302
Classifying Glass with Random Forests 302
Dealing with Class Imbalances 305
Classifying Glass Using Gradient Boosting 307
Assessing the Advantage of Using Random Forest Base Learners with Gradient Boosting 311
Comparing Algorithms 314
Summary 315
Index 319
Errata
Chapter  Page  Details  Date  Print Run 

56  Errata in text The text currently reads: Listing 211: Parallel Coordinate Plot for Abalone DataabaloneParallelPlot.py Text should read: Listing 212: Parallel Coordinate Plot for Abalone DataabaloneParallelPlot.py  10feb18  
 
60  Errata in text The text currently reads: Listing 212 shows the code Text should read: Listing 213 shows the code  10Feb18  
 
60  Errata in text The text currently reads: Listing 212: Correlation Calculations for Abalone DataabaloneCorrHeat.py Text should read: Listing 213: Correlation Calculations for Abalone DataabaloneCorrHeat.py  10Feb18  
 
61  Errata in text The text currently reads: Listing 212 shows the numeric Text should read: Listing 213 shows the numeric  10Feb18  
 
62  Errata in text The text currently reads: Listing 213 shows the code Text should read: Listing 214 shows the code  10Feb18  
 
62  Errata in text The text currently reads: Listing 213: Wine Data SummarywineSummary.py Text should read: Listing 214: Wine Data SummarywineSummary.py  10Feb18  
 
64  Errata in text The text currently reads: Listing 214 shows Should read: Listing 215 shows  10Feb18  
 
64  Errata in text Text currently reads: Listing 214 normalizes the wine Should read: Listing 215 normalizes the wine  10Feb18  
 
65  Errata in text Text currently reads: Listing 214: Producing a Parallel Coordinate Plot for Wine DatawineParallelPlot.py Should read: Listing 215: Producing a Parallel Coordinate Plot for Wine DatawineParallelPlot.py  10Feb18  
 
68  Errata in text Text currently reads: Listing 215 shows the code Should read: Listing 216 shows the code  10Feb18  
 
68  Errata in text Text currently reads: Listing 215: Summary of Glass Data SetglassSummary.py Should read: Listing 216: Summary of Glass Data SetglassSummary.py  10Feb18  
 
71  Errata in text Text currently reads: Listing 216: Parallel Coordinate Plot for the Glass DataglassParallelPlot.py Should read: Listing 217: Parallel Coordinate Plot for the Glass DataglassParallelPlot.py  10Feb18  
 
72  Errata in text Text currently reads: Listing 216 shows the code Should read: Listing 217 shows the code  10Feb18  
 
130  Errata in text Text currently reads: Equations 47 and 49 Should read: Equations 47 and 48  10Feb18  
 
131  Errata in text Text currently reads: in Equation 46 Should read: in Equation 47  10Feb18  
 
131  Errata in text Text currently reads: for Equations 46 Should read: for Equations 47  10Feb18  
 
132  Errata in text Text currently reads: in Equations 46, 48, and 411 Should read: in Equations 46, 48, and ElasticNet  10Feb18  
 
136  Errata in text Text currently reads: at Equations 46, 48, and 411 Should read: at Equations 46 and 48  10Feb18  
 
137  Errata in text Text currently reads: Equations 46, 48, and 411 Should read: Equations 46 and 48  10Feb18  
 
144  Errata in text Text currently reads: The glmnet algorithm solves the ElasticNet problem given by Equation 411 Should read: The glmnet algorithm solves the ElasticNet problem  10Feb18  
 
145  Errata in text Text currently reads: key iterative equation for the coefficients that solve Equation 11 Should read: key iterative equation for the coefficients that solve Note: Delete "Equation 11"  10Feb18  
 
145  Errata in text Text currently reads: the iteration inferred in 412 Should read: the iteration inferred in 49  10Feb18  
 
146  Errata in text Text currently reads: how Equation 412 Should read: how Equation 49  10Feb18  
 
146  Errata in text Text currently reads: in Equation 412 gives Should read: in Equation 49 gives  10Feb18  
