Preface xiii

**1 Introduction: Classification Learning Features and Applications 1**

1.1 Scope 1

1.2 Why Machine Learning? 2

1.3 Some Applications 3

1.3.1 Image Recognition 3

1.3.2 Speech Recognition 3

1.3.3 Medical Diagnosis 4

1.3.4 Statistical Arbitrage 4

1.4 Measurements Features and Feature Vectors 4

1.5 The Need for Probability 5

1.6 Supervised Learning 5

1.7 Summary 6

1.8 Appendix: Induction 6

1.9 Questions 7

1.10 References 8

**2 Probability 10**

2.1 Probability of Some Basic Events 10

2.2 Probabilities of Compound Events 12

2.3 Conditional Probability 13

2.4 Drawing Without Replacement 14

2.5 A Classic Birthday Problem 15

2.6 Random Variables 15

2.7 Expected Value 16

2.8 Variance 17

2.9 Summary 19

2.10 Appendix: Interpretations of Probability 19

2.11 Questions 20

2.12 References 21

**3 Probability Densities 23**

3.1 An Example in Two Dimensions 23

3.2 Random Numbers in [01] 23

3.3 Density Functions 24

3.4 Probability Densities in Higher Dimensions 27

3.5 Joint and Conditional Densities 27

3.6 Expected Value and Variance 28

3.7 Laws of Large Numbers 29

3.8 Summary 30

3.9 Appendix: Measurability 30

3.10 Questions 32

3.11 References 32

**4 The Pattern Recognition Problem 34**

4.1 A Simple Example 34

4.2 Decision Rules 35

4.3 Success Criterion 37

4.4 The Best Classifier: Bayes Decision Rule 37

4.5 Continuous Features and Densities 38

4.6 Summary 39

4.7 Appendix: Uncountably Many 39

4.8 Questions 40

4.9 References 41

**5 The Optimal Bayes Decision Rule 43**

5.1 Bayes Theorem 43

5.2 Bayes Decision Rule 44

5.3 Optimality and Some Comments 45

5.4 An Example 47

5.5 Bayes Theorem and Decision Rule with Densities 48

5.6 Summary 49

5.7 Appendix: Defining Conditional Probability 50

5.8 Questions 50

5.9 References 53

**6 Learning from Examples 55**

6.1 Lack of Knowledge of Distributions 55

6.2 Training Data 56

6.3 Assumptions on the Training Data 57

6.4 A Brute Force Approach to Learning 59

6.5 Curse of Dimensionality Inductive Bias and No Free Lunch 60

6.6 Summary 61

6.7 Appendix: What Sort of Learning? 62

6.8 Questions 63

6.9 References 64

**7 The Nearest Neighbor Rule 65**

7.1 The Nearest Neighbor Rule 65

7.2 Performance of the Nearest Neighbor Rule 66

7.3 Intuition and Proof Sketch of Performance 67

7.4 Using more Neighbors 69

7.5 Summary 70

7.6 Appendix: When People use Nearest Neighbor Reasoning 70

7.6.1 Who Is a Bachelor? 70

7.6.2 Legal Reasoning 71

7.6.3 Moral Reasoning 71

7.7 Questions 72

7.8 References 73

**8 Kernel Rules 74**

8.1 Motivation 74

8.2 A Variation on Nearest Neighbor Rules 75

8.3 Kernel Rules 76

8.4 Universal Consistency of Kernel Rules 79

8.5 Potential Functions 80

8.6 More General Kernels 81

8.7 Summary 82

8.8 Appendix: Kernels Similarity and Features 82

8.9 Questions 83

8.10 References 84

**9 Neural Networks: Perceptrons 86**

9.1 Multilayer Feedforward Networks 86

9.2 Neural Networks for Learning and Classification 87

9.3 Perceptrons 89

9.3.1 Threshold 90

9.4 Learning Rule for Perceptrons 90

9.5 Representational Capabilities of Perceptrons 92

9.6 Summary 94

9.7 Appendix: Models of Mind 95

9.8 Questions 96

9.9 References 97

**10 Multilayer Networks 99**

10.1 Representation Capabilities of Multilayer Networks 99

10.2 Learning and Sigmoidal Outputs 101

10.3 Training Error and Weight Space 104

10.4 Error Minimization by Gradient Descent 105

10.5 Backpropagation 106

10.6 Derivation of Backpropagation Equations 109

10.6.1 Derivation for a Single Unit 110

10.6.2 Derivation for a Network 111

10.7 Summary 113

10.8 Appendix: Gradient Descent and Reasoning toward Reflective Equilibrium 113

10.9 Questions 114

10.10 References 115

**11 PAC Learning 116**

11.1 Class of Decision Rules 117

11.2 Best Rule from a Class 118

11.3 Probably Approximately Correct Criterion 119

11.4 PAC Learning 120

11.5 Summary 122

11.6 Appendix: Identifying Indiscernibles 122

11.7 Questions 123

11.8 References 123

**12 VC Dimension 125**

12.1 Approximation and Estimation Errors 125

12.2 Shattering 126

12.3 VC Dimension 127

12.4 Learning Result 128

12.5 Some Examples 129

12.6 Application to Neural Nets 132

12.7 Summary 133

12.8 Appendix: VC Dimension and Popper Dimension 133

12.9 Questions 134

12.10 References 135

**13 Infinite VC Dimension 137**

13.1 A Hierarchy of Classes and Modified PAC Criterion 138

13.2 Misfit Versus Complexity Trade-Off 138

13.3 Learning Results 139

13.4 Inductive Bias and Simplicity 140

13.5 Summary 141

13.6 Appendix: Uniform Convergence and Universal Consistency 141

13.7 Questions 142

13.8 References 143

**14 The Function Estimation Problem 144**

14.1 Estimation 144

14.2 Success Criterion 145

14.3 Best Estimator: Regression Function 146

14.4 Learning in Function Estimation 146

14.5 Summary 147

14.6 Appendix: Regression Toward the Mean 147

14.7 Questions 148

14.8 References 149

**15 Learning Function Estimation 150**

15.1 Review of the Function Estimation/Regression Problem 150

15.2 Nearest Neighbor Rules 151

15.3 Kernel Methods 151

15.4 Neural Network Learning 152

15.5 Estimation with a Fixed Class of Functions 153

15.6 Shattering Pseudo-Dimension and Learning 154

15.7 Conclusion 156

15.8 Appendix: Accuracy Precision Bias and Variance in Estimation 156

15.9 Questions 157

15.10 References 158

**16 Simplicity 160**

16.1 Simplicity in Science 160

16.1.1 Explicit Appeals to Simplicity 160

16.1.2 Is the World Simple? 161

16.1.3 Mistaken Appeals to Simplicity 161

16.1.4 Implicit Appeals to Simplicity 161

16.2 Ordering Hypotheses 162

16.2.1 Two Kinds of Simplicity Orderings 162

16.3 Two Examples 163

16.3.1 Curve Fitting 163

16.3.2 Enumerative Induction 164

16.4 Simplicity as Simplicity of Representation 165

16.4.1 Fix on a Particular System of Representation? 166

16.4.2 Are Fewer Parameters Simpler? 167

16.5 Pragmatic Theory of Simplicity 167

16.6 Simplicity and Global Indeterminacy 168

16.7 Summary 169

16.8 Appendix: Basic Science and Statistical Learning Theory 169

16.9 Questions 170

16.10 References 170

**17 Support Vector Machines 172**

17.1 Mapping the Feature Vectors 173

17.2 Maximizing the Margin 175

17.3 Optimization and Support Vectors 177

17.4 Implementation and Connection to Kernel Methods 179

17.5 Details of the Optimization Problem 180

17.5.1 Rewriting Separation Conditions 180

17.5.2 Equation for Margin 181

17.5.3 Slack Variables for Nonseparable Examples 181

17.5.4 Reformulation and Solution of Optimization 182

17.6 Summary 183

17.7 Appendix: Computation 184

17.8 Questions 185

17.9 References 186

**18 Boosting 187**

18.1 Weak Learning Rules 187

18.2 Combining Classifiers 188

18.3 Distribution on the Training Examples 189

18.4 The Adaboost Algorithm 190

18.5 Performance on Training Data 191

18.6 Generalization Performance 192

18.7 Summary 194

18.8 Appendix: Ensemble Methods 194

18.9 Questions 195

18.10 References 196

Bibliography 197

Author Index 203

Subject Index 207