Ebook
An Elementary Introduction to Statistical Learning TheoryISBN: 9781118023464
288 pages
June 2011

Description
A joint endeavor from leading researchers in the fields of philosophy and electrical engineering, An Elementary Introduction to Statistical Learning Theory is a comprehensive and accessible primer on the rapidly evolving fields of statistical pattern recognition and statistical learning theory. Explaining these areas at a level and in a way that is not often found in other books on the topic, the authors present the basic theory behind contemporary machine learning and uniquely utilize its foundations as a framework for philosophical thinking about inductive inference.
Promoting the fundamental goal of statistical learning, knowing what is achievable and what is not, this book demonstrates the value of a systematic methodology when used along with the needed techniques for evaluating the performance of a learning system. First, an introduction to machine learning is presented that includes brief discussions of applications such as image recognition, speech recognition, medical diagnostics, and statistical arbitrage. To enhance accessibility, two chapters on relevant aspects of probability theory are provided. Subsequent chapters feature coverage of topics such as the pattern recognition problem, optimal Bayes decision rule, the nearest neighbor rule, kernel rules, neural networks, support vector machines, and boosting.
Appendices throughout the book explore the relationship between the discussed material and related topics from mathematics, philosophy, psychology, and statistics, drawing insightful connections between problems in these areas and statistical learning theory. All chapters conclude with a summary section, a set of practice questions, and a reference sections that supplies historical notes and additional resources for further study.
An Elementary Introduction to Statistical Learning Theory is an excellent book for courses on statistical learning theory, pattern recognition, and machine learning at the upperundergraduate and graduate levels. It also serves as an introductory reference for researchers and practitioners in the fields of engineering, computer science, philosophy, and cognitive science that would like to further their knowledge of the topic.
Table of Contents
Preface xiii
1 Introduction: Classification Learning Features and Applications 1
1.1 Scope 1
1.2 Why Machine Learning? 2
1.3 Some Applications 3
1.3.1 Image Recognition 3
1.3.2 Speech Recognition 3
1.3.3 Medical Diagnosis 4
1.3.4 Statistical Arbitrage 4
1.4 Measurements Features and Feature Vectors 4
1.5 The Need for Probability 5
1.6 Supervised Learning 5
1.7 Summary 6
1.8 Appendix: Induction 6
1.9 Questions 7
1.10 References 8
2 Probability 10
2.1 Probability of Some Basic Events 10
2.2 Probabilities of Compound Events 12
2.3 Conditional Probability 13
2.4 Drawing Without Replacement 14
2.5 A Classic Birthday Problem 15
2.6 Random Variables 15
2.7 Expected Value 16
2.8 Variance 17
2.9 Summary 19
2.10 Appendix: Interpretations of Probability 19
2.11 Questions 20
2.12 References 21
3 Probability Densities 23
3.1 An Example in Two Dimensions 23
3.2 Random Numbers in [01] 23
3.3 Density Functions 24
3.4 Probability Densities in Higher Dimensions 27
3.5 Joint and Conditional Densities 27
3.6 Expected Value and Variance 28
3.7 Laws of Large Numbers 29
3.8 Summary 30
3.9 Appendix: Measurability 30
3.10 Questions 32
3.11 References 32
4 The Pattern Recognition Problem 34
4.1 A Simple Example 34
4.2 Decision Rules 35
4.3 Success Criterion 37
4.4 The Best Classifier: Bayes Decision Rule 37
4.5 Continuous Features and Densities 38
4.6 Summary 39
4.7 Appendix: Uncountably Many 39
4.8 Questions 40
4.9 References 41
5 The Optimal Bayes Decision Rule 43
5.1 Bayes Theorem 43
5.2 Bayes Decision Rule 44
5.3 Optimality and Some Comments 45
5.4 An Example 47
5.5 Bayes Theorem and Decision Rule with Densities 48
5.6 Summary 49
5.7 Appendix: Defining Conditional Probability 50
5.8 Questions 50
5.9 References 53
6 Learning from Examples 55
6.1 Lack of Knowledge of Distributions 55
6.2 Training Data 56
6.3 Assumptions on the Training Data 57
6.4 A Brute Force Approach to Learning 59
6.5 Curse of Dimensionality Inductive Bias and No Free Lunch 60
6.6 Summary 61
6.7 Appendix: What Sort of Learning? 62
6.8 Questions 63
6.9 References 64
7 The Nearest Neighbor Rule 65
7.1 The Nearest Neighbor Rule 65
7.2 Performance of the Nearest Neighbor Rule 66
7.3 Intuition and Proof Sketch of Performance 67
7.4 Using more Neighbors 69
7.5 Summary 70
7.6 Appendix: When People use Nearest Neighbor Reasoning 70
7.6.1 Who Is a Bachelor? 70
7.6.2 Legal Reasoning 71
7.6.3 Moral Reasoning 71
7.7 Questions 72
7.8 References 73
8 Kernel Rules 74
8.1 Motivation 74
8.2 A Variation on Nearest Neighbor Rules 75
8.3 Kernel Rules 76
8.4 Universal Consistency of Kernel Rules 79
8.5 Potential Functions 80
8.6 More General Kernels 81
8.7 Summary 82
8.8 Appendix: Kernels Similarity and Features 82
8.9 Questions 83
8.10 References 84
9 Neural Networks: Perceptrons 86
9.1 Multilayer Feedforward Networks 86
9.2 Neural Networks for Learning and Classification 87
9.3 Perceptrons 89
9.3.1 Threshold 90
9.4 Learning Rule for Perceptrons 90
9.5 Representational Capabilities of Perceptrons 92
9.6 Summary 94
9.7 Appendix: Models of Mind 95
9.8 Questions 96
9.9 References 97
10 Multilayer Networks 99
10.1 Representation Capabilities of Multilayer Networks 99
10.2 Learning and Sigmoidal Outputs 101
10.3 Training Error and Weight Space 104
10.4 Error Minimization by Gradient Descent 105
10.5 Backpropagation 106
10.6 Derivation of Backpropagation Equations 109
10.6.1 Derivation for a Single Unit 110
10.6.2 Derivation for a Network 111
10.7 Summary 113
10.8 Appendix: Gradient Descent and Reasoning toward Reflective Equilibrium 113
10.9 Questions 114
10.10 References 115
11 PAC Learning 116
11.1 Class of Decision Rules 117
11.2 Best Rule from a Class 118
11.3 Probably Approximately Correct Criterion 119
11.4 PAC Learning 120
11.5 Summary 122
11.6 Appendix: Identifying Indiscernibles 122
11.7 Questions 123
11.8 References 123
12 VC Dimension 125
12.1 Approximation and Estimation Errors 125
12.2 Shattering 126
12.3 VC Dimension 127
12.4 Learning Result 128
12.5 Some Examples 129
12.6 Application to Neural Nets 132
12.7 Summary 133
12.8 Appendix: VC Dimension and Popper Dimension 133
12.9 Questions 134
12.10 References 135
13 Infinite VC Dimension 137
13.1 A Hierarchy of Classes and Modified PAC Criterion 138
13.2 Misfit Versus Complexity TradeOff 138
13.3 Learning Results 139
13.4 Inductive Bias and Simplicity 140
13.5 Summary 141
13.6 Appendix: Uniform Convergence and Universal Consistency 141
13.7 Questions 142
13.8 References 143
14 The Function Estimation Problem 144
14.1 Estimation 144
14.2 Success Criterion 145
14.3 Best Estimator: Regression Function 146
14.4 Learning in Function Estimation 146
14.5 Summary 147
14.6 Appendix: Regression Toward the Mean 147
14.7 Questions 148
14.8 References 149
15 Learning Function Estimation 150
15.1 Review of the Function Estimation/Regression Problem 150
15.2 Nearest Neighbor Rules 151
15.3 Kernel Methods 151
15.4 Neural Network Learning 152
15.5 Estimation with a Fixed Class of Functions 153
15.6 Shattering PseudoDimension and Learning 154
15.7 Conclusion 156
15.8 Appendix: Accuracy Precision Bias and Variance in Estimation 156
15.9 Questions 157
15.10 References 158
16 Simplicity 160
16.1 Simplicity in Science 160
16.1.1 Explicit Appeals to Simplicity 160
16.1.2 Is the World Simple? 161
16.1.3 Mistaken Appeals to Simplicity 161
16.1.4 Implicit Appeals to Simplicity 161
16.2 Ordering Hypotheses 162
16.2.1 Two Kinds of Simplicity Orderings 162
16.3 Two Examples 163
16.3.1 Curve Fitting 163
16.3.2 Enumerative Induction 164
16.4 Simplicity as Simplicity of Representation 165
16.4.1 Fix on a Particular System of Representation? 166
16.4.2 Are Fewer Parameters Simpler? 167
16.5 Pragmatic Theory of Simplicity 167
16.6 Simplicity and Global Indeterminacy 168
16.7 Summary 169
16.8 Appendix: Basic Science and Statistical Learning Theory 169
16.9 Questions 170
16.10 References 170
17 Support Vector Machines 172
17.1 Mapping the Feature Vectors 173
17.2 Maximizing the Margin 175
17.3 Optimization and Support Vectors 177
17.4 Implementation and Connection to Kernel Methods 179
17.5 Details of the Optimization Problem 180
17.5.1 Rewriting Separation Conditions 180
17.5.2 Equation for Margin 181
17.5.3 Slack Variables for Nonseparable Examples 181
17.5.4 Reformulation and Solution of Optimization 182
17.6 Summary 183
17.7 Appendix: Computation 184
17.8 Questions 185
17.9 References 186
18 Boosting 187
18.1 Weak Learning Rules 187
18.2 Combining Classifiers 188
18.3 Distribution on the Training Examples 189
18.4 The Adaboost Algorithm 190
18.5 Performance on Training Data 191
18.6 Generalization Performance 192
18.7 Summary 194
18.8 Appendix: Ensemble Methods 194
18.9 Questions 195
18.10 References 196
Bibliography 197
Author Index 203
Subject Index 207
Author Information
SANJEEV KULKARNI, PhD, is Professor in the Department of Electrical Engineering at Princeton University, where he is also an affiliated faculty member in the Department of Operations Research and Financial Engineering and the Department of Philosophy. Dr. Kulkarni has published widely on statistical pattern recognition, nonparametric estimation, machine learning, information theory, and other areas. A Fellow of the IEEE, he was awarded Princeton University's President's Award for Distinguished Teaching in 2007.
GILBERT HARMAN, PhD, is James S. McDonnell Distinguished University Professor in the Department of Philosophy at Princeton University. A Fellow of the Cognitive Science Society, he is the author of more than fifty published articles in his areas of research interest, which include ethics, statistical learning theory, psychology of reasoning, and logic.
Reviews
“The main focus of the book is on the ideas behind basic principles of learning theory and I can strongly recommend the book to anyone who wants to comprehend these ideas.” (Mathematical Reviews, 1 January 2013)
“It also serves as an introductory reference for researchers and practitioners in the fields of engineering, computer science, philosophy, and cognitive science that would like to further their knowledge of the topic.” (Zentralblatt MATH, 2012)