DescriptionOver the past decade, pattern recognition has been one of the fastest growth points in chemometrics. This has been catalysed by the increase in capabilities of automated instruments such as LCMS, GCMS, and NMR, to name a few, to obtain large quantities of data, and, in parallel, the significant growth in applications especially in biomedical analytical chemical measurements of extracts from humans and animals, together with the increased capabilities of desktop computing. The interpretation of such multivariate datasets has required the application and development of new chemometric techniques such as pattern recognition, the focus of this work.
Included within the text are:
- ‘Real world’ pattern recognition case studies from a wide variety of sources including biology, medicine, materials, pharmaceuticals, food, forensics and environmental science;
- Discussions of methods, many of which are also common in biology, biological analytical chemistry and machine learning;
- Common tools such as Partial Least Squares and Principal Components Analysis, as well as those that are rarely used in chemometrics such as Self Organising Maps and Support Vector Machines;
- Representation in full colour;
- Validation of models and hypothesis testing, and the underlying motivation of the methods, including how to avoid some common pitfalls.
Relevant to active chemometricians and analytical scientists in industry, academia and government establishments as well as those involved in applying statistics and computational pattern recognition.
1.1 Past, Present and Future.
1.2 About this Book.
2 Case Studies.
2.2 Datasets, Matrices and Vectors.
2.3 Case Study 1: Forensic Analysis of Banknotes.
2.4 Case Study 2: Near Infrared Spectroscopic Analysis of Food.
2.5 Case Study 3: Thermal Analysis of Polymers.
2.6 Case Study 4: Environmental Pollution using Headspace Mass Spectrometry.
2.7 Case Study 5: Human Sweat Analysed by Gas Chromatography Mass Spectrometry.
2.8 Case Study 6: Liquid Chromatography Mass Spectrometry of Pharmaceutical Tablets.
2.9 Case Study 7: Atomic Spectroscopy for the Study of Hypertension.
2.10 Case Study 8: Metabolic Profiling of Mouse Urine by Gas Chromatography of Urine Extracts.
2.11 Case Study 9: Nuclear Magnetic Resonance Spectroscopy for Salival Analysis of the Effect of Mouthwash.
2.12 Case Study 10: Simulations.
2.13 Case Study 11: Null Dataset.
2.14 Case Study 12: GCMS and Microbiology of Mouse Scent Marks.
3 Exploratory Data Analysis.
3.2 Principal Components Analysis.
3.2.2 Scores and Loadings.
3.2.4 PCA Algorithm.
3.2.5 Graphical Representation.
3.3 Dissimilarity Indices, Principal Co-ordinates Analysis and Ranking.
3.3.2 Principal Co-ordinates Analysis.
3.4 Self Organizing Maps.
3.4.2 SOM Algorithm.
3.4.5 Map Quality.
4.2 Data Scaling.
4.2.1 Transforming Individual Elements.
4.2.2 Row Scaling.
4.2.3 Column Scaling.
4.3 Multivariate Methods of Data Reduction.
4.3.1 Largest Principal Components.
4.3.2 Discriminatory Principal Components.
4.3.3 Partial Least Squares Discriminatory Analysis Scores.
4.4 Strategies for Data Preprocessing.
4.4.1 Flow Charts.
4.4.2 Level 1.
4.4.3 Level 2.
4.4.4 Level 3.
4.4.5 Level 4.
5 Two Class Classifiers.
5.1.1 Two Class Classifiers.
5.1.4 Autoprediction and Class Boundaries.
5.2 Euclidean Distance to Centroids.
5.3 Linear Discriminant Analysis.
5.4 Quadratic Discriminant Analysis.
5.5 Partial Least Squares Discriminant Analysis.
5.5.1 PLS Method.
5.5.2 PLS Algorithm.
5.6 Learning Vector Quantization.
5.6.1 Voronoi Tesselation and Codebooks.
5.6.4 LVQ Illustration and Summary of Parameters.
5.7 Support Vector Machines.
5.7.1 Linear Learning Machines.
5.7.3 Controlling Complexity and Soft Margin SVMs.
5.7.4 SVM Parameters.
6 One Class Classifiers.
6.2 Distance Based Classifiers.
6.3 PC Based Models and SIMCA.
6.4 Indicators of Significance.
6.4.1 Gaussian Density Estimators and Chi-Squared.
6.4.2 Hotelling’s T2.
6.4.4 Q-Statistic or Squared Prediction Error.
6.4.5 Visualization of D- and Q-Statistics for Disjoint PC Models.
6.4.6 Multivariate Normality and What to do if it Fails.
6.5 Support Vector Data Description.
6.6 Summarizing One Class Classifiers.
6.6.1 Class Membership Plots.
6.6.2 ROC Curves.
7 Multiclass Classifiers.
7.2 EDC, LDA and QDA.
7.6 One against One Decisions.
8 Validation and Optimization.
8.2 Classification Abilities, Contingency Tables and Related Concepts.
8.2.1 Two Class Classifiers.
8.2.2 Multiclass Classifiers.
8.2.3 One Class Classifiers.
8.3.1 Testing Models.
8.3.2 Test and Training Sets.
8.3.4 Increasing the Number of Variables for the Classifier.
8.4 Iterative Approaches for Validation.
8.4.1 Predictive Ability, Model Stability, Classification by Majority Vote and Cross Classification Rate.
8.4.2 Number of Iterations.
8.4.3 Test and Training Set Boundaries.
8.5 Optimizing PLS Models.
8.5.1 Number of Components: Cross-Validation and Bootstrap.
8.5.2 Thresholds and ROC Curves.
8.6 Optimizing Learning Vector Quantization Models.
8.7 Optimizing Support Vector Machine Models.
9 Determining Potential Discriminatory Variables.
9.1.1 Two Class Distributions.
9.1.2 Multiclass Distributions.
9.1.3 Multilevel and Multiway Distributions.
9.1.4 Sample Sizes.
9.1.5 Modelling after Variable Reduction.
9.1.6 Preliminary Variable Reduction.
9.2 Which Variables are most Significant?.
9.2.1 Basic Concepts: Statistical Indicators and Rank.
9.2.2 T-Statistic and Fisher Weights.
9.2.3 Multiple Linear Regression, ANOVA and the F-Ratio.
9.2.4 Partial Least Squares.
9.2.5 Relationship between the Indicator Functions.
9.3 How Many Variables are Significant?
9.3.1 Probabilistic Approaches.
9.3.2 Empirical Methods: Monte Carlo.
9.3.3 Cost/Benefit of Increasing the Number of Variables.
10 Bayesian Methods and Unequal Class Sizes.
10.2 Contingency Tables and Bayes’ Theorem.
10.3 Bayesian Extensions to Classifiers.
11 Class Separation Indices.
11.2 Davies Bouldin Index.
11.3 Silhouette Width and Modified Silhouette Width.
11.3.1 Silhouette Width.
11.3.2 Modified Silhouette Width.
11.4 Overlap Coefficient.
12 Comparing Different Patterns.
12.2 Correlation Based Methods.
12.2.1 Mantel Test.
12.2.2 RV Coefficient.
12.3 Consensus PCA.
12.4 Procrustes Analysis.