Ebook
Practical Text Mining with PerlISBN: 9781118210505
296 pages
September 2011

This book is devoted to the fundamentals of text mining using Perl, an opensource programming tool that is freely available via the Internet (www.perl.org). It covers mining ideas from several perspectivesstatistics, data mining, linguistics, and information retrievaland provides readers with the means to successfully complete text mining tasks on their own.
The book begins with an introduction to regular expressions, a text pattern methodology, and quantitative text summaries, all of which are fundamental tools of analyzing text. Then, it builds upon this foundation to explore:
 Probability and texts, including the bagofwords model
 Information retrieval techniques such as the TFIDF similarity measure
 Concordance lines and corpus linguistics
 Multivariate techniques such as correlation, principal components analysis, and clustering
 Perl modules, German, and permutation tests
Each chapter is devoted to a single key topic, and the author carefully and thoughtfully introduces mathematical concepts as they arise, allowing readers to learn as they go without having to refer to additional books. The inclusion of numerous exercises and workedout examples further complements the book's studentfriendly format.
Practical Text Mining with Perl is ideal as a textbook for undergraduate and graduate courses in text mining and as a reference for a variety of professionals who are interested in extracting information from text documents.
List of Tables.
Preface.
Acknowledgments.
1. Introduction.
1.1 Overview of this Book.
1.2 Text Mining and Related Fields.
1.3 Advice for Reading this Book.
2. Text Patterns.
2.1 Introduction.
2.2 Regular Expressions.
2.3 Finding Words in a Text.
2.4 Decomposing Poe's "The TellTale Heart" into Words.
2.5 A Simple Concordance.
2.6 First Attempt at Extracting Sentences.
2.7 Regex Odds and Ends.
2.8 References.
3. Quantitative Text Summaries.
3.1 Introduction.
3.2 Scalars, Interpolation, and Context in Perl.
3.3 Arrays and Context in Perl.
3.4 Word Lengths in Poe's "The TellTale Heart".
3.5 Arrays and Functions.
3.6 Hashes.
3.7 Two Text Applications.
3.8 Complex Data Structures.
3.9 References.
3.10 First Transition.
4. Probability and Text Sampling.
4.1 Introduction.
4.2 Probability.
4.3 Conditioned Probability.
4.4 Mean and Variance of random Variables.
4.5 The BagofWords Model for Poe's :The Black Cat".
4.6 The Effect of Sample Size.
4.7 References.
5. Applying Information Retrieval to Text Mining.
5.1 Introduction.
5.2 Counting Letters and Words.
5.3 Text Counts and Vectors.
5.4 The TermDocument Matrix Applied to Poe.
5.5 Matrix Multiplication.
5.6 Functions of Counts.
5.7 Document Similarity.
5.8 References.
6. Concordance Lines and Corpus Linguistics.
6.1 Introduction.
6.2 Sampling.
6.3 Corpus as Baseline.
6.4 Concordancing.
6.5 Collocations and Concordance Lines.
6.6 Applications with References.
6.7 Second Transition.
7. Multivariate Techniques with Text.
7.1 Introduction.
7.2 Basic Statistics.
7.3 Basic Linear Algebra.
7.4 Principal Component Matrices.
7.5 Text Applications.
7.6 Applications and References.
8. Text Clustering.
8.1 Introduction.
8.2 Clustering.
8.3 A Note on Classification.
8.4 References.
8.5 Last Transition.
9. A Sample of Additional Topics.
9.1 Introduction.
9.2 Perl Modules.
9.3 Other Languages: Analyzing Goethe in German.
9.4 Permutation Tests.
9.5 References.
Appendix A. Overview of Perl for Text Mining.
A.1 Basic Data Structures.
A.2 Operators.
A.3 Branching and Looping.
A.4 A Few Functions.
A.5 Introduction to Regular Expressions.
Appendix B. Summary of R used in this Book
B.1 Basics of R.
B.2 This Book's R Code..
References.
Index.