WILEY

Publishers since 1807

Wiley - Publishers Since 1807

United States Change Location

cart.gif CART |  MY ACCOUNT |  CONTACT US |  HELP    
Cover image for product 0470176431
Practical Text Mining with Perl
ISBN: 978-0-470-17643-6
Hardcover
320 pages
August 2008
US $89.95 Add to Cart

This price is valid for United States. Change location to view local pricing and availability.

  • Description
  • Table of Contents
  • Author Information
Preface

Acknowledgments

1. Introduction

1.1 Overview of this Book

1.2 Text Mining and Related Fields

1.2.1 Chapter 2: Pattern Matching

1.2.2 Chapter 3: Data Structures

1.2.3 Chapter 4: Probability

1.2.4 Chapter 5: Information Retrieval

1.2.5 Chapter 6: Corpus Linguistics

1.2.6 Chapter 7: Multivariate Statistics

1.2.7 Chapter 8: Clustering

1.2.8 Chapter 9: Three Additional Topics

1.3 Advice for Reading this Book

2. Text Patterns

2.1 Introduction

2.2 Regular Expressions

2.2.1 First Regex: Finding the Word "Cat"

2.2.2 Character Ranges and Finding Telephone Numbers

2.2.3 Testing Regexes with Perl

2.3 Finding Words in a Text

2.3.1 Regex Summary

2.3.2 Nineteenth Century Literature

2.3.3 Perl Variables and the Function split

2.3.4 Match Variables

2.4 Decomposing Poe’s "The Tell-Tale Heart" into Words

2.4.1 Dashes and String Substitutions

2.4.2 Hyphens

2.4.3 Apostrophes

2.5 A Simple Concordance

2.5.1 Command Line Arguments

2.5.2 Writing to Files

2.6 First Attempt at Extracting Sentences

2.6.1 Sentence Segmentation Preliminaries

2.6.2 Sentence Segmentation for "A Christmas Carol"

2.6.3 Leftmost Greediness and Sentence Segmentation

2.7 Regex Odds and Ends

2.7.1 Match Variables and Backreferences

2.7.2 Regular Expression Operators and Their Output

2.7.3 Lookaround

2.8 References

Problems

3. Quantitative Text Summaries

3.1 Introduction

3.2 Scalars, Interpolation and Context in Perl

3.3 Arrays and Context in Perl

3.4 Word Length Application

3.5 Arrays and Functions

3.5.1 Adding and Removing Entries from Arrays

3.5.2 Selecting Subsets of an Array

3.5.3 Sorting an Array

3.6 Hashes

3.6.1 Using a Hash

3.7 Two Text Applications

3.7.1 Zipf’s Law

3.7.2 Perl for Word Games

3.7.2.1 An Aid to Crossword Puzzles

3.7.2.2 Word Anagrams

3.7.2.3 Finding Words in a Set of Letters

3.8 Complex Data Structures

3.8.1 References and Pointers

3.8.2 Arrays of Arrays and Beyond

3.8.3 Application: Comparing the Words in Two Poe Stories

3.9 References

3.10 First Transition

Problems

4. Probability and Texts

4.1 Introduction

4.2 Probability

4.2.1 Probability and Coin Flipping

4.2.2 Probabilities and Texts

4.2.2.1 Estimating Letter Probabilities

4.2.2.2 Estimating Letter Bigram Probabilities

4.3 Conditional Probability

4.3.1 Independence

4.4 Mean and Variance of Random Variables

4.4.1 Sampling and Error Estimates

4.5 The Bag-of-Words Model

4.6 The Effect of Sample Size

4.6.1 Tokens vs. Types

4.7 References

Problems

5. Applying Information Retrieval to Text Mining

5.1 Introduction

5.2 Text Counts and Vectors

5.2.1 Counting Words with Perl

5.2.2 Pronouns

5.3 Text Counts and Vectors

5.3.1 Vectors and Angles

5.3.2 Computing Angles between Vectors

5.3.2.1 Subroutines in Perl

5.3.2.2 Computing the Angle between Vectors

5.4 The Term-Document Matrix

5.5 Matrix Multiplication

5.5.1 A Text Application of Matrix Multiplication

5.6 Functions of Counts

5.7 Document Similarity

5.7.1 Inverse Document Frequency

5.7.2 Poe Story Angles Revisited

5.8 References

Problems

6. Concordance Lines and Corpus Linguistics

6.1 Introduction

6.2 Sampling

6.2.1 Statistical Survey Sampling

6.2.2 Text Sampling

6.3 Corpus as Baseline

6.3.1 Function vs. Content Words

6.4 Concordancing

6.4.1 Sorting Concordance Lines

6.4.1.1 Code for Sorting Concordance Lines

6.4.2 Application: Word Usage

6.4.3 Application: Word Morphology

6.5 Collocations and Concordance Lines

6.5.1 More Ways to Sort Concordance Lines

6.5.2 Application: Phrasal Verbs

6.5.3 Rare Events

6.6 Applications with References

6.7 Second Transition

Problems

7. Multivariate Techniques with Text

7.1 Introduction

7.2 Basic Statistics

7.2.1 Z-Scores

7.2.2 Z-Scores and Correlations

7.2.3 Correlations and Cosines

7.2.4 Correlations and Covariances

7.3 Basic linear algebra

7.3.1 2 by 2 Correlation Matrices

7.4 Principal Components Analysis

7.4.1 Finding the Principal Components

7.4.2 PCA and the 68 Poe Short Stories

7.4.3 Another PCA Example with Poe’s Short Stories

7.4.4 Rotations

7.5 Text Applications

7.5.1 A Word on Factor Analysis

7.6 Applications and References

Problems

8. Text Clustering

8.1 Introduction

8.2 Clustering

8.2.1 Two Variable Example of K-Means

8.2.2 K-Means with R

8.2.3 "He" versus "She" in Poe’s Short Stories

8.2.4 Poe Clusters using Eight Pronouns

8.2.5 Clustering Poe using Principal Components

8.2.6 Hierarchical Clustering of Poe’s Short Stories

8.3 A Note on Classification

8.3.1 Decision Trees and Over-fitting

8.4 References

8.5 Last Transition

Problems

9. Three Additional Topics

9.1 Introduction

9.2 Perl Modules

9.2.1 Modules for Number Words

9.2.2 The StopWords Module

9.2.3 The Sentence Segmentation Module

9.2.4 An Object Oriented Module for Tagging

9.2.5 Miscellaneous Modules

9.3 Other Languages: German

9.4 Permutation Tests

9.4.1 Runs and Hypothesis Testing

9.4.2 Distribution of Character Names

9.5 References

Appendix A: Overview of Perl for Text Mining

A.1 Basic Data Structures

A.1.1 Special Variables and Arrays

A.2 Operators

A.3 Branching and Looping

A.4 A Few Perl Functions

A.5 Introduction to Regular Expressions

Appendix B: Summary of R used in this Book

B.1 Basics of R

B.1.1 Data Entry

B.1.2 Basic Operators

B.1.3 Matrix Manipulation

B.2 This Book’s R Code

References