Wiley.com
Print this page Share

A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R

ISBN: 978-1-119-08002-2
312 pages
December 2017
A Data Scientist

Description

The only how-to guide offering a unified, systemic approach to acquiring, cleaning, and managing data in R

Every experienced practitioner knows that preparing data for modeling is a painstaking, time-consuming process. Adding to the difficulty is that most modelers learn the steps involved in cleaning and managing data piecemeal, often on the fly, or they develop their own ad hoc methods. This book helps simplify their task by providing a unified, systematic approach to acquiring, modeling, manipulating, cleaning, and maintaining data in R. 

Starting with the very basics, data scientists Samuel E. Buttrey and Lyn R. Whitaker walk readers through the entire process. From what data looks like and what it should look like, they progress through all the steps involved in getting data ready for modeling.  They describe best practices for acquiring data from numerous sources; explore key issues in data handling, including text/regular expressions, big data, parallel processing, merging, matching, and checking for duplicates; and outline highly efficient and reliable techniques for documenting data and recordkeeping, including audit trails, getting data back out of R, and more.

  • The only single-source guide to R data and its preparation, it describes best practices for acquiring, manipulating, cleaning, and maintaining data
  • Begins with the basics and walks readers through all the steps necessary to get data ready for the modeling process
  • Provides expert guidance on how to document the processes described so that they are reproducible
  • Written by seasoned professionals, it provides both introductory and advanced techniques
  • Features case studies with supporting data and R code, hosted on a companion website

A Data Scientist's Guide to Acquiring, Cleaning and Managing Data in R is a valuable working resource/bench manual for practitioners who collect and analyze data, lab scientists and research associates of all levels of experience, and graduate-level data mining students.

See More

Table of Contents

About the Authors xv

Preface xvii

Acknowledgments xix

About the CompanionWebsite xxi

1 R 1

1.1 Introduction 1

1.1.1 What Is R? 1

1.1.2 Who Uses R and Why? 2

1.1.3 Acquiring and Installing R 2

1.1.4 Starting and Quitting R 3

1.2 Data 3

1.2.1 Acquiring Data 3

1.2.2 Cleaning Data 4

1.2.3 The Goal of Data Cleaning 4

1.2.4 Making YourWork Reproducible 5

1.3 The Very Basics of R 5

1.3.1 Top Ten Quick Facts You Need to Know about R 5

1.3.2 Vocabulary 8

1.3.3 Calculating and Printing in R 11

1.4 Running an R Session 12

1.4.1 Where Your Data Is Stored 13

1.4.2 Options 13

1.4.3 Scripts 14

1.4.4 R Packages 14

1.4.5 RStudio and Other GUIs 15

1.4.6 Locales and Character Sets 15

1.5 Getting Help 16

1.5.1 At the Command Line 16

1.5.2 The Online Manuals 16

1.5.3 On the Internet 17

1.5.4 Further Reading 17

1.6 How to Use This Book 17

1.6.1 Syntax and Conventions inThis Book 17

1.6.2 The Chapters 18

2 RData,Part1:Vectors 21

2.1 Vectors 21

2.1.1 Creating Vectors 21

2.1.2 Sequences 22

2.1.3 Logical Vectors 23

2.1.4 Vector Operations 24

2.1.5 Names 27

2.2 Data Types 27

2.2.1 Some Less-Common Data Types 28

2.2.2 What Type of Vector IsThis? 28

2.2.3 Converting from One Type to Another 29

2.3 Subsets of Vectors 31

2.3.1 Extracting 31

2.3.2 Vectors of Length 0 34

2.3.3 Assigning or Replacing Elements of a Vector 35

2.4 Missing Data (NA) and Other Special Values 36

2.4.1 The Effect of NAs in Expressions 37

2.4.2 Identifying and Removing or Replacing NAs 37

2.4.3 Indexing with NAs 39

2.4.4 NaN and Inf Values 40

2.4.5 NULL Values 40

2.5 The table() Function 40

2.5.1 Two- and Higher-Way Tables 42

2.5.2 Operating on Elements of a Table 42

2.6 Other Actions on Vectors 45

2.6.1 Rounding 45

2.6.2 Sorting and Ordering 45

2.6.3 Vectors as Sets 46

2.6.4 Identifying Duplicates and Matching 47

2.6.5 Finding Runs of Duplicate Values 49

2.7 Long Vectors and Big Data 50

2.8 Chapter Summary and Critical Data Handling Tools 50

3 R Data, Part 2:More Complicated Structures 53

3.1 Introduction 53

3.2 Matrices 53

3.2.1 Extracting and Assigning 54

3.2.2 Row and Column Names 56

3.2.3 Applying a Function to Rows or Columns 57

3.2.4 Missing Values in Matrices 59

3.2.5 Using a Matrix Subscript 60

3.2.6 Sparse Matrices 61

3.2.7 Three- and Higher-Way Arrays 62

3.3 Lists 62

3.3.1 Extracting and Assigning 64

3.3.2 Lists in Practice 65

3.4 Data Frames 67

3.4.1 Missing Values in Data Frames 69

3.4.2 Extracting and Assigning in Data Frames 69

3.4.3 ExtractingThings That Aren’tThere 72

3.5 Operating on Lists and Data Frames 74

3.5.1 Split, Apply, Combine 75

3.5.2 All-Numeric Data Frames 77

3.5.3 Convenience Functions 78

3.5.4 Re-Ordering, De-Duplicating, and Sampling from Data Frames 79

3.6 Date and Time Objects 80

3.6.1 Formatting Dates 80

3.6.2 Common Operations on Date Objects 82

3.6.3 Differences between Dates 83

3.6.4 Dates and Times 83

3.6.5 Creating POSIXt Objects 85

3.6.6 Mathematical Functions for Date and Times 86

3.6.7 Missing Values in Dates 88

3.6.8 Using Apply Functions with Dates and Times 89

3.7 Other Actions on Data Frames 90

3.7.1 Combining by Rows or Columns 90

3.7.2 Merging Data Frames 91

3.7.3 Comparing Two Data Frames 94

3.7.4 Viewing and Editing Data Frames Interactively 94

3.8 Handling Big Data 94

3.9 Chapter Summary and Critical Data Handling Tools 96

4 RData, Part 3: Text and Factors 99

4.1 Character Data 100

4.1.1 The length() and nchar() Functions 100

4.1.2 Tab, New-Line, Quote, and Backslash Characters 100

4.1.3 The Empty String 101

4.1.4 Substrings 102

4.1.5 Changing Case and Other Substitutions 103

4.2 Converting Numbers into Text 103

4.2.2 Scientific Notation 106

4.2.3 Discretizing a Numeric Variable 107

4.3 Constructing Character Strings: Paste in Action 109

4.3.1 Constructing Column Names 109

4.3.2 Tabulating Dates by Year and Month or Quarter Labels 111

4.3.3 Constructing Unique Keys 112

4.3.4 Constructing File and Path Names 112

4.4 Regular Expressions 112

4.4.1 Types of Regular Expressions 113

4.4.2 Tools for Regular Expressions in R 113

4.4.3 Special Characters in Regular Expressions 114

4.4.4 Examples 114

4.4.5 The regexpr() Function and Its Variants 121

4.4.6 Using Regular Expressions in Replacement 123

4.4.7 Splitting Strings at Regular Expressions 124

4.4.8 Regular Expressions versusWildcard Matching 125

4.4.9 Common Data Cleaning Tasks Using Regular Expressions 126

4.4.10 Documenting and Debugging Regular Expressions 127

4.5 UTF-8 and Other Non-ASCII Characters 128

4.5.1 Extended ASCII for Latin Alphabets 128

4.5.2 Non-Latin Alphabets 129

4.5.3 Character and String Encoding in R 130

4.6 Factors 131

4.6.1 What Is a Factor? 131

4.6.2 Factor Levels 132

4.6.3 Converting and Combining Factors 134

4.6.4 Missing Values in Factors 136

4.6.5 Factors in Data Frames 137

4.7 R Object Names and Commands as Text 137

4.7.1 R Object Names as Text 137

4.7.2 R Commands as Text 138

4.8 Chapter Summary and Critical Data Handling Tools 140

5 Writing Functions and Scripts 143

5.1 Functions 143

5.1.1 Function Arguments 144

5.1.2 Global versus Local Variables 148

5.1.3 Return Values 149

5.1.4 Creating and Editing Functions 151

5.2 Scripts and Shell Scripts 153

5.2.1 Line-by-Line Parsing 155

5.3 Error Handling and Debugging 156

5.3.1 Debugging Functions 156

5.3.2 Issuing Error andWarning Messages 158

5.3.3 Catching and Processing Errors 159

5.4 Interacting with the Operating System 161

5.4.1 File and Directory Handling 162

5.4.2 Environment Variables 162

5.5 SpeedingThings Up 163

5.5.1 Profiling 163

5.5.2 Vectorizing Functions 164

5.5.3 Other Techniques to Speed Things Up 165

5.6 Chapter Summary and Critical Data Handling Tools 167

5.6.1 Programming Style 168

5.6.2 Common Bugs 169

5.6.3 Objects, Classes, and Methods 170

6 Getting Data into and out of R 171

6.1 Reading Tabular ASCII Data into Data Frames 171

6.1.1 Files with Delimiters 172

6.1.2 Column Classes 173

6.1.3 Common Pitfalls in Reading Tables 175

6.1.4 An Example of When read.table() Fails 177

6.1.5 Other Uses of the scan() Function 181

6.1.6 Writing Delimited Files 182

6.1.7 Reading andWriting Fixed-Width Files 183

6.1.8 A Note on End-of-Line Characters 183

6.2 Reading Large, Non-Tabular, or Non-ASCII Data 184

6.2.1 Opening and Closing Files 184

6.2.2 Reading andWriting Lines 185

6.2.3 Reading andWriting UTF-8 and Other Encodings 187

6.2.4 The Null Character 187

6.2.5 Binary Data 188

6.2.6 Reading Problem Files in Action 190

6.3 Reading Data From Relational Databases 192

6.3.1 Connecting to the Database Server 193

6.3.2 Introduction to SQL 194

6.4 Handling Large Numbers of Input Files 197

6.5 Other Formats 200

6.5.1 Using the Clipboard 200

6.5.2 Reading Data from Spreadsheets 201

6.5.3 Reading Data from theWeb 203

6.5.4 Reading Data from Other Statistical Packages 208

6.6 Reading andWriting R Data Directly 209

6.7 Chapter Summary and Critical Data Handling Tools 210

7 Data Handling in Practice 213

7.1 Acquiring and Reading Data 213

7.2 Cleaning Data 214

7.3 Combining Data 216

7.3.1 Combining by Row 216

7.3.2 Combining by Column 218

7.3.3 Merging by Key 218

7.4 Transactional Data 219

7.4.1 Example of Transactional Data 219

7.4.2 Combining Tabular and Transactional Data 221

7.5 Preparing Data 225

7.6 Documentation and Reproducibility 226

7.7 The Role of Judgment 228

7.8 Data Cleaning in Action 230

7.8.1 Reading and Cleaning BedBath1.csv 231

7.8.2 Reading and Cleaning BedBath2.csv 236

7.8.3 Combining the BedBath Data Frames 238

7.8.4 Reading and Cleaning EnergyUsage.csv 239

7.8.5 Merging the BedBath and EnergyUsage Data Frames 242

7.9 Chapter Summary and Critical Data Handling Tools 245

8 Extended Exercise 247

8.1 Introduction to the Problem 247

8.1.1 The Goal 248

8.1.2 Modeling Considerations 249

8.1.3 Examples ofThings to Check 249

8.2 The Data 250

8.3 Five Important Fields 252

8.4 Loan and Application Portfolios 252

8.4.1 Layout of the Beachside Lenders Data 253

8.4.2 Layout of theWilson and Sons Data 254

8.4.3 Combining the Two Portfolios 254

8.5 Scores 256

8.5.1 Scores Layout 256

8.6 Co-borrower Scores 257

8.6.1 Co-borrower Score Examples 258

8.7 Updated KScores 259

8.7.1 Updated KScores Layout 259

8.8 Loans to Be Excluded 260

8.8.1 Sample Exclusion File 260

8.9 Response Variable 260

8.10 Assembling the Final Data Sets 262

8.10.1 Final Data Layout 262

8.10.2 Concluding Remarks 263

A Hints and Pseudocode 265

A.1 Loan Portfolios 265

A.1.1 Things to Check 266

A.2 Scores Database 267

A.2.1 Things to Check 268

A.3 Co-borrower Scores 269

A.3.1 Things to Check 270

A.4 Updated KScores 271

A.4.1 Things to Check 272

A.5 Excluder Files 272

A.5.1 Things to Check 272

A.6 Payment Matrix 273

A.6.1 Things to Check 274

A.7 Starting the Modeling Process 275

Bibliography 277

Index 279

See More

Author Information

Samuel E. Buttrey, PhD is an Associate Professor of Operations Research at the Naval Postgraduate School, Monterey, California, USA.

Lyn R. Whitaker, PhD is an Associate Professor of Operations Research at the Naval Postgraduate School, Monterey, California, USA.

See More

Related Titles

Back to Top