Skip to main content

Data Smart: Using Data Science to Transform Information into Insight




Data Smart: Using Data Science to Transform Information into Insight

John W. Foreman

ISBN: 978-1-118-66146-8 November 2013 432 Pages


The book provides nine tutorials on optimization, machine learning, data mining, and forecasting all within the confines of a spreadsheet. Each tutorial uses a real-world problem and the author guides the reader using query’s the reader might ask as how to craft a solution using the correct data science technique. Hosting these nine spreadsheets for download will be necessary so that the reader can work the problems along with the book.

Important topics covered by the book:

  • Linear and integer programming
  • K-nearest neighbors graphs and clustering
  • Logistic regression
  • Demand forecasting with seasonal adjustments
  • Price sensitivity, revenue optimization, and price-sensitive forecasting
  • Naïve Bayes classification
  • Outlier detection using graphs and Local Outlier Factors
  • Multi-criteria decision analysis

Introduction xiii

1 Everything You Ever Needed to Know about Spreadsheets but Were Too Afraid to Ask 1

Some Sample Data 2

Moving Quickly with the Control Button 2

Copying Formulas and Data Quickly 4

Formatting Cells 5

Paste Special Values 7

Inserting Charts 8

Locating the Find and Replace Menus 9

Formulas for Locating and Pulling Values 10

Using VLOOKUP to Merge Data 12

Filtering and Sorting 13

Using PivotTables 16

Using Array Formulas 19

Solving Stuff with Solver 20

OpenSolver: I Wish We Didn’t Need This, but We Do 26

Wrapping Up 27

2 Cluster Analysis Part I: Using K-Means to Segment Your Customer Base 29

Girls Dance with Girls, Boys Scratch Their Elbows 30

Getting Real: K-Means Clustering Subscribers in E-mail Marketing 35

Joey Bag O’ Donuts Wholesale Wine Emporium 36

The Initial Dataset 36

Determining What to Measure 38

Start with Four Clusters 41

Euclidean Distance: Measuring Distances as the Crow Flies 41

Distances and Cluster Assignments for Everybody! 44

Solving for the Cluster Centers 46

Making Sense of the Results 49

Getting the Top Deals by Cluster 50

The Silhouette: A Good Way to Let Different K Values Duke It Out 53

How about Five Clusters? 60

Solving for Five Clusters 60

Getting the Top Deals for All Five Clusters 61

Computing the Silhouette for 5-Means Clustering 64

K-Medians Clustering and Asymmetric Distance Measurements 66

Using K-Medians Clustering 66

Getting a More Appropriate Distance Metric 67

Putting It All in Excel 69

The Top Deals for the 5-Medians Clusters 70

Wrapping Up 75

3 Naïve Bayes and the Incredible Lightness of Being an Idiot 77

When You Name a Product Mandrill, You’re Going to Get Some Signal and Some Noise 77

The World’s Fastest Intro to Probability Theory 79

Totaling Conditional Probabilities 80

Joint Probability, the Chain Rule, and Independence 80

What Happens in a Dependent Situation? 81

Bayes Rule 82

Using Bayes Rule to Create an AI Model 83

High-Level Class Probabilities Are Often Assumed to Be Equal 84

A Couple More Odds and Ends 85

Let’s Get This Excel Party Started 87

Removing Extraneous Punctuation 87

Splitting on Spaces 88

Counting Tokens and Calculating Probabilities 92

And We Have a Model! Let’s Use It 94

Wrapping Up 98

4 Optimization Modeling: Because That “Fresh Squeezed” Orange Juice Ain’t Gonna Blend Itself 101

Why Should Data Scientists Know Optimization? 102

Starting with a Simple Trade-Off f 103

Representing the Problem as a Polytope 103

Solving by Sliding the Level Set 105

The Simplex Method: Rooting around the Corners 106

Working in Excel 108

There’s a Monster at the End of This Chapter 117

Fresh from the Grove to Your Glasswith a Pit Stop Through a Blending Model 118

You Use a Blending Model 119

Let’s Start with Some Specs 119

Coming Back to Consistency 121

Putting the Data into Excel 121

Setting Up the Problem in Solver 124

Lowering Your Standards 126

Dead Squirrel Removal: The Minimax Formulation 131

If-Then and the “Big M” Constraint 133

Multiplying Variables: Cranking Up the Volume to 11 137

Modeling Risk 144

Normally Distributed Data 145

Wrapping Up 154

5 Cluster Analysis Part II: Network Graphs and Community Detection 155

What Is a Network Graph? 156

Visualizing a Simple Graph 157

Brief Introduction to Gephi 159

Gephi Installation and File Preparation 160

Laying Out the Graph 162

Node Degree 165

Pretty Printing 166

Touching the Graph Data 168

Building a Graph from the Wholesale Wine Data 170

Creating a Cosine Similarity Matrix 172

Producing an r-Neighborhood Graph 174

How Much Is an Edge Worth? Points and Penalties in Graph Modularity 178

What’s a Point and What’s a Penalty? 179

Setting Up the Score Sheet 183

Let’s Get Clustering! 185

Split Number 1 185

Split 2: Electric Boogaloo 190

And…Split 3: Split with a Vengeance 192

Encoding and Analyzing the Communities 193

There and Back Again: A Gephi Tale 197

Wrapping Up 202

6 The Granddaddy of Supervised Artificial Intelligence—Regression 205

Wait, What? You’re Pregnant? 205

Don’t Kid Yourself 206

Predicting Pregnant Customers at RetailMart Using Linear Regression 207

The Feature Set 207

Assembling the Training Data 209

Creating Dummy Variables 210

Let’s Bake Our Own Linear Regression 213

Linear Regression Statistics: R-Squared, F Tests, t Tests 221

Making Predictions on Some New Data and Measuring Performance 230

Predicting Pregnant Customers at RetailMart Using Logistic Regression 239

First You Need a Link Function 240

Hooking Up the Logistic Function and Reoptimizing 241

Baking an Actual Logistic Regression 244

Model Selection—Comparing the Performance of the Linear and Logistic Regressions 245

For More Information 248

Wrapping Up 249

7 Ensemble Models: A Whole Lot of Bad Pizza 251

Using the Data from Chapter 6 252

Bagging: Randomize, Train, Repeat 254

Decision Stump Is an Unsexy Term for a Stupid Predictor 254

Doesn’t Seem So Stupid to Me! 255

You Need More Power! 257

Let’s Train It 258

Evaluating the Bagged Model 267

Boosting: If You Get It Wrong, Just Boost and Try Again 272

Training the Model—Every Feature Gets a Shot 272

Evaluating the Boosted Model 280

Wrapping Up 283

8 Forecasting: Breathe Easy; You Can’t Win 285

The Sword Trade Is Hopping 286

Getting Acquainted with Time Series Data 286

Starting Slow with Simple Exponential Smoothing 288

Setting Up the Simple Exponential Smoothing Forecast 290

You Might Have a Trend 296

Holt’s Trend-Corrected Exponential Smoothing 299

Setting Up Holt’s Trend-Corrected Smoothing in a Spreadsheet 300

So Are You Done? Looking at Autocorrelations 306

Multiplicative Holt-Winters Exponential Smoothing 313

Setting the Initial Values for Level, Trend, and Seasonality 315

Getting Rolling on the Forecast 319

And Optimize! 324

Please Tell Me We’re Done Now!!! 326

Putting a Prediction Interval around the Forecast 327

Creating a Fan Chart for Effect 331

Wrapping Up 333

9 Outlier Detection: Just Because They’re Odd Doesn’t Mean They’re Unimportant 335

Outliers Are (Bad?) People, Too 335

The Fascinating Case of Hadlum v Hadlum 336

Tukey Fences 337

Applying Tukey Fences in a Spreadsheet 338

The Limitations of This Simple Approach 340

Terrible at Nothing, Bad at Everything 341

Preparing Data for Graphing 342

Creating a Graph 345

Getting the k Nearest Neighbors 347

Graph Outlier Detection Method 1: Just Use the Indegree 348

Graph Outlier Detection Method 2: Getting Nuanced with k-Distance 351

Graph Outlier Detection Method 3: Local Outlier Factors Are Where It’s At 353

Wrapping Up 358

10 Moving from Spreadsheets into R 361

Getting Up and Running with R 362

Some Simple Hand-Jamming 363

Reading Data into R 370

Doing Some Actual Data Science 372

Spherical K-Means on Wine Data in Just a Few Lines 372

Building AI Models on the Pregnancy Data 378

Forecasting in R 385

Looking at Outlier Detection 389

Wrapping Up 394

Conclusion 395

Where Am I? What Just Happened? 395

Before You Go-Go 395

Get to Know the Problem 396

We Need More Translators 397

Beware the Three-Headed Geek-Monster: Tools, Performance, and Mathematical Perfection 397

You Are Not the Most Important Function of Your Organization 400

Get Creative and Keep in Touch! 400

Index 401

Read Me Text
Chapter 1
Chapter 2
Chapter 3
Updated on 1/28/14.
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10