Skip to main content


Data Science For Dummies, 2nd Edition

Lillian Pierson, Jake Porway (Foreword by)

ISBN: 978-1-119-32763-9 March 2017 384 Pages


Discover how data science can help you gain in-depth insight into your business - the easy way!

Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles. Data Science For Dummies is the perfect starting point for IT professionals and students who want a quick primer on all areas of the expansive data science space. With a focus on business cases, the book explores topics in big data, data science, and data engineering, and how these three areas are combined to produce tremendous value. If you want to pick-up the skills you need to begin a new career or initiate a new project, reading this book will help you understand what technologies, programming languages, and mathematical methods on which to focus.

While this book serves as a wildly fantastic guide through the broad, sometimes intimidating field of big data and data science, it is not an instruction manual for hands-on implementation. Here’s what to expect:

  • Provides a background in big data and data engineering before moving on to data science and how it's applied to generate value
  • Includes coverage of big data frameworks like Hadoop, MapReduce, Spark, MPP platforms, and NoSQL
  • Explains machine learning and many of its algorithms as well as artificial intelligence and the evolution of the Internet of Things
  • Details data visualization techniques that can be used to showcase, summarize, and communicate the data insights you generate

It's a big, big data world out there—let Data Science For Dummies help you harness its power and gain a competitive edge for your organization.

Foreword xv

Introduction 1

About This Book 2

Foolish Assumptions 2

Icons Used in This Book 3

Beyond the Book 3

Where to Go from Here 4

Part 1: Getting Started with Data Science 5

Chapter 1: Wrapping Your Head around Data Science 7

Seeing Who Can Make Use of Data Science 8

Analyzing the Pieces of the Data Science Puzzle 10

Collecting, querying, and consuming data 10

Applying mathematical modeling to data science tasks 11

Deriving insights from statistical methods 12

Coding, coding, coding — it’s just part of the game 12

Applying data science to a subject area 12

Communicating data insights 14

Exploring the Data Science Solution Alternatives 14

Assembling your own in-house team 14

Outsourcing requirements to private data science consultants 15

Leveraging cloud-based platform solutions 15

Letting Data Science Make You More Marketable 16

Chapter 2: Exploring Data Engineering Pipelines and Infrastructure 17

Defining Big Data by the Three Vs 18

Grappling with data volume 18

Handling data velocity 18

Dealing with data variety 19

Identifying Big Data Sources 20

Grasping the Difference between Data Science and Data Engineering 21

Defining data science 21

Defining data engineering 22

Comparing data scientists and data engineers 23

Making Sense of Data in Hadoop 24

Digging into MapReduce 24

Stepping into real-time processing 26

Storing data on the Hadoop distributed file system (HDFS) 27

Putting it all together on the Hadoop platform 28

Identifying Alternative Big Data Solutions 28

Introducing massively parallel processing (MPP) platforms 29

Introducing NoSQL databases 29

Data Engineering in Action: A Case Study 30

Identifying the business challenge 30

Solving business problems with data engineering 32

Boasting about benefits 32

Chapter 3: Applying Data-Driven Insights to Business and Industry 33

Benefiting from Business-Centric Data Science 34

Converting Raw Data into Actionable Insights with Data Analytics 35

Types of analytics 35

Common challenges in analytics 36

Data wrangling 36

Taking Action on Business Insights 37

Distinguishing between Business Intelligence and Data Science 39

Business intelligence, defined 39

The kinds of data used in business intelligence 40

Technologies and skillsets that are useful in business intelligence 40

Defining Business-Centric Data Science 41

Kinds of data that are useful in business-centric data science 42

Technologies and skillsets that are useful in business-centric data science 43

Making business value from machine learning methods 43

Differentiating between Business Intelligence and Business-Centric Data Science 44

Knowing Whom to Call to Get the Job Done Right 45

Exploring Data Science in Business: A Data-Driven Business Success Story 46

Part 2: Using Data Science to Extract Meaning from Your Data 49

Chapter 4: Machine Learning: Learning from Data with Your Machine 51

Defining Machine Learning and Its Processes 51

Walking through the steps of the machine learning process 52

Getting familiar with machine learning terms 52

Considering Learning Styles 53

Learning with supervised algorithms 53

Learning with unsupervised algorithms 53

Learning with reinforcement 54

Seeing What You Can Do 54

Selecting algorithms based on function 54

Using Spark to generate real-time big data analytics 58

Chapter 5: Math, Probability, and Statistical Modeling 61

Exploring Probability and Inferential Statistics 62

Probability distributions 63

Conditional probability with Naïve Bayes 65

Quantifying Correlation 66

Calculating correlation with Pearson’s r 66

Ranking variable-pairs using Spearman’s rank correlation 66

Reducing Data Dimensionality with Linear Algebra 67

Decomposing data to reduce dimensionality 67

Reducing dimensionality with factor analysis 69

Decreasing dimensionality and removing outliers with PCA 70

Modeling Decisions with Multi-Criteria Decision Making 70

Turning to traditional MCDM 71

Focusing on fuzzy MCDM 72

Introducing Regression Methods 73

Linear regression 73

Logistic regression 74

Ordinary least squares (OLS) regression methods 74

Detecting Outliers 75

Analyzing extreme values 75

Detecting outliers with univariate analysis 76

Detecting outliers with multivariate analysis 77

Introducing Time Series Analysis 78

Identifying patterns in time series 78

Modeling univariate time series data 79

Chapter 6: Using Clustering to Subdivide Data 81

Introducing Clustering Basics 81

Getting to know clustering algorithms 82

Looking at clustering similarity metrics 85

Identifying Clusters in Your Data 86

Clustering with the k-means algorithm 86

Estimating clusters with kernel density estimation (KDE) 87

Clustering with hierarchical algorithms 88

Dabbling in the DBScan neighborhood 90

Categorizing Data with Decision Tree and Random Forest Algorithms 91

Chapter 7: Modeling with Instances 93

Recognizing the Difference between Clustering and Classification 94

Reintroducing clustering concepts 94

Getting to know classification algorithms 95

Making Sense of Data with Nearest Neighbor Analysis 97

Classifying Data with Average Nearest Neighbor Algorithms 98

Classifying with K-Nearest Neighbor Algorithms 101

Understanding how the k-nearest neighbor algorithm works 102

Knowing when to use the k-nearest neighbor algorithm 103

Exploring common applications of k-nearest neighbor algorithms 104

Solving Real-World Problems with Nearest Neighbor Algorithms 104

Seeing k-nearest neighbor algorithms in action 104

Seeing average nearest neighbor algorithms in action 105

Chapter 8: Building Models That Operate Internet-of-Things Devices 107

Overviewing the Vocabulary and Technologies 108

Learning the lingo 108

Procuring IoT platforms 110

Spark streaming for the IoT 110

Getting context-aware with sensor fusion 111

Digging into the Data Science Approaches 111

Taking on time series 112

Geospatial analysis 112

Dabbling in deep learning 113

Advancing Artificial Intelligence Innovation 113

Part 3: Creating Data Visualizations That Clearly Communicate Meaning 115

Chapter 9: Following the Principles of Data Visualization Design 117

Data Visualizations: The Big Three 118

Data storytelling for organizational decision makers 118

Data showcasing for analysts 118

Designing data art for activists 119

Designing to Meet the Needs of Your Target Audience 119

Step 1: Brainstorm (about Brenda) 120

Step 2: Define the purpose 121

Step 3: Choose the most functional visualization type for your purpose 121

Picking the Most Appropriate Design Style 122

Inducing a calculating, exacting response 122

Eliciting a strong emotional response 123

Choosing How to Add Context 124

Creating context with data 125

Creating context with annotations 125

Creating context with graphical elements 125

Selecting the Appropriate Data Graphic Type 127

Standard chart graphics 127

Comparative graphics 130

Statistical plots 134

Topology structures 135

Spatial plots and maps 138

Choosing a Data Graphic 140

Chapter 10: Using D3.js for Data Visualization 141

Introducing the D3.js Library 141

Knowing When to Use D3.js (and When Not To) 142

Getting Started in D3.js 143

Bringing in the HTML and DOM 144

Bringing in the JavaScript and SVG 145

Bringing in the Cascading Style Sheets (CSS) 146

Bringing in the web servers and PHP 146

Implementing More Advanced Concepts and Practices in D3.js 147

Getting to know chain syntax 151

Getting to know scales 152

Getting to know transitions and interactions 153

Chapter 11: Web-Based Applications for Visualization Design 157

Designing Data Visualizations for Collaboration 158

Visualizing and collaborating with Plotly 159

Talking about Tableau Public 161

Visualizing Spatial Data with Online Geographic Tools 162

Making pretty maps with OpenHeatMap 163

Mapmaking and spatial data analytics with CartoDB 164

Visualizing with Open Source: Web-Based Data Visualization Platforms 166

Making pretty data graphics with Google Fusion Tables 166

Using iCharts for web-based data visualization 167

Using RAW for web-based data visualization 168

Knowing When to Stick with Infographics 170

Making cool infographics with 170

Making cool infographics with Piktochart 172

Chapter 12: Exploring Best Practices in Dashboard Design 173

Focusing on the Audience 174

Starting with the Big Picture 175

Getting the Details Right 176

Testing Your Design 178

Chapter 13: Making Maps from Spatial Data 179

Getting into the Basics of GIS 180

Spatial databases 181

File formats in GIS 182

Map projections and coordinate systems 185

Analyzing Spatial Data 187

Querying spatial data 187

Buffering and proximity functions 188

Using layer overlay analysis 189

Reclassifying spatial data 190

Getting Started with Open-Source QGIS 191

Getting to know the QGIS interface 191

Adding a vector layer in QGIS 192

Displaying data in QGIS 193

Part 4: Computing for Data Science 199

Chapter 14: Using Python for Data Science 201

Sorting Out the Python Data Types 203

Numbers in Python 204

Strings in Python 204

Lists in Python 204

Tuples in Python 205

Sets in Python 205

Dictionaries in Python 205

Putting Loops to Good Use in Python 206

Having Fun with Functions 207

Keeping Cool with Classes 208

Checking Out Some Useful Python Libraries 210

Saying hello to the NumPy library 211

Getting up close and personal with the SciPy library 213

Peeking into the Pandas offering 213

Bonding with MatPlotLib for data visualization 214

Learning from data with Scikit-learn 215

Analyzing Data with Python — an Exercise 216

Installing Python on the Mac and Windows OS 216

Loading CSV files 218

Calculating a weighted average 219

Drawing trendlines 222

Chapter 15: Using Open Source R for Data Science 225

R’s Basic Vocabulary 226

Delving into Functions and Operators 229

Iterating in R 232

Observing How Objects Work 234

Sorting Out Popular Statistical Analysis Packages 236

Examining Packages for Visualizing, Mapping, and Graphing in R 238

Visualizing R statistics with ggplot2 238

Analyzing networks with statnet and igraph 239

Mapping and analyzing spatial point patterns with spatstat 240

Chapter 16: Using SQL in Data Science 241

Getting a Handle on Relational Databases and SQL 242

Investing Some Effort into Database Design 245

Defining data types 246

Designing constraints properly 246

Normalizing your database 247

Integrating SQL, R, Python, and Excel into Your Data Science Strategy 249

Narrowing the Focus with SQL Functions 249

Chapter 17: Doing Data Science with Excel and Knime 255

Making Life Easier with Excel 255

Using Excel to quickly get to know your data 256

Reformatting and summarizing with pivot tables 261

Automating Excel tasks with macros 262

Using KNIME for Advanced Data Analytics 264

Reducing customer churn via KNIME 265

Using KNIME to make the most of your social data 265

Using KNIME for environmental good stewardship 266

Part 5: Applying Domain Expertise to Solve Real-World Problems Using Data Science 267

Chapter 18: Data Science in Journalism: Nailing Down the Five Ws (and an H) 269

Who Is the Audience? 270

Who made the data 271

Who comprises the audience 271

What: Getting Directly to the Point 272

Bringing Data Journalism to Life: The Black Budget 273

When Did It Happen? 274

When as the context to your story 274

When does the audience care the most? 275

Where Does the Story Matter? 275

Where is the story relevant? 276

Where should the story be published? 276

Why the Story Matters 277

Asking why in order to generate and augment a storyline 277

Why your audience should care 277

How to Develop, Tell, and Present the Story 278

Integrating how as a source of data and story context 278

Finding stories in your data 278

Presenting a data-driven story 279

Collecting Data for Your Story 279

Scraping data 279

Setting up data alerts 280

Finding and Telling Your Data’s Story 280

Spotting strange trends and outliers 281

Examining context to understand the significance of data 283

Emphasizing the story through visualization 284

Creating compelling and highly focused narratives 285

Chapter 19: Delving into Environmental Data Science 287

Modeling Environmental-Human Interactions with Environmental Intelligence 288

Examining the types of problems solved 288

Defining environmental intelligence 289

Identifying major organizations that work in environmental intelligence 290

Making positive impacts with environmental intelligence 291

Modeling Natural Resources in the Raw 293

Exploring natural resource modeling 293

Dabbling in data science 293

Modeling natural resources to solve environmental problems 294

Using Spatial Statistics to Predict for Environmental Variation across Space 295

Addressing environmental issues with spatial predictive analytics 296

Describing the data science that’s involved 296

Addressing environmental issues with spatial statistics 297

Chapter 20: Data Science for Driving Growth in E-Commerce 299

Making Sense of Data for E-Commerce Growth 302

Optimizing E-Commerce Business Systems 303

Angling in on analytics 304

Talking about testing your strategies 308

Segmenting and targeting for success 311

Chapter 21: Using Data Science to Describe and Predict Criminal Activity 315

Temporal Analysis for Crime Prevention and Monitoring 316

Spatial Crime Prediction and Monitoring 317

Crime mapping with GIS technology 317

Going one step further with location-allocation analysis 318

Analyzing complex spatial statistics to better understand crime 319

Probing the Problems with Data Science for Crime Analysis 322

Caving in on civil rights 322

Taking on technical limitations 323

Part 6: The Part of Tens 325

Chapter 22: Ten Phenomenal Resources for Open Data 327

Digging through 328

Checking Out Canada Open Data 329

Diving into 330

Checking Out U.S Census Bureau Data 331

Knowing NASA Data 332

Wrangling World Bank Data 333

Getting to Know Knoema Data 334

Queuing Up with Quandl Data 335

Exploring Exversion Data 336

Mapping OpenStreetMap Spatial Data 337

Chapter 23: Ten Free Data Science Tools and Applications 339

Making Custom Web-Based Data Visualizations with Free R Packages 340

Getting Shiny by RStudio 340

Charting with rCharts 341

Mapping with rMaps 341

Examining Scraping, Collecting, and Handling Tools 342

Scraping data with 342

Collecting images with ImageQuilts 343

Wrangling data with DataWrangler 343

Looking into Data Exploration Tools 344

Getting up to speed in Gephi 345

Machine learning with the WEKA suite 347

Evaluating Web-Based Visualization Tools 347

Getting a little Weave up your sleeve 347

Checking out Knoema’s data visualization offerings 348

Index 351