Wiley.com
Print this page Share
E-book

Machine Learning: Hands-On for Developers and Technical Professionals

ISBN: 978-1-118-88949-7
408 pages
October 2014
Machine Learning: Hands-On for Developers and Technical Professionals (1118889495) cover image

Description

Dig deep into the data with a hands-on guide to machine learning

Machine Learning: Hands-On for Developers and Technical Professionals provides hands-on instruction and fully-coded working examples for the most common machine learning techniques used by developers and technical professionals. The book contains a breakdown of each ML variant, explaining how it works and how it is used within certain industries, allowing readers to incorporate the presented techniques into their own work as they follow along. A core tenant of machine learning is a strong focus on data preparation, and a full exploration of the various types of learning algorithms illustrates how the proper tools can help any developer extract information and insights from existing data. The book includes a full complement of Instructor's Materials to facilitate use in the classroom, making this resource useful for students and as a professional reference.

At its core, machine learning is a mathematical, algorithm-based technology that forms the basis of historical data mining and modern big data science. Scientific analysis of big data requires a working knowledge of machine learning, which forms predictions based on known properties learned from training data. Machine Learning is an accessible, comprehensive guide for the non-mathematician, providing clear guidance that allows readers to:

  • Learn the languages of machine learning including Hadoop, Mahout, and Weka
  • Understand decision trees, Bayesian networks, and artificial neural networks
  • Implement Association Rule, Real Time, and Batch learning
  • Develop a strategic plan for safe, effective, and efficient machine learning

By learning to construct a system that can learn from data, readers can increase their utility across industries. Machine learning sits at the core of deep dive data analysis and visualization, which is increasingly in demand as companies discover the goldmine hiding in their existing data. For the tech professional involved in data science, Machine Learning: Hands-On for Developers and Technical Professionals provides the skills and techniques required to dig deeper.

See More

Table of Contents

Introduction xix

Chapter 1 What Is Machine Learning? 1

History of Machine Learning 1

Alan Turing 1

Arthur Samuel 2

Tom M. Mitchell 2

Summary Definition 2

Algorithm Types for Machine Learning 3

Supervised Learning 3

Unsupervised Learning 3

The Human Touch 4

Uses for Machine Learning 4

Software 4

Stock Trading 5

Robotics 6

Medicine and Healthcare 6

Advertising 6

Retail and E-Commerce 7

Gaming Analytics 8

The Internet of Things 9

Languages for Machine Learning 10

Python 10

R 10

Matlab 10

Scala 10

Clojure 11

Ruby 11

Software Used in This Book 11

Checking the Java Version 11

Weka Toolkit 12

Mahout 12

SpringXD 13

Hadoop 13

Using an IDE 14

Data Repositories 14

UC Irvine Machine Learning Repository 14

Infochimps 14

Kaggle 15

Summary 15

Chapter 2 Planning for Machine Learning 17

The Machine Learning Cycle 17

It All Starts with a Question 18

I Don’t Have Data! 19

Starting Local 19

Competitions 19

One Solution Fits All? 20

Defining the Process 20

Planning 20

Developing 21

Testing 21

Reporting 21

Refining 22

Production 22

Building a Data Team 22

Mathematics and Statistics 22

Programming 23

Graphic Design 23

Domain Knowledge 23

Data Processing 23

Using Your Computer 24

A Cluster of Machines 24

Cloud-Based Services 24

Data Storage 25

Physical Discs 25

Cloud-Based Storage 25

Data Privacy 25

Cultural Norms 25

Generational Expectations 26

The Anonymity of User Data 26

Don’t Cross “The Creepy Line” 27

Data Quality and Cleaning 28

Presence Checks 28

Type Checks 29

Length Checks 29

Range Checks 30

Format Checks 30

The Britney Dilemma 30

What’s in a Country Name? 33

Dates and Times 35

Final Thoughts on Data Cleaning 35

Thinking about Input Data 36

Raw Text 36

Comma Separated Variables 36

JSON 37

YAML 39

XML 39

Spreadsheets 40

Databases 41

Thinking about Output Data 42

Don’t Be Afraid to Experiment 42

Summary 43

Chapter 3 Working with Decision Trees 45

The Basics of Decision Trees 45

Uses for Decision Trees 45

Advantages of Decision Trees 46

Limitations of Decision Trees 46

Different Algorithm Types 47

How Decision Trees Work 48

Decision Trees in Weka 53

The Requirement 53

Training Data 53

Using Weka to Create a Decision Tree 55

Creating Java Code from the Classifi cation 60

Testing the Classifi er Code 64

Thinking about Future Iterations 66

Summary 67

Chapter 4 Bayesian Networks 69

Pilots to Paperclips 69

A Little Graph Theory 70

A Little Probability Theory 72

Coin Flips 72

Conditional Probability 72

Winning the Lottery 73

Bayes’ Theorem 73

How Bayesian Networks Work 75

Assigning Probabilities 76

Calculating Results 77

Node Counts 78

Using Domain Experts 78

A Bayesian Network Walkthrough 79

Java APIs for Bayesian Networks 79

Planning the Network 79

Coding Up the Network 81

Summary 90

Chapter 5 Artificial Neural Networks 91

What Is a Neural Network? 91

Artificial Neural Network Uses 92

High-Frequency Trading 92

Credit Applications 93

Data Center Management 93

Robotics 93

Medical Monitoring 93

Breaking Down the Artifi cial Neural Network 94

Perceptrons 94

Activation Functions 95

Multilayer Perceptrons 96

Back Propagation 98

Data Preparation for Artifi cial Neural Networks 99

Artificial Neural Networks with Weka 100

Generating a Dataset 100

Loading the Data into Weka 102

Configuring the Multilayer Perceptron 103

Training the Network 105

Altering the Network 108

Increasing the Test Data Size 108

Implementing a Neural Network in Java 109

Create the Project 109

The Code 111

Converting from CSV to Arff 114

Running the Neural Network 114

Summary 115

Chapter 6 Association Rules Learning 117

Where Is Association Rules Learning Used? 117

Web Usage Mining 118

Beer and Diapers 118

How Association Rules Learning Works 119

Support 121

Confidence 121

Lift 122

Conviction 122

Defining the Process 122

Algorithms 123

Apriori 123

FP-Growth 124

Mining the Baskets—A Walkthrough 124

Downloading the Raw Data 124

Setting Up the Project in Eclipse 125

Setting Up the Items Data File 126

Setting Up the Data 129

Running Mahout 131

Inspecting the Results 133

Putting It All Together 135

Further Development 136

Summary 137

Chapter 7 Support Vector Machines 139

What Is a Support Vector Machine? 139

Where Are Support Vector Machines Used? 140

The Basic Classifi cation Principles 140

Binary and Multiclass Classifi cation 140

Linear Classifi ers 142

Confidence 143

Maximizing and Minimizing to Find the Line 143

How Support Vector Machines Approach Classifi cation 144

Using Linear Classifi cation 144

Using Non-Linear Classifi cation 146

Using Support Vector Machines in Weka 147

Installing LibSVM 147

A Classification Walkthrough 148

Implementing LibSVM with Java 154

Summary 159

Chapter 8 Clustering 161

What Is Clustering? 161

Where Is Clustering Used? 162

The Internet 162

Business and Retail 163

Law Enforcement 163

Computing 163

Clustering Models 164

How the K-Means Works 164

Calculating the Number of Clusters in a Dataset 166

K-Means Clustering with Weka 168

Preparing the Data 168

The Workbench Method 169

The Command-Line Method 174

The Coded Method 178

Summary 186

Chapter 9 Machine Learning in Real Time with Spring XD 187

Capturing the Firehose of Data 187

Considerations of Using Data in Real Time 188

Potential Uses for a Real-Time System 188

Using Spring XD 189

Spring XD Streams 190

Input Sources, Sinks, and Processors 190

Learning from Twitter Data 193

The Development Plan 193

Configuring the Twitter API Developer Application 194

Configuring Spring XD 196

Starting the Spring XD Server 197

Creating Sample Data 198

The Spring XD Shell 198

Streams 101 199

Spring XD and Twitter 202

Setting the Twitter Credentials 202

Creating Your First Twitter Stream 203

Where to Go from Here 205

Introducing Processors 206

How Processors Work within a Stream 206

Creating Your Own Processor 207

Real-Time Sentiment Analysis 215

How the Basic Analysis Works 215

Creating a Sentiment Processor 217

Spring XD Taps 221

Summary 222

Chapter 10 Machine Learning as a Batch Process 223

Is It Big Data? 223

Considerations for Batch Processing Data 224

Volume and Frequency 224

How Much Data? 225

Which Process Method? 225

Practical Examples of Batch Processes 225

Hadoop 225

Sqoop 226

Pig 226

Mahout 226

Cloud-Based Elastic Map Reduce 226

A Note about the Walkthroughs 227

Using the Hadoop Framework 227

The Hadoop Architecture 227

Setting Up a Single-Node Cluster 229

How MapReduce Works 233

Mining the Hashtags 234

Hadoop Support in Spring XD 235

Objectives for This Walkthrough 235

What’s a Hashtag? 235

Creating the MapReduce Classes 236

Performing ETL on Existing Data 247

Product Recommendation with Mahout 250

Mining Sales Data 256

Welcome to My Coffee Shop! 257

Going Small Scale 258

Writing the Core Methods 258

Using Hadoop and MapReduce 260

Using Pig to Mine Sales Data 263

Scheduling Batch Jobs 273

Summary 274

Chapter 11 Apache Spark 275

Spark: A Hadoop Replacement? 275

Java, Scala, or Python? 276

Scala Crash Course 276

Installing Scala 276

Packages 277

Data Types 277

Classes 278

Calling Functions 278

Operators 279

Control Structures 279

Downloading and Installing Spark 280

A Quick Intro to Spark 280

Starting the Shell 281

Data Sources 282

Testing Spark 282

Spark Monitor 284

Comparing Hadoop MapReduce to Spark 285

Writing Standalone Programs with Spark 288

Spark Programs in Scala 288

Installing SBT 288

Spark Programs in Java 291

Spark Program Summary 295

Spark SQL 295

Basic Concepts 295

Using SparkSQL with RDDs 296

Spark Streaming 305

Basic Concepts 305

Creating Your First Stream with Scala 306

Creating Your First Stream with Java 309

MLib: The Machine Learning Library 311

Dependencies 311

Decision Trees 312

Clustering 313

Summary 313

Chapter 12 Machine Learning with R 315

Installing R 315

Mac OSX 315

Windows 316

Linux 316

Your First Run 316

Installing R-Studio 317

The R Basics 318

Variables and Vectors 318

Matrices 319

Lists 320

Data Frames 321

Installing Packages 322

Loading in Data 323

Plotting Data 324

Simple Statistics 327

Simple Linear Regression 329

Creating the Data 329

The Initial Graph 329

Regression with the Linear Model 330

Making a Prediction 331

Basic Sentiment Analysis 331

Functions to Load in Word Lists 331

Writing a Function to Score Sentiment 332

Testing the Function 333

Apriori Association Rules 333

Installing the ARules Package 334

The Training Data 334

Importing the Transaction Data 335

Running the Apriori Algorithm 336

Inspecting the Results 336

Accessing R from Java 337

Installing the rJava Package 337

Your First Java Code in R 337

Calling R from Java Programs 338

Setting Up an Eclipse Project 338

Creating the Java/R Class 339

Running the Example 340

Extending Your R Implementations 342

R and Hadoop 342

The RHadoop Project 342

A Sample Map Reduce Job in RHadoop 343

Connecting to Social Media with R 345

Summary 347

Appendix A SpringXD Quick Start 349

Installing Manually 349

Starting SpringXD 349

Creating a Stream 350

Adding a Twitter Application Key 350

Appendix B Hadoop 1.x Quick Start 351

Downloading and Installing Hadoop 351

Formatting the HDFS Filesystem 352

Starting and Stopping Hadoop 353

Process List of a Basic Job 353

Appendix C Useful Unix Commands 355

Using Sample Data 355

Showing the Contents: cat, more, and less 356

Example Command 356

Expected Output 356

Filtering Content: grep 357

Example Command for Finding Text 357

Example Output 357

Sorting Data: sort 358

Example Command for Basic Sorting 358

Example Output 358

Finding Unique Occurrences: uniq 360

Showing the Top of a File: head 361

Counting Words: wc 361

Locating Anything: fi nd 362

Combining Commands and Redirecting Output 363

Picking a Text Editor 363

Colon Frenzy: Vi and Vim 363

Nano 364

Emacs 364

Appendix D Further Reading 367

Machine Learning 367

Statistics 368

Big Data and Data Science 368

Hadoop 368

Visualization 369

Making Decisions 369

Datasets 369

Blogs 370

Useful Websites 370

The Tools of the Trade 370

Index 373

See More

Author Information

Jason Bell has been working with point of sale and customer loyalty data since 2002 and has been involved in software development for more than 25 years. He works as a senior technical architect, lecturer and also advises startups that are just beginning their technical adventures.

See More

Related Titles

Back to Top