Ebook
Complex Surveys: A Guide to Analysis Using RISBN: 9781118210932
296 pages
September 2011

As survey analysis continues to serve as a core component of sociological research, researchers are increasingly relying upon data gathered from complex surveys to carry out traditional analyses. Complex Surveys is a practical guide to the analysis of this kind of data using R, the freely available and downloadable statistical programming language. As creator of the specific survey package for R, the author provides the ultimate presentation of how to successfully use the software for analyzing data from complex surveys while also utilizing the most current data from health and social sciences studies to demonstrate the application of survey research methods in these fields.
The book begins with coverage of basic tools and topics within survey analysis such as simple and stratified sampling, cluster sampling, linear regression, and categorical data regression. Subsequent chapters delve into more technical aspects of complex survey analysis, including poststratification, twophase sampling, missing data, and causal inference. Throughout the book, an emphasis is placed on graphics, regression modeling, and twophase designs. In addition, the author supplies a unique discussion of epidemiological twophase designs as well as probabilityweighting for causal inference. All of the book's examples and figures are generated using R, and a related Web site provides the R code that allows readers to reproduce the presented content. Each chapter concludes with exercises that vary in level of complexity, and detailed appendices outline additional mathematical and computational descriptions to assist readers with comparing results from various software systems.
Complex Surveys is an excellent book for courses on sampling and complex surveys at the upperundergraduate and graduate levels. It is also a practical reference guide for applied statisticians and practitioners in the social and health sciences who use statistics in their everyday work.
Preface.
Acronyms.
1 Basic Tools.
1.1 Goals of Inference.
1.1.1 Population or Process?
1.1.2 Probability Samples.
1.1.3 Sampling Weights.
1.1.4 Design Effects.
1.2 An Introduction to the Data.
1.2.1 Real Surveys.
1.2.2 Populations.
1.3 Obtaining the Software.
1.3.1 Obtaining R.
1.3.2 Obtaining the Survey Package.
1.4 Using R.
1.4.1 Reading Plain Text Data.
1.4.2 Reading Data from Other Packages.
1.4.3 Simple Computation.
Exercises.
2 Simple and Stratified Sampling.
2.1 Analyzing Simple Random Samples.
2.1.1 Confidence Intervals.
2.1.2 Describing the Sample to R.
2.2 Stratified Sampling.
2.3 Replicate Weights.
2.3.1 Specifying Replicate Weights to R.
2.3.2 Creating Replicate Weights in R.
2.4 Other Population Summaries.
2.4.1 Quantiles.
2.4.2 Contingency Tables.
2.5 Estimates in Subpopulations.
2.6 Design of Stratified Samples.
Exercises.
3 Cluster Sampling.
3.1 Introduction.
3.1.1 Why Clusters: The NHANES II Design.
3.1.2 SingleStage and Multistage Designs.
3.2 Describing Multistage Designs to R.
3.2.1 Strata with Only One PSU.
3.2.2 How Good is the SingleState Approximation?
3.2.3 Replicate Weights for Multistage Samples.
3.3 Sampling by Size.
3.3.1 Loss of Information from Sampling Clusters.
3.4 Repeated Measurements.
Exercises.
4 Graphics.
4.1 Why is Survey Data Different?
4.2 Plotting a Table.
4.3 One Continuous Variable.
4.3.1 Graphs Based on the Distribution Function.
4.3.2 Graphs Based on the Density.
4.4 Two Continuous Variables.
4.4.1 Scatterplots.
4.4.2 Aggregation and Smoothing.
4.4.3 Scatterplot Smoothers.
4.5 Conditioning Plots.
4.6 Maps.
4.6.1 Design and Estimation Issues.
4.6.2 Drawing Maps in R.
Exercises.
5 Ratios and Linear Regression.
5.1 Ratio Estimation.
5.1.1 Estimating Ratios.
5.1.2 Ratios for Subpopulation Estimates.
5.1.3 Ratio Estimators of Totals.
5.2 Linear Regression.
5.2.1 The LeastSquares Slope as an Estimated Population.
5.2.2 Regression Estimation of Population Totals.
5.2.3 Confounding and Other Criteria for Model Choice.
5.2.4 Linear Models in the Survey Package.
5.3 Is Weighting Needed in Regression Models?
Exercises.
6 Categorical Data Regression.
6.1 Logistic Regression.
6.1.1 Relative Risk Regression.
6.2 Ordinal Regression.
6.2.1 Other Cumulative Link Models.
6.3 Loglinear Models.
6.3.1 Choosing Models.
6.3.2 Linear Association Models.
Exercises.
7 PostStratification, Raking and Calibration.
7.1 Introduction.
7.2 PostStratification.
7.3 Raking.
7.4 Generalized Raking, GREG Estimation, and Calibration.
7.4.1 Calibration in R.
7.5 Basu’s Elephants.
7.6 Selecting Auxiliary Variables for NonResponse.
7.6.1 Direct Standardization.
7.6.2 Standard Error Estimation.
Exercises.
8 TwoPhase Sampling.
8.1 Multistage and Multiphase Sampling.
8.2 Sampling for Stratification.
8.3 The CaseControl Design.
8.3.1 Simulations: Efficiency of the DesignBased Estimator.
8.3.2 Frequency Matching.
8.4 Sampling from Existing Cohorts.
8.4.1 Logistic Regression.
8.4.2 TwoPhase CaseControl Designs in R.
8.4.3 Survival Analysis.
8.4.4 CaseCohort Designs in R.
8.5 Using Auxiliary Information from Phase One.
8.5.1 Population Calibration for Regression Models.
8.5.2 TwoPhase Designs.
8.5.3 Some History of the TwoPhase Calibration Estimator.
Exercises.
9 Missing Data.
9.1 Item NonResponse.
9.2 TwoPhase Estimation for Missing Data.
9.2.1 Calibration for Item NonResponse.
9.2.2 Models for Response Probability.
9.2.3 Effect on Precision.
9.2.4 DoublyRobust Estimators.
9.3 Imputation of Missing Data.
9.3.1 Describing Multiple Imputations to R.
9.3.2 Example: NHANES III Imputations.
Exercises.
10 Causal Inference.
10.1 IPTW Estimators.
10.1.1 Randomized Trials and Calibration.
10.1.2 Estimated Weights for IPTW.
10.1.3 Double Robustness.
10.2 Marginal Structural Models.
Appendix A: Analytic Details.
A.1 Asymptotics.
A.1.1 Embedding in an Infinite Sequence.
A.1.2 Asymptotic Unbiasedness.
A.1.3 Asymptotic Normality and Consistency.
A.2 Variances by Linearization.
A.2.1 Subpopulation Inference.
A.3 Tests in Contingency Tables.
A.4 Multiple Imputation.
A.5 Calibration and Influence Functions.
A.6 Calibration in Randomized Trials and ANCOVA.
Appendix B: Basic R.
B.1 Reading Data.
B.1.1 Plain Text Data.
B.2 Data Manipulation.
B.2.1 Merging.
B.2.2 Factors.
B.3 Randomness.
B.4 Methods and Objects.
B.5 Writing Functions.
B.5.1 Repetition.
B.5.2 Strings.
Appendix C: Computational Details.
C.1 Linearization.
C.1.1 Generalized Linear Models and Expected Information.
C.2 Replicate Weights.
C.2.1 Choice of Estimators.
C.2.2 Hadamard Matrices.
C.3 Scatterplot Smoothers.
C.4 Quantiles.
C.5 Bug Reports and Feature Requests.
Appendix D: DatabaseBacked Design Objects.
D.1 Large Data.
D.2 Setting Up Database Interfaces.
D.2.1 ODBC.
D.2.2 DBI.
Appendix E: Extending the Survey Package.
E.1 A Case Study: Negative Binomial Regression.
E.2 Using a Poisson Model.
E.3 Replicate Weights.
E.4 Linearization.
References.
Author Index.
Topic Index.
 Actual analysis code, in R, is presented throughout the book
 Realistically large, but not technologically inaccessible, data sets are showcased and special cases are avoided
 Formulas are developed or given only when illuminating (ex., the HorvitzThompson variance estimator, but not the variance of a ratio estimator)
 Coverage of regression modeling in epidemiological twophase designs is unique, as is probabilityweighting for causal inference
 A related Web site houses additional data sets, figures, and code for reproducing the book's examples