Getting Started with the Survey of Consumer Finances

Eric Busboom.

A tutorial for getting started with the Survey of Consumer Finances, one of the most important surveys for studying American’s financial status and habits.

import pandas as pd
import numpy as np
import rowgenerators as rg
import seaborn as sns
sns.set_context('notebook')
import matplotlib.pyplot as plt
%matplotlib inline

from IPython.display import Markdown
from IPython.core.magic import register_cell_magic


@register_cell_magic
def markdown(line, cell):
    return Markdown(cell.format(**globals()))

The Survey of Consumer Finances is a large, cross-sectional, triennial survey of Americans that asks questions about families’ balance sheets, pensions, income, and demographic characteristics. It is sponsored and published by the Board of Governors of the Federal Reserve System and administered by NORC. It is one of the most widely used datasets1 for financial research and is a critical resource for studying a wide range of topics. The Federal Reserve has a chartbook application that you can use to get a sense of what data is in the SCF and what you can analyze with it.

In this guide we will analyze the full public dataset and the extract files using Python and Pandas.

About the Dataset

Nearly all of the files, scripts and documentation for the SCF are available on the web page for each year of the survey. The page for the 2019 survey is: https://www.federalreserve.gov/econres/scfindex.htm

Other survey pages are linked from the “Previous Surveys” link in the left hand menu.

The main download page has many links to files and documentation in a variety of formats, and each collection of files is organized in a blue table. The first table, in the “Historic Tables” section has compiled data from multiple years. These files are useful for basic reporting, but not for other types of analysis. We are interested in the Full Public Dataset and the Summary Extract Public Data. We’ll access these files in the Stata format using Pandas.

For our analysis, there are three important data files:

There are also two import documentation files:

  • The Codebook, which describes the variables in the Main Survey Data.
  • The SAS macro that creates the Summary Extract Public Data. This macro code the only documentation for the extract file variables.

The SCF is a very complex dataset, with many quirks that researchers should be aware of. The Hanna et al article1 provides an authoritative discussion of the nuances of this dataset.

Dataset Structure and Organization.

The SCF appears to collect financial information about households, but the main unit of analysis is actually the “primary economic unit”2 (PEU), which is defined as an economically dominant individual or coupe and all of the people dependent on that individual or couple. This unit may be a subset of the household if some members of the household are independently employed. However, most documentation will use the terms “household” and “primary economic unit” interchangbly, and sometimes even SCF staff will use the term “family” instead. For convenience, we will use “household” to describe the primary economic units.

Because the dataset is a survey, it does not have responses from every household in America, so in order to use the dataset to infer statistics about the whole American population, each record represents around 20,000 American households. The number of households that a record represents is known as the weight; the sum of the weights in the dataset should equal the number of households in the US.

However, the weights for each record are statistical guesses, and there could be other possible weights for each household. The main data file holds one primary weight, but the weights file holds an additional 999 weights — the replicate weights — that can be used for more complex analsys. We will use the main weight in this tutorial, and demonstrate how to use the replicate weights later.

But this file has one additional set of weights — the implicate weights. The data file, for 2019, has records for 6248, but the data file has 5 times as many rows, 5 rows for each household. Each of these rows is an implicate, a repetition of the reocord that serves to obscure the identity of the household and to provide a statistically sophisticate way of handling missing values. The implicates will be discussed later, for now we will use only the first implicate and ignore the rest.

There are two identifier values for the records. Variable YY1 identifies each household, and because of the implicates, each value will be repeated 5 times. The second inentifier is Y1, which is constructed by multiplying YY1 by 10 and adding the implicate number

The main file, the Full Publilc Data set, holds all of the publicly released variables, but these variables have cryptic names. For instance, X5907 is the variable for the respondent’s date of birth. The second important file, the Summary Extract Public Data, is created from the primary data with an SAS macro. It’s variables have human readable names; HHSEX is the variable for the sex of the household reference person. Additionally, the macro maintains the same variable names from year to year, with approximately the same meaning, so using this file is often much easier. For instance, the main dataset has many variables for household income, and often, a household’s income can vary a lot from year to year. In the Extract file, all of these values are combined and smoothed in the NORMINC ( Normal annual income ) variable. Using this file can greatly reduce an analysts burden of data preparation.

Loading Files

To load the files, we will use the rowgenerators Python package which provides support for getting Pandas Dataframes from urls. The rowgenerators package will cache the download files, so re-running our program is faster, and can select a specific file out of the ZIP archive.

import pandas as pd
import numpy as np
import rowgenerators as rg
# The  row generators package can extract the .dta file from the ZIP archive and
# produce a dataframe. It also caches them locally, to speed-up re-runs.

frb_site = 'https://www.federalreserve.gov/econres/files'

# Public File
scfp = rg.dataframe(frb_site+'/scf2016s.zip#p16i6.dta')

# Extract
scfe = rg.dataframe(frb_site+'/scfp2016s.zip#rscfp2016.dta')

# Weights
scfw = rg.dataframe(frb_site+'/scf2016rw1s.zip#p16_rw1.dta')

The public file is fairly large: 31K rows, and 5K columns, and the columns names are nearly all cryptic codes, which you wil have to decipher with the Codebook.. By the way, the Codebook is 850 pages, so don’t try to print it out.

scfp.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 31240 entries, 0 to 31239
Columns: 5320 entries, J7398 to X11572
dtypes: float32(293), float64(103), int16(2526), int32(376), int8(2022)
memory usage: 315.3 MB

The Summary Extract Public Data file is different. It is just as long as the main public file, with one row per row of the main file, but only has 348 columns, and the columns have sensible names. There doesn’t appear to be a codebook for this file. Instead, all of the columns are created by a SAS script, and you have to read the script to understand what the variables are for. Fortunately, the script is not hard to read.

 scfe.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 31240 entries, 0 to 31239
Columns: 348 entries, YY1 to nincpctlecat
dtypes: float64(82), int16(21), int32(61), int8(184)
memory usage: 33.8 MB

Basic Analysis

The plan for our analysis is to first create an extract file that we’ll use for all of the later analysis, with sensibe names for the variables, then perform the analysis. Most often, these steps are in seperate Jupyter notebooks, but here we will join them together.

Extracting data has a few steps. First we have to locate the variables that we want to use in the codebook and extract them from the file. Since all of the variable names in the mail file are codes, we will want to rename them to more memorable names. The values from the Extract file already have somewhat sensible names, so we will extract those variabes without renaming them. We will also have to either extract only the first of the implicate, or will have to remember to divide sums of records by 5, or divide the weights by 5. But, there is also one other option, if we have a lot of memory, that makes the analysis simpler.

Because the records are weighted — each household actually represents from between 10 and 65,000 real households — any summary statistics, such as means or medians, will have to account for the weighting. Because of the structure of the sampling for the SCF, proper weighting is very important.

However, there is another technique that can avoid many of the dificulties and produce reasonably accurate estimates with no complicated hadling of weights: sampling. We will do analysis in this tutorial using sampling, and demonstrate other handling of weights later. The idea behind sampling is very simple: we select a large number of records from the dataset, 100,000 or more, with a probability of selection equal to the record weight. With a large number of sampled records, we can easily compute percentages, ratios, means and medians.

Extracting Records.

Our first step is to extract the records we will need for analysis. For the main file, we will need to consult the code book to get the variabe code, and then we’ll want to assign it a sensible name. This information is best captured in a Python dict:

p_cols = {
    'case_id': 'YY1',
    'record_id': 'Y1',
    'age_1': 'X14', # Reconciled age
    'age_2': 'X19', # Reconciled age
    'hisp': 'X7004', # Do you consider yourself to be Hispanic or Latino in culture or origin?
    'race': 'X6809', # Race of respondent
    'addtional_race': 'X6810', # Respondent offered another race categot (1) or did not (5)
    # X6402 #In 2015, did (other adult) receive any income from wages or salaries?
    #'income': 'X5729', # How much was the total income you (and your family living here) received in 2015 from all sources, before taxes and other deductions were made?
    'unusual_income': 'X7650', # Is this income unusually high or low...
    'ed_1': 'X5931', # What is the highest level of school completed or the highest degree you have received?
    'ed_2': 'X6111', # What is the highest level of school completed or the highest degree you have received?
    'ed_mother_1': 'X6032', # What is the highest level of school or the highest degree mother completed?
    'ed_father_1': 'X6033', # What is the highest level of school or the highest degree father completed?
    'ed_mother_2': 'X6132', # What is the highest level of school or the highest degree mother completed?
    'ed_father_2': 'X6133', # What is the highest level of school or the highest degree father completed?  
    'occ_1': 'X7401', # What is the official title of your job?
    'occ_2': 'X7411', # What is the official title of your job?
    'gi_other_value': 'X5818', #How much altogether were any others (inheritances) you have received?
    'fin_risk': 'X7557', # Willingness to take fnancial risks, 1 to 10
    'shop_credit' : 'X7561', # Financial shopping
    'shop_credit_2' : 'X7562', # Financial Shopping
    'fin_know': 'X7556', # Financial Knowledge
    'borrow_vacation': 'X402', # Borrow for a vacation
    'plan_horizon': 'X3008', # which of the time periods listed on this page is most important to you
    'spend_exceeds':'X7510', # would you say that your (family's) spending exceeded your (family's) income,
    'spend_exceeds_excl_house':'X7508', # Spending exceeds, after purchase of house. 
    'wt0': 'X42001' # Weight   
}
 
df_p = scfp[p_cols.values()].rename(columns={v:k for k,v in p_cols.items()})
df_p.insert(2, 'implicate_id', df_p.record_id - df_p.case_id.astype('int32')*10)   
    

The variables from the Public Extract file are easier to use, because they already have sensible names, so they can be collected into a list.

# Extract columns we are interested in 
e_cols = ['y1',
'networth',  'income', 'nwcat', 'nwpctlecat', 
'norminc', 'ninccat',  'ninc2cat', 'nincpctlecat',
'occat1', 'occat2', 'edcl', 'lifecl', 
'famstruct', 'married',  'agecl', 'housecl', 
'racecl','racecl4',
'asset', 'liq','bond', 'fin','nfin', 'debt', 
'indcat', 'equity','homeeq', 'revpay','bnkruplast5', 
'debt2inc', 'hsaving' , 'saved' ]

# Lowercase so all cols are consistent case
scfe.columns = [c.lower() for c in scfe.columns] 

# make a sensible name for an important column
df_e = scfe[e_cols].rename(columns={'y1':'record_id'})

# Merge the public file and the extract file
df = df_p.merge(df_e, on='record_id')
df.to_csv('scf16.csv')

Weighting and Sampling

The SCL is a weighted dataset: each record represents more than one household in the US, and there are multiple columns that indicate how many housholds each record can represent. We will focus on the wt0 column, the primary weight. As with most weighted datasets, adding up the weights for all of the records should result in the number of households in the us, about 120M.Remember that there are actually 5 implicates for each household There are a few ways to deal with this: you can divide all of the weights by 5, or only use one of the implicates

df.wt0.sum()/1e6/5, df[df.implicate_id==1].wt0.sum()/1e6
(125.98170068524078, 125.98170155281443)

Weighting makes basic calculations much more difficult. Means, meadians and linear regresions will all require extra work to calculate properly. The equation for a weighted mean is fairly simple:

$$
\bar{x} = \frac{ \sum\limits_{i=1}^n w_i x_i}{\sum\limits_{i=1}^n w_i}
$$

To calculate the weighted mean, we just need to multiply each of our values times the associated weight, then divide by the sum of the weights. Calculating the weighted mean for NORMINC, we have:

(df.norminc*df.wt0).sum()/df.wt0.sum()
100594.40751649837

Compare this value to the mean of the unweighted NORMINC, which 7x larger.

df.norminc.mean()
761808.0683438

For this dataset, the unweighted mean is so mauch larger than the mean because there are 605 households in the sample that have more than \$ 1M in normal income, and 85 with more than \\$10M. Because money variables are often very non-linear, these values will significantly skew statistics. This effect is also the reason that money values are most sensibly summarized with medians, rather than with means, as in our examples, because the mean of NORMINC, at \$762,000, is larger than 99\% of the population and 90\% of the surveyed households.

Another useful trick is to re-arrange the equation for the weighted mean, by pre-dividing the weights by the sum of the weights. The result will be that the weights will sum to 1, and we can skip the division.

df['nwt0'] = df.wt0 / df.wt0.sum() # Normalized weight

(df.norminc*df.nwt0).sum() # Simplier weighted mean of normal income. 
100594.40751649837

Additionally, because the weights sum to 1, summing groups of weights gives us the proportion of that group in the total population. For instace, to caluculate the proportions of racial groups.

df.groupby('racecl').nwt0.sum() # 1=White, 2=Non-white
racecl
1    0.647282
2    0.352718
Name: nwt0, dtype: float64

Compare this to the proportions in the unweighted statistic:

df.groupby('racecl').case_id.count()/len(df)
racecl
1    0.687772
2    0.312228
Name: case_id, dtype: float64

Since this is a survey of finances, we are going to to a lot of work with variables representing money, and as we saw previously, it is better to use the median or quantiles to summarize money variables. Unfortunately, the algorithm for a weighted quantiles is much more complicated than the equation for weighted means, but there are several Python packages that implement weighted statistics, including:

import weightedstats as ws 
import wquantiles as wq

ws.weighted_median( df.norminc, df.wt0), wq.median(df.norminc, df.wt0)
(55593.739230327395, 55593.739230327395)

More commonly you will want to compute medians of groups. This proedure is a bit more complicated in Pandas. First, we will define some helper functions, one for each of the first two helper packages. The weightedstats package doesn’t like NA values, so there is a bit of extra work to remove them.

import weightedstats as ws 
import wquantiles as wq

def wmedian(df, column_name, weights_name='wt0'):
    
    df = df.dropna(subset=[column_name,weights_name ])
    
    return ws.weighted_median( df[column_name], weights=df[weights_name])
    
def wmedian2(df, column_name, weights_name='wt0'):
    
    df = df.dropna(subset=[column_name,weights_name ])
    
    return wq.median( df[column_name], df[weights_name])
    
# Education class. 1=no high school diploma/GED, 2=high school diploma or GED,
# 3=some college or Assoc. degree, 4=Bachelors degree or higher;
df.groupby('edcl').apply(wmedian2, 'norminc') 
edcl
1    28353.819644
2    42530.729466
3    50631.820793
4    93162.550258
dtype: float64

Sampling the Sample

The odd way that these weighted median functions are called on groups is a bit disruptive, so it would be nice to use normal Pandas idioms instead. Fortunately we can, if we transform the weighted sample to an unweighted sample with random sampling. Using the Pandas .sample() method, we can draw a representative sample from the weighted sample and then just use regular statistics. Here, we will draw 1M samples from the weighted dataset and compute the group median incomes for EDCL.

dfs = df.sample(int(1e6),replace=True, weights=df.wt0)

dfs.groupby('edcl').norminc.median()
edcl
1    28353.819644
2    42530.729466
3    50631.820793
4    93162.550258
Name: norminc, dtype: float64

There are a few disadvantages with this technique:

  • It is a bit slow to generate the sampled dataset
  • The sampled dataset is very large. For 1M records, this dataset is 60Mb
  • The results are probablistic.

Fortunately, these are not significant disadvantages. The first two should not matter much any modern computer, including laptops, and the third doesn’t matter much for computing medians, because the coarseness of the dataset means that nearly any large sample will return the same median for most variables. For instance, let’s compute the median of NORMINC for colleged educated households with three draws at each sample size. The values are identical for all draws with samples of 100,000 or more.

print([ df.sample(int(sz),replace=True, weights=df.wt0).groupby('edcl').norminc.median().loc[4] 
     for _ in range(3) for sz in (10e5, 100e5, 1e6, 10e6)])
[93162.55025847214, 93162.55025847214, 93162.55025847214, 93162.55025847214, 93162.55025847214, 93162.55025847214, 94175.1866743251, 93162.55025847214, 93162.55025847214, 93162.55025847214, 93162.55025847214, 93162.55025847214]

The means, however, have a range of values.


means=[ df.sample(int(sz),replace=True, weights=df.wt0).groupby('edcl').norminc.mean().loc[4] 
     for _ in range(3) for sz in (10e5, 100e5, 1e6, 10e6)]
plt.hist(means);

But, if we average all of these different sampled values, we’ll get a nearly identical result to the directly computed weighted mean, ( $\pm .03 \%$ )

t = df[df.edcl==4]
direct_mean = (t.norminc*t.wt0).sum()/t.wt0.sum()
sampled_mean = np.mean(means)

direct_mean, sampled_mean, (direct_mean-sampled_mean)/direct_mean
(182595.659126974, 182750.25484013426, -0.0008466560152602499)

Using the sampling teach makes nearly all of the work with a weighted dataset much easier, and for sufficiently large samples, returns nearly identical results. It also greatly simplifies other techniques, such as regression. For future tutorials, we will use a sampling method for all exploratory work and will use weighted calculations for final, publishable results or for cases where the difference is negligible.

References

  • 1 Hanna, Sherman D., Kyoung Tae Kim, and Suzanne Lindamood. “Behind the Numbers: Understanding the Survey of Consumer Finances.” SSRN Scholarly Paper. Rochester, NY: Social Science Research Network, June 19, 2018. https://papers.ssrn.com/abstract=3199293.
  • 2 LINDAMOOD, SUZANNE, SHERMAN D. HANNA, and LAN BI. “Using the Survey of Consumer Finances: Some Methodological Considerations and Issues.” The Journal of Consumer Affairs 41, no. 2 (2007): 195–214. https://www.jstor.org/stable/23860056. PDF from ResearchGate.