1 Introduction



This document provides a fully reproducible workflow for creating the 2022 Reproducible Area Deprivation Index (ReADI) for U.S. census block groups, tracts, and counties. ReADI is a corrected and updated implementation of the Area Deprivation Index that addresses several limitations of the existing Neighborhood Atlas ADI (NA-ADI), while preserving the core goal of summarizing socioeconomic deprivation across small areas.

Broadly, this workflow:

  1. Retrieves 2018–2022 5-year American Community Survey (ACS) data at county, tract, and block group levels.
  2. Cleans and harmonizes the ACS variables required to construct deprivation components.
  3. Imputes missing values using spatial information (neighboring areas) where appropriate.
  4. Constructs a set of standardized component variables that represent different dimensions of deprivation (income, education, employment, housing, etc.).
  5. Fits a one-factor, population-weighted factor model and uses the resulting factor scores as the basis for ReADI.
  6. Converts ReADI scores to a 0–100 national rank, with higher values indicating greater deprivation.

The primary goal of this document is transparency and accountability: to make every step of ReADI construction visible to the field so that it can be scrutinized, improved, and iteratively refined into the most accurate and informative measure possible. In addition, the code is written so that users can, if they wish, rerun the full workflow to regenerate ReADI or adapt individual pieces of the code for their own analyses.

1.1 Overview


The original Area Deprivation Index (ADI) as described by Singh was designed to summarize multiple socioeconomic indicators into a single deprivation score at relatively small area levels. Singh (2003) created the area deprivation index (ADI) which is a factor-based deprivation index comprised of 17 measures including poverty, education, housing and employment indicators drawn from US Census data to create a measure of socioeconomic context for a particular census-based region.

The Neighborhoood Atlas (NA-ADI) (Kind et al. 2018) is a widely used operationalization of the ADI and has become the de facto standard deprivation index in many U.S. health studies. However, it has several practical limitations for ongoing analytic work, including fixed historical coefficients, lack of regular updating to contemporary ACS data, and limited transparency around how certain decisions were made (e.g., standardization, thresholds, and handling of missing data).

The following subsections summarize the main conceptual issues with NA-ADI and how ReADI addresses each of them in practice.

1.1.1 Singh Methods


SINGH (2003)

1. Singh took 21 socioeconomic indicators and performed factor analysis on them using 1990 census data. This resulted in two factors represented in 43% and 17% of the variance with 17 measures clustering with large loading (>0.45) on the first factor and so these 17 were selected and then re-run on factor analysis with a one factor solution producing the factor loading scores listed in the table below. These weights accounted for 52% of the variance in the data.

2. The factor score coefficients were used to weight the 17 indicators comprising the index.

3. The factor scale was transformed into a standardized index by arbitrarily setting the index mean and standard deviation at 100 and 20, respectively.

4. The tract index scores were averaged to allow computation of index scores for each of the 3097 US counties.

5. Higher index scores denote higher levels of deprivation.


1.1.2 Kind Methods


Version 1 (5/1/2018)

1. Kind et al. (2014) used Singh (2003) methods, using the 17 indicators from 2000 census data and tract level factor score coefficients from Singh (2003).

2. The 17 US Census variables were multiplied by their factor weights and then summed for each geographic unit.

3. The result is then transformed into a standardized index (the ADI) by arbitrarily setting the index mean at 100 and standard deviation at 20.

4. Using this approach, neighborhoods with higher ADI scores have higher levels of deprivation. All coefficients are multiplied by -1 to ease interpretation (greater ADI means a greater disadvantage).


Version 2 (5/1/2019)

Every step in the methodology is the same as v1 except:

1. Same as v1 except there was the suppression of CBGs was based on those with <100 people, fewer than 30 housing units, or >33% of the population living in group quarters based on Diez et al. (2001).


Version 3.0 (11/19/2020)

Every step in the methodology is the same as v2 except:

1. Two methodological changes have been implemented to address missing Census Block Group level data in the underlying ACS data, survey errors and a new geographically-nested imputation method: a) Additional suppression of a small number of CBGs with survey errors acknowledged by the US Census Bureau. b) Imputation of missing ACS items using a geographically-nested imputation methodology consistent with other area deprivation indices methodologies, including the English Indices of Multiple Deprivation (IMD) and the Scottish Indices of Multiple Deprivation (SIMD).


Version 3.1 (7/14/2021)

Every step in the methodology is the same as v3.0 except:

1. Suppression criteria was updated to use Population, Housing and Group Quarters estimates from the ACS in place of the decennial Census. This adjusts the ADI of approximately 70 block groups nationally.


Version 3.2 (9/16/2022)

Every step in the methodology is the same as v3.1 except:

1. The No Phone variable was removed from the ADI and replaced with a No Household Internet variable to reflect the change in household phone and internet service.


Version 4.0.0 (7/10/2023)

Every step in the methodology is the same as v3.2 except:

1. The v4 ADI has minor standard shrinkage statistical updates included to mitigate the effect of year-to-year sampling variations in block group level component estimates within American Community Survey (ACS) data. This results in very little actual change in ADI ranking but buffers from known and future expected variation in ACS source data.


Version 4.0.1 (9/9/2023)

Every step in the methodology is the same as v4.0.0 except:

1. The v4.0.1 ADI has minor refined shrinkage statistical updates to improve the precision of the block group level component estimates derived from the American Community Survey (ACS) data. This results in slight shifts to the block group rankings from the previous version while retaining the general pattern and improving the overall temporal stability of the ADI. Using Census Block Groups, Nine Digit ZIP Code Centroids geography.


Factor Analysis Results 1990 Census Data (Singh 2003)

Census Variable Tract Score Coefficient Tract Loading County Loading
Population aged ≥25 y with < 9 y of education, % 0.0849 0.7498 0.7885
Population aged ≥25 y with at least a high school diploma, % –0.0970 -0.8562 -0.8231
Employed persons aged ≥16 y in white-collar occupations, % –0.0874 -0.7721 -0.6890
Median family income, $ –0.0977 -0.8629 -0.9218
Income disparity\(^a\) 0.0936 0.8262 0.8827
Median home value, $ –0.0688 -0.6074 -0.6740
Median gross rent, $ –0.0781 -0.6896 -0.7876
Median monthly mortgage, $ –0.0770 -0.6795 -0.7812
Owner-occupied housing units, % –0.0615 -0.5431 -0.4408
Civilian labor force population aged ≥16 y unemployed, % 0.0806 0.7117 0.5679
Families below poverty level, % 0.0977 0.8623 0.8796
Population below 150% of the poverty threshold, % 0.1037 0.9157 0.9266
Single-parent households with children aged < 18 y, % 0.0719 0.6346 0.3329
Households without a motor vehicle, % 0.0694 0.6126 0.4549
Households without a telephone, % 0.0877 0.7748 0.7830
Occupied housing units without complete plumbing, % (log) 0.0510 0.4505 0.6392
Households with more than 1 person per room, % (crowding) 0.0556 0.4910 0.4018

\(^a\) Income disparity in 1990 was defined as the log of 100×ratio of number of households with <$10,000 income to number of households with ≥$50,000 income.


1.2 NA-ADI Issues


Several limitations of the NA-ADI as implemented by 2014 Kind et al. and distributed via the Neighbourhood Atlas have important implications for current analytic work. In the tabs below, we outline four key issues: standardization, use of historical versus concurrent coefficients, updating variable thresholds, and population size adjustment. We also describe how ReADI addresses each of them. This section is meant to make those decisions explicit and open to scrutiny so the index can be transp


1.2.1 Standardization

Problem

The most consequential technical problem with the NA-ADI is the lack of appropriate standardization when applying Singh’s factor loadings to new ACS data. While this cannot be definitively confirmed because no code for the Neighborhood Atlas ADI has been published, Petterson (2023) was able to reverse engineer the index to an R² ≥ 0.999 simply by standardizing the component variables before applying Singh’s published weights. This level of agreement strongly supports the conclusion that the missing standardization step is the main source of the discrepancy in the NA-ADI implementation. Factor analysis assumes variables are on a comparable scale, and standard practice in software such as SAS PROC FACTOR, Stata, and R is to standardize variables internally before estimating loadings. This is especially important here because the ADI components are on very different scales: most are proportions between 0 and 1, while four are dollar-valued medians with ranges from a few thousand to several hundred million dollars. Without standardization, variables with larger numeric ranges dominate the factor solution. Based on Singh’s description of using v8.1 SAS/STAT FACTOR and the documentation for that procedure, it is highly likely that the original ADI loadings were estimated on standardized variables, with that step handled automatically by the software. In contrast, the NA-ADI implementation did not re-fit the factor model but instead took Singh’s published loadings and applied them directly to raw ACS variables. As those new variables were not first standardized in the same way as in the original factor analysis, the resulting index is mathematically inconsistent with the model that produced the loadings which resulted in an overemphasize of components such as median home value.

Solution

In ReADI, we fit the factor model separately for each geography (county, tract, block group) and ACS period, rather than reusing historical loadings. The factor analysis is implemented in R (using psych::fa), which operates on the correlation matrix of the input variables and therefore centers and scales them internally. As a result, the component variables are placed on a common standardized scale as part of the model fitting, and we can directly use the resulting factor scores from that function as the raw ReADI values for that geography and year.


1.2.2 Concurrent & Geographically Consistent Coefficients

Problem

Kind et al. (2014) report in their Table 1 that the coefficients (factor loadings) used for the NA-ADI are identical to those published by Singh (2003) based on 1990 census data. This implies that all subsequent NA-ADI vintages reuse the same 1990-era weights, even when applied to much more recent ACS data. Given substantial changes in the socioeconomic distribution over time, it is unlikely that factor loadings estimated on 1990 data remain appropriate for contemporary indices. Notably, when Singh evaluated temporal stability, he reconstructed the ADI using 1970 census data and recomputed the factor loadings for that decade, rather than reusing the 1990 coefficients. In other words, Singh’s own validation work did not rely on a single fixed set of weights across time.

In addition, the original ADI factor analysis was conducted at the census tract level, yet the Neighborhood Atlas NA-ADI is distributed and promoted primarily at the census block group level. This means tract-level loadings are being applied to block group–level variables, which assumes that the factor structure is invariant across spatial scale. The Neighborhood Atlas documentation also discourages users from aggregating NA-ADI to higher levels such as tracts or counties, even though its own coefficients were derived at the tract level. Taken together, this raises concerns about both temporal and geographic inconsistency in the NA-ADI weights.

Solution

In ReADI, we re-estimate the factor model for each ACS period and for each geographic level (county, tract, and block group) using the corresponding contemporary data. For every combination of year and geography, we fit a one-factor model in R (using psych::fa) with the chosen set of component variables and population-based weights. The resulting loadings and factor scores are therefore concurrent with the data to which they are applied and are specific to the spatial scale at which the index is used. This avoids relying on decades-old, tract-derived coefficients for block group–level analyses and ensures that ReADI reflects current relationships between the component variables and deprivation within each geography.


1.2.3 Updating Variable Thresholds

Problem

Kind et al. (2014) list in their appendix the same variables and thresholds that Singh (2003) used with 1990 census data. As a result, most NA-ADI vintages rely on variable definitions and cut points that reflect socioeconomic conditions more than 30 years old. This is problematic for several reasons. Education thresholds based on less than 9 years of schooling and “higher education” defined as high school graduation are likely to underestimate educational deprivation in 2022, when completing high school is far more common and a bachelor’s degree is often a more appropriate benchmark. The income disparity bands used in the original ADI also do not account for changes in the income distribution or inflation (for example, very low income defined as less than $10,000 and high income as at least $50,000). Although the Neighborhood Atlas team has more recently updated the “telephone” variable to reflect internet access, many of the other original thresholds have not been modernized. Importantly, when Singh evaluated temporal stability using 1970 census data, he redefined at least the income disparity thresholds, which suggests that the original intent was to update these definitions over time rather than freeze them permanently.

Solution

In ReADI, we update key variable definitions and thresholds to better reflect contemporary socioeconomic conditions while preserving the underlying constructs. Specifically:

  • The original “percent of the population aged ≥ 25 years with less than 9 years of education” is replaced with “percent of the population aged ≥ 25 years with less than a high school diploma.”

  • The original “percent of the population aged ≥ 25 years with less than a high school diploma” is replaced with “percent of the population aged ≥ 25 years with less than a bachelor’s degree.”

  • The original income disparity measure

    \(\text{Income Disparity} = log(100*{\text{Household Income <\$10,000} \over \text{Household Income ≥\$50,000}})\) is updated to \(\text{Income Disparity} = log(100*{\text{Household Income <\$20,000} \over \text{Household Income ≥\$100,000}})\)

which better reflects the current income distribution.


1.2.4 Population Size Adjustment

Problem

Counties, census tracts, and census block groups have highly variable and skewed population sizes. Some units have only a handful of residents, while others contain thousands (and counties can range from a few hundred to several million people). Without accounting for these differences, small areas can contribute as much to the factor solution as large areas, which can make estimates unstable and distort comparisons both within and across geographic levels.

Solution

To account for this, we transform population size using the natural log of the total population and use this transformed measure as an analytic weight in the factor analysis. In practice, we compute a log population term (POPWT) and pass it as the weight argument to the factor analysis function. This gives more influence to more populous areas when estimating loadings and factor scores, while still producing ReADI values for each individual geographic unit.


2 ReADI Preparation


Libraries

library(tidyverse)
library(censusapi)
library(reshape2) 
library(plyr) 
library(SciViews) 
library(readxl) 
library(tidycensus) 
library(psych)
options(tigris_use_cache = TRUE)
library(tigris)
library(spdep)
library(statar)
library(dbplyr)
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.7.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Vancouver
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] dbplyr_2.5.0      statar_0.7.6      spdep_1.3-11      sf_1.0-21        
##  [5] spData_2.3.4      tigris_2.2.1      psych_2.5.3       tidycensus_1.7.1 
##  [9] readxl_1.4.5      SciViews_0.9-13.2 plyr_1.8.9        reshape2_1.4.4   
## [13] censusapi_0.9.0   lubridate_1.9.4   forcats_1.0.0     stringr_1.5.1    
## [17] dplyr_1.1.4       purrr_1.0.4       readr_2.1.5       tidyr_1.3.1      
## [21] tibble_3.2.1      ggplot2_4.0.0     tidyverse_2.0.0  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1   farver_2.1.2       S7_0.2.0           fastmap_1.2.0     
##  [5] digest_0.6.37      timechange_0.3.0   lifecycle_1.0.4    magrittr_2.0.3    
##  [9] compiler_4.5.0     rlang_1.1.6        sass_0.4.10        tools_4.5.0       
## [13] yaml_2.3.10        corrplot_0.95      data.table_1.17.8  knitr_1.50        
## [17] sp_2.2-0           classInt_0.4-11    mnormt_2.1.1       xml2_1.3.8        
## [21] RColorBrewer_1.1-3 KernSmooth_2.23-26 withr_3.0.2        Metrics_0.1.4     
## [25] grid_4.5.0         e1071_1.7-16       scales_1.4.0       dichromat_2.0-0.1 
## [29] cli_3.6.5          ellipse_0.5.0      rmarkdown_2.29     crayon_1.5.3      
## [33] generics_0.1.4     rstudioapi_0.17.1  httr_1.4.7         tzdb_0.5.0        
## [37] DBI_1.2.3          cachem_1.1.0       proxy_0.4-27       rvest_1.0.4       
## [41] parallel_4.5.0     s2_1.1.9           cellranger_1.1.0   MBESS_4.9.41      
## [45] vctrs_0.6.5        boot_1.3-31        jsonlite_2.0.0     hms_1.1.3         
## [49] jquerylib_0.1.4    units_0.8-7        glue_1.8.0         stringi_1.8.7     
## [53] gtable_0.3.6       deldir_2.0-4       pillar_1.10.2      rappdirs_0.3.3    
## [57] htmltools_0.5.8.1  R6_2.6.1           wk_0.9.4           evaluate_1.0.3    
## [61] lattice_0.22-7     bslib_0.9.0        class_7.3-23       Rcpp_1.0.14       
## [65] uuid_1.2-1         nlme_3.1-168       xfun_0.52          pkgconfig_2.0.3



2.1 Retrieving 2018-2022 5-Year ACS



Pulling Variables Required


In this section we obtain the ACS 2018–2022 5-year estimates for all variables used to construct ReADI. We work at three geographic levels: counties, census tracts, and census block groups. Definitions of these geographic units are provided in the U.S. Census Bureau documentation.

Accessing ACS data via the Census API requires an API key, which can be requested from the Census Bureau. Once you receive your key, install it in your R environment as the CENSUS_API_KEY environment variable (for example, by adding it to ~/.Renviron or a project .Renviron) and do not hard-code the key directly in shared code.

The code that follows pulls the required ACS variables for counties, census tracts, and census block groups using the Census API and saves them as R objects for use in later steps.

Below are the final numbers of areas at each geography:

Area Amount
CBG 242,336
CT 85,396
County 3,222
if (!nzchar(Sys.getenv("CENSUS_API_KEY"))) {
  stop(
    "Census API key not found.\n",
    "Please set the CENSUS_API_KEY environment variable (e.g. in .Renviron) ",
    "before running this document.\n")}

states <- setdiff(unique(fips_codes$state_code), c("60", "66", "69", "74", "78")) # Removing US territories

ADI_Vars <- c(
       "B01003_001",
       "B19013_001",
       "B19001_002", "B19001_003", "B19001_004", "B19001_014", "B19001_015", "B19001_016", "B19001_017", 
       "B17017_002", "B17017_001", 
       "C17002_005", "C17002_004", "C17002_003", "C17002_002", "C17002_001", 
       "B15003_001", "B15003_002", "B15003_003", "B15003_004", "B15003_005", "B15003_006", "B15003_007", "B15003_008", "B15003_009", "B15003_010", "B15003_011", "B15003_012", "B15003_013", "B15003_014", "B15003_015", "B15003_016", "B15003_022", "B15003_023", "B15003_024", "B15003_025",
       "C24010_003", "C24010_027", "C24010_039", "C24010_063", "C24010_001",
       "B23025_005", "B23025_003",
       "B25077_001", 
       "B25064_001", 
       "B25088_002", 
       "B25003_001", "B25003_002",
       "B11012_010", "B11012_015", "B11012_001", 
       "B25044_003", "B25044_010", "B25044_001", 
       "B28002_013", "B28002_001",
       "B25049_007", "B25049_004", "B25049_001", 
       "B25014_005", "B25014_006", "B25014_007", "B25014_011", "B25014_012", "B25014_013", "B25014_001")
  
# CBG level
ACS_CBG <- do.call(rbind, lapply(states, function(st) {
get_acs(
    year = 2022,
    geography = "block group",
    state = st,
    key = Sys.getenv("CENSUS_API_KEY"),
    variables = ADI_Vars)
}))
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
ACS_CBG <- dcast(ACS_CBG, GEOID + NAME ~ variable, value.var=c("estimate"))
ReADI_2022_CBG_Raw <- ACS_CBG %>% separate(NAME, c("Block_Group", "Tract", "County", "State"), sep = ";") 

# census tract level
ACS_CT <- do.call(rbind, lapply(states, function(st) {
get_acs(
    year = 2022,
    geography = "tract",
    state = st,
    key = Sys.getenv("CENSUS_API_KEY"),
    variables = ADI_Vars)
}))
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
ACS_CT <- dcast(ACS_CT, GEOID + NAME ~ variable, value.var=c("estimate"))
ReADI_2022_CT_Raw <- ACS_CT %>% separate(NAME, c("Tract", "County", "State"), sep = ";") 

# county level
ACS_C <- do.call(rbind, lapply(states, function(st) {
get_acs(
    year = 2022,
    geography = "county",
    state = st,
    key = Sys.getenv("CENSUS_API_KEY"),
    variables = ADI_Vars)
}))
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
ACS_C <- dcast(ACS_C, GEOID + NAME ~ variable, value.var=c("estimate"))
ReADI_2022_C_Raw <- ACS_C %>% separate(NAME, c("County", "State"), sep = ",") 



Pulling Geography Measures


In addition to the ACS variables, we retrieve the corresponding spatial geometries for each geography (counties, census tracts, and census block groups). These geometry objects are required for both the spatial imputation procedures and for any visualizations that involve mapping ReADI values. The code in this section downloads and stores the geometry data that will be linked to the ACS variables in later steps.

CBG_Geo <- do.call(rbind, lapply(states, function(st) {
get_acs(
    year = 2022,
    geography = "block group",
    geometry = TRUE,
    state = st,
    key = Sys.getenv("CENSUS_API_KEY"),
    variables = "B01003_001")
}))
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
CBG_Geo <- CBG_Geo[!CBG_Geo$estimate == 0,] 
ACS_CBG_2022_Geometry <- CBG_Geo[, c("GEOID", "geometry")]

CT_Geo <- do.call(rbind, lapply(states, function(st) {
   get_acs(
    year = 2022,
    geography = "tract",
    geometry = TRUE,
    state = st,
    key = Sys.getenv("CENSUS_API_KEY"),
    variables = "B01003_001")
}))
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
CT_Geo <- CT_Geo[!CT_Geo$estimate == 0,] 
ACS_CT_2022_Geometry <- CT_Geo[, c("GEOID", "geometry")]

C_Geo <- do.call(rbind, lapply(states, function(st) {
get_acs(
    year = 2022,
    geography = "county",
    geometry = TRUE,
    state = st,
    key = Sys.getenv("CENSUS_API_KEY"),
    variables = "B01003_001")
}))
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
## Getting data from the 2018-2022 5-year ACS
C_Geo <- C_Geo[!C_Geo$estimate == 0,] 
ACS_C_2022_Geometry <- C_Geo[, c("GEOID", "geometry")]



2.2 Group Quarters & Small Populations



The Neighborhood Atlas NA-ADI applies several suppression rules, excluding block groups with (1) ≥ 33% of residents in group quarters, (2) population < 100 or < 30 housing units, or (3) missing key ACS inputs (“QDI”), following guidance from Roux et al. (2001). These criteria remove a small fraction of areas nationally but disproportionately affect settings such as prisons, nursing homes, college dormitory tracts, and very small rural communities.

In ReADI, we do not automatically exclude these areas. Although estimates for group-quarters–heavy or very small-population units are inherently less precise, the factor analysis in ReADI is population-weighted, which substantially downweights their influence on the estimation of loadings and scores while still allowing these units to receive values. In many applications, these areas are themselves important to evaluate, and suppressing them by default can obscure relevant patterns.

Because we provide the code, raw factor scores and the nationally ranked ReADI scores, users who prefer to exclude or analyze these units separately can do so easily based on their own analytic requirements. This approach maximizes transparency and flexibility while minimizing the impact of imprecise units on the construction of the index itself.



2.3 Population Size



Some geographic units have no residents or correspond to areas such as open water, uninhabited land, or industrial zones. Because these units do not meaningfully contribute to the construction of a deprivation index and can interfere with both imputation and factor analysis, we remove them prior to processing. The table below summarizes the number of remaining units at each geographic level after this basic population filter is applied.

Area Pop = 0 Remaining
CBG 2,164 240,172
CT 857 84,539
County 0 3,222
ReADI_2022_CBG_Raw <- ReADI_2022_CBG_Raw[!ReADI_2022_CBG_Raw$B01003_001 == 0,] 
ReADI_2022_CT_Raw <- ReADI_2022_CT_Raw[!ReADI_2022_CT_Raw$B01003_001 == 0,] 
ReADI_2022_C_Raw <- ReADI_2022_C_Raw[!ReADI_2022_C_Raw$B01003_001 == 0,] 



2.4 Missing Values



Table 1 below summarizes the extent of missing data for variables with incomplete information at each geographic level. Only a subset of components have any missing values, and for most variables the proportion of missingness is small.

Variable CBG N (/%) CT N (/%) C N (/%)
B19013_001 15042 (6) 724 (1) 1 (<1)
B25064_001 65920 (27) 4392 (5) 10 (<1)
B25077_001 21260 (9) 2344 (3) 3 (<1)
B25088_002 29910 (12) 2796 (3) 6 (<1)


Table 2 lists the 12 counties with missing values on at least one component variable. These counties are almost entirely very low population areas, which appears to be the primary source of ACS nonreporting. Kalawao County in Hawaii is a notable example: it is an isolated, historically protected community with a very small resident population, and several ACS estimates are suppressed or unavailable for this county.

County, ST Population B19013_001 B25064_001 B25077_001 B25088_002
Alpine County, CA 1,515 $101,125 NA $463,900 $2,401
Kalawao County, HI 50 $87,813 $1,297 NA NA
Owsley County, KY 4,054 $32,844 NA $78,500 $947
Cameron Parish, LA 5,447 $69,847 NA $177,600 $1,270
Buffalo County, SD 1,859 $42,917 $575 $103,800 NA
Borden County, TX 686 $80,625 NA $104,200 NA
Glasscock County, TX 1,068 $112,188 NA $258,300 $859
Kenedy County, TX 116 $45,455 NA NA NA
King County, TX 216 $59,375 NA NA NA
Loving County, TX 96 NA NA $218,800 NA
Terrell County, TX 862 $52,813 NA $119,800 $1,274
Daggett County, UT 638 $61,250 NA $252,400 $1,238

Across geographies, virtually all systematic missingness occurs in the four currency variables that are central to ADI construction (median family income, median home value, median rent, and median mortgage). In other words, the variables at the core of the standardization issues in the NA-ADI are also the ones that require imputation in ReADI. This concentration of missingness makes it especially important to handle imputation carefully, as decisions for these variables directly affect the resulting deprivation scores.

# By Variables
na_stats <- function(dat, vars) {
  n_miss <- colSums(is.na(dat[, vars, drop = FALSE]))
  perc   <- round(n_miss / nrow(dat) * 100, 2)
  list(N = n_miss, P = perc)
}

vars_cbg <- names(which(colSums(is.na(ReADI_2022_CBG_Raw)) > 0))

stopifnot(
  identical(vars_cbg, names(which(colSums(is.na(ReADI_2022_CT_Raw)) > 0))),
  identical(vars_cbg, names(which(colSums(is.na(ReADI_2022_C_Raw)) > 0))))

cbg <- na_stats(ReADI_2022_CBG_Raw, vars_cbg)
ct  <- na_stats(ReADI_2022_CT_Raw,  vars_cbg)
c   <- na_stats(ReADI_2022_C_Raw,   vars_cbg)

## combine into one data.frame
NA_Vars <- data.frame(
  Variable    = vars_cbg,
  CBG_N       = cbg$N,
  Perc_NA_CBG = cbg$P,
  CT_N        = ct$N,
  Perc_NA_CT  = ct$P,
  C_N         = c$N,
  Perc_NA_C   = c$P,
  row.names   = vars_cbg,
  check.names = FALSE)

# Missing by county
CountiesMissing <- ReADI_2022_C_Raw[rowSums(is.na(ReADI_2022_C_Raw)) > 0,]

# By Geo Area
NA_CBG_GEOs <- ReADI_2022_CBG_Raw[rowSums(is.na(ReADI_2022_CBG_Raw)) > 0,]
NA_CBG_GEOs <- as.data.frame(rowSums(is.na(NA_CBG_GEOs))) 
table(NA_CBG_GEOs[1]) # 4 max
## rowSums(is.na(NA_CBG_GEOs))
##     1     2     3     4 
## 73696 19851  3902  1757
NA_CT_GEOs <- ReADI_2022_CT_Raw[rowSums(is.na(ReADI_2022_CT_Raw)) > 0,]
NA_CT_GEOs <- as.data.frame(rowSums(is.na(NA_CT_GEOs))) 
table(NA_CT_GEOs[1]) # 4 max
## rowSums(is.na(NA_CT_GEOs))
##    1    2    3    4 
## 5440 1418  104  417
NA_C_GEOs <- ReADI_2022_C_Raw[rowSums(is.na(ReADI_2022_C_Raw)) > 0,]
NA_C_GEOs <- as.data.frame(rowSums(is.na(NA_C_GEOs))) 
table(NA_C_GEOs[1]) # 4 max
## rowSums(is.na(NA_C_GEOs))
## 1 2 3 
## 7 2 3



2.4.1 Imputation



We use spatial imputation to address missing values in a small subset of key variables before constructing ReADI. For each geography (block group, tract, county), missing values are imputed using neighboring units defined by queen contiguity within state (polygons that share a border or a vertex).

In brief, the procedure:

  1. Joins the variable of interest to its corresponding geometry within each state.
  2. Builds a queen-contiguity neighbor list using spdep::poly2nb().
  3. Iteratively replaces missing values with the median of their neighbors’ values.
  4. Stops when no further values can be filled in using this rule.

Newly imputed values are allowed to contribute to subsequent iterations, so clusters of missingness are gradually filled in from their observed borders inward. Units that remain missing after this process are typically geographically isolated (for example, islands, national parks, or small federal facilities) or have neighbours that are all missing for that variable.

When running the spatial imputation function you may see warnings such as:

  • “some observations have no neighbours”
  • “neighbour object has X sub-graphs”

These messages are expected for real-world geographies and do not indicate that the imputation has failed. “No neighbors” means a polygon has no contiguous neighbors under queen contiguity and therefore cannot borrow information; “sub-graphs” indicates that the adjacency graph is split into separate clusters (for example, mainland vs islands), and imputation is carried out within each cluster separately.

We do not force additional neighbors (for example, by using large snapping tolerances or k-nearest-neighbor rules) because that would introduce arbitrary spatial relationships that are difficult to justify. A small number of units are therefore expected to remain missing after spatial imputation and are handled explicitly in later steps.

Imputation Functions

The following code defines the reusable functions used for spatial imputation at each geography:

impute.fun <- function(data, var, shape) {
  # pull GEOID + variable, rename to "estimate"
  variable <- data[, c("GEOID", var)]
  colnames(variable) <- c("GEOID", "estimate")
  message(paste0(sum(is.na(variable$estimate)), " missing before imputation."))

  # join onto shape and drop invalid geometries
  df <- dplyr::left_join(shape, variable, by = "GEOID")
  df <- df[!is.na(sf::st_dimension(df)), ]

  df$State <- substr(df$GEOID, 1, 2)
  df$raw.estimate <- df$estimate

  n <- nrow(df)
  states <- df$State
  state_levels <- unique(states)

  # precompute row indices per state
  idx_list <- split(seq_len(n), states)

  # precompute neighbour lists per state (geometry only)
  nb_list <- vector("list", length(state_levels))
  names(nb_list) <- state_levels
  for (st in state_levels) {
    idx <- idx_list[[st]]
    nb_list[[st]] <- spdep::poly2nb(df[idx, ], queen = TRUE)
  }

  imputed_last <- rep(NA_real_, n)  # will hold the last round of neighbour medians

  repeat {
    missing_pre <- sum(is.na(df$estimate))

    # compute neighbour medians for current estimates
    imputed <- rep(NA_real_, n)
    for (st in state_levels) {
      idx <- idx_list[[st]]
      est_state <- df$estimate[idx]
      nb_state  <- nb_list[[st]]

      imputed[idx] <- vapply(
        nb_state,
        function(z) stats::median(est_state[z], na.rm = TRUE),
        numeric(1)
      )
    }

    # keep a copy of the latest neighbour medians
    imputed_last <- imputed

    # same logic as before: only fill where estimate is NA
    na_pos <- which(is.na(df$estimate))
    df$estimate[na_pos] <- imputed[na_pos]

    missing_post <- sum(is.na(df$estimate))
    if (missing_post >= missing_pre) break
  }

  # attach the final neighbour-median predictions
  df$imputed <- imputed_last

  message(paste0(sum(is.na(df$estimate)), " missing after imputing."))
  df
}

impute_vars <- function(data, shape, vars, level_label = "") {
  # inner helper for a single variable
  impute_one <- function(var_name) {
    msg_suffix <- if (nzchar(level_label)) paste0(" at ", level_label, " level") else ""
    message("Imputing ", var_name, msg_suffix, " ...")

    df_imp <- impute.fun(
      data  = data,
      var   = var_name,
      shape = shape
    )

    # diagnostics (same as you had)
    message("Correlation (estimate vs raw) for ", var_name)
    print(cor.test(df_imp$estimate, df_imp$raw.estimate))

    message("Correlation (imputed vs raw) for ", var_name)
    print(cor.test(df_imp$imputed, df_imp$raw.estimate))

    # drop geometry, keep GEOID + estimate, rename back to original var name
    df_imp$geometry <- NULL
    out <- df_imp[, c("GEOID", "estimate")]
    names(out)[names(out) == "estimate"] <- var_name

    out
  }

  # run for all vars
  res_list <- lapply(vars, impute_one)

  # combine like your chained plyr::join calls
  imputed <- plyr::join_all(res_list, by = "GEOID", type = "left")

  # NA check (same info you were printing manually)
  na_counts <- sapply(vars, function(v) sum(is.na(imputed[[v]])))
  message("Remaining NAs by variable:")
  print(na_counts)

  imputed
}

Block group (CBG) imputation

At the census block group level, spatial imputation is applied to the subset of variables with missing values. Only originally missing values are replaced; observed values are retained unchanged. Table below summarizes the number and proportion of missing values for each imputed variable before and after CBG-level imputation.

Variable \(r\) pvalue NAs Remaining
B19013_001 0.757 \(<2.2x10^{-16}\) 14
B25064_001 0.831 \(<2.2x10^{-16}\) 29
B25077_001 0.918 \(<2.2x10^{-16}\) 14
B25088_002 0.869 \(<2.2x10^{-16}\) 15

The code below runs the imputation for block groups and produces the post-imputation dataset used in subsequent ReADI construction steps.

vars <- c("B19013_001", "B25064_001", "B25077_001", "B25088_002")
CBG_2022_Imputed <- impute_vars(
  data        = ReADI_2022_CBG_Raw,
  shape       = ACS_CBG_2022_Geometry,
  vars        = vars,
  level_label = "CBG")
## Imputing B19013_001 at CBG level ...
## 15042 missing before imputation.
## 14 missing after imputing.
## Correlation (estimate vs raw) for B19013_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$estimate and df_imp$raw.estimate
## t = Inf, df = 225128, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
## Correlation (imputed vs raw) for B19013_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$imputed and df_imp$raw.estimate
## t = 550.34, df = 225103, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7556302 0.7591527
## sample estimates:
##      cor 
## 0.757397
## Imputing B25064_001 at CBG level ...
## 65920 missing before imputation.
## 29 missing after imputing.
## Correlation (estimate vs raw) for B25064_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$estimate and df_imp$raw.estimate
## t = Inf, df = 174250, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
## Correlation (imputed vs raw) for B25064_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$imputed and df_imp$raw.estimate
## t = 622.94, df = 174234, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8292831 0.8321931
## sample estimates:
##       cor 
## 0.8307438
## Imputing B25077_001 at CBG level ...
## 21260 missing before imputation.
## 14 missing after imputing.
## Correlation (estimate vs raw) for B25077_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$estimate and df_imp$raw.estimate
## t = Inf, df = 218910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
## Correlation (imputed vs raw) for B25077_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$imputed and df_imp$raw.estimate
## t = 1082.6, df = 218885, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9172939 0.9186124
## sample estimates:
##       cor 
## 0.9179557
## Imputing B25088_002 at CBG level ...
## 29910 missing before imputation.
## 15 missing after imputing.
## Correlation (estimate vs raw) for B25088_002
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$estimate and df_imp$raw.estimate
## t = Inf, df = 210260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
## Correlation (imputed vs raw) for B25088_002
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$imputed and df_imp$raw.estimate
## t = 804.77, df = 210236, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8678194 0.8699145
## sample estimates:
##       cor 
## 0.8688709
## Remaining NAs by variable:
## B19013_001 B25064_001 B25077_001 B25088_002 
##         14         29         14         15
ReADI_CBG_2022_Imp <- ReADI_2022_CBG_Raw[, !names(ReADI_2022_CBG_Raw) %in% vars]
ReADI_CBG_2022_Imp <- plyr::join(ReADI_CBG_2022_Imp, CBG_2022_Imputed, by = "GEOID")

CT

The same spatial imputation strategy is applied at the census tract and county levels, using tract and county geometries and state-specific neighbor lists. For each geography, only originally missing values are imputed, and imputation proceeds iteratively within state until no additional units can be filled based on their neighbors.

Variable \(r\) pvalue NAs Remaining
B19013_001 0.802 \(<2.2x10^{-16}\) 4
B25064_001 0.839 \(<2.2x10^{-16}\) 4
B25077_001 0.920 \(<2.2x10^{-16}\) 4
B25088_002 0.894 \(<2.2x10^{-16}\) 4
CT_2022_Imputed <- impute_vars(
  data        = ReADI_2022_CT_Raw,
  shape       = ACS_CT_2022_Geometry,
  vars        = vars,
  level_label = "CT")
## Imputing B19013_001 at CT level ...
## 724 missing before imputation.
## 4 missing after imputing.
## Correlation (estimate vs raw) for B19013_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$estimate and df_imp$raw.estimate
## t = 1.9428e+10, df = 83813, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
## Correlation (imputed vs raw) for B19013_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$imputed and df_imp$raw.estimate
## t = 388.56, df = 83794, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7994933 0.8043267
## sample estimates:
##       cor 
## 0.8019231
## Imputing B25064_001 at CT level ...
## 4392 missing before imputation.
## 4 missing after imputing.
## Correlation (estimate vs raw) for B25064_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$estimate and df_imp$raw.estimate
## t = Inf, df = 80145, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
## Correlation (imputed vs raw) for B25064_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$imputed and df_imp$raw.estimate
## t = 436.16, df = 80126, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8367624 0.8408667
## sample estimates:
##       cor 
## 0.8388264
## Imputing B25077_001 at CT level ...
## 2344 missing before imputation.
## 4 missing after imputing.
## Correlation (estimate vs raw) for B25077_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$estimate and df_imp$raw.estimate
## t = 1.3604e+10, df = 82193, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
## Correlation (imputed vs raw) for B25077_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$imputed and df_imp$raw.estimate
## t = 673.1, df = 82174, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9189838 0.9210832
## sample estimates:
##       cor 
## 0.9200401
## Imputing B25088_002 at CT level ...
## 2796 missing before imputation.
## 4 missing after imputing.
## Correlation (estimate vs raw) for B25088_002
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$estimate and df_imp$raw.estimate
## t = Inf, df = 81741, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
## Correlation (imputed vs raw) for B25088_002
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$imputed and df_imp$raw.estimate
## t = 569.7, df = 81722, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8923977 0.8951560
## sample estimates:
##       cor 
## 0.8937853
## Remaining NAs by variable:
## B19013_001 B25064_001 B25077_001 B25088_002 
##          4          4          4          4
ReADI_CT_2022_Imp <- ReADI_2022_CT_Raw[, !names(ReADI_2022_CT_Raw) %in% vars]
ReADI_CT_2022_Imp <- plyr::join(ReADI_CT_2022_Imp, CT_2022_Imputed, by = "GEOID")

C

The same spatial imputation strategy is applied at the census tract and county levels, using tract and county geometries and state-specific neighbor lists. For each geography, only originally missing values are imputed, and imputation proceeds iteratively within state until no additional units can be filled based on their neighbors.

Variable \(r\) pvalue NAs Remaining
B19013_001 0.782 \(<2.2x10^{-16}\) 0
B25064_001 0.832 \(<2.2x10^{-16}\) 0
B25077_001 0.862 \(<2.2x10^{-16}\) 0
B25088_002 0.858 \(<2.2x10^{-16}\) 0
## Remove DC for spatial imputation
data_no_dc  <- ReADI_2022_C_Raw[ReADI_2022_C_Raw$State != " District of Columbia", ]
geoid_dc    <- ReADI_2022_C_Raw[ReADI_2022_C_Raw$State == " District of Columbia", "GEOID"]
shape_no_dc <- ACS_C_2022_Geometry[!ACS_C_2022_Geometry$GEOID %in% geoid_dc, ]

C_2022_Imputed <- impute_vars(
  data        = data_no_dc,
  shape       = shape_no_dc,
  vars        = vars,
  level_label = "county (excluding DC)")
## Imputing B19013_001 at county (excluding DC) level ...
## 1 missing before imputation.
## 0 missing after imputing.
## Correlation (estimate vs raw) for B19013_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$estimate and df_imp$raw.estimate
## t = Inf, df = 3218, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
## Correlation (imputed vs raw) for B19013_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$imputed and df_imp$raw.estimate
## t = 70.972, df = 3207, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7678229 0.7947632
## sample estimates:
##       cor 
## 0.7816574
## Imputing B25064_001 at county (excluding DC) level ...
## 10 missing before imputation.
## 0 missing after imputing.
## Correlation (estimate vs raw) for B25064_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$estimate and df_imp$raw.estimate
## t = 2688126783, df = 3209, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
## Correlation (imputed vs raw) for B25064_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$imputed and df_imp$raw.estimate
## t = 84.971, df = 3198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8215354 0.8428257
## sample estimates:
##       cor 
## 0.8324876
## Imputing B25077_001 at county (excluding DC) level ...
## 3 missing before imputation.
## 0 missing after imputing.
## Correlation (estimate vs raw) for B25077_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$estimate and df_imp$raw.estimate
## t = Inf, df = 3216, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
## Correlation (imputed vs raw) for B25077_001
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$imputed and df_imp$raw.estimate
## t = 96.467, df = 3205, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8533113 0.8710609
## sample estimates:
##      cor 
## 0.862451
## Imputing B25088_002 at county (excluding DC) level ...
## 6 missing before imputation.
## 0 missing after imputing.
## Correlation (estimate vs raw) for B25088_002
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$estimate and df_imp$raw.estimate
## t = 2689801629, df = 3213, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
## Correlation (imputed vs raw) for B25088_002
## 
##  Pearson's product-moment correlation
## 
## data:  df_imp$imputed and df_imp$raw.estimate
## t = 94.341, df = 3202, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8481233 0.8664635
## sample estimates:
##       cor 
## 0.8575657
## Remaining NAs by variable:
## B19013_001 B25064_001 B25077_001 B25088_002 
##          0          0          0          0
ReADI_C_2022_Imp <- ReADI_2022_C_Raw[, !names(ReADI_2022_C_Raw) %in% vars]
ReADI_C_2022_Imp <- plyr::join(ReADI_C_2022_Imp, C_2022_Imputed, by = "GEOID")

stopifnot(identical(ReADI_2022_C_Raw$GEOID, ReADI_C_2022_Imp$GEOID))

## Restore DC’s original values (they were excluded from imputation)
for (v in vars) {
  ReADI_C_2022_Imp[[v]] <- ifelse(
    is.na(ReADI_C_2022_Imp[[v]]),
    ReADI_2022_C_Raw[[v]],
    ReADI_C_2022_Imp[[v]]
  )}



2.4.2 Remaining Missing Values After Spatial Imputation



After the primary spatial imputation step, a small number of geographic units still have missing values. These remaining NAs occur almost entirely in areas that either (1) have no bordering geographies to borrow information from (for example, isolated tracts or block groups and islands) or (2) are special-use areas such as national parks, military or naval facilities, or other atypically enumerated locations. To handle these cases, we use a simple hierarchical approach: where units are clearly non-residential or special-use, we remove them from the analytic dataset; otherwise, we borrow information from the next-highest relevant geography.



2.4.3 Counties



No counties have missing values after spatial imputation.



2.4.4 Census Tracts



Four census tracts remain with missing values after imputation. Table below summarizes these tracts and their characteristics based on Census Reporter:

GEOID CT Location Population CT Information
06083980100 9801 Santa Barbara County, CA 10 Channel Islands National Park.
06111980000 9800 Ventura County, CA 107 Naval Facility San Nicolas Island.
12087980100 9801 Monroe County, FL 2 Dry Tortugas is a National Park.
26083980100 9801 Keweenaw County, MI 3 Isle Royale Island is a National Park.

These tracts are either extremely small, geographically isolated, or special-use areas. For ReADI construction, these tracts are removed from the tract-level analytic dataset, and any block groups nested entirely within them are also excluded (see below).

na_list <- lapply(vars, function(v) ReADI_CT_2022_Imp[is.na(ReADI_CT_2022_Imp[[v]]), c("GEOID", "Tract", "County", "State")])
CT_Missing <- unique(do.call(rbind, na_list))
ReADI_CT_2022_Imp <- ReADI_CT_2022_Imp[!ReADI_CT_2022_Imp$GEOID %in% CT_Missing$GEOID, ]

Removing CBGs located within removed tracts

Thirty-two census block groups remain with missing values after block group level imputation. Of these, four are located within the census tracts removed above, and an additional three correspond to national parks or similar areas that are not appropriate for inclusion in a residential deprivation index. Table below lists these CBGs and their characteristics using Census Reporter:

GEOID CBG.CT Location Population CT Information Removed
022200002006 6.2 Sitka City & Borough, AK 372 Series of Islands
060290057001 1.57 Kern County, CA 2,851 Wherry Housing
060375991001 1.5991 Los Angeles County, CA 201 San Clemente Islands
060839801001 1.9801 Santa Barbara County, CA 10 Channel Islands National Park X
061119800001 1.9800 Ventura County, CA 107 Naval Facility San Nicolas Island X
120379703041 1.9703.04 Franklin County, FL 970 Saint George Island
120879709001 1.9709 Monroe County, FL 1,033 Lower Matecumbe Beach
120879801001 1.9801 Monroe County, FL 2 Dry Tortugas is a National Park X
230050044022 2.44.02 Cumberland County, ME 562 Cousins Island
230139711004 4.9711 Knox County, ME. 138 Isle au Haut and Matinicus Isle
230159756005 5.9756 Lincoln County, ME 119 Monhegan Island
250072004006 6.2004 Dukes County, MA 38 Gosnold
260839801001 1.9801 Keweenaw County, MI 3 Isle Royale Island is a National Park X
260979505001 1.9505 Mackinac County, MI 44 Bois Blanc Island
360050001001 1.1 Bronx County, NY 4,446 Rikers Island Prison X
360610240002 2.240 New York County, NY 2,278 Manhattan Psychiatric Center X
360610240003 3.240 New York County, NY 38 Waste Facility Plant
360810916034 4.916.03 Queens County, NY 2,230 Queens
360810916035 5.916.03 Queens County, NY 444 Queens
360810916036 6.916.03 Queens County, NY 684 Queens
361031702061 1.1702.06 Suffolk County, NY 53 Fishers Island
370959201013 3.9201.01 Hyde County, NC 715 Ocracoke Island
390430419004 4.419 Erie County, OH 221 Kelleys Island
440050401052 2.401.05 Newport County, RI 98 Prudance Park and Homestead on Island
510010906001 1.906 Accomack County, VA 345 Tangier Island
530350927043 3.927.04 Kitsap County, WA 14 Blake Island Marine State Park X
530530724101 1.724.10 Pierce County, WA 1,435 Fox Island
530530724102 2.724.10 Pierce County, WA 1,080 Fox Island
530530724103 3.724.10 Pierce County, WA 1,129 Fox Island
530530726033 3.726.03 Pierce County, WA 1,621 Anderson Island
530530726035 5.726.03 Pierce County, WA 371 McNeil Island
720531501062 2.1501.06 Fajardo Municipio, PR 38 Palomino Island

Block groups nested in removed tracts and clearly non-residential CBGs are dropped from the CBG-level analytic dataset.

bad_len <- which(is.na(ReADI_CBG_2022_Imp$GEOID) | nchar(ReADI_CBG_2022_Imp$GEOID) != 12)
if (length(bad_len) > 0) {
  stop("Found invalid CBG GEOIDs (NA or not 12 chars). Example rows: ",
       paste(head(bad_len, 10), collapse = ", "))
}
cbg_na_list <- lapply(vars, function(v) ReADI_CBG_2022_Imp[is.na(ReADI_CBG_2022_Imp[[v]]), c("GEOID", "Tract", "County", "State")])
CBG_Missing <- unique(do.call(rbind, cbg_na_list))
CBG_Missing$GEOID_CT <- substr(CBG_Missing$GEOID, 1, 11)
CBG_rm <- CBG_Missing[CBG_Missing$GEOID_CT %in% CT_Missing$GEOID,]$GEOID
ReADI_CBG_2022_Imp <- ReADI_CBG_2022_Imp[!ReADI_CBG_2022_Imp$GEOID %in% CBG_rm, ]



2.4.5 Final CBG Imputation Using Tract Values



The remaining CBGs with missing values after the steps above are imputed using the corresponding census tract level values for the same variables. This preserves information from the broader local context while avoiding arbitrary spatial borrowing across unrelated areas. Table below summarizes the final counts of units retained at each geographic level (block group, tract, county) after all imputation and exclusion steps.

Area NAs Amount
CBG 0 240,165
CT 0 84,535
C 0 3,222
impute_CBG_geo <- CBG_Missing[!CBG_Missing$GEOID %in% CBG_rm,]
impute_data <- ReADI_CT_2022_Imp[ReADI_CT_2022_Imp$GEOID %in% impute_CBG_geo$GEOID_CT, c("GEOID", vars)]
colnames(impute_data) <- paste(colnames(impute_data), "_CT", sep = "")
ReADI_CBG_2022_Imp$GEOID_CT <- substring(ReADI_CBG_2022_Imp$GEOID, 1, 11)
ReADI_CBG_2022_Imp <- plyr::join(ReADI_CBG_2022_Imp, impute_data, by = "GEOID_CT", type = "left")
ReADI_CBG_2022_Imp[vars] <- lapply(vars, function(v) {ifelse(is.na(ReADI_CBG_2022_Imp[[v]]), ReADI_CBG_2022_Imp[[paste0(v, "_CT")]], ReADI_CBG_2022_Imp[[v]])})
ReADI_CBG_2022_Imp[c("GEOID_CT", paste0(vars, "_CT"))] <- NULL

sum(is.na(ReADI_CBG_2022_Imp))
## [1] 0
sum(is.na(ReADI_CT_2022_Imp))
## [1] 0
sum(is.na(ReADI_C_2022_Imp))
## [1] 0

With all preprocessing complete, including variable assembly, population filtering, spatial imputation, and resolution of remaining missing values, the analytic datasets for counties, census tracts, and block groups are now fully prepared for index construction. The following section implements the ReADI model itself: calculating component variables, fitting the population-weighted factor model, extracting loadings and factor scores, and generating the final ReADI values and national percentile ranks for each geographic unit.



3 ReADI Production


3.1 Method


The steps below describe how ReADI is constructed at each geographic level (county, census tract, and census block group). Although the same procedure is applied across all three geographies, the models are estimated independently within each level so that the resulting loadings and scores are geographically appropriate.

1. Construct the 17 component variables: Using the imputed ACS data for each geography, we compute the 17 deprivation-related component variables used in ReADI. These variables represent domains such as income, education, employment, housing cost burden, household composition, and infrastructure access. Table below summarizes each variable, the corresponding ACS table or cell, and the formula used in its construction.

Census Variable Name Table
Median family income, $ MEDFI B19013 \(001\)
Income disparity\(^a\) INCDIS B19001 \(ln({\sum(002-004)+1\over\sum(014-017)+1}x100)\)|
Families below poverty level, % BELPL B17017 \({002\over001}x100\)
Population below 150% of the poverty threshold, % PL150 C17002 \({\sum(002-005)\over 001}x100\)
Population ≥25y with < a high school diploma, % EDU9YR B15003 \({\sum(002-016)\over 001}x100\)
Population ≥25y with at least a bachelor’s degree, % HSDPNHIGH B15003 \({\sum(022-025)\over 001}x100\)
Employed persons ≥16y in white-collar occupations, % EMPWC C24010 \({\sum(003,027,039,063)\over 001}x100\)
Civilian labor force population ≥16 y unemployed, % UNEMP B23025 \({005\over003}x100\)
Median home value, $ MEDHV B25077 \(001\)
Median gross rent, $ MEDRNT B25064 \(001\)
Median monthly mortgage, $ MEDMORT B25088 \(002\)
Owner-occupied housing units, % OWN B25003 \({002\over001}x100\)
Single-parent households with children <18y, % SNGPNT B11012 \({\sum(010,015)\over 001}x100\)
Households without a motor vehicle, % NOVEH B25044 \({\sum(010,003)\over 001}x100\)
Households without internet, % NOINT B28002 \({013\over001}x100\)
Occupied housing units without complete plumbing, % NOPLUM B25049 \(ln({\sum(004,007)+1\over001+1}x100)\) |
Households with more than 1 person per room, % CROWD B25014 \({\sum(005-007,011-013)\over 001}x100\)

\(^a\) Income disparity in 2022 was defined as the log of 100×ratio of number of households with <$20,000 income to number of households with ≥$100,000 income.

2. Resolve structural zeros: Any remaining NAs after calculation are not true missing values. They occur only when both the numerator and denominator of a ratio are zero. These structural cases are converted to zero so that they do not propagate into later steps.

3. Standardize the component variables: The 17 component variables are standardized using z-scores (mean = 0, standard deviation = 1). Standardization ensures that variables measured on different numerical scales (for example, percentages vs. dollar-valued medians) contribute appropriately to the factor analysis rather than being dominated by variables with large numeric ranges.

4. Compute population weights: Population size for each geographic unit is taken from ACS table B01003_001 (total population). To reduce the influence of extremely large or small areas and to make population distributions more comparable across geographies, we transform the weights using: \({POPWT} = \sqrt{\text{population}} + 1\). These transformed weights are used as analytic weights in the factor model.

5. Fit the factor model: A one-factor model is estimated using the fa() function from the psych package. Key specifications, otherwise default was applied:

  • method: principal factor extraction
  • number of factors: 1
  • weights: population-based POPWT

This procedure produces the following core outputs for each geography:

  1. ReADI coefficients: the estimated scoring coefficients for each of the 17 component variables, as derived from the one-factor model.
  2. Raw ReADI score: a continuous ADI score for each unit based on the factor solution (unscaled).
  3. Nationally ranked ReADI score: the raw ReADI score transformed to a 0–100 national rank, where higher values indicate greater deprivation.
  4. Factor loadings and weights: the underlying loadings and scoring weights that describe how each component contributes to the factor and how scores are computed.

Table below shows the number of NAs and the resulting factor weights for block groups.

3.1.1 Block group (CBG) ReADI


Below are the factor weights and scoring coefficients estimated for block groups using the 2022 ACS data. These coefficients are applied to the standardized component variables to generate both the raw and nationally ranked ReADI scores at the CBG level. The code below performs the component construction, weighting, factor model estimation, and score generation for block groups.

Variable NAs Weights
MEDFI 0 -0.280
INCDIS 0 0.145
BELPL 631 0.085
PL150 502 0.184
EDU9YR 81 0.065
HSDPNHIGH 81 -0.135
EMPWC 445 -0.104
UNEMP 424 0.023
MEDHV 0 -0.033
MEDRNT 0 -0.066
MEDMORT 0 -0.049
OWN 631 0.005
SNGPNT 631 0.035
NOVEH 631 0.025
NOINT 631 0.059
NOPLUM 0 0.014
CROWD 631 0.026
ReADI_CBG_2022 <- ReADI_CBG_2022_Imp %>%
  dplyr::mutate(MEDFI = B19013_001,
    INCDIS = log(100*((B19001_002 + B19001_003 + B19001_004 + 1)/(B19001_014 + B19001_015 + B19001_016 + B19001_017 + 1))),
    BELPL = B17017_002/B17017_001*100,
    PL150 = (C17002_005 + C17002_004 + C17002_003 + C17002_002)/C17002_001*100,
    EDU9YR = (B15003_002 + B15003_003 + B15003_004 + B15003_005 + B15003_006 + B15003_007 + B15003_008 + B15003_009 + B15003_010 + B15003_011 + B15003_012 + B15003_013 + B15003_014 + B15003_015 + B15003_016)/B15003_001*100,
    HSDPNHIGH = (B15003_022 + B15003_023 + B15003_024 + B15003_025)/B15003_001*100,
    EMPWC = (C24010_003 + C24010_027 + C24010_039 + C24010_063)/ C24010_001*100,
    UNEMP = B23025_005/B23025_003*100,
    MEDHV = B25077_001,
    MEDRNT = B25064_001,
    MEDMORT = B25088_002,
    OWN = B25003_002/B25003_001*100,
    SNGPNT = (B11012_010 + B11012_015)/B11012_001*100,
    NOVEH = (B25044_010 + B25044_003)/B25044_001*100,
    NOINT = B28002_013/B28002_001*100,
    NOPLUM = log((B25049_007 + B25049_004 + 1)/(B25049_001 + 1)*100),
    CROWD = (B25014_005 + B25014_006 + B25014_007 + B25014_011 + B25014_012 + B25014_013)/B25014_001*100,
    POPWT = log(B01003_001)
) %>%
  select(GEOID, Block_Group, Tract, County, State, MEDFI, INCDIS, BELPL, PL150, EDU9YR, HSDPNHIGH, EMPWC, UNEMP, MEDHV, MEDRNT, MEDMORT, OWN, SNGPNT, NOVEH, NOINT, NOPLUM, CROWD, POPWT)

# Replacing NA values resulting from 0 values back to 0. 
colSums(is.na(ReADI_CBG_2022))
##       GEOID Block_Group       Tract      County       State       MEDFI 
##           0           0           0           0           0           0 
##      INCDIS       BELPL       PL150      EDU9YR   HSDPNHIGH       EMPWC 
##           0         633         503          81          81         446 
##       UNEMP       MEDHV      MEDRNT     MEDMORT         OWN      SNGPNT 
##         425           0           0           0         633         633 
##       NOVEH       NOINT      NOPLUM       CROWD       POPWT 
##         633         633           0         633           0
ReADI_CBG_2022[is.na(ReADI_CBG_2022)] <- 0
ReADI_2022_CBG_RawVars <- ReADI_CBG_2022

ADI_short <- ReADI_CBG_2022[, c("GEOID", "MEDFI", "INCDIS", "BELPL", "PL150", "EDU9YR", "HSDPNHIGH", "EMPWC", "UNEMP", "MEDHV", "MEDRNT", "MEDMORT", "OWN", "SNGPNT", "NOVEH", "NOINT", "NOPLUM", "CROWD")]
rownames(ADI_short) <- ADI_short$GEOID
ADI_short$GEOID <- NULL

# zscore transform
ADI_short <- as.data.frame(scale(ADI_short))

# Factor analysis is performed with a one factor solution
stopifnot(identical(rownames(ADI_short), ReADI_CBG_2022$GEOID))
adi.fa <- fa(ADI_short, nfactors = 1, fm = "pa", weight = ReADI_CBG_2022$POPWT)

# Loadings
ReADI_2022_Loadings <- as.data.frame(loadings(adi.fa)[])
colnames(ReADI_2022_Loadings) <- "CBG"
ReADI_2022_Loadings$Vars <- rownames(ReADI_2022_Loadings)
ReADI_2022_Loadings <- ReADI_2022_Loadings[,c(2,1)]

# Weights
ReADI_2022_Wt <- as.data.frame(adi.fa$weights)
colnames(ReADI_2022_Wt) <- "CBG"
ReADI_2022_Wt$Vars <- rownames(ReADI_2022_Wt)
ReADI_2022_Wt <- ReADI_2022_Wt[,c(2,1)]

# Pulling the raw scores as the raw ReADI, producing the ReADI on a scale of 0-100
ADI <- as.data.frame(adi.fa$scores)
colnames(ADI) <- c("ReADI_CBG_Raw")
stopifnot(identical(rownames(ADI), ReADI_CBG_2022$GEOID))
ADI$ReADI_CBG_NR <- round(percent_rank(ADI$ReADI_CBG_Raw)*100, 0)
ADI$GEOID <- rownames(ADI)
ReADI_CBG_2022 <- plyr::join(ReADI_CBG_2022, ADI, by = "GEOID")
save(ReADI_CBG_2022, file = "ReADI_CBG_2022.RData")
write.csv(ReADI_CBG_2022, file = "ReADI_CBG_2022.csv")

3.1.2 Census tract (CT) ReADI


This section applies the same ReADI procedure at the census tract level. The table below shows the tract-specific factor loadings and scoring coefficients. The code that follows constructs the 17 tract-level component variables, applies population weighting, fits the factor model, and produces both the raw and nationally ranked ReADI values for census tracts.

Variable NAs Weights
MEDFI 0 -0.308
INCDIS 0 0.171
BELPL 255 0.048
PL150 198 0.224
EDU9YR 21 0.059
HSDPNHIGH 21 -0.091
EMPWC 154 -0.129
UNEMP 153 0.019
MEDHV 0 -0.020
MEDRNT 0 -0.036
MEDMORT 0 -0.002
OWN 255 0.045
SNGPNT 255 0.045
NOVEH 255 0.015
NOINT 255 0.057
NOPLUM 0 0.010
CROWD 255 0.018
ReADI_CT_2022 <- ReADI_CT_2022_Imp %>%
  dplyr::mutate(MEDFI = B19013_001,
    INCDIS = log(100*((B19001_002 + B19001_003 + B19001_004 + 1)/(B19001_014 + B19001_015 + B19001_016 + B19001_017 + 1))),
    BELPL = B17017_002/B17017_001*100,
    PL150 = (C17002_005 + C17002_004 + C17002_003 + C17002_002)/C17002_001*100,
    EDU9YR = (B15003_002 + B15003_003 + B15003_004 + B15003_005 + B15003_006 + B15003_007 + B15003_008 + B15003_009 + B15003_010 + B15003_011 + B15003_012 + B15003_013 + B15003_014 + B15003_015 + B15003_016)/B15003_001*100,
    HSDPNHIGH = (B15003_022 + B15003_023 + B15003_024 + B15003_025)/B15003_001*100,
    EMPWC = (C24010_003 + C24010_027 + C24010_039 + C24010_063)/ C24010_001*100,
    UNEMP = B23025_005/B23025_003*100,
    MEDHV = B25077_001,
    MEDRNT = B25064_001,
    MEDMORT = B25088_002,
    OWN = B25003_002/B25003_001*100,
    SNGPNT = (B11012_010 + B11012_015)/B11012_001*100,
    NOVEH = (B25044_010 + B25044_003)/B25044_001*100,
    NOINT = B28002_013/B28002_001*100,
    NOPLUM = log((B25049_007 + B25049_004 + 1)/(B25049_001 + 1)*100),
    CROWD = (B25014_005 + B25014_006 + B25014_007 + B25014_011 + B25014_012 + B25014_013)/B25014_001*100,
    POPWT = log(B01003_001)
) %>%
  select(GEOID, Tract, County, State, MEDFI, INCDIS, BELPL, PL150, EDU9YR, HSDPNHIGH, EMPWC, UNEMP, MEDHV, MEDRNT, MEDMORT, OWN, SNGPNT, NOVEH, NOINT, NOPLUM, CROWD, POPWT)

colSums(is.na(ReADI_CT_2022))
##     GEOID     Tract    County     State     MEDFI    INCDIS     BELPL     PL150 
##         0         0         0         0         0         0       255       198 
##    EDU9YR HSDPNHIGH     EMPWC     UNEMP     MEDHV    MEDRNT   MEDMORT       OWN 
##        21        21       154       153         0         0         0       255 
##    SNGPNT     NOVEH     NOINT    NOPLUM     CROWD     POPWT 
##       255       255       255         0       255         0
ReADI_CT_2022[is.na(ReADI_CT_2022)] <- 0
ReADI_2022_CT_RawVars <- ReADI_CT_2022

ADI_short <- ReADI_CT_2022[, c("GEOID", "MEDFI", "INCDIS", "BELPL", "PL150", "EDU9YR", "HSDPNHIGH", "EMPWC", "UNEMP", "MEDHV", "MEDRNT", "MEDMORT", "OWN", "SNGPNT", "NOVEH", "NOINT", "NOPLUM", "CROWD")]
rownames(ADI_short) <- ADI_short$GEOID
ADI_short$GEOID <- NULL

ADI_short <- as.data.frame(scale(ADI_short))

stopifnot(identical(rownames(ADI_short), ReADI_CT_2022$GEOID))
adi.fa <- fa(ADI_short, nfactors = 1, fm = "pa", weight = ReADI_CT_2022$POPWT)

# Loadings
identical(colnames(adi.fa$residual),ReADI_2022_Loadings$Vars)
## [1] TRUE
ReADI_2022_Loadings$CT <- loadings(adi.fa)[,"PA1"]

# Weights
identical(colnames(adi.fa$residual),ReADI_2022_Wt$Vars)
## [1] TRUE
ReADI_2022_Wt$CT <- adi.fa$weights[,"PA1"]

# ReADI scores
ADI <- as.data.frame(adi.fa$scores)
colnames(ADI) <- c("ReADI_CT_Raw")
stopifnot(identical(rownames(ADI), ReADI_CT_2022$GEOID))
ADI$ReADI_CT_NR <- round(percent_rank(ADI$ReADI_CT_Raw)*100, 0)
ADI$GEOID <- rownames(ADI)
ReADI_CT_2022 <- plyr::join(ReADI_CT_2022, ADI, by = "GEOID")
save(ReADI_CT_2022, file = "ReADI_CT_2022.RData")
write.csv(ReADI_CT_2022, file = "ReADI_CT_2022.csv")

3.1.3 County (C) ReADI


Finally, the ReADI procedure is estimated at the county level. The table below displays the county-level factor loadings and scoring coefficients. The code below generates the component variables, applies analytic weights, fits the county factor model, and computes raw and nationally ranked ReADI scores for all counties.

Variable NAs Weights
MEDFI 0 -0.301
INCDIS 0 0.314
BELPL 0 -0.085
PL150 0 0.182
EDU9YR 0 0.055
HSDPNHIGH 0 -0.069
EMPWC 0 -0.069
UNEMP 0 0.013
MEDHV 0 -0.079
MEDRNT 0 -0.055
MEDMORT 0 0.019
OWN 0 0.051
SNGPNT 0 0.055
NOVEH 0 -0.024
NOINT 0 0.094
NOPLUM 0 0.016
CROWD 0 0.044
ReADI_C_2022 <- ReADI_C_2022_Imp %>%
  dplyr::mutate(MEDFI = B19013_001,
    INCDIS = log(100*((B19001_002 + B19001_003 + B19001_004 + 1)/(B19001_014 + B19001_015 + B19001_016 + B19001_017 + 1))),
    BELPL = B17017_002/B17017_001*100,
    PL150 = (C17002_005 + C17002_004 + C17002_003 + C17002_002)/C17002_001*100,
    EDU9YR = (B15003_002 + B15003_003 + B15003_004 + B15003_005 + B15003_006 + B15003_007 + B15003_008 + B15003_009 + B15003_010 + B15003_011 + B15003_012 + B15003_013 + B15003_014 + B15003_015 + B15003_016)/B15003_001*100,
    HSDPNHIGH = (B15003_022 + B15003_023 + B15003_024 + B15003_025)/B15003_001*100,
    EMPWC = (C24010_003 + C24010_027 + C24010_039 + C24010_063)/ C24010_001*100,
    UNEMP = B23025_005/B23025_003*100,
    MEDHV = B25077_001,
    MEDRNT = B25064_001,
    MEDMORT = B25088_002,
    OWN = B25003_002/B25003_001*100,
    SNGPNT = (B11012_010 + B11012_015)/B11012_001*100,
    NOVEH = (B25044_010 + B25044_003)/B25044_001*100,
    NOINT = B28002_013/B28002_001*100,
    NOPLUM = log((B25049_007 + B25049_004 + 1)/(B25049_001 + 1)*100),
    CROWD = (B25014_005 + B25014_006 + B25014_007 + B25014_011 + B25014_012 + B25014_013)/B25014_001*100,
    POPWT = log(B01003_001)
) %>%
  select(GEOID, County, State, MEDFI, INCDIS, BELPL, PL150, EDU9YR, HSDPNHIGH, EMPWC, UNEMP, MEDHV, MEDRNT, MEDMORT, OWN, SNGPNT, NOVEH, NOINT, NOPLUM, CROWD, POPWT)

colSums(is.na(ReADI_C_2022))
##     GEOID    County     State     MEDFI    INCDIS     BELPL     PL150    EDU9YR 
##         0         0         0         0         0         0         0         0 
## HSDPNHIGH     EMPWC     UNEMP     MEDHV    MEDRNT   MEDMORT       OWN    SNGPNT 
##         0         0         0         0         0         0         0         0 
##     NOVEH     NOINT    NOPLUM     CROWD     POPWT 
##         0         0         0         0         0
ReADI_2022_C_RawVars <- ReADI_C_2022

ADI_short <- ReADI_C_2022[, c("GEOID", "MEDFI", "INCDIS", "BELPL", "PL150", "EDU9YR", "HSDPNHIGH", "EMPWC", "UNEMP", "MEDHV", "MEDRNT", "MEDMORT", "OWN", "SNGPNT", "NOVEH", "NOINT", "NOPLUM", "CROWD")]
rownames(ADI_short) <- ADI_short$GEOID
ADI_short$GEOID <- NULL

ADI_short <- as.data.frame(scale(ADI_short))

stopifnot(identical(rownames(ADI_short), ReADI_C_2022$GEOID))
adi.fa <- fa(ADI_short, nfactors = 1, fm = "pa", weight = ReADI_C_2022$POPWT)

# Loadings
identical(colnames(adi.fa$residual),ReADI_2022_Loadings$Vars)
## [1] TRUE
ReADI_2022_Loadings$C <- loadings(adi.fa)[,"PA1"]
save(ReADI_2022_Loadings, file = "ReADI_2022_Loadings.RData")
write.csv(ReADI_2022_Loadings, file = "ReADI_2022_Loadings.csv")

# weights
identical(colnames(adi.fa$residual),ReADI_2022_Wt$Vars)
## [1] TRUE
ReADI_2022_Wt$C <- adi.fa$weights[,"PA1"]
save(ReADI_2022_Wt, file = "ReADI_2022_Wt.RData")
write.csv(ReADI_2022_Wt, file = "ReADI_2022_Wt.csv")

# ReADI Scores
ADI <- as.data.frame(adi.fa$scores)
colnames(ADI) <- c("ReADI_C_Raw")
stopifnot(identical(rownames(ADI), ReADI_C_2022$GEOID))
ADI$ReADI_C_NR <- round(percent_rank(ADI$ReADI_C_Raw)*100, 0)
ADI$GEOID <- rownames(ADI)
ReADI_C_2022 <- plyr::join(ReADI_C_2022, ADI, by = "GEOID")
save(ReADI_C_2022, file = "ReADI_C_2022.RData")
write.csv(ReADI_C_2022, file = "ReADI_C_2022.csv")



4 Summary


At this stage, ReADI has been fully constructed for all three geographic levels. For each level, the workflow produces the calculated scoring coefficients, raw ReADI scores, nationally ranked (0–100) ReADI scores, and the corresponding factor loadings and weights. These outputs form the basis for all subsequent mapping, validation, and comparison analyses. Because all steps are explicitly documented and reproducible, users can subset geographic levels, modify components, or re-estimate the factor model for alternative datasets as needed.