What is the difference between ZIP code "boundaries" and ZCTA areas?

Phil Hurvitz <phurvitz-at-u-dot-washington-dot-edu>

This document describes very briefly the differences between ZIP code "areas" and ZCTAs.


What are ZIP codes?

ZIP (acronym for Zone Improvement Plan) codes were implemented in 1963 as a method to improve mail delivery service. ZIP code areas are technically not areas in the geometric sense, they are essentially lists of addresses that are used to make delivery of mail more efficient. ZIP code polygons can be constructed by grouping addresses and then circumscribing polygons around addresses with the same ZIP code. This is what commercial vendors of ZIP code boundaries do.


What are ZCTAs?

ZCTAs (acronym for ZIP Code Tabulation Areas). A good description of ZCTAs can be found at the US Census ZCTA FAQ. Specifically, from the FAQ:

Is there an equivalency or comparability data product that shows the relationship between Census 2000 ZCTAs™ (ZIP Code Tabulation Areas) and USPS 2000 ZIP Codes?

The Census Bureau is not planning to produce a 2000 ZIP Code to 2000 ZCTA relationship file. We created the ZCTAs specifically to address the inadequacies of ZIP Codes for census data tabulation.

For those who may want to do this, the TIGER/Line® files will continue to show address ranges with mailing ZIP Codes. These files can be processed using a GIS to compare the ZCTA code for a block to the mailing ZIP Code associated with the address ranges on each block side. Such a comparison can provide a general idea of how the two relate.

The relationship between ZIP Code and ZCTA can be determined fully only by comparing individual block-geocoded addresses to the ZCTAs. This process is quite involved. Some examples of why the process can become quite involved are as follows: ZCTAs follow census block boundaries. In contrast, USPS ZIP Codes serve addresses with no correlation to census block boundaries; therefore, the area covered by a ZCTA may include mailing addresses associated with ZIP Codes that are not the same as the ZCTA.

A ZCTA may include a mailing address with a unique or PO Box ZIP Code that is ineligible to become a ZCTA. Addresses with PO Box ZIP Codes generally cluster around a post office, but they may be widely scattered across several ZCTAs. Consequently, the relationships that exist between ZCTAs and ZIP Codes can become quite complicated, so that within the boundaries of a single ZCTA there may exist several ZIP Codes; likewise, within the boundaries of a single ZIP Code, there may exist more than one ZCTA.

Some addresses included in the census and used to define ZCTAs (typically in rural areas) have incomplete or, in some cases, no mailing ZIP Code, thus making it difficult to determine the full extent of the relationships between ZCTAs and ZIP Codes.


How different are ZIP code areas and ZCTAs?

In some cases, they are very similar. In other cases not as similar as we would like. The following set of 3 images demonstrates this. Draw your attention to ZIP 98166 (Figure 1) & ZCTA 98116 (Figure 2) at the northwest corner of the map. The boundaries are nearly identical (a best case scenario). Now draw your attention to ZIP 98055 and 98057 (Figure 1, in the center of the map). Note that ZCTA 98057 does not exist.

Figure 1: ESRI ZIP code boundaries from the ESRI Maps and Data DVD for 2007:


Figure 2: ZCTAs from US Census TIGER/Line files (2000)


These differences can be visualized more clearly by displaying both sets of data on one map (Figure 3). Specifically, look at ZCTAs 98055 (center), 98188 (just west of 98188), and 98031 (south-central). Some ZCTAs appear to be split rather cleanly into two ZIPs (e.g., ZCTA 98031) and others are less well behaved (e.g., ZIP 98027).

Figure 3: ZIP and ZCTA boundaries


Furhtermore, many ZCTAs in the US Census data are represented by more than one polygon (e.g., where they span county boundaries), as shown in Figure 4.

Figure 4a: Multiple polygons for ZCTA 99013


Figure 4b: Multiple database records for ZCTA 99013


Why is this a problem?

We have detailed demographic data from the US Census for ZCTAs. We do not have demographic data for ZIP code areas. It is possible to use area-weighted methods to apportion statistics from ZCTAs to ZIP code areas. Suppose a sampling scheme were developed for randomly selecting households based on Census variables from ZCTAs (e.g., poverty status or race). A number of households will be specified for random selection from a given ZCTA or group of ZCTAs. Suppose that group includes ZCTA 98055. That list of ZCTAs is passed to a company that specializes in operationalizing surveys. They pass the list of ZCTA numbers to their telephone exchange number vendor. The telephone exchange provider selects exchanges that are in ZIP (not ZCTA) 98055. The problem is that the neither the survey contractor nor the telephone number vendor has been told to also select telephone numbers from ZIP 98057 (which does not exist as a ZCTA). Thus, there will be "holes" in the list of selected telephone numbers.

Using GIS to overlay the data sets and compute area proportions can be used to assign ZCTA statistics to the ZIP code areas. For example, ZCTA 98031 is about 50% ZIP 98031 and 50% 98030. Also consider ZIP 98056, which completely contains ZCTA 98056 but also part of ZCTA 98059. Also, ZIP 98027 is split by ZCTAs 98059 and 98027.


How to deal with this

Imagine if you had a jigsaw puzzle with pieces in the shape of ZCTAs made of cookie dough. Each puzzle piece is a different thickness (proportional to the value of a demographic variable). Now get a set of cookie cutters that are the shape of ZIP code areas and cut the puzzle with the new cookie cutters. Push the dough down into each new cookie cutter so it is completely level within each individual cutter. These heights are the new recalculated values. Do this for various puzzles where each puzzle starts with pieces with different thicknesses (these correspond to each census demographic variable). Operationally:

  1. Use a personal geodatabase in ArcGIS. This is preferable because area is automatically calculated for polygon features. Also using the geodatabase will allow ODBC connections using R, which avoids needing to create intermediate data files.
  2. Have at hand the US Census ZCTA demographic data in Access format, as documented by the US Census, (assume these are stored in ./zip_zcta/SF3_zcta.mdb). Create a query that collects the basic demographic fields, and name this query demographics:

    SELECT SF3GEO.ZCTA5, Sum(SF30001.P001001) AS population, Sum(SF30001.P010001) AS numhh,
    Sum(SF30001.P006003) AS numblack, Sum(SF30001.P007010) AS numhispanic, Avg(SF30006.P053001) AS medhhinc
    HAVING (((SF3GEO.ZCTA5) Not Like "*HH"))

  3. Perform the GIS analysis steps (note, this can all be done in an ArcGIS model)
    1. Dissolve ZIPs (by geometry) to create a statewide feature class circumbscribing the complete coverage of ZIP code areas.
    2. Dissolve ZCTAs (by geometry) to create a statewide feature class circumbscribing the complete coverage of ZCTA areas.
    3. Intersect the geometry dissolved ZIP and ZCTA feature classes.
    4. Dissolve the intersected ZIP+ZCTA feature class to generate a single statewide clipping feature class
    5. Dissolve ZCTAs by the zcta field (because there may be more than one polygon with the same ZCTA value.
    6. Clip the zcta-dissolved ZCTAs and original ZIPs with the statewide clipper, so each data set will share the same external boundary.
    7. Add fields to represent area to each data set (e.g., area_orig_zcta and area_orig_zip).
    8. Calculate these new fields equal to shape_area. This is necessary because Arc will re-calculate the shape_area field when new geodatabase feature classes are created.
    9. Union the clipped data sets to a new polygon feature class (assume this is stored as ./zip_zcta.mdb/zip_zcta).
  4. Run the following R code to perform data extractions, proportion calculations, image creation, and data export (note you will need to set the working directory properly).

    # handle proportions of zip & zctas

    # read the data in from GDB
    wd <- "C:/zip_zcta"
    mdb <- odbcConnectAccess("zip_zcta.mdb")
    dat <- sqlFetch(mdb, "zip_zcta", as.is=T)
    colnames(dat) <- fix.colnames(dat)
    dat <- extract.columns(dat, "zcta,area.zcta.orig,zip,area.zip.orig,shape.area")
    dat <- with(dat, dat[zip!="" & zcta!="",])

    #ZCTA census data
    mdb <- odbcConnectAccess("SF3_zcta.mdb")
    zcta <- sqlFetch(mdb, "demographics", as.is=T)
    zcta$pct.black <- with(zcta, round(numblack / population * 100, 3))
    zcta$pct.hispanic <- with(zcta, round(numhispanic / population * 100, 3))
    colnames(zcta) <- fix.colnames(zcta)

    # merge
    dat.zcta <- merge(dat, zcta, by.x="zcta", by.y="zcta5", all.x=T, all.y=F)

    # proportion of ZCTA in this ZIP
    dat.zcta$prop.area.zcta <- round(with(dat.zcta, ifelse(area.zcta.orig > 0, shape.area / area.zcta.orig, 0)), 5)

    # multiply the proportion of ZCTA in this ZIP by the counts in the ZCTA
    dat.zcta$zip.population <- with(dat.zcta, prop.area.zcta * population)
    dat.zcta$zip.numhh <- with(dat.zcta, prop.area.zcta * numhh)
    dat.zcta$zip.numblack <- with(dat.zcta, prop.area.zcta * numblack)
    dat.zcta$zip.numhispanic <- with(dat.zcta, prop.area.zcta * numhispanic)
    dat.zcta$zip.totalincome <- with(dat.zcta, medhhinc * zip.numhh)

    # now sum each proportionalized variable per ZIP
    zip.population <- sapply(with(dat.zcta, by(zip.population, zip, sum)), sum)
    zip.numhh <- sapply(with(dat.zcta, by(zip.numhh, zip, sum)), sum)
    zip.numblack <- sapply(with(dat.zcta, by(zip.numblack, zip, sum)), sum)
    zip.numhispanic <- sapply(with(dat.zcta, by(zip.numhispanic, zip, sum)), sum)
    zip.totalincome <- sapply(with(dat.zcta, by(zip.totalincome, zip, sum)), sum)

    # combine variables into a data frame
    zip.medhhinc <- zip.totalincome/zip.numhh
    zip.demog <- as.data.frame(cbind(zip.population, zip.numhh, zip.numblack, zip.numhispanic, zip.medhhinc))

    # calculate percentages
    zip.demog$zip.pct.black <- round(zip.numblack / zip.population * 100, 3)
    zip.demog$zip.pct.hispanic <- round(zip.numhispanic / zip.population * 100, 3)
    zip.demog$zip <- rownames(zip.demog)

    # join the ZCTA values for comparison
    zip.demog1 <- merge(zip.demog, zcta, by.x="zip", by.y="zcta5")
    zip.demog1$diff.medhhinc <- with(zip.demog1, medhhinc - zip.medhhinc)
    zip.demog1$diff.pctblack <- with(zip.demog1, pct.black - zip.pct.black)
    zip.demog1$diff.pcthispanic <- with(zip.demog1, pct.hispanic - zip.pct.hispanic)
    zip.demog1$diff.numhh <- with(zip.demog1, numhh - zip.numhh)
    zip.demog1$diff.population <- with(zip.demog1, population - zip.population)

    # a plot of recalculated (ZIP) vs original (ZCTA) values
    # a trellis plot for multiple graphs on the same image
    p1 <- with(zip.demog1, xyplot(population ~ zip.population, ylab="ZCTA", xlab="ZIP calculated", main="population", col="black"))
    p2 <- with(zip.demog1, xyplot(numhh ~ zip.numhh, ylab="ZCTA", xlab="ZIP calculated", main="households", col="black"))
    p3 <- with(zip.demog1, xyplot(pct.hispanic ~ zip.pct.hispanic, ylab="ZCTA", xlab="ZIP calculated", main="% hispanic", col="black"))
    p4 <- with(zip.demog1, xyplot(pct.black ~ zip.pct.black, ylab="ZCTA", xlab="ZIP calculated", main="% black", col="black"))
    p5 <- with(zip.demog1, xyplot(medhhinc ~ zip.medhhinc, ylab="ZCTA", xlab="ZIP calculated", main="median HH income", col="black"))
    trellis.device("png", file="zip_zcta_comparison.png", width=6, height=9, units="in", res=92)
    print(p1, position=c(0,0,.5,.33), more=T)
    print(p2, position=c(.5,0,1,.33), more=T)
    print(p3, position=c(0,.33,.5,.66), more=T)
    print(p4, position=c(.5, .33, 1, .66), more=T)
    print(p5, position=c(0, .66, .5,1))

    #write the values out
    colnames(zip.demog) <- unfix.colnames(zip.demog)
    write.csv(zip.demog, file=paste(wd, "zip_demographics.csv", sep="/"), row.names=F, quote=T)
    colnames(zip.demog) <- fix.colnames(zip.demog)

    colnames(zip.demog1) <- unfix.colnames(zip.demog1)
    write.csv(zip.demog1, file=paste(wd, "zip_demographics_comparison.csv", sep="/"), row.names=F, quote=T)
    colnames(zip.demog1) <- fix.colnames(zip.demog1)

  5. Now evaluate results. Figure 5 shows a scatterplot of original ZCTA values and re-aggregated ZIP code values. Figure 6 shows the difference between the data sets for two selected variables.

    Figure 5: Scatter plots of demographic variables comparing original ZCTA and re-aggregated ZIP code area data (statewide)

    Figure 6: Differences between ZCTA and recalculated ZIP code variables; note "holes" where new ZIP code numbers exist in ZCTAs that were subdivided

Potentially better results can be obtained by aggregating from census block groups to ZIP code areas, because the original block group data contain more spatial variation than do ZCTA data. Figure 7 displays percent of residents that are black at the block group level (raw data; the smallest available spatial unit with these variables). Figure 8 shows a scatter plot of ZCTA data and re-aggregated block group data. How to do this is left as an exercise for the reader.

Figure 7: Percent of residents that are black by census block group (upper), block group data aggregated to ZIP code area (middle), and ZCTA data aggregated to ZIP code area (lower)

Figure 8: Scatter plots of demographic variables comparing original ZCTA and ZIP code area data re-aggregated from block groups (King County only)