Published On: Sat, Jul 25th, 2020

Hitting the Books: America needs a new public data system


MIT Press

Excerpted from Democratizing Our Data: A Manifesto by Julia Lane. Reprinted with permission from The MIT PRESS. Copyright 2020. On sale as an ebook now. On sale in print 9/1/2020.

Nowadays when people have an appointment to go to across town, their calendar app obligingly predicts how long it’s going to take to get there. When they go to Amazon to research books that might be of interest, Amazon makes helpful suggestions—and asks for feedback on how to make its platform better. If they select photos from Google Photos, it suggests people to send them to, prompts with other photos it thinks are like the ones selected, and warns if the zip file is going to be especially big. Our apps today are aware of multiple dimensions of the data they manage for us, they update that information in real time, and suggest options and possibilities based upon those dimensions. In other words, the private sector sets itself up for success because it uses data to provide us with useful products and services.

The government—not so much. Lack of data makes Joe Salvo’s job much more difficult. He is New York City’s chief demographer, and he uses the Census Bureau’s American Community Survey (ACS) data to prepare for emergencies like Hurricane Sandy. He needs to use data to decide how to get older residents to physically accessible shelters—operationally, where to tell a fleet of fifty buses to go to pick up and evacuate seniors. He needs data on the characteristics of the local population for the Mayor’s Office for People with Disabilities. He needs to identify areas with large senior populations to tell the Metropolitan Transit Authority where to send buses. He needs to identify neighborhoods with significant vulnerable populations so that the Department of Health and Mental Hygiene can install emergency generators at Department of Health facilities. But the products produced by the federal statistical system do not provide him with the value that he needs. The most current data from the prime source about the US population, the ACS, is released two years after collection, and that itself reflects five-year moving averages.

Creating value for the consumer is key to success in the private sector. The challenge to statistical agencies is figuring out how to get set up for success and produce high-quality data as measured against the same checklist by providing access to data while at the same time protecting privacy and confidentiality.

The problem is that the checklist for agencies is even longer with additional requirements so that Joe Salvo and his counterparts can do their jobs better. One requirement, given that the United States is a democracy, is that statistics should be as unbiased as possible—so that all residents, whatever their characteristics, are counted and that they are treated equally in measurement. Correcting for the inevitable bias in source data is an important role for statistical agencies. Another requirement is that collecting the data is cost-effective, so that the taxpayer gets a good deal. A third requirement is that the information collected is consistent over time so that trends can easily be spotted and responded to. Agencies need outside help from both stakeholders and experts to ensure all these requirements are met. That requires access to data, which requires dealing with confidentiality issues.

The value that is generated when governmental agencies can straightforwardly provide access and produce new measures can be great. For example, the same people who bring you the National Weather service and its weather predictions—the National Oceanic and Atmospheric Agency, or NOAA—have provided scientists and entrepreneurs with access to data to develop new products, such as predicting forest fires and providing real-time intelligence services for natural disasters in the United States and Canada. City transit agencies share transit data with private-sector app developers who produce high-quality apps that offer real-time maps of bus locations and expected arrival times at bus stops and more.

But other cases, when the government has confidential data, which is the case for most statistical agencies, are different. We need to be able to rely on our government to keep some data very private, but that will often mean that we have to give up on the granularity of government data that are produced. If, for example, the IRS provided so much information about taxpayers that it was possible to know how much money a given individual made, the public would be outraged.

So many government agencies have to worry about two things: (1) producing data that have value and (2) at the same time ensuring that the confidentiality of data owners is protected. This can be done. Some—smaller—governments have succeeded better than others in creating data systems that live up to the checklist of the desired features while at the same time protecting privacy.

Take the child services system as an example. To put child services in context, almost four in ten US children will be referred to their local government for possible child abuse or neglect by the time they’re eighteen. That’s almost four million referrals a year. Frontline caseworkers have to make quick decisions on these referrals. If they are wrong in either direction, the potential downside is enormous: Children incorrectly screened because of inadequate or inaccurate data could be ripped away from loving families. Or, conversely, also as a result of poor data, children could be left with abusive families and die. Furthermore, there could be bias in decisions, leaving black or LGBTQ parents more likely to be penalized, for example.

In 2014, Allegheny County’s Office of Children, Youth and Families (CYF) in Pennsylvania stepped up to the plate to use its internal data in a careful and ethical manner to help caseworkers do their job better. The results have captured national attention, as reported in a New York Times Magazine article. CYF brought in academic experts to design an automatic risk-scoring tool that summarizes information about a family to help the caseworker make better decisions. The risk score, a number between 1 and 20, makes use of a great deal of the information about the family in the county’s system, such as child welfare records, jail records, and behavioral health records, to predict adverse events that can lead to placing a child in foster care.

An analysis of the effectiveness of that tool showed that a child whose placement score at referral is the highest possible—20—is twenty-one times more likely to be admitted to a hospital for a self-inflicted injury, seventeen times more likely to be admitted for being physically assaulted, and 1.4 times more likely to be admitted for suffering from an accidental fall than a child with a risk score of 1, the lowest  possible. An independent evaluation found that caseworker decisions that were informed by the score were more accurate (cases were more likely to be correctly identified as needing help and less likely to be incorrectly identified as not needing help), case workloads decreased, and racial bias was likely to be reduced.  On the eight-item checklist Allegheny County hit on all items. They produced a new product that was used, was cost effective, and produced real-time, accurate, complete, relevant, accessible, interpretable, granular, and consistent data. And CYF didn’t breach confidentiality. But most importantly, Allegheny County worked carefully and openly with advocates for parents, children, and civil rights to ensure that the program was not built behind closed doors. They worked, in other words, to ensure that the new measures were democratically developed and used.

The Allegheny County story is one illustration of how new technologies can be used to democratize the decision of how to balance the ever-present tradeoff between the utility of new measurement against the risk of compromising confidentiality. They took advantage of the potential to create useful information that people and policy makers need while at the same time protecting privacy. That potential can be made real in other contexts by making the value of data clearer to the public. While that utility/cost tradeoff has typically been made by a small group of experts within an agency, there are many new tools that can democratize the decision by providing more information to the public. This chapter goes into more detail about the challenges of and new approaches to the utility/cost tradeoff. There are many lessons to be learned from past experiences.