Counting the Invisible: How Disparities Affect Sampling Methods Needed to Understand COVID-19

May 28, 2020
Abstract image of viruses with DNA strand.


As policymakers across the country grapple with how to reopen local economies while still protecting the public’s health, many are looking to rapidly evolving evidence to mitigate risks. Policymakers need not be scientists to make evidence-based decisions; but an appreciation of the flaws or limitations that may be underlying a particular study is important.

Recent community-based prevalence studies estimating how many people have been infected with the virus that causes COVID-19 (SARS-CoV2) have generated substantial interest. These studies often rely on antibody tests. Because antibodies circulate in the body long after recovery from the disease, seroprevalence studies (estimates of what percent of people have antibodies to the virus in their blood) can reveal what fraction of the population has previously been infected with the virus. Combining this with diagnosed case data provides a more complete picture of the size of the disease epidemic because asymptomatic or undiagnosed cases are counted.

Given the urgency created by the pandemic, many studies are being conducted on an ad hoc basis, with flawed study designs. We will not discuss some important caveats to these studies, such as the likelihood that a test detects antibodies (Caini et al). Instead, we are focused on who is included in the studies and how limitations to sampling methods affect results interpretation. The technical properties of antibody tests are likely to improve, but they will still not give us the right answer if our studies are not designed to achieve samples closely representing the population.

Example Research Questions

Seroprevalence studies are used to directly answer questions like:

  • What percent of a population has already been infected and generated an immune response to the SARS-CoV2 virus?
  • Are some groups of people differentially affected? For example, what are the racial/ethnic, geographic, or socioeconomic differences in who is infected? Are children equally likely to be infected as adults?

Seroprevalence data, in combination with other information, inform calculations to answer other important questions, such as:

  • What fraction of people who are infected with SARS-CoV2 become symptomatic, require medical treatment, or die?
  • Are some groups of people especially likely to become symptomatic if they are infected?
  • How contagious is the virus, i.e., on average for each person who is infected, how many more people become infected from that individual?

Sampling shortcuts often used to expedite prevalence studies can systematically under-represent low-income individuals, certain racial/ethnic groups, and hidden or marginalized populations. These biases can give us the wrong answers to the above questions. For example, studies may incompletely represent the burden of the epidemic within important subgroups. There are massive differences in the disease burden of COVID-19 between populations, including differences in infection, hospitalization, and mortality risk (Azar et al., Webb Hooper et al). Sampling methods that under-represent populations at elevated risk of infection may lead to incorrect findings and misinformed policy decisions thereby exacerbating health disparities.

Possible Approaches

How has sampling been done in COVID-19 seroprevalence studies so far?

The first seroprevalence studies of COVID-19 infection history were fielded using ‘convenience samples’ of volunteers. One of the first studies used Facebook to advertise and enroll volunteers (Bendavid et al); another enrolled people shopping at a specific set of grocery stores or big box stores (New York Times); another used a marketing database (Los Angeles Times). These studies all generated substantial attention because the fraction of people with antibodies indicating a history of COVID-19 infection was much higher than the fraction of people who had been diagnosed with COVID-19 in those communities. More recent studies have tried to identify representative samples, for example using utility service registries as sampling frames (Miami Herald).

However all of these sampling approaches may under-represent people most affected by COVID-19. People experiencing homelessness are less likely to respond to a Facebook advertisement than someone housed. College students residing on-campus would not be included in utility service registries. Parents who are working full time in essential occupations may not have time in their day to schedule an appointment and travel to a test site. Individuals who face a threat of deportation may not register with a research study to participate in such a project.

The Ideal Sample

To estimate the population prevalence of any disease, we must start by defining the target population, for example, all residents of a specific geographic area. To know with certainty the prevalence of the disease, we would need to test everyone in the population, but we can do nearly as well by testing a random sample of members of the population. To draw a random sample, however, requires a list (called a sampling frame) of everyone in the target population, a random sample of whom would then be selected, contacted, and tested. Some tricks (e.g., stratification or clustering during randomization) can improve feasibility without compromising representativeness. But if our list is incomplete, if we fail to contact everyone we select, or if not everyone agrees to participate, we no longer have a simple random sample. For many target populations, there is no available list of contact information for everyone in the population. Even with a list, it is difficult to successfully contact and enroll all of the selected individuals. For any sample, we must ask:

  • Was there a sampling frame that included everyone in the population?
  • Were all groups of individuals equally likely to participate in the study if they were selected?

For many seroprevalence studies to date, the answer to at least one of these questions is likely “no”.

Putting Methods Into Practice

What should we look for in best-practice studies? Ideal studies are guided by a sampling frame and often draw on multiple “lists” representing different subgroups of the population. Even when studies nominally have a sampling frame, it is important to ask who in the population might be omitted from that frame (e.g., not all residents would be listed on utility service registers). Because no study will be perfect, it is also important to track and report how often someone who was selected for the study could not be reached or decided not to participate. This information can be used to honestly express uncertainty about what we learned: given who was missing, how precise is our estimate?

Another valuable approach is to randomly select some individuals who are initially not enrolled for extra recruitment efforts. Reaching out again to these individuals, offering incentives, and making participation more convenient are all strategies to improve study participation. No matter what approach is used, successfully reaching a diverse set of communities entails working with people who understand and are trusted in those communities, from the research design period through implementation.

Uncertainty is a natural part of conducting research. The goal is to employ well-thought-out sampling methods and record information that will tell us what can or cannot be learned from the findings. Ensuring our studies produce findings that represent all groups in a population -- including marginalized or disadvantaged communities -- is crucial to guiding good decision-making.

Tools, Resources, and Citations

The evidence on COVID-19 is unfolding so quickly that many of the most relevant studies are available only via a press release or in preprints, prior to peer review.

Antibody Seroprevalence Studies

Critiques about antibody seroprevalence studies

Other topics

About the Authors

Meghan Morris, PhD is an epidemiologist engaged in research at the intersection of social justice and infectious disease transmission. She has collaborated on community-centered epidemiologic research with people who inject drugs and medically underserved communities for over a decade. Learn more about Meghan here.

Maria Glymour, ScD is a social epidemiologist and Associate Professor in the Department of Epidemiology and Biostatistics at the University of California, San Francisco. She has dedicated much of her career to overcoming methodological problems encountered in observational epidemiology, in particular analyses of social determinants of health and dementia risk. Learn more about Maria here.