The data we used draws from three main public datasets in order to create one collective dataset.
I. Common Core of Data (CCD): Used for district and subgroup enrollment counts.
II. EDFacts: U.S. Department of Education’s official accountability data system (2008–2019).
III. Zelma: A compiled database of state-reported accountability data (2022–2024).
Standardized Test Performance
The metadata centers around academic achievement metrics derived from state tests across grades 3–8 for mathematics and reading/language arts. Every data file corresponds to a state/district unit and scale, emphasizing uniform comparisons over time and geography. The dataset’s priority lies in ensuring comparability across states by converting raw state test scores into common nation-wide metrics (NAEP-linked standardized scales), highlighting performance shifts before, during, and after COVID-19.
Educational Equity and Demographic Subgroups
Each dataset contains fields subgroups (race/ethnicity, gender, and economic status) and subcategories (e.g., economically disadvantaged vs. non-disadvantaged). This helps us measure disparities in achievement across demographic groups. Missing subgroup data or suppressed results signal a broader ethical and statistical focus on protecting student privacy and ensuring only reliable subgroup comparisons are published.
Data Reliability and Privacy Control
The archive’s metadata includes multiple flags and filters—for estimated values, data reliability, and cross-year comparability . These metadata elements indicate a prioritization of data quality, suppression rules, and anonymization, abiding with the U.S. Department of Education reporting standards.
How the data was generated?
The SEDA dataset was created by the Education Recovery Project (ERP) at Stanford University. The synthesizing process of the data sets involved:
- Collecting state accountability test data from EDFacts (2008–2019) and state-reported data from Zelma (2022–2024).
- Using National Assessment of Educational Progress (NAEP) data to link state-level results to a common national scale.
- Supplementing with Common Core of Data (CCD) to fill missing student counts.
- Aggregating results to state and district levels for grades 3–8 in math and reading/language arts.
Who or what organization funded the creation of the dataset?
Data construction was funded by the Bill & Melinda Gates Foundation. Additional partnerships acknowledged include Harvard University’s Center for Education Policy Research, Zelma, and various researchers (Emily Oster, Clare Halloran, etc.).
What does our data not include?
The data lacks material indications of the variation of scores over time, such as student mental states, parental involvement, or diverse and often unequal environments that students learn in.
- Individual student-level data (only aggregated data used).
- Cells with fewer than 20 students to protect privacy.
- Highly imprecise estimates (e.g., standard error > 1).
- Data from subgroups or districts with incomplete or suppressed reporting.
- Years or grades with low participation (<94%) or non-standard tests.
- Detailed school-level data — only district and state aggregations are released.
- Some alternate assessments and non-primary test versions excluded for comparability.