“The tests are a sort of shadow play in which we pretend to measure ability while actually measuring advantage.”
— Jerome Karabel, The Chosen, 2005
The Design Brief — What the SAT Was Built to Measure
Carl Brigham designed the Scholastic Aptitude Test in 1926, drawing on Army Alpha intelligence tests developed during World War I. Brigham's original framework was explicitly eugenic: the test was intended to measure innate, heritable cognitive capacity — the kind of intelligence Brigham believed was distributed unequally across ethnic and racial groups. He published a 1923 book, A Study of American Intelligence, arguing on the basis of Army test data that immigration from Southern and Eastern Europe was degrading American cognitive stock. The College Board adopted his test as a selective admissions instrument.
Brigham subsequently repudiated his own work. In 1930, he published a retraction in Psychological Review, writing that he had made "a number of glaring errors" and that his earlier conclusions were "without foundation." He specifically rejected the assumption that the tests measured fixed, heritable intelligence rather than culturally specific educational preparation. His retraction did not stop the test's institutional adoption. By 1930, the SAT was already embedded in the admissions processes of the most selective universities, and the College Board had an institutional interest in its continued use.
The test's stated purpose was subsequently reframed: not innate intelligence, but academic aptitude — the capacity to succeed in college-level work. This framing required that the test demonstrate predictive validity against a college performance criterion. The criterion chosen was first-year GPA. The test's validation record against that criterion is the subject of Section II.
The Correlation Record — What the SAT Predicts
The SAT's predictive validity for first-year college GPA is r=0.35. This is a statistically significant but modest association. It means SAT scores explain approximately 12% of the variance in first-year academic performance — the remaining 88% is explained by other variables the SAT does not capture. This figure comes from the College Board's own validity studies, replicated across decades of research. It is not a contested number.
High school GPA has a predictive validity of approximately r=0.36–0.40 for first-year college performance — slightly equal to or better than SAT scores, depending on the study. The combination of high school GPA and SAT scores improves predictive validity to approximately r=0.47, but the marginal contribution of the SAT beyond high school GPA is modest. A four-year record of actual academic performance in a real school environment predicts college success about as well as a four-hour test administered on a Saturday morning.
The modest predictive validity is not surprising given what the test measures. The SAT is a timed test of specific cognitive skills — verbal reasoning, mathematical reasoning — under specific conditions. College academic success depends on a much wider set of capabilities: time management, sustained motivation, interest in subject matter, social integration, ability to seek help, and the kind of deep engagement with a discipline that manifests over months and years rather than hours. None of these are measured by the SAT. The test's predictive validity reflects the degree to which the specific skills it measures happen to overlap with the broader capability set that drives college success.
The Income Correlation — A Steady Rise Across Every Bracket
The College Board publishes annual data on SAT scores by family income bracket. The pattern is consistent across every year the data has been published, across all three sections of the test, and across all demographic subgroups: SAT scores rise with family income. The relationship is not merely that very high income produces very high scores and very low income produces very low scores — it is that each successive income bracket produces marginally higher average scores than the bracket below it, continuously, across the entire income distribution.
The aggregate correlation between SAT scores and family income is approximately r=0.43 — higher than the correlation between SAT scores and first-year college GPA. This is the central finding that this paper documents: a test designed to predict academic merit correlates more strongly with family economic resources than with the academic outcome it was designed to predict. The test is a better instrument for sorting by income than for predicting performance.
The income correlation has been documented since at least the 1980s and has not decreased over time. The College Board is aware of it — their published data makes it visible. Multiple reform efforts have attempted to address it: the addition of the essay section, the removal of obscure vocabulary questions, the "adversity score" proposal (withdrawn under political pressure in 2019), and various socioeconomic contextual scoring approaches. None has substantially altered the fundamental pattern. The income gradient in SAT scores is not a residual effect of an imperfect instrument. It is a structural feature of what the instrument measures.
The Preparation Effect — A $200M Industry Built on a Truth
The test preparation industry — Kaplan, Princeton Review, competitive private tutors, online platforms — exists because preparation for the SAT demonstrably improves scores. If the SAT measured fixed cognitive capacity, preparation would not work. The preparation industry's continued operation is empirical evidence that the test is, to a meaningful degree, measuring familiarity with specific question formats, test-taking strategies, and academic content that can be learned.
The documented phenomenon in which SAT scores improve 100-200 points with professional preparation, revealing that the test measures preparation quality alongside native ability — and that preparation quality correlates r=0.43 with family income. The Preparation Effect names the mechanism through which the income correlation is produced: not simply that wealthy families have better-educated parents and more cognitively stimulating environments, but that they can purchase the specific preparation that converts existing ability into SAT performance. When a test can be prepared for, and preparation can be purchased, the test measures purchasing power alongside ability.
The College Board's own research estimates that official preparation materials produce score improvements of approximately 20-30 points, and that the effect of comprehensive preparation courses is "modest." Independent research documents substantially larger effects for intensive preparation, particularly for students who began with scores in a range where specific skill gaps were addressable. The discrepancy between College Board estimates and independent research findings reflects, in part, the institutional interest the College Board has in not confirming that its flagship instrument can be purchased.
The preparation industry's annual revenue exceeds $200 million. Elite private tutors in major metropolitan areas charge $300-500 per hour. College-bound students from the top income quintile receive substantially more and higher-quality preparation than those from the bottom quintile — not because their parents value education more, but because they can afford the specific preparation that converts general cognitive ability into SAT performance. The result is a test score that reflects not merely what a student knows, but what their family could afford to teach them to demonstrate.
What Tests Miss — The Predictors That Don't Fit in a Bubble Sheet
The predictors of long-term academic and professional success that the SAT does not measure include, but are not limited to: growth mindset (the belief that ability can be developed through effort — Carol Dweck's research shows this predicts academic resilience more strongly than fixed ability measures); grit (sustained passion and perseverance toward long-term goals — Angela Duckworth's research documents its predictive validity for outcomes ranging from Army Special Forces completion to spelling bee performance); emotional intelligence; creative problem-solving; collaborative capacity; intrinsic motivation; and the ability to seek help when needed.
Research on non-cognitive predictors of college success consistently finds that these variables contribute independently to outcomes after controlling for standardized test scores and high school GPA. The National Bureau of Economic Research has published research finding that character skills — conscientiousness, openness to experience, emotional stability — predict educational attainment, labor market success, and health outcomes. None of these are captured by a four-hour timed test of verbal and mathematical reasoning.
The practical consequence is that selective college admissions processes that weight SAT scores heavily are optimizing admissions for a narrow slice of the capability space that predicts success. Students with exceptional creative, collaborative, or character-based capabilities who score below average on standardized tests are systematically deprioritized by processes that treat the test score as a primary filter. The test does not identify the full range of what it claims to identify. It identifies a specific, preparable, purchasable subset of it.
The High-Stakes Turn — No Child Left Behind and Goodhart's Law
The No Child Left Behind Act of 2001 transformed standardized testing from an admissions instrument into an accountability mechanism. Under NCLB, schools were ranked, funded, placed under corrective action, and restructured based on their students' performance on standardized tests aligned to state academic standards. The test score became not merely a signal about student learning but a determinant of institutional survival — a classic precondition for Goodhart's Law.
The documented consequence was instruction reorganized around test performance. Teachers reduced time devoted to subjects not covered by tested assessments — social studies, art, music, physical education. School days were restructured to maximize test preparation time. The content of instruction shifted toward the specific skills tested — question-format familiarity, vocabulary lists, mathematical procedures — rather than toward the deeper learning the tests were designed to measure. Test scores improved in many districts. Evidence from the National Assessment of Educational Progress, the "Nation's Report Card" administered independently of NCLB, showed substantially less improvement than state-administered tests over the same period.
The gap between state test improvement and NAEP improvement is the empirical signature of Goodhart's Law in education: when test scores became the target, they improved; when an independent measure of the underlying learning was applied, improvement was smaller. Schools learned to produce test scores. They did not learn, to the same degree, to produce the knowledge the test scores were supposed to represent.
The Alternatives Record — What the Test-Optional Data Shows
The test-optional admissions movement accelerated during the COVID-19 pandemic, when standardized testing was unavailable for most applicants. Many universities that adopted test-optional admissions during 2020 and 2021 subsequently extended or made permanent the policy, allowing researchers to compare outcomes for students admitted with and without submitted test scores. The data from these natural experiments is available and consistent: students admitted without submitting test scores perform comparably to those who submitted scores, controlling for high school GPA and other factors.
Research on test-optional schools finds that the absence of a submitted SAT score does not predict lower first-year GPA or higher attrition rates. The students who chose not to submit scores were not, on average, weaker academically than those who submitted scores — they were often students from under-resourced schools or low-income backgrounds who had lower test scores relative to their actual academic capability. Test-optional admissions increased the diversity of admitted classes without measurable decreases in academic performance. The finding is directionally consistent across multiple institutions and research teams.
The argument for standardized testing is not that the SAT is a perfect instrument, but that it provides a common metric that corrects for grade inflation and cross-school inconsistency in grading standards. A 4.0 GPA from a rigorous private school and a 4.0 GPA from a school with inflated grading and limited course offerings are not equivalent — but they are indistinguishable from GPA data alone. A standardized test provides a consistent national benchmark that allows admissions committees to contextualize GPA data and identify students who may be outperforming their school's typical preparation level.
This argument has merit and is not dismissed by the evidence. The question is not whether standardized tests provide zero information — they provide some — but whether the information they provide, given its demonstrated income correlation and the documented preparation effect, justifies the weight assigned to it in admissions decisions. The evidence suggests the SAT is a useful supplementary signal. The institutional and cultural treatment of SAT scores as a primary merit indicator assigns it a weight the evidence does not support.
The Meritocracy Architecture — What the Test Performs
The function the SAT performs in the selective college admissions system is not merely predictive. It is legitimating. The test converts the socioeconomic sorting that selective admissions produces into a meritocratic narrative: the students who gain admission did not gain it because their families were wealthy enough to provide superior educational preparation and purchase test preparation services. They gained admission because they scored higher on an objective measure of academic ability. The test launders economic advantage as intellectual merit.
If the SAT correlated 0.00 with family income and 0.80 with academic performance, the legitimation function would be well-grounded. If it correlates 0.43 with income and 0.35 with performance — as the data shows — the legitimation function is performing work the instrument cannot justify. The test is claiming to measure merit while substantially measuring economic preparation. The claim and the reality have diverged in a way that the test's continued cultural authority obscures.
The named condition this paper documents — The Preparation Effect — is not an argument that standardized tests should be abolished or that academic preparation is irrelevant. It is a documentation of what the test actually measures when preparation is purchasable and when preparation quality correlates with family income. A test that can be substantially improved through preparation that costs thousands of dollars is not a pure measure of cognitive ability. It is a measure of cognitive ability plus the quality of preparation a family can purchase. Treating it as the former when it is substantially the latter is a category error with structural consequences — consequences that determine which 18-year-olds are labeled meritorious and which are not.
Selected References
- Brigham, C. C. (1923). A Study of American Intelligence. Princeton University Press.
- Brigham, C. C. (1930). Intelligence tests of immigrant groups. Psychological Review, 37(2), 158–165.
- Hezlett, S. A., et al. (2001). The effectiveness of the SAT in predicting success early and late in college: A comprehensive meta-analysis. Paper presented at the Annual Conference of the National Council on Measurement in Education.
- College Board. (2024). 2024 College-Bound Seniors Total Group Profile Report. Annual SAT score data by income bracket.
- Sackett, P. R., et al. (2012). The role of socioeconomic status in SAT-grade relationships and in college admissions decisions. Psychological Science, 23(9), 1000–1007.
- Karabel, J. (2005). The Chosen: The Hidden History of Admission and Exclusion at Harvard, Yale, and Princeton. Houghton Mifflin.
- Heckman, J., & Kautz, T. (2012). Hard evidence on soft skills. Labour Economics, 19(4), 451–464.
- Duckworth, A. L., et al. (2007). Grit: Perseverance and passion for long-term goals. Journal of Personality and Social Psychology, 92(6), 1087–1101.
- Dweck, C. S. (2006). Mindset: The New Psychology of Success. Random House.
- Atkinson, R. C., & Geiser, S. (2009). Reflections on a century of college admissions tests. Educational Researcher, 38(9), 665–676.
- Heilig, J. V., & Darling-Hammond, L. (2008). Accountability Texas-style: The progress and learning of urban minority students in a high-stakes testing context. Educational Evaluation and Policy Analysis, 30(2), 75–110.