SolandAssessment.org

Soland Psychometrics & Assessment

Welcome to my site. I am a psychometrician by training, which means I help develop assessments (tests, surveys, teacher observational protocols), find defensible ways to score them, and investigate whether they are being used appropriately (validly) for their intended purposes. I chose this (admittedly niche) career because I believe good social science begins with good measurement. While I've had some great teachers over the years, I've also spent a lot of time teaching myself and, just maybe, shaking my fists at the sky over the lack of available, digestible instructional material. This web site is meant to help others on their measurement journey by sharing some of the best resources it's taken me years to accrue.

As a heads up, I was once told that psychometricians, deep down, really wanted to be accountants, but couldn't stand the excitement. Some of this material is…detailed. Buyer beware.

Jim Soland, Ph.D., is an Associate Professor of Research, Statistics, and Evaluation at the University of Virginia School of Education and Human Development. He has a courtesy faculty appointment in UVA Psychology, is an Associate Editor at the journal Educational Assessment, serves on multiple state assessment Technical Advisory Committees (TACs), and is a Faculty Affiliate in the Max Planck Research School on The Life Course (LIFE). His research examines how measurement decisions shape our understanding of academic, psychological, and socio-emotional development, including how measurement affects evaluation of programs/interventions supporting that development. More recent work focuses on survey design using AI, and on measure development and validation in sub-Saharan Africa. Prior to joining the University of Virginia, Jim completed a doctorate in Educational Psychology at Stanford University with a concentration in measurement. He has also served as a Senior Research Scientist at NWEA, a policy analyst at the RAND Corporation, a Senior Policy Analyst at the Legislative Analyst's Office (LAO) in California, and an adjunct professor at Oregon State University.

Curriculum Vitae: CV (PDF)

Google Scholar Page: Google Scholar

Study with Us at UVA: MA or PhD

Favorite textbook or learning resource?

I have two favorite textbooks, both of which treat latent variables as a unified statistical framework. One is more conceptual—Bartholomew et al.'s Analysis of Multivariate Social Science Data. The other is much more technical—Skrondal and Rabe-Hesketh's Generalized Latent Variable Modeling. Together, they make clear that IRT, SEM, mixture models, and related approaches are not separate methods, but all instances of latent variable models.

One go-to article by someone else?

Two classic examples that explicitly crosswalk SEM and IRT are articles by Wirth & Edwards and Kamata & Bauer. I also particularly like the paper by Ferron & Hess because it makes very explicit what SEM is doing "under the hood" when estimating a model.

Article of yours you wish more people knew about?

Megan Kuhfeld, Kelly Edwards, and I wrote a tutorial that walks through practical options for scoring measures in studies with multiple groups (e.g., treatment and control), multiple timepoints (e.g., pre–post designs or growth models), or both. The goal is to demystify how psychometricians think about measurement tradeoffs in real applied settings. I also wrote an article I wish more people knew about that, to me, suggests growth mixture models shouldn't be a thing anymore.

Featured Articles

Why the replication crisis may be a measurement crisis…

Is the replication crisis a measurement crisis? Evidence from over 100 randomized trial outcomes

Gilbert, J. B., & Soland, J. PsyArXiv preprint.

Randomized controlled trials (RCTs) are the gold standard for causal inference, yet the validity of RCT conclusions depends not only on randomization but also on how outcomes are measured and scored. This study analyzes item-level data from 112 RCT outcome measures spanning psychology, medicine, public health, and education to test whether foundational measurement assumptions are met and whether alternative scoring approaches alter conclusions. Across disciplines, measurement assumptions are rarely evaluated and frequently violated. For nearly half of outcomes, the single score used is not likely a plausible representation of the data. Moreover, when outcomes are scored using statistical models aligned with the study design and better matching the data instead of sum scores, the proportion of pre/post trials reporting statistically significant treatment effects approximately doubles. These findings indicate that routine measurement decisions can systematically shift causal inferences, suggesting that a largely unexamined aspect of RCT analysis may be at the heart of replication failures.

Why growth mixture models should probably die a quiet death…

Evidence that growth mixture model results are highly sensitive to scoring decisions

Soland, J., Cole, V., Tavares, S., & Zhang, Q. (2025). Multivariate Behavioral Research, 60(3), 487–508.

Interest in identifying latent growth profiles to support the psychological and social-emotional development of individuals has translated into the widespread use of growth mixture models (GMMs). In most cases, GMMs are based on scores from item responses collected using survey scales or other measures. Research already shows that GMMs can be sensitive to departures from ideal modeling conditions and that growth model results outside of GMMs are sensitive to decisions about how item responses are scored, but the impact of scoring decisions on GMMs has never been investigated. This study begins to close that gap through empirical and Monte Carlo studies, showing that GMM results—including convergence, class enumeration, and latent growth trajectories within class—are extremely sensitive to seemingly arcane measurement decisions.

Exploring Whether Response Style Biases Manifest as Spurious Classes in Longitudinal Mixture Models

Cole, V. T., Soland, J., Zhang, Q., & Tavares, S. (2025). Structural Equation Modeling: A Multidisciplinary Journal.

Growth mixture models (GMMs) are used to identify unobserved groups based on longitudinal data. The current study investigated whether failing to model response styles, person-specific patterns of item responses unrelated to the latent variable, may lead researchers to overestimate the number of classes. In a simulation study, data were generated from a single-class model with socially desirable responding. We compared sum scores, multidimensional IRT scores, and scores from a model accounting for response styles. No method consistently recovered the correct single-class solution, though accounting for response styles helped when response style effects were strong. In an empirical study of growth mindset in middle-school children (N from 2,319 to 3,466), we found that the scoring method used to generate scores yielded some differences in predicted trajectories. Results suggest response styles may contribute to spurious class extraction in GMMs, but this effect is difficult to disaggregate from other known issues with GMMs.

How to avoid bias from scoring in intervention studies…

How survey scoring decisions can influence your study's results: A trip through the IRT looking glass

Soland, J., Kuhfeld, M., & Edwards, K. (2024). Psychological Methods, 29(5), 1003.

Though much effort is often put into designing psychological studies, the measurement model and scoring approach employed are often an afterthought, especially when short survey scales are used. One possible reason that measurement gets downplayed is that there is generally little understanding of how calibration and scoring approaches could impact common estimands of interest, including treatment effect estimates, beyond random noise due to measurement error. In this study, three motivating examples are provided in which surveys are used to understand individuals' underlying social-emotional and personality constructs, demonstrating the potential consequences of measurement and scoring decisions. As the analyses show, the decisions researchers make about how to calibrate and score the survey used have consequences that are often overlooked, with likely implications for both conclusions drawn from individual psychological studies and replications of those studies.

Evidence that selecting an appropriate IRT-based approach to scoring surveys can help avoid biased treatment effect estimates

Soland, J. (2022). Educational and Psychological Measurement, 82(2), 376–403.

Considerable thought is often put into designing randomized control trials (RCTs), yet when psychological constructs measured using survey scales are the outcome of interest, measurement is often an afterthought. The purpose of this study is to examine how choices about scoring and calibration of survey item responses affect recovery of true treatment effects. Simulation and empirical studies compare the performance of sum scores—which are frequently used in RCTs in psychology and education—to that of approaches rooted in item response theory (IRT) that better account for the longitudinal, multigroup nature of the data. Results indicate that a multigroup longitudinal IRT approach performs best, particularly when measurement noninvariance is present across treatment and control groups, with sum score approaches understating true treatment effects by 25% or more.

When Should Evaluators Lose Sleep Over Measurement? Toward Establishing Best Practices

Soland, J., Edwards, K., & Talbert, E. (2024). Journal of Research on Educational Effectiveness, pages 474–506.

Evaluators often invest much effort into designing evaluation studies. However, there is evidence that less attention is paid to measurement. One possible explanation is that focus in applied psychometrics is on reliability, with less placed on measurement model misspecification and the bias it can introduce into estimates that use resultant scores. Another possible explanation is that evaluators frequently want to use the simplest scoring approach possible under the assumption that it is transparent and therefore relies on fewer assumptions—a mindset that is often, if not always, misguided. In this study, we walk through the decisions involved in producing scores for program evaluation studies in an attempt to demystify the psychometrics, as well as show how related decisions can be consequential. We use Monte Carlo simulations to illustrate the effects of those decisions in a randomized control trial, then show that these decisions can impact published evaluation results. Finally, we try to give evaluators best practices in scoring for evaluation, including understanding when deviating from those practices is most likely to impact their work.

How to avoid bias from scoring in growth models…

Avoiding bias from sum scores in growth estimates: An examination of IRT-based approaches to scoring longitudinal survey responses

Kuhfeld, M., & Soland, J. (2022). Psychological Methods, 27(2), 234–260.

A large portion of what we know about how humans develop, learn, behave, and interact is based on survey data. Researchers use longitudinal growth modeling to understand the development of students on psychological and social-emotional learning constructs across elementary and middle school, with growth typically measured using sum scores or scale scores produced by item response theory (IRT) methods. Although there is a great deal of guidance on scaling and linking IRT-based large-scale educational assessments to facilitate the estimation of examinee growth, little of this expertise is brought to bear in the scaling of psychological and social-emotional constructs. Through a series of simulation and empirical studies, this study produces scores using sum scores and multiple IRT approaches and compares the recovery of true latent growth parameters, demonstrating that certain IRT-based approaches substantially reduce bias relative to sum scores.

How Scoring Approaches Impact Estimates of Growth in the Presence of Survey Item Ceiling Effects

Edwards, K. D., & Soland, J. (2024). Applied Psychological Measurement, 48(3), 147–164.

Survey scores are often the basis for understanding how individuals grow psychologically and socio-emotionally. A known problem with many surveys is that the items are all "easy"—that is, individuals tend to use only the top one or two response categories on the Likert scale. Such an issue could be especially problematic, and lead to ceiling effects, when the same survey is administered repeatedly over time. In this study, we conduct simulation and empirical studies to (a) quantify the impact of these ceiling effects on growth estimates when using typical scoring approaches like sum scores and unidimensional item response theory (IRT) models and (b) examine whether approaches to survey design and scoring, including employing various longitudinal multidimensional IRT (MIRT) models, can mitigate any bias in growth estimates. We show that bias is substantial when using typical scoring approaches and that, while lengthening the survey helps somewhat, using a longitudinal MIRT model with plausible values scoring all but alleviates the issue. Results have implications for scoring surveys in growth studies going forward, as well as understanding how Likert item ceiling effects may be contributing to replication failures.

Do response styles affect estimates of growth on social-emotional constructs? Evidence from four years of longitudinal survey scores

Soland, J., & Kuhfeld, M. (2021). Multivariate Behavioral Research, 56(6), 853–873.

Survey respondents employ different response styles when they use the categories of the Likert scale differently despite having the same true score on the construct of interest—for example, by being more likely to use the extremes of the response scale independent of their true score. Research already shows that differing response styles can create a construct-irrelevant source of bias that distorts fundamental inferences made from survey data. While some initial studies examine the effect of response styles on survey scores in longitudinal analyses, the issue of how response styles affect estimates of growth is underexamined. In this study, empirical and simulation analyses are conducted in which surveys are scored using IRT models that do and do not account for response styles, and those different scores are then used in growth models to quantify the impact on growth estimates.

Tutorials

Curated tutorials and exemplars, organized by topic. Titles link directly to the resource.

Growth Modeling

Latent class growth modeling: A tutorial

Authors: Andruff, Carraro, Thompson, Gaudreau, Louvet

12 frequently asked questions about growth curve models

Authors: Curran, Obeidat, Losardo

The separation of between-person and within-person components of individual change over time: A latent curve model with structured residuals

Authors: Curran, Howard, Bainter, Lane, McGinley

Item Response Theory

A note on the relation between factor analytic and item response theory models

Authors: Kamata, Bauer

How survey scoring decisions can influence your study's results: A trip through the IRT looking glass

Authors: Soland, Kuhfeld, Edwards

Item factor analysis: Current approaches and future directions

Authors: Wirth, Edwards

Modeling item-level heterogeneous treatment effects

Authors: Gilbert

Structural Equation Modeling

A trifactor model for integrating ratings across multiple informants

Authors: Bauer, Howard, Baldasaro, Curran, Hussong, Chassin, Zucker

A more general model for testing measurement invariance and differential item functioning

Authors: Bauer

Estimation in SEM: A concrete example

Authors: Ferron, Hess

A note on the relation between factor analytic and item response theory models

Authors: Kamata, Bauer

Regression discontinuity designs in a latent variable framework

Authors: Soland, Johnson, Talbert

Item factor analysis: Current approaches and future directions

Authors: Wirth, Edwards

Course Materials

High-quality course materials, lecture notes, and videos from trusted methods instructors and labs.

CenterStat (Formerly Curran Bauer Analytics)

Matrix Algebra Review Intensive Longitudinal Data CenterStat YouTube Channel SEM in R Notes

Lesa Hoffman

Factor Analysis and SEM Latent Traits and SEM

Jason Newsom

Teaching Materials

Jonathan Templin

Teaching Materials

IDRE at UCLA

Exploratory Factor Analysis Confirmatory Factor Analysis Structural Equation Modeling Latent Growth Curve Models

Textbooks

Recommended textbooks for learning measurement, latent variable modeling, and related methods. Listed in alphabetical order by first author.

Measurement Theory and Applications for the Social Sciences

Bandalos, D. L. (2018). Measurement theory and applications for the social sciences. Guilford Publications.

Amazon

Note. Excellent for walking through the basics of instrument design, as well as simple descriptive analyses that can be done to understand whether those instruments are working as intended.

Analysis of Multivariate Social Science Data (2nd ed.)

Bartholomew, D. J., Knott, M., & Moustaki, I. (2nd ed.). Analysis of Multivariate Social Science Data.

Amazon

Note. Outstanding and much more conceptual. Starts with cluster analysis and walks through regression, factor analysis, factor analysis with binary variables (IRT), SEM, and then multilevel models.

Historical and Conceptual Foundations of Measurement in the Human Sciences

Briggs, D. C. Historical and Conceptual Foundations of Measurement in the Human Sciences.

Amazon

Note. Gets into the history of psychometrics — the good, the bad, and the ugly.

Statistical Methods for the Social and Behavioural Sciences: A Model-Based Approach (1st ed.)

Flora, D. Statistical Methods for the Social and Behavioural Sciences: A Model-Based Approach (1st ed.).

Amazon

Designing Monte Carlo Simulations in R

Miratrix, L. W., & Pustejovsky, J. E. Designing Monte Carlo Simulations in R.

Online Book

Note. This is an excellent text on how to conduct high-quality, reproducible simulation studies.

Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models

Skrondal, A., & Rabe-Hesketh, S. Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models.

Amazon

Note. Outstanding but more advanced; makes strong connections among latent variable models and includes helpful matrix algebra examples. The first lecture in my SEM class draws heavily from Chapter 1.

Datasets

Below are data repositories highlighted in a recent Psychometrika special-issue call on data-intensive methods in psychometrics.

Highlighted Repositories

Item Response Warehouse (IRW)

A harmonized collection of item response datasets spanning many measures and domains.

Attentional Control Data Collection

Data from attentional control tasks.

openESM

Data from experience sampling (ESM) studies.

PrefLib

A library of preference data (rankings, choices, and related formats).

Wordbank

A database of children's vocabulary development.

Grant Proposals

Funded grant proposals shared here as examples and resources.

NSF: Understanding How Approaches to Calibrating and Scoring Survey Item Responses Affect Results from Growth Mixture Models

📄 Grant Proposal (PDF)

NCME Tools

Standards for Educational and Psychological Testing

APA/AERA/NCME joint standards — the foundational reference for testing practice.

ITEMS Portal (YouTube)

NCME's Instructional Topics in Educational Measurement Series — video tutorials on a wide range of measurement topics.

ITEMS Modules (NCME)

Written instructional modules from NCME's ITEMS series, covering foundational and advanced measurement topics.

NCME Task Force on Foundational Competencies in Educational Measurement

Resources from the NCME task force on defining and building core competencies in educational measurement.

Code & Appendices

My Articles

How survey scoring decisions can influence your study's results: A trip through the IRT looking glass

Soland, J., Kuhfeld, M., & Edwards, K. (2024). Psychological Methods, 29(5), 1003.

📄 Appendices with Code (.docx)

How scoring approaches impact estimates of growth in the presence of survey item ceiling effects

Edwards, K. D., & Soland, J. (2024). Applied Psychological Measurement, 48(3), 147–164.

📄 Appendices with Code (.docx)

Item response theory models for difference-in-difference estimates (and whether they are worth the trouble)

Soland, J. (2024). Journal of Research on Educational Effectiveness, 17(2), 391–421.

📄 Appendices with Code (.docx)

Scoring assessments in multisite randomized control trials: Examining the sensitivity of treatment effect estimates to measurement choices

Kuhfeld, M., & Soland, J. (2023). Psychological Methods.

📄 Appendices with Code (.rtf)

When should evaluators lose sleep over measurement? Toward establishing best practices

Soland, J., Edwards, K., & Talbert, E. (2025). Journal of Research on Educational Effectiveness, 18(3), 474–506.

📄 Appendices with Code (.docx)

Articles from Others

Dan Bauer

Li Cai ▶

Patrick Curran

Lesa Hoffman

Kristopher Preacher

Jonathan Templin

Opinion / Media

Selected media mentions, op-eds, and blog posts. Titles link directly to each piece.

Highlights

OPINION: A more expansive approach to studying what works in education Brookings Institution

OPINION: What Virginia families lose when research funding becomes politicized Richmond Times-Dispatch

OPINION: When politics disrupt science, families pay the price Special Education Today

New research is showing the high costs of long school closures in some communities New York Times · 2022

Studies Show COVID's Toll on Students Living in Poverty Washington Post · 2022

American Schools Got a $190 Billion Covid Windfall. Where Is It Going? New York Times · 2022

Inside the new middle school math crisis Hechinger Report · 2022

COVID-19 Has Left Millions Of Students Behind. Now What? FiveThirtyEight · 2022

Why Districts' Initial Learning Recovery Efforts Missed the Mark Education Week · 2022

Measuring COVID Learning Loss UVA Today · 2021

The New Normal: Projecting The Impact Of COVID-19 On Education National Public Radio · 2021

How is COVID-19 Affecting Student Learning? Brookings Institution · 2020

Do we really have a covid-19 'lost generation'? One educator's message: 'Stop panicking. Get a grip!' Washington Post · 2020

How One District Got Its Students Back Into Classrooms New York Times · 2020

Many parents want it; few can afford it. Amid school uncertainty, private tutoring ramps up. NBC News · 2020

How to Reopen America's Schools New York Times Opinion · 2020

Study shows declines in new kindergartners' math skills Education Dive · 2020

Research Shows Students Falling Months Behind During Virus Disruptions New York Times · 2020

The impact of COVID-19 on student achievement and what it may mean for educators Brookings Institution · 2020

50 Million Kids Can't Attend School. What Happens to Them? New York Times Opinion · 2020

Homeschooling during the coronavirus will set back a generation of children Washington Post · 2020

Oregon students face profound learning losses from school closures, especially in math, new research shows The Oregonian · 2020

2019 NAEP Results Show There's Something Wrong Going On. 3 Theories About What Might Be Happening in Our Schools, and Beyond The 74 Million · 2019

Can Test Metadata Help Schools Measure Social-Emotional Learning? CPRE, University of Pennsylvania · 2019

Student Social and Emotional Learning Explored at Global Gathering Diverse Issues in Higher Education · 2018

Attending to Issues of Equity in Evaluating Research-Practice Partnership Outcomes NNERP Extra · 2019

Student Test Engagement and Its Impact on Achievement Gap Estimates Brookings Institution · 2017

Design Challenge Winner: Using Test Metadata to Measure SEL CASEL · 2017

New Tool Alerts Teachers When Students Give Up on Tests Education Week · 2017

For English-Learners, an Effective Teacher in Any Language Is What Matters Education Week · 2014

Economy Puts Squeeze on Education Promises National Public Radio · 2010

Africa Work

I am a psychometrician and statistician on multiple projects in Africa to improve assessment practices and translation. These projects focus on understanding autism in Kenya through the STAR Global Autism Initiative and creating a psychometric and assessment center serving sub-Saharan Africa through a hub based in South Africa.

Measurement and Assessment Psychometric Center for Africa (MAP Center)

STAR Global Autism Initiative

Instruments & Survey Tools

Autism in the Context of Education – Kenya Survey (ACE-KS) Instrument

Related Publications

Developing Culturally Responsive Surveys on Neurodevelopmental Disabilities: Lessons from the Kenyan Context

Accruing validity evidence for the Autism in the Context of Education–Kenya Survey

Links

Curated external links coming soon.

Substack

Measured Response

Where Data Meets the Real World

Megan Kuhfeld, Josh Gilbert, Kyndra Middleton, and I have started a free Substack! Measured Response unpacks how social science turns human experience — whether achievement tests, anxiety screeners, or political polls — into numbers, and why getting that right matters more than most people realize. Each issue covers commentary on measurement in the news, accessible research findings, and plain-language explanations of key measurement concepts.

Subscribe — it's free