Soland Psychometrics & Assessment
Welcome to my site. I am a psychometrician by training, which means I help develop assessments (tests, surveys, teacher observational protocols), find defensible ways to score them, and investigate whether they are being used appropriately (validly) for their intended purposes. I chose this (admittedly niche) career because I believe good social science begins with good measurement. While I've had some great teachers over the years, I've also spent a lot of time teaching myself and, just maybe, shaking my fists at the sky over the lack of available, digestible instructional material. This web site is meant to help others on their measurement journey by sharing some of the best resources it's taken me years to accrue.
As a heads up, I was once told that psychometricians, deep down, really wanted to be accountants, but couldn't stand the excitement. Some of this material is…detailed. Buyer beware.
Jim Soland, Ph.D., is an Associate Professor of Research, Statistics, and Evaluation at the University of Virginia School of Education and Human Development. He has a courtesy faculty appointment in UVA Psychology, is an Associate Editor at the journal Educational Assessment, serves on multiple state assessment Technical Advisory Committees (TACs), and is a Faculty Affiliate in the Max Planck Research School on The Life Course (LIFE). His research examines how measurement decisions shape our understanding of academic, psychological, and socio-emotional development, including how measurement affects evaluation of programs/interventions supporting that development. More recent work focuses on survey design using AI, and on measure development and validation in sub-Saharan Africa. Prior to joining the University of Virginia, Jim completed a doctorate in Educational Psychology at Stanford University with a concentration in measurement. He has also served as a Senior Research Scientist at NWEA, a policy analyst at the RAND Corporation, a Senior Policy Analyst at the Legislative Analyst's Office (LAO) in California, and an adjunct professor at Oregon State University.
Curriculum Vitae: CV (PDF)
Google Scholar Page: Google Scholar
Favorite textbook or learning resource?
I have two favorite textbooks, both of which treat latent variables as a unified statistical framework. One is more conceptual—Bartholomew et al.'s Analysis of Multivariate Social Science Data. The other is much more technical—Skrondal and Rabe-Hesketh's Generalized Latent Variable Modeling. Together, they make clear that IRT, SEM, mixture models, and related approaches are not separate methods, but all instances of latent variable models.
One go-to article by someone else?
Two classic examples that explicitly crosswalk SEM and IRT are articles by Wirth & Edwards and Kamata & Bauer. I also particularly like the paper by Ferron & Hess because it makes very explicit what SEM is doing "under the hood" when estimating a model.
Article of yours you wish more people knew about?
Megan Kuhfeld, Kelly Edwards, and I wrote a tutorial that walks through practical options for scoring measures in studies with multiple groups (e.g., treatment and control), multiple timepoints (e.g., pre–post designs or growth models), or both. The goal is to demystify how psychometricians think about measurement tradeoffs in real applied settings. I also wrote an article I wish more people knew about that, to me, suggests growth mixture models shouldn't be a thing anymore.
Featured Articles
Why the replication crisis may be a measurement crisis…
Is the replication crisis a measurement crisis? Evidence from over 100 randomized trial outcomes
Gilbert, J. B., & Soland, J. PsyArXiv preprint.
Randomized controlled trials (RCTs) are the gold standard for causal inference, yet the validity of RCT conclusions depends not only on randomization but also on how outcomes are measured and scored. This study analyzes item-level data from 112 RCT outcome measures spanning psychology, medicine, public health, and education to test whether foundational measurement assumptions are met and whether alternative scoring approaches alter conclusions. Across disciplines, measurement assumptions are rarely evaluated and frequently violated. For nearly half of outcomes, the single score used is not likely a plausible representation of the data. Moreover, when outcomes are scored using statistical models aligned with the study design and better matching the data instead of sum scores, the proportion of pre/post trials reporting statistically significant treatment effects approximately doubles. These findings indicate that routine measurement decisions can systematically shift causal inferences, suggesting that a largely unexamined aspect of RCT analysis may be at the heart of replication failures.
Why growth mixture models should probably die a quiet death…
Evidence that growth mixture model results are highly sensitive to scoring decisions
Soland, J., Cole, V., Tavares, S., & Zhang, Q. (2025). Multivariate Behavioral Research, 60(3), 487–508.
Interest in identifying latent growth profiles to support the psychological and social-emotional development of individuals has translated into the widespread use of growth mixture models (GMMs). In most cases, GMMs are based on scores from item responses collected using survey scales or other measures. Research already shows that GMMs can be sensitive to departures from ideal modeling conditions and that growth model results outside of GMMs are sensitive to decisions about how item responses are scored, but the impact of scoring decisions on GMMs has never been investigated. This study begins to close that gap through empirical and Monte Carlo studies, showing that GMM results—including convergence, class enumeration, and latent growth trajectories within class—are extremely sensitive to seemingly arcane measurement decisions.
Exploring Whether Response Style Biases Manifest as Spurious Classes in Longitudinal Mixture Models
Cole, V. T., Soland, J., Zhang, Q., & Tavares, S. (2025). Structural Equation Modeling: A Multidisciplinary Journal.
Growth mixture models (GMMs) are used to identify unobserved groups based on longitudinal data. The current study investigated whether failing to model response styles, person-specific patterns of item responses unrelated to the latent variable, may lead researchers to overestimate the number of classes. In a simulation study, data were generated from a single-class model with socially desirable responding. We compared sum scores, multidimensional IRT scores, and scores from a model accounting for response styles. No method consistently recovered the correct single-class solution, though accounting for response styles helped when response style effects were strong. In an empirical study of growth mindset in middle-school children (N from 2,319 to 3,466), we found that the scoring method used to generate scores yielded some differences in predicted trajectories. Results suggest response styles may contribute to spurious class extraction in GMMs, but this effect is difficult to disaggregate from other known issues with GMMs.
How to avoid bias from scoring in intervention studies…
Soland, J., Kuhfeld, M., & Edwards, K. (2024). Psychological Methods, 29(5), 1003.
Though much effort is often put into designing psychological studies, the measurement model and scoring approach employed are often an afterthought, especially when short survey scales are used. One possible reason that measurement gets downplayed is that there is generally little understanding of how calibration and scoring approaches could impact common estimands of interest, including treatment effect estimates, beyond random noise due to measurement error. In this study, three motivating examples are provided in which surveys are used to understand individuals' underlying social-emotional and personality constructs, demonstrating the potential consequences of measurement and scoring decisions. As the analyses show, the decisions researchers make about how to calibrate and score the survey used have consequences that are often overlooked, with likely implications for both conclusions drawn from individual psychological studies and replications of those studies.
Soland, J. (2022). Educational and Psychological Measurement, 82(2), 376–403.
Considerable thought is often put into designing randomized control trials (RCTs), yet when psychological constructs measured using survey scales are the outcome of interest, measurement is often an afterthought. The purpose of this study is to examine how choices about scoring and calibration of survey item responses affect recovery of true treatment effects. Simulation and empirical studies compare the performance of sum scores—which are frequently used in RCTs in psychology and education—to that of approaches rooted in item response theory (IRT) that better account for the longitudinal, multigroup nature of the data. Results indicate that a multigroup longitudinal IRT approach performs best, particularly when measurement noninvariance is present across treatment and control groups, with sum score approaches understating true treatment effects by 25% or more.
When Should Evaluators Lose Sleep Over Measurement? Toward Establishing Best Practices
Soland, J., Edwards, K., & Talbert, E. (2024). Journal of Research on Educational Effectiveness, pages 474–506.
Evaluators often invest much effort into designing evaluation studies. However, there is evidence that less attention is paid to measurement. One possible explanation is that focus in applied psychometrics is on reliability, with less placed on measurement model misspecification and the bias it can introduce into estimates that use resultant scores. Another possible explanation is that evaluators frequently want to use the simplest scoring approach possible under the assumption that it is transparent and therefore relies on fewer assumptions—a mindset that is often, if not always, misguided. In this study, we walk through the decisions involved in producing scores for program evaluation studies in an attempt to demystify the psychometrics, as well as show how related decisions can be consequential. We use Monte Carlo simulations to illustrate the effects of those decisions in a randomized control trial, then show that these decisions can impact published evaluation results. Finally, we try to give evaluators best practices in scoring for evaluation, including understanding when deviating from those practices is most likely to impact their work.
How to avoid bias from scoring in growth models…
Kuhfeld, M., & Soland, J. (2022). Psychological Methods, 27(2), 234–260.
A large portion of what we know about how humans develop, learn, behave, and interact is based on survey data. Researchers use longitudinal growth modeling to understand the development of students on psychological and social-emotional learning constructs across elementary and middle school, with growth typically measured using sum scores or scale scores produced by item response theory (IRT) methods. Although there is a great deal of guidance on scaling and linking IRT-based large-scale educational assessments to facilitate the estimation of examinee growth, little of this expertise is brought to bear in the scaling of psychological and social-emotional constructs. Through a series of simulation and empirical studies, this study produces scores using sum scores and multiple IRT approaches and compares the recovery of true latent growth parameters, demonstrating that certain IRT-based approaches substantially reduce bias relative to sum scores.
How Scoring Approaches Impact Estimates of Growth in the Presence of Survey Item Ceiling Effects
Edwards, K. D., & Soland, J. (2024). Applied Psychological Measurement, 48(3), 147–164.
Survey scores are often the basis for understanding how individuals grow psychologically and socio-emotionally. A known problem with many surveys is that the items are all "easy"—that is, individuals tend to use only the top one or two response categories on the Likert scale. Such an issue could be especially problematic, and lead to ceiling effects, when the same survey is administered repeatedly over time. In this study, we conduct simulation and empirical studies to (a) quantify the impact of these ceiling effects on growth estimates when using typical scoring approaches like sum scores and unidimensional item response theory (IRT) models and (b) examine whether approaches to survey design and scoring, including employing various longitudinal multidimensional IRT (MIRT) models, can mitigate any bias in growth estimates. We show that bias is substantial when using typical scoring approaches and that, while lengthening the survey helps somewhat, using a longitudinal MIRT model with plausible values scoring all but alleviates the issue. Results have implications for scoring surveys in growth studies going forward, as well as understanding how Likert item ceiling effects may be contributing to replication failures.
Soland, J., & Kuhfeld, M. (2021). Multivariate Behavioral Research, 56(6), 853–873.
Survey respondents employ different response styles when they use the categories of the Likert scale differently despite having the same true score on the construct of interest—for example, by being more likely to use the extremes of the response scale independent of their true score. Research already shows that differing response styles can create a construct-irrelevant source of bias that distorts fundamental inferences made from survey data. While some initial studies examine the effect of response styles on survey scores in longitudinal analyses, the issue of how response styles affect estimates of growth is underexamined. In this study, empirical and simulation analyses are conducted in which surveys are scored using IRT models that do and do not account for response styles, and those different scores are then used in growth models to quantify the impact on growth estimates.
Tutorials
Curated tutorials and exemplars, organized by topic. Titles link directly to the resource.
Growth Modeling
Item Response Theory
Structural Equation Modeling
Course Materials
High-quality course materials, lecture notes, and videos from trusted methods instructors and labs.
Textbooks
Recommended textbooks for learning measurement, latent variable modeling, and related methods. Listed in alphabetical order by first author.
Measurement Theory and Applications for the Social Sciences
Bandalos, D. L. (2018). Measurement theory and applications for the social sciences. Guilford Publications.
Note. Excellent for walking through the basics of instrument design, as well as simple descriptive analyses that can be done to understand whether those instruments are working as intended.
Analysis of Multivariate Social Science Data (2nd ed.)
Bartholomew, D. J., Knott, M., & Moustaki, I. (2nd ed.). Analysis of Multivariate Social Science Data.
Note. Outstanding and much more conceptual. Starts with cluster analysis and walks through regression, factor analysis, factor analysis with binary variables (IRT), SEM, and then multilevel models.
Historical and Conceptual Foundations of Measurement in the Human Sciences
Briggs, D. C. Historical and Conceptual Foundations of Measurement in the Human Sciences.
Note. Gets into the history of psychometrics — the good, the bad, and the ugly.
Statistical Methods for the Social and Behavioural Sciences: A Model-Based Approach (1st ed.)
Flora, D. Statistical Methods for the Social and Behavioural Sciences: A Model-Based Approach (1st ed.).
Designing Monte Carlo Simulations in R
Miratrix, L. W., & Pustejovsky, J. E. Designing Monte Carlo Simulations in R.
Note. This is an excellent text on how to conduct high-quality, reproducible simulation studies.
Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models
Skrondal, A., & Rabe-Hesketh, S. Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models.
Note. Outstanding but more advanced; makes strong connections among latent variable models and includes helpful matrix algebra examples. The first lecture in my SEM class draws heavily from Chapter 1.
Datasets
Below are data repositories highlighted in a recent Psychometrika special-issue call on data-intensive methods in psychometrics.
Highlighted Repositories
A harmonized collection of item response datasets spanning many measures and domains.
Attentional Control Data Collection
Data from attentional control tasks.
Data from experience sampling (ESM) studies.
A library of preference data (rankings, choices, and related formats).
A database of children's vocabulary development.
Grant Proposals
Funded grant proposals shared here as examples and resources.
NSF: Understanding How Approaches to Calibrating and Scoring Survey Item Responses Affect Results from Growth Mixture Models
NCME Tools
Code & Appendices
My Articles
How survey scoring decisions can influence your study's results: A trip through the IRT looking glass
Soland, J., Kuhfeld, M., & Edwards, K. (2024). Psychological Methods, 29(5), 1003.
How scoring approaches impact estimates of growth in the presence of survey item ceiling effects
Edwards, K. D., & Soland, J. (2024). Applied Psychological Measurement, 48(3), 147–164.
Item response theory models for difference-in-difference estimates (and whether they are worth the trouble)
Soland, J. (2024). Journal of Research on Educational Effectiveness, 17(2), 391–421.
Scoring assessments in multisite randomized control trials: Examining the sensitivity of treatment effect estimates to measurement choices
Kuhfeld, M., & Soland, J. (2023). Psychological Methods.
When should evaluators lose sleep over measurement? Toward establishing best practices
Soland, J., Edwards, K., & Talbert, E. (2025). Journal of Research on Educational Effectiveness, 18(3), 474–506.
Articles from Others
Opinion / Media
Selected media mentions, op-eds, and blog posts. Titles link directly to each piece.
Highlights
Africa Work
I am a psychometrician and statistician on multiple projects in Africa to improve assessment practices and translation. These projects focus on understanding autism in Kenya through the STAR Global Autism Initiative and creating a psychometric and assessment center serving sub-Saharan Africa through a hub based in South Africa.