The Generalizability Crisis Revisited

In the summer of 2015, I remember coming across a New York Times article in a hotel lobby about a study by Brian Nosek and the Open Science Collaboration. Recently published in Science, the researchers tested whether some of the best-known effects in psychology studies could be replicated by independent teams; this included famous results such as the priming effect, in which a subject who read a list of words related to a concept such as old age would start to act as if they were indeed old, by walking slower, spending way too much time in the bathroom, and repeatedly asking where they left their glasses.

Out of the 100 studies they looked at, only 39 were successfully replicated, which according to my calculations is a success rate of approximately 39%, and on the subjective scale falls somewhere between “Uh-oh” and “We done goofed”. Clearly this was not good for science and for psychology in particular, and although social psychologists were a convenient whipping-boy, it became obvious that the problem ran deep throughout other branches of science as well. Neuroimaging in particular lost some of its prestige with the publication of Eklund et al.’s 2016 article on inflated false positives in fMRI studies, casting further doubt on the reliability of imaging research.

During these times, many researchers tried to discern the cause of all these problems. Was it a flawed reward system, in which only statistically significant results are noticed and rewarded, while null results are ignored, or, at best, seen as a sign of incompetence? Or was it a hangover from the small study designs of the 90s and early 2000s, which could be afforded when studying robust effects in the brain, for example, but which became impractical when testing for more subtle effects?

One argument made by Yarkoni (2022) is that construct validity is necessary for the results to mean anything, let alone for them to be generalizable. Construct validity refers to whether a measurement accurately captures what it is supposed to be measuring; finding out how much someone donated to charity, for example, might be a measurement of a psychological construct such as selflessness, but it could also be confounded with the personal interests of the donor - perhaps the donation was given for tax purposes, or maybe the donation was given to a charity run by his brother.

According to Yarkoni, many psychological studies fail this basic test of construct validity, and it becomes even further separated from reality, as well as more costly, when more sophisticated techniques are involved - namely, brain imaging. This is a technology that usually costs anywhere from $500-$1000 per hour for an MRI scan, and a typical fMRI study can cost tens of thousands of dollars. Furthermore, fMRI is several steps removed from the underlying neural activity; the blood-oxygenation level dependent (BOLD) signal that is measured in an fMRI scan is an epiphenomenon of neural activity, requiring several assumptions about blood flow and neurovascular coupling that may not be completely accurate throughout the brain.

On top of this, virtually all statistics used in psychological science rely on what is called Random Effects. This refers to the estimation of the mean and standard deviation of the population that we are drawing from, in order to generalize about the findings from our sample. Then, if the mean of the sample is sufficiently large, and the standard deviation sufficiently small, we calculate a p-value to quantify how likely it is that we would have randomly drawn a sample with those statistics, if there were truly no effect in the population.

The rub is that the validity of the inference depends on the researcher’s assumption. For example, would this generalize to a different age group, testing site, or even slightly different stimuli? To illustrate a generalizable phenomenon, the Stroop task furnishes us with a good example. Originally designed to test whether subjects could override their habit of reading a word and respond to the color of the font instead, the Stroop task has been modified to encompass many different types of incongruency, such as direction, location, and emotion. Across all of these different variations, the underlying Stroop effect appears to hold: No matter what kind of task design or population you study, people tend to have more difficulty and commit more errors when responding to an incongruent stimulus as compared to a congruent one.

Note that for such an effect to be considered generalizable, it takes years or decades of replications across different populations, testing sites, and, ideally, different sets of stimuli. It also helps that the original effect and most of the follow-up studies also showed large effect sizes. On the other hand, there are many other psychological studies that are not as robust, and even though they may generate a statistically significant result in one study, care should be taken about generalizing the effect to the general population. Yarkoni used a study by Alogna et al. (2014) as an example. This was an attempted replication of something called the “verbal overshadowing” effect, in which participants who described the physical appearance of a perpetrator caught on camera were less able to recognize the same perpetrator following a delay as compared to participants to did a control task, such as naming as many states and capitals as they could.

Even though Alogna et al. (2014) found a significant effect similar to the original study, however, Yarkoni maintains that this is not enough to call the effect generalizable. If we take it for granted that subjects should be treated as a random effect in order for our statistics to generalize to the population, then it stands to reason that elements of the experiment such as the stimuli and even the testing location should also be treated as random factors. As it stands, virtually all psychological studies treat variables such as stimuli and testing location as a fixed effect - meaning that the variable of interest (in this case, recognition of the perpetrator) requires a lower statistical threshold to be labeled significant.

Therefore the question is, How much of our experiment should be treated as random factors? Including everyone as a random effect will make achieving statistical significance much more difficult, but on the other hand, any effect that survives such a threshold is most likely generalizable to the general population. However, other factors should also be considered: If many different studies from independent research groups using similar stimuli reach the same conclusion about a psychological phenomenon, that should weigh in favor of the phenomenon being a true effect; i.e., that it is sufficiently general throughout the population to be taken for granted. Extenuating circumstances and special populations should always be considered - nobody would expect the word Stroop effect in someone who is illiterate - but often what we are looking for is an effect that is broad enough to merit consideration when taking into account something like the ergonomics of a new office space, or whether a clinical intervention may be effective.

Regarding the latter, there is evidence that new neuromodulation interventions do work, and they are often based on fMRI studies that have articulated the functional architecture of the brain. A new technique called Stanford Neuromodulation Therapy, for example, is based on targeting the dorsolateral prefrontal cortex (DLPFC) with transcranial magnetic stimulation - and this region was chosen as a target based on several neuroimaging studies showing that the region was hypoactive in people with major depressive disorder. The therapy appears to work, the remission of depressive symptoms tend to last, and it has been approved by the FDA. Similarly, a recent review of neuromodulation studies of addiction have found that some of the strongest effects in clinical studies come from targeting the frontal pole and ventromedial prefrontal cortex - areas that have been shown in previous fMRI studies to be involved in reward sensitivity and craving. Again, these studies have not treated every experimental factor as a random effect, but the converging evidence from multiple studies appears to lead to significant clinical outcomes for different types of patients.

Yarkoni raises important points about many psychological studies being underpowered, poorly designed, and overgeneralized, and the field would benefit from greater rigor and large sample sizes. However, we should also train our judgment about what effects appear to be real, which includes critically examining the study design, whether the methods and data are publicly available and reproducible, and whether independent studies have confirmed the effect. The recent success of neuromodulation as a therapy for different illnesses should also encourage us about the body of neuroimaging literature that has accumulated over the years regarding psychological phenomena from cognitive control to reward processing, as these have provided the foundation for efficacious clinical interventions. The growing normalization of sharing data and code, as well as providing large open-access databases for analysis, will likely continue to yield important insights into how to treat different kinds of mental illness.