Andy's Brain Blog

When asked, What is the most important part of an experiment?, some will tell you that it lies in careful, considered deliberation about the design of the study, and being able to accurately tease apart alternative explanations of the results; others will say that emphasis should be placed on technical finesse, statistical competence, and strictly adhering to the rules governing good experimental behavior, including correcting for your critical p-value every time you peek at the data - each viewing like another lashing from the scourge of science.

However, what these people fail to mention is the selection of subjects, which, if overlooked or neglected, will render all of the other facets of your experiment moot. Good subjects provide good data; or, at the very least, reliable data, as you will be certain that they performed the task as instructed; that they were alert, awake, and engaged, and that therefore any issues with your results must be attributed to your design, your construct, or technical problems, but that any problems due to the individuals in your experiment must be ruled out.

To honor this observation, I am constantly on the lookout for fresh cerebrums to wheedle and coax to participate in my studies; during my walk to work I observe in a nearby pedestrian a particularly promising yet subtle eminence on the frontal bone, and silently estimate the amount of cubic centimeters that must therefore be located within Brodmann's Area Number Ten; I sidle up to a young girl at the bar, and after a few minutes of small talk and light banter, playfully brush aside a few unruly strands of her hair and place it behind her ear, taking the opportunity to lightly trace the arc of her squamous suture with my finger, feel the faint pulse of her temporal artery, and fantasize about the blood flowing to the auditory association cortex in response to strings of nonsense vowels. "Do you like playing with my hair?" she asks coyly. "Yes," I manage to stammer, roused from my reverie; "It is beautiful - Beautiful!"

There is one qualm I have with selecting good subjects, however. Often they are people I know, or they are referred by reliable friends, so that I have little doubt that they will be able to successfully carry out their charge. Often they are young, college-aged, healthy, right-handed, intelligent, motivated, and desperate for cash; and as I think about the generalizability of my results, I cannot help but conclude that my results are only generalizable to people like this. A great number of people, either not having enough regard to follow the instructions, or not neurotic enough to care about how they do on the task as they would on a test, perform at a suboptimal level and are thereby excluded; else, they are not even recruited in the first place. This becomes more of a concern when moving beyond simple responses to visual and auditory stimuli, and into higher-level tasks such as decision-making, and I begin to question what meaning my results have for the great mass of humanity; but then I simply stir in more laudanum into my coffee, drink deep from the dregs of Lethe, and sink into carefree oblivion.

In any case, once you have found a good subject, odds are that they also know good subjects; and it is prudent to have them contact their friends and acquaintances, in order to rapidly fill up your subject quota. However, when this approach fails me, and I am strapped for participants, I try a viral marketing approach: As each subject is paid about fifty dollars for two hours of scanning time, upon completion of the study and payment of the subject, I request that they convert their money into fifty one-dollar bills, go to some swank location - such as a hockey game, gentleman's club, or monster truck rally - and take a picture of themselves holding the bills spread out like a fan in one hand and a thumbs-up in the other, while underneath the picture in impact font are the words ANDY HOOKED ME UP. This leads to a noticeable spike in requests for participating in my study, although not always from the clientele that I would like.

Since FMRI data is (mostly) crap - but extremely expensive crap - there is much debate over how experiments should be designed in order to maximize both power and efficiency. My opinion is that most of these issues would be null if we simply precluded doing any experiments which study trivial or useless things. For example, I have in front of me a paper discussing the neural correlates of heterosexual attraction among females, which used a sample of thirty-nine subjects. Assuming that experiment took about an hour and each scanning hour cost about $500, we can guess that this study cost upwards of $20,000. And all this to study a question that has been answered long ago, as common sense and my own observations suggest that females are irresistibly attracted to the soft, slightly pudgy build of the neuroscience blogger.

For those who must conduct such experiments, however, there are guidelines for balancing the tradeoff between efficiency and reliability (or, the probability that an independent study will replicate your results). In a study conducted by Thirion et al (2007), a large sample of 81 subjects was partitioned into different numbers of subgroups: 2 groups of 40 each, 3 groups of 27 each, 4 groups of 20 each, and so on, to test whether there is a noticeable cutoff for reproducible effects below a specific group size. Further parameters were tested, such as group-level variability and and sensitivity of different p-thresholds.

Figure 1 reproduced from Thirion et al (2007). The top two rows in (a) represent activation maps for disjoint groups of subjects from the same sample; (b) is the group-level statistical map. Note the spread in activation profiles between each of the disjoint groups.

The authors found that an optimal number of subjects to balance both reliability and statistical sensitivity (that is, the ability to detect an effect that is actually present) is about N=25-27, with diminishing returns after that. In addition, the authors counsel the use of mixed-effects models, which take into account variance from first-level analyses (i.e., individual subjects), and downweight subjects with high variability. This procedure is similar to the one employed in FSL's FLAME and AFNI's 3dMEMA.

Figure 8 reproduced from Thirion et al (2007). Both Kappa (a measure of reliability) and activated voxels increase significantly up to around 27 subjects, with a plateau shortly after that.

As a side note, besides merely testing for statistical significance (which is virtually guaranteed with a large enough sample), effect sizes should also be calculated to measure the...well, size of your effect. Essentially all an effect size is, is quantifying the magnitude of the difference between your calculated mean and the null hypothesis mean, in terms of standard deviations. The following table will help you qualify how big your effect is when describing the result:

0-0.3: Wee
0.3-0.5: Not so wee
0.5+: Friggin' HUGE

More details, along with a description of why cluster-thresholding is a better methods than whole-brain corrected voxels, can be found in the original paper.