Replicat ion, falsificat ion, and t he crisis of confidence in social psychology
Br i an D. Ear p and Davi d Tr af i mow
Jour nal Name: Fr ont i er s i n Psychol ogy
ISSN: 1664-1078
Ar t i cl e t ype: Hypot hesi s & Theor y Ar t i cl e
Fi r st r ecei ved on: 05 Mar 2015
Revi sed on: 22 Apr 2015
Fr ont i er s websi t e l i nk: www. f r ont i er si n. or g
Quantitative Psychology and Measurement
1
Replication, Falsification, and the
Crisis of Confidence in Social Psychology
Brian D. Earp
1, 2
& David Trafimow
3
1
University of Oxford
2
University of Cambridge
3
New Mexico State University
Abstract
The (latest) Òcrisis in confidenceÓ in social psychology has generated much heated
discussion about the importance of replication, including how such replication should be
carried out as well as interpreted by scholars in the field. What does it mean if a replication
attempt ÒfailsÓÑdoes it mean that the original results, or the theory that predicted them,
have been falsified? And how should ÒfailedÓ replications affect our belief in the validity of
the original research? In this paper, we consider the ÒreplicationÓ debate from a historical
and philosophical perspective, and provide a conceptual analysis of both replication and
falsification as they pertain to this important discussion. Along the way, we introduce a
Bayesian framework for assessing ÒfailedÓ replications in terms of how they should affect
our confidence in purported findings.
Key words: replication, falsifiability, crisis of confidence, social psychology, priming,
philosophy of science, Karl Popper
ÒOnly when certain events recur in accordance with rules
or regularities, as in the case of repeatable experiments,
can our observations be testedÑin principleÑby anyone. ...
Only by such repetition can we convince ourselves that we
are not dealing with a mere isolated ÔcoincidenceÕ, but with
events which, on account of their regularity and
reproducibility, are in principle inter-subjectively
testable.Ó
-- Karl Popper (1959, p. 45)
2
Introduction
Scientists pay lip-service to the importance of replication. It is the Òcoin of the scientific realmÓ
(Loscalzo, 2012, p. 1211); Òone of the central issues in any empirical scienceÓ (Schmidt, 2009, p.
90); or even the Òdemarcation criterion between science and nonscienceÓ (Braude, 1979, p. 2).
Similar declarations have been made about falsifiability, the Òdemarcation criterionÓ proposed by
Popper in his seminal work of 1959 (see epigraph). As we will discuss below, the concepts are
closely relatedÑand also frequently misunderstood. Nevertheless, their regular invocation
suggests a widespread if vague allegiance to Popperian ideals among contemporary scientists,
working from a range of different disciplines (Jordan, 2004; Jost, 2013). The cosmologist
Hermann Bondi once put it this way: ÒThere is no more to science than its method, and there is
no more to its method than what Popper has saidÓ (quoted in Magee, 1973, p. 2).
Experimental social psychologists have fallen in line. Perhaps in part to bolster our sense of
identity with the natural sciences (Danzinger, 1997), we psychologists have been especially keen
to talk about replication. We want to trade in the ÒcoinÓ of the realm. As Billig (2013) notes,
psychologists Òcling fast to the belief that the route to knowledge is through the accumulation of
[replicable] experimental findingsÓ (p. 179). The connection to Popper is often made explicit.
One recent example comes from Kepes and McDaniel (2013), from the field of industrial-
organizational psychology: ÒThe lack of exact replication studies [in our field] prevents the
opportunity to disconfirm research results and thus to falsify [contested] theoriesÓ (p. 257). They
cite The Logic of Scientific Discovery.
There are problems here. First, there is the ÒlackÓ of replication noted in the quote from Kepes
and McDaniel. If replication is so important, why isnÕt it being done? This question has become
a source of crisis-level anxiety among many psychologists in recent years, as we will explore in a
later section. The anxiety is due to a disconnect: between what is seen as being necessary for
scientific credibilityÑi.e., careful replication of findings based on precisely-stated theoriesÑand
what appears to be characteristic of the field in practice (Nosek, Spies, & Motyl, 2012). Part of
the problem is the lack of prestige associated with carrying out replications (Smith, 1970). To put
it simply, few would want to be seen by their peers as merely ÒcopyingÓ anotherÕs work (e.g.,
Mulkay & Gilbert, 1986); and few could afford to be seen in this way by tenure committees or
3
by the funding bodies that sponsor their research. Thus: while Òa field that replicates its work is
[seen as] rigorous and scientifically soundÓÑaccording to Makel, Plucker, and Hegarty (2012)Ñ
psychologists who actually conduct those replications Òare looked down on as bricklayers and
not [as] advancing [scientific] knowledgeÓ (p. 537). In consequence, actual replication attempts
are rare.
A second problem is with the reliance on PopperÑor, at any rate, a first-pass reading of Popper
that seems to be uninformed by subsequent debates in the philosophy of science. Indeed, as
critics of Popper have noted, since the 1960s and consistently thereafter, neither his notion of
falsification nor his account of experimental replicability seem strictly amenable to being put
into practice (e.g., Mulkay & Gilbert, 1981; see also Earp, 2011)Ñat least not without
considerable ambiguity and confusion. What is more, they may not even be fully coherent as
stand-alone ÒabstractÓ theories, as has been repeatedly noted as well (cf. Cross, 1992).
The arguments here are familiar. Let us suppose thatÑat the risk of being accused of laying
down bricksÑResearcher B sets up an experiment to try to ÒreplicateÓ a controversial finding
that has been reported by Researcher A. She follows the original methods section as closely as
she can (assuming that this has been published in detail; or even better, she simply asks
Researcher A for precise instructions). She calibrates her equipment. She prepares the samples
and materials just so. And she collects and then analyzes the data. If she gets a different result
from what was reported by Researcher AÑwhat follows? Has she ÒfalsifiedÓ the other labÕs
theory? Has she even shown the original result to be erroneous in some way?
The answer to both of these questions, as we will demonstrate in some detail below, is Òno.Ó
Perhaps Researcher B made a mistake (see Trafimow, 2014). Perhaps the other lab did. Perhaps
one of BÕs research assistants wrote down the wrong number. Perhaps the original effect is a
genuine effect, but can only be obtained under specific conditionsÑand we just donÕt know yet
what they are (Cesario, 2014). Perhaps it relies on ÒtacitÓ (Polanyi, 1962) or ÒunofficialÓ
(Westen, 1988) experimental knowledge that can only be acquired over the course of several
years, and perhaps Researcher B has not yet acquired this knowledge (Collins, 1975).
4
Or perhaps the original effect is not a genuine effect, but Researcher AÕs theory can actually
accommodate this fact. Perhaps Researcher A can abandon some auxiliary hypothesis, or take on
board another, or re-formulate a previously unacknowledged background assumptionÑor
whatever (cf. Lakatos, 1970; Folger, 1989; Cross, 1992). As Lakatos (1970) once put it: Ògiven
sufficient imagination, any theory ... can be permanently saved from ÔrefutationÕ by some
suitable adjustment in the background knowledge in which it is embeddedÓ (p. 184). We will
discuss some of these potential ÒadjustmentsÓ below. The upshot, however, is that we simply do
not know, and cannot know, exactly what the implications of a given ÒreplicationÓ attempt are,
no matter which way the data come out. There are no critical tests of theories; and there are no
objectively decisive replications.
Popper (1959) was not blind to this problem. ÒIn point of fact,Ó he wrote, in an under-appreciated
passage of his famous book, Òno conclusive disproof of a theory can ever be produced, for it is
always possible to say that the experimental results are not reliable, or that the discrepancies
which are asserted to exist between the experimental results and the theory are only apparentÓ (p.
50, emphasis added). Hence as Mulkay and Gilbert (1981) explain:
... in relation to [actual] scientific practice, one can only talk of positive and negative
results, and not of proof or disproof. Negative results, that is, results which seem
inconsistent with a given hypothesis [or with a putative finding from a previous
experiment], may incline a scientist to abandon [the] hypothesis but they will never
require him to abandon it ... Whether or not he does so may depend on the amount and
quality of positive evidence, on his confidence in his own and othersÕ experimental skills
and on his ability to conceive of alternative interpretations of the negative findings. (p.
391)
Drawing hard and fast conclusions, therefore, about ÒnegativeÓ resultsÑsuch as those that may
be produced by a ÒfailedÓ replication attemptÑis much more difficult than Kepes and McDaniel
seem to imagine (see, e.g., Chow, 1988, for similarly problematic arguments). This difficulty
may be especially acute in the field of psychology. As Folger (1989) notes, ÒPopper himself
believed that too many theories, particularly in the social sciences, were constructed so loosely
that they could be stretched to fit any conceivable set of experimental results, making them ...
devoid of testable contentÓ (p. 156, emphasis added). Furthermore, as Collins (1985) has argued,
5
the less secure a fieldÕs foundational theoriesÑand especially at the fieldÕs ÒfrontierÓÑthe more
room there is for disagreement about what should ÒcountÓ as a proper replication.
1
Related to this problem is that it can be difficult to know in what specific sense a replication
study should be considered to be Òthe sameÓ as the original (e.g., van IJzendoorn, 1994).
Consider that the goal for these kinds of studies is to rule out flukes and other types of error.
Thus we want to be able to say that the same experiment, if repeated one more time, would
produce the same result as was originally observed. But an original study and a replication study
cannot, by definition, be identicalÑat the very least, some time will have passed and the
participants will all be new
2
Ñand if we donÕt yet know which differences are theory-relevant,
we wonÕt be able to control for their effects. The problem with a field like psychology, whose
theoretical predictions are often Òconstructed so loosely,Ó as noted above, is precisely that we
simply do not knowÑor at least we do not know in a large number of casesÑwhich differences
are in fact relevant to the theory.
Finally, human behavior is notoriously complex. We are not like billiard balls, or beavers, or
planets, or paramecia (that is, relatively simple objects or organisms with comparatively
circumscribed behavior). This means that we should expect our behavioral responses to vary
across a Òwide range of moderating individual difference and experimental context variablesÓ
(Cesario, in 2014, p. 41)Ñmany of which are not yet known, and some of which may be difficult
or even impossible to uncover (Meehl, 1990). Thus, in the absence of Òwell-developed theories
for specifying such [moderating] variables, the conclusions of replication failures will be
ambiguousÓ (Cesario, 2014, p. 41; see also Meehl, 1978).
1
There are two steps to understanding this idea. First, because the foundational theories are so insecure,
and the fieldÕs findings so under dispute, the ÒcorrectÓ empirical outcome of a given experimental design
is unlikely to have been firmly established. Second, and insofar the first step applies, the standard by
which to judge whether a replication has been competently performed is equally unavailableÑsince that
would depend upon knowing the ÒcorrectÓ outcome of just such an experiment. Thus a Òcompetently
performedÓ experiment is one that produces the ÒcorrectÓ outcome; while the ÒcorrectÓ outcome is
defined by whatever it is that is produced by a Òcompetently performedÓ experiment. As Collins (1985)
states: ÒWhere there is disagreement about what counts as a competently performed experiment, the
ensuing debate is coextensive with the debate about what the proper outcome of the experiment isÓ (p.
89). This is the infamously circular experimenterÕs regress. Of course, as a reviewer for this paper notes,
a competently performed experiment should produce satisfactory (i.e., meaningful, useful) results on
"outcome neutral" tests.
2
Assuming that it is a psychology experiment, of course. Note that even if the ÒsameÓ participants are run
through the experiment one more time, theyÕll have changed in at least one essential way: theyÕll have
already gone through the experiment (opening the door for practice effects, etc.).
6
Summing up the problem
Hence we have two major points to consider. First, due to a lack of adequate incentives in the
reward structure of professional science (e.g., Nosek et al., 2012), actual replication attempts are
rarely carried out. Second, to the extent that they are carried out, it can be well-nigh impossible
to say conclusively what they mean, whether they are ÒsuccessfulÓ (i.e., showing similar, or
apparently similar, results to the original experiment) or ÒunsuccessfulÓ (i.e., showing different,
or apparently different, results to the original experiment). Thus Collins (1985) came to the
conclusion that, in physics at least, disputes over contested findings are likelier to be resolved by
social and reputational negotiationsÑover, e.g., who should be considered a competent
experimenterÑthan by any ÒobjectiveÓ consideration of the experiments themselves. Meehl
(1990) drew a similar conclusion about the field of social psychology, although he identified
sheer boredom (rather than social/reputational negotiation) as the alternative to decisive
experimentation:
... theories in the Òsoft areasÓ of psychology have a tendency to go through periods of
initial enthusiasm leading to large amounts of empirical investigation with ambiguous
over-all results. This period of infatuation is followed by various kinds of amendment and
the proliferation of ad hoc hypotheses. Finally, in the long run, experimenters lose
interest rather than deliberately discard a theory as clearly falsified. (p. 196)
So how shall we take stock of what has been said? A cynical reader might conclude thatÑfar
from being a Òdemarcation criterion between science and nonscienceÓÑreplication is actually
closer to being a waste of time. Indeed, if even replications in physics are sometimes not
conclusive, as Collins (1975, 1981, 1985) has convincingly shown, then what hope is there for
replications in psychology?
Our answer is simply as follows. Replications do not need to be ÒconclusiveÓ in order to be
informative. In this paper, we will highlight some of the ways in which replication attempts can
be more, rather than less, informative, and we will discussÑusing a Bayesian frameworkÑhow
they can reasonably affect a researcherÕs confidence in the validity of an original finding. The
same is true of Òfalsification.Ó While a scientist should not simply abandon her favorite theory on
7
account of a single (apparently) contradictory resultÑas Popper himself was careful to point out
3
(1959, p. 66-67; see also Earp, 2011)Ñshe might reasonably be open to doubt it, given enough
disconfirmatory evidence, and assuming that she had stated the theory precisely. Rather than
being a Òwaste of time,Ó therefore, experimental replication of oneÕs own and other findings can
be a useful tool for restoring confidence in the reliability of basic effectsÑprovided that certain
conditions are met. The work of the latter part of this essay will be to describe and to justify at
least a few of those essential conditions. In this context, we will draw a distinction between
ÒconceptualÓ or ÒreproductiveÓ replications (cf. Cartwright, 1991)Ñwhich may conceivably be
used to bolster confidence in a particular theoryÑand ÒdirectÓ or ÒcloseÓ replications, which
may be used to bolster confidence in a finding (Schmidt, 2009; see also Earp et al., 2014). Since
it is doubt about the findings that seems to have prompted the recent ÒcrisisÓ in social
psychology, it is the latter that will be our focus. But first we must introduce the crisis.
The (Latest) Crisis in Social Psychology and Calls for Replication
ÒIs there currently a crisis of confidence in psychological science reflecting an unprecedented
level of doubt among practitioners about the reliability of research findings in the field? It would
certainly appear that there is.Ó So write Harold Pashler and Eric-Jan Wagenmakers (2012, p.
529) in a recent issue of Perspectives on Psychological Science. The ÒcrisisÓ is not unique to
psychology; it is rippling through biomedicine and other fields as well (Ioannidis, 2005; Earp &
Darby, 2014; Loscalzo 2012) Ð but psychology will be the focus of this paper, if for no other
reason than that the present authors have been closer to the facts on the ground.
Some of the causes of the crisis are fairly well known. In 2011, an eminent Dutch researcher
confessed to making up data and experiments, producing a rŽsumŽ-full of ÒfindingsÓ that he had
3
On PopperÕs view, one must set up a "falsifying hypothesis," i.e., a hypothesis specifying how another
experimenter could recreate the falsifying evidence. But then, Popper says, the falsifying hypothesis itself
should be severely tested and corroborated before it is accepted as falsifying the main theory.
Interestingly, as a reviewer has suggested, the distinction between a falsifying hypothesis and the main
theory may also correspond to the distinction between direct vs. conceptual replications that we discuss in
a later section. On this view, direct replications (attempt to) reproduce what the falsifying hypothesis
states is necessary to generate the original predicted effect, whereas conceptual replications are attempts
to test the main theory.
8
simply invented out of whole cloth (Carey, 2011). He was outed by his own students, however,
and not by peer review nor by any attempt to replicate his work. In other words, he might just as
well have not been found out, had he only been a little more careful (Stroebe, Postmes, & Russell
Spears, 2012). An unsettling prospect was thus aroused: Could other fraudulent ÒfindingsÓ be
circulatingÑundetected, and perhaps even undetectableÑthroughout the published record? After
an exhaustive analysis of the Dutch fraud case, Stroebe et al. (2012) concluded that the notion of
self-correction in science was actually a ÒmythÓ (p. 670); and others have offered similar
pronouncements (Ioannidis, 2012a).
But fraud, it is hoped, is rare. Nevertheless, as Ioannidis (2005, 2012a) and others have argued,
the line between explicitly fraudulent behavior and merely ÒquestionableÓ research practices is
perilously thin, and the latter are probably common. John, Loewenstein, and Prelec (2012)
conducted a massive, anonymous survey of practicing psychologists and showed that this
conjecture is likely correct. Psychologists admitted to such questionable research practices as
failing to report all of the dependent measures for which they had collected data (78%
4
),
collecting additional data after checking to see whether preliminary results were statistically
significant (72%), selectively reporting studies that ÒworkedÓ (67%), claiming to have predicted
an unexpected finding (54%), and failing to report all of the conditions that they ran (42%). Each
of these practices alone, and even more so when combined, reduces the interpretability of the
final reported statistics, casting doubt upon any claimed ÒeffectsÓ (e.g., Simmons, Nelson, &
Simonsohn, 2011).
The motivation behind these practices, though not necessarily conscious or deliberate, is also not
obscure. Professional journals have long had a tendency to publish only or primarily novel,
Òstatistically significantÓ effects, to the exclusion of replicationsÑand especially ÒfailedÓ
replicationsÑor other null results. This problem, known as Òpublication bias,Ó leads to a file-
drawer effect whereby ÒnegativeÓ experimental outcomes are simply Òfiled awayÓ in a
researcherÕs bottom drawer, rather than written up and submitted for publication (e.g., Rosenthal
1979). Meanwhile, the Òquestionable research practicesÓ carry on in full force, since they
4
The percentages reported here are the geometric mean of self-admission rates, prevalence estimates by
the psychologists surveyed, and prevalence estimates derived by the John et al. from the other two
figures.
9
increase the researcherÕs chances of obtaining a Òstatistically significantÓ findingÑwhether it
turns out to be reliable or not.
To add insult to injury, in 2012, an acrimonious public skirmish broke out in the form of dueling
blog posts between the distinguished author of a classic behavioral priming study
5
and a team of
researchers who had questioned his findings (Yong, 2012). The disputed results had already been
cited more than 2,000 timesÑan extremely large number for the fieldÑand even been enshrined
in introductory textbooks. What if they did turn out to be a fluke? Should other Òpriming studiesÓ
be double-checked as well? Coverage of the debate ensued in the mainstream media (e.g.,
Bartlett, 2013).
Another triggering event resulted in Òwidespread public mockeryÓ (Pashler & Wagenmakers,
2012, p. 528). In contrast to the fraud case described above, which involved intentional,
unblushing deception, the psychologist Daryl Bem relied on well-established and widely-
followed research and reporting practices to generate an apparently fantastic result, namely
evidence that participantsÕ current responses could be influenced by future events (Bem, 2011).
Since such paranormal precognition is inconsistent with widely-held theories about Òthe
fundamental nature of time and causalityÓ (Lebel & Peters, p. 371), few took the findings
seriously. Instead, they began to wonder about the Ôwell-established and widely-followed
research and reporting practicesÕ that had sanctioned the findings in the first place (and allowed
for their publication in a leading journal). As Simmons et al. (2011) concludedÑreflecting
broadly on the state of the disciplineÑÒit is unacceptably easy to publish Ôstatistically
significantÕ evidence consistent with any hypothesisÓ (p. 1359).
6
The main culprit for this phenomenon is what Simmons et al. (2012) identified as researcher
degrees of freedom:
5
Priming has been defined a number of different ways. Typically, it refers to the ability of subtle cues in
the environment to affect an individualÕs thoughts and behavior, often outside of her awareness or control
(e.g., Bargh & Chartrand, 1999).
6
Even more damning, Trafimow (2003; Trafimow & Rice, 2009; Trafimow & Marks, 2015) has argued
that the standard significance tests used in psychology are invalid even when they are done Òcorrectly.Ó
Thus, even if psychologists were to follow the prescriptions of Simmons et al.Ñand reduce their
researcher degrees of freedom (see the discussion following this footnote)Ñthis would still
fail to address the core problem that such tests should not be used in the first place.
10
In the course of collecting and analyzing data, researchers have many decisions to make:
Should more data be collected? Should some observations be excluded? Which
conditions should be combined and which ones compared? Which control variables
should be considered? Should specific measures be combined or transformed or both? ...
It is rare, and sometimes impractical, for researchers to make all these decisions
beforehand. Rather, it is common (and accepted practice) for researchers to explore
various analytic alternatives, to search for a combination that yields Òstatistical
significanceÓ and to then report only what Òworked.Ó (p. 1359)
One unfortunate consequence of such a strategyÑinvolving, as it does, some of the very same
Òquestionable research practicesÓ later identified by John et al. (2012) in their survey of
psychologistsÑis that it inflates the possibility of producing a Òfalse positiveÓ (or a Type 1
error). Since such practices are ÒcommonÓ and even Òaccepted,Ó the literature may be replete
with erroneous results. Thus, as Ioaniddis (2005) declared after performing a similar analysis in
his own field of biomedicine, Òmost published research findingsÓ may be ÒfalseÓ (p. 0696,
emphasis added). Hence the Òunprecedented level of doubtÓ referred to by Pashler and
Wagenmakers (2012) in the opening quote to this section.
This not the first crisis for psychology. Roger Giner-Sorolla (2012) points out that ÒcrisesÓ of
one sort or another Òhave been declared regularly at least since the time of Wilhelm WundtÓÑ
with turmoil as recent as the 1970s inspiring particular dŽjˆ vu (p. 563). Then, as now, a string of
embarrassing eventsÑincluding the publication in mainstream journals of literally unbelievable
findings
7
Ñled to Òsoul searchingÓ amongst leading practitioners. Standard experimental
methods, statistical strategies, reporting requirements, and norms of peer review were all put
under the microscope; numerous sources of bias were carefully rooted out (e.g., Greenwald,
1975). While various calls for reform were put forwardÑsome more energetically than othersÑ
a single corrective strategy seemed to emerge from all the din: the need for psychologists to
replicate their work. Since Òall flawed research practices yield findings that cannot be
reproduced,Ó critics reasoned, replication could be used to separate the wheat from the chaff
(Koole & Lakens, 2012, p. 608, emphasis added; see also Elms, 1975).
7
For example, a Òstudy found that eating disorder patients were significantly more likely than others to
see frogs in a Rorschach test, which the author interpreted as showing unconscious fear of oral
impregnation and anal birth ...Ó (Giner-Sorrolla, 2012, p. 562).
11
The same calls reverberate today. ÒFor psychology to truly adhere to the principles of science,Ó
write Ferguson and Heene (2012), Òthe need for replication of research results [is] important ... to
considerÓ (p. 556). Lebel and Peters (2011) put it like this: ÒAcross all scientific disciplines,
close replication is the gold standard for corroborating the discovery of an empirical
phenomenonÓ and Òthe importance of this point for psychology has been noted many timesÓ (p.
375). Indeed, Òleading researchers [in psychology]Ó agree, according to Francis (2012), that
Òexperimental replication is the final arbiter in determining whether effects are true or falseÓ (p.
585).
We have already seen that such calls must be heeded with caution: replication is not
straightforward, and the outcome of replication studies may be difficult to interpret. Indeed they
can never be conclusive on their own. But we suggested that replications could be more or less
informative; and in the following sections we discuss some strategies for making them ÒmoreÓ
rather than Òless.Ó We begin with a discussion of ÒdirectÓ vs. ÒconceptualÓ replication.
Increasing Replication Informativeness: ÒDirectÓ vs. ÒConceptualÓ Replication
In a systematic review of the literature, encompassing every conceivable discipline, G—mez,
Juristo, and Vegas (2010) identified 18 different types of replication. Three of these were from
Lykken (1968), who drew a distinction between Òliteral,Ó Òoperational,Ó and ÒconstructiveÓÑ
which Schmidt (2009) then winnowed down (and re-labeled) to arrive at ÒdirectÓ and
ÒconceptualÓ in an influential paper. As Makel et al. (2012) have pointed out, it is SchmidtÕs
particular framework that seems to have crystallized in the field of psychology, shaping most of
the subsequent discussion on this issue. We have no particular reason to rock the boat; indeed
these categories will suit our argument just fine.
The first step in making a replication informative is to decide what specifically it is for. ÒDirectÓ
replications and ÒconceptualÓ replications are ÒforÓ different things; and assigning them their
proper role and function will be necessary for resolving the Òcrisis.Ó First, some definitions:
A ÒdirectÓ replication may be defined as an experiment that is intended to be as similar to the
original as possible (Schmidt, 2009; Makel et al., 2012). This means that along every conceivable
dimensionÑfrom the equipment and materials used, to the procedure, to the time of day, to the
12
gender of the experimenter, etc.Ñthe replicating scientist should strive to avoid making any kind
of change or alteration. The purpose here is to ÒcheckÓ the original results. Some changes will be
inevitable, of course; but the point is that only the inevitable changes (such as the passage of time
between experiments) are ideally tolerated in this form of replication. In a ÒconceptualÓ
replication, by contrast, at least certain elements of the original experiment are intentionally
altered, ideally systematically so, toward the end of achieving a very different sort of purposeÑ
namely to see whether a given phenomenon, assuming that it is reliable, might obtain across a
range of variable conditions. But as Doyen et al. (2014) note in a recent paper:
The problem with conceptual replication in the absence of direct replication is that there
is no such thing as a Òconceptual failure to replicate.Ó A failure to find the same ÒeffectÓ
using a different operationalization can be attributed to the differences in method rather
than to the fragility of the original effect. Only the successful conceptual replications will
be published, and the unsuccessful ones can be dismissed without challenging the
underlying foundations of the claim. Consequently, conceptual replication without direct
replication is unlikely to change beliefs about the underlying effect. (p. 28)
In simplest terms, therefore, a ÒdirectÓ replication seeks to validate a particular fact or finding;
whereas a ÒconceptualÓ replication seeks to validate the underlying theory or phenomenonÑi.e.
the theory that has been proposed to ÒpredictÓ the effect that was obtained by the initial
experimentÑas well to establish the boundary conditions within which the theory holds true
(Nosek, Spies, & Motyl, 2012). The latter is impossible without the former. In other words, if we
cannot be sure that our finding is reliable to begin with (because it turns out to have been a
coincidence, or else a false alarm due to questionable research practices, publication bias, or
fraud), then we are in no position to begin testing the theory by which it is supposedly explained
(Cartwright, 1991; see also Earp et al. 2014).
Of course both types of replication are important, and there is no absolute line between them.
Rather, as Asendorpf et al. (2013) point out, Òdirect replicability [is] one extreme pole of a
continuous dimension extending to broad generalizability [via ÒconceptualÓ replication] at the
other pole, ranging across multiple, theoretically relevant facets of study designÓ (p. 139).
Collins made a similar point in 1985 (e.g., p. 37). But so long as we remain largely ignorant
about exactly which Òfacets of study designÓ are Òtheoretically relevantÓ to begin withÑas is the
case with much of current social psychology (see Meehl, 1990), and nearly all of the most
13
heavily-contested experimental findingsÑwe need to orient our attention more toward the
ÒdirectÓ end of the spectrum.
8
How else can replication be made more informative? Brandt et al. (2013)Õs ÒReplication RecipeÓ
offers several important factors, one of which must be highlighted to begin with. This is their
contention that a ÒconvincingÓ replication should be carried out outside the lab of origin. Clearly
this requirement shifts away from the ÒdirectÓ extreme of the replication gradient that we have
emphasized so far, but such a change from the original experiment, in this case, is justified. As
Ioannidis (2012b) points out, replications by the original researchersÑwhile certainly important
and to be encouraged as a preliminary stepÑare not sufficient to establish ÒconvincingÓ
experimental reliability. This is because allegiance and confirmation biases, which may apply
especially to the original team, would be less of an issue for independent replicators.
Partially against this view, Schnall (2014, np) argues that Òauthors of the original work should be
allowed to participate in the process of having their work replicated.Ó On the one hand, this
might have the desirable effect of ensuring that the replication attempt faithfully reproduces the
original procedure. It seems reasonable to think that the original author would know more than
anyone else about how the original research was conductedÑso her viewpoint is likely to be
helpful. On the other hand, however, too much input by the original author could compromise
the independence of the replication: she might have a strong motivation to make the replication a
success, which could subtly influence the results (see Earp & Darby, 2014). Whichever position
one takes on the appropriate degree of input and/or oversight from the original author, however,
Schnall (2014, np) is certainly right to note that Òthe quality standards for replications need to be
at least as high as for the original findings. Competent evaluation by experts is absolutely
essential, and is especially important if replication authors have no prior expertise with a given
research topic.Ó
8
Asendorpf et al. (2013) explain why this is so: Ò[direct] replicability is a necessary condition for further
generalization and thus indispensible for building solid starting points for theoretical development.
Without such starting points, research may become lost in endless fluctuation between alternative
generalization studies that add numerous boundary conditions but fail to advance theory about why these
boundary conditions existÓ (p. 140, emphasis added).
14
Other ingredients in increasing the informativeness of replication attempts include: (1) carefully
defining the effects and methods that the researchers intend to replicate; (2) following exactly as
possible the methods of the original study (as described above); (3) having high statistical power
(i.e., an adequate sample size to detect an effect if one is really present); (4) making complete
details about the replication available, so that interested experts can fully evaluate the replication
attempt (or attempt another replication themselves); and (5) evaluating the replication results,
comparing them critically to the results of the study (Brandt et al., 2013, p. 218, paraphrased).
This list is not exhaustive, but it gives a concrete sense of how ÒstabilizingÓ procedures (see
Radder, 1992) can be employed to give greater credence to the quality and informativeness of
replication efforts.
Falsification, Replication, and Auxiliary Assumptions
Brandt et al.Õs (2013) Òreplication recipeÓ provides a vital tool for researchers seeking to conduct
high quality replications. In this section, we offer an additional ÒingredientÓ to the discussion, by
highlighting the role of auxiliary assumptions in increasing replication informativeness,
specifically as these pertain to the relationship between replication and falsification. Consider the
logical fallacy of affirming the consequent that provided an important basis for PopperÕs
falsification argument.
If the theory is true, an observation should occur ! ! ! (Premise 1)
The observation occurs ! (Premise 2)
Therefore, the theory is true ! (Conclusion)
Obviously, the conclusion does not follow. Any number of things might have led to the
observation that have nothing to do with the theory being proposed (see Earp, 2015 for a similar
argument). On the other hand, denying the consequent (modus tollens) does invalidate the theory,
strictly according to the logic given:
If the theory is true, an observation should occur ! ! ! (Premise 1)
The observation does not occur !! (Premise 2)
Therefore, the theory is not true !! (Conclusion)
15
Given this logical asymmetry, then, between affirming and denying the consequent of a
theoretical prediction (see Earp & Everett, 2013), Popper opted for the latter. By doing so, he
famously defended a strategy of disconfirming rather than confirming theories. Yet if the goal is
to disconfirm theories, then the theories must be capable of being disconfirmed in the first place;
hence, a basic requirement of scientific theories (in order to count as properly scientific; see Earp
& Westermann, under revision) is that they have this feature: they must be falsifiable.
As we hinted at above, however, this basic framework is an oversimplification. As Popper
himself noted, and as was made particularly clear by Lakatos (1978; also see Duhem, 1954;
Quine; 1980), scientists do not derive predictions only from a given theory, but rather from a
combination of the theory and auxiliary assumptions. The auxiliary assumptions are not part of
the theory proper, but they serve several important functions. One of these functions is to show
the link between the sorts of outcomes that a scientist can actually observe (i.e., by running an
experiment), and the non-observable, ÒabstractÓ content of the theory itself. To pick one classic
example from psychology, according to the theory of reasoned action (e.g., Fishbein, 1980),
attitudes determine behavioral intentions. One implication of this theoretical assumption is that
researchers should be able to obtain strong correlations between attitudes and subjective norms.
But this assumes, among other things, that a check mark on an attitude scale really indicates the
personÕs attitude, and that a check mark on an intention scale really indicates the personÕs
intention. The theory of reasoned action has nothing to say about whether check marks on scales
indicate attitudes or intentions; these are assumptions that are peripheral to the basic theory.
They are auxiliary assumptions that researchers use to connect non-observational terms such as
ÒattitudeÓ and ÒintentionÓ to observable phenomena such as check marks. Fishbein and Ajzen
(e.g., 1975; Ajzen & Fishbein, 1980) recognized this and took great pains to spell out, as well as
possible, the auxiliary assumptions that best aid in measuring theoretically relevant variables.
The existence of auxiliary assumptions complicates the project of falsification. This is because
the major premise of the modus tollens argumentÑdenying the consequent of the theoretical
predictionÑmust be stated somewhat differently. It must be stated like this: ÒIf the theory is true
and a set of auxiliary assumptions is true, an observation should occur.Ó Keeping the second
premise the same implies that either the theory is not true or that at least one auxiliary
assumption is not true, as the following syllogism (in symbols only) illustrates.
16
! ! !
!
! !
!
!!
!
! ! (Premise 1)
!! (Premise 2)
! !! !" ! !
!
! !
!
!!
!
! !! !" !!
!
!" !!
!
! !!
!
(Conclusion)
Consider an example. It often is said that NewtonÕs gravitational theory predicted where planets
would be at particular times. This is not precisely accurate. It would be more accurate to say that
such predictions were derived from a combination of NewtonÕs theory and auxiliary assumptions
not contained in that theory (e.g., about the present locations of the planets). To return to our
example about attitudes and intentions from psychology, consider the mini-crisis in social
psychology from the 1960s, when it became clear to researchers that attitudesÑthe kingly
constructÑfailed to predict behaviors. Much of the impetus for the theory of reasoned action
(e.g., Fishbein, 1980) was FishbeinÕs realization that there was a problem with attitude
measurement at the time: when this problem was fixed, strong attitude-behavior (or at least
attitude-intention) correlations became the rule rather than the exception. This episode provides a
compelling illustration of a case in which attention to the auxiliary assumptions that bore on
actual measurement played a larger role in resolving a crisis in psychology than debates over the
theory itself.
What is the lesson here? Due to the fact that failures to obtain a predicted observation can be
blamed either on the theory itself or on at least one auxiliary assumption, absolute theory
falsification is about as problematic as is absolute theory verification. In the Newton example,
when some of NewtonÕs planetary predictions were shown to be wrong, he blamed the failures
on incorrect auxiliary assumptions rather than on his theory, arguing that there were additional
but unknown astronomical bodies that skewed his findingsÑwhich turned out to be a correct
defense of his theory. Likewise, in the attitude literature, the theoretical connection between
attitudes and behaviors turned out to be correct (as far as we know) with the problem having
been caused by incorrect auxiliary assumptions pertaining to attitude measurement.
There is an additional consequence to the necessity of giving explicit consideration to oneÕs
auxiliary assumptions. Suppose, as often happens in psychology, that a researcher deems a
theory to be unfalsifiable because he or she does not see any testable predictions. Is the theory
really unfalsifiable or is the problem that the researcher has not been sufficiently thorough in
17
identifying the necessary auxiliary assumptions that would lead to falsifiable predictions? Given
that absolute falsification is impossible, and that researchers are therefore limited to some kind of
ÒreasonableÓ falsification, Trafimow (2009) has argued that many allegedly unfalsifiable theories
are reasonably falsifiable after all: it is just a matter of researchers having to be more thoughtful
about considering auxiliary assumptions. Trafimow documented examples of theories that had
been described as unfalsifiable that one could in fact falsify by proposing better auxiliary
assumptions than had been imagined by previous researchers.
The notion that auxiliary assumptions can vary in quality is relevant for replication. Consider, for
example, the case alluded to earlier regarding a purported failure to replicate Bargh et al.Õs
(1996) famous priming results. In the replication attempt of this well-known Òwalking timeÓ
study (Doyen et al., 2012), laser beams were used to measure the speed with which participants
left the laboratory, rather than students with stopwatches. Undoubtedly, this adjustment was
made on the basis of a reasonable auxiliary assumption that methods of measuring time that are
less susceptible to human idiosyncrasies would be superior to methods that are more subject to
them. Does the fact that the failed replication was not exactly like the original experiment
disqualify it as invalid? At least with regard to this specific feature of this specific replication
attempt, the answer is clearly Òno.Ó If a researcher uses a better auxiliary assumption than in the
original experiment, this should add to its validity rather than subtract from it.
9
But suppose, for a particular experiment, that we are not in a good position to judge the
superiority of alternative auxiliary assumptions. We might invoke what Meehl (1990) termed the
ceteris paribus (all else equal) assumption. This idea, applied to the issue of direct replications,
suggests that for researchers to be confident that a replication attempt is a valid one, the auxiliary
assumptions in the replication have to be sufficiently similar to those in the original experiment
that any differences in findings cannot reasonably be attributed to differences in the assumptions.
Put another way, all of the unconsidered auxiliary assumptions should be indistinguishable in the
relevant way: that is, all have to be sufficiently equal or sufficiently right or sufficiently
irrelevant so as not to matter to the final result.
9
There may be other reasons why the ÒfailedÓ replication by Doyen et al. should not be considered
conclusive, of course; for further discussion see, e.g., Lieberman (2012).
18
What makes it allowable for the researcher to make the ceteris paribus assumption? In a strict
philosophical sense, of course, it is not allowable. To see this, suppose that Researcher A has
published an experiment, Researcher B has replicated it, but the replication failed. If Researcher
A claims that Researcher B made a mistake in performing the replication, or just got unlucky,
there is no way to disprove Researcher AÕs argument absolutely. But suppose that Researchers C,
D, E, and F also attempt replications, and also fail. It becomes increasingly difficult to support
the contention that Researchers B-F all Òdid it wrongÓ or were unlucky, and that we should
continue to accept Researcher AÕs version of the experiment. Even if a million researchers
attempted replications, and all of them failed, it is theoretically possible that Researcher AÕs
version is the unflawed one and all the others are flawed. But most researchers would conclude
(and in our view, would be right to conclude) that it is more likely that it is Researcher A who
got it wrong and not the million researchers who failed to replicate the observation. Thus, we are
not arguing that replications, whether successful or not, are definitive. Rather, our argument is
that replications (of sufficient quality) are informative.
Introducing a Bayesian framework
To see why this is the case, we shall employ a Bayesian framework similar to Trafimow (2010).
Suppose that an aficionado of Researcher A believes that the prior probability of anything
Researcher A said or did is very high. Researcher B attempts a replication of an experiment by
Researcher A and fails. The aficionado might continue confidently to believe in Researcher AÕs
version, but the aficionadoÕs confidence likely would be decreased slightly. Well, then, as there
are more replication failures, the aficionadoÕs confidence would continue to decrease
accordingly, and at some point the decrease in confidence would push the aficionadoÕs
confidence below the 50% mark, in which case the aficionado would put more credence in the
replication failures than on the success obtained by Researcher A.
In the foregoing scenario, we would want to know the probability that the original result is
actually true given Researcher BÕs replication failure ! !!! . As Equation 1 shows, this
depends on the aficionadoÕs prior level of confidence that the original result is true ! ! , the
probability of failing to replicate given that the original result is true ! !!! , and the
probability of failing to replicate ! ! , as Equation 1 shows.
19
! !!! !
! ! ! !!!
! !
(1)
Alternatively, we could frame what we want to know in terms of a confidence ratio that the
original result is true or not true given the failure to replicate
!!!!!!
!!!!!!!
. This would be a function
of the aficionadoÕs prior confidence ratio about the truth of the finding
! !
! !!
and the ratio of
probabilities of failing given that the original result is true or not
! !!!
! !!!!
. Thus, Equation 2
gives the posterior confidence ratio.
!!!!!!
!!!!!!!
!
! !
! !!
! !!!
! !!!!
(2)
Suppose that the aficionado is a very strong one, so that the prior confidence ratio is 50. In
addition, the probability ratio pertaining to failing to replicate is .5. It is worthwhile to clarify
two points about this probability ratio. First, we assume that the probability of failing to replicate
is less if the original finding is true than if it is not true, so that the ratio ought to be substantially
less than 1. Second, how much less than 1 this ratio will be depends largely on the quality of the
replication; as the replication becomes closer to meeting the ideal ceteris paribus condition, the
ratio will deviate increasingly from 1. Put more generally, as the quality of the auxiliary
assumptions going into the replication attempt increases, the ratio will decrease. Given these two
ratios of 50 and .5, the posterior confidence ratio is 25. Although this is a substantial decrease in
confidence from 50, the aficionado still believes that the finding is extremely likely to be true.
But suppose there is another replication failure and the probability ratio is .8. In that case, the
new confidence ratio is !" ! ! ! !". The pattern should be clear here: As there are more
replication failures, a rational person, even if that person is an aficionado of the original
researcher, will experience continually decreasing confidence as the replication failures mount.
If we imagine that there are N attempts to replicate the original finding that fail, the process
described in the foregoing paragraph can be summarized in a single equation that gives the ratio
of posterior confidences in the original finding, given that there have been N failures to replicate.
This is a function of the prior confidence ratio and the probability ratios in the first replication
failure, the second replication failure, and so on.
20
!!!!!
!
!
!!!!!!
!
!
!
!!!!
!!!!!
!!!
!
!!!
!!!
!
!!!!
!!!
!
!!!
!!!
!
!!!!
!
!!!
!
!!!
!!!
!
!!!!
!
!!!!
!!!!!
!!!
!
!!!
!!!
!
!!!!
!
!!!
(3)
For example, staying with our aficionado with a prior confidence ratio of 50, imagine a set of 10
replication failures, with the following probability ratios: .5, .8, .7, .65, .75, .56, .69, .54, .73, and
.52. The final confidence ratio, according to Equation 3, would be:
!" ! ! ! ! ! ! ! !" ! !" ! !" ! !" ! !" ! !" ! !" ! !!"!
Note the following. First, even with an extreme prior confidence ratio (we had set it at 50 for the
aficionado), it is possible to overcome it with a reasonable number of replication failures
providing that the person tallying the replication failures is a rational Bayesian (and there is
reason to think that those attempting the replications are sufficiently competent in the subject
area and methods to be qualified to undertake them). Second, it is possible to go from a state of
extreme confidence to one of substantial lack of confidence. To see this in the example, take the
reciprocal of the final confidence ratio (.54), which equals 1.84. In other words, the Bayesian
aficionado now believes that the finding is 1.84 times as likely to be not true as true. If we
imagine yet more failed attempts to replicate, it is easy to foresee that the eventual belief that the
original finding is not true could eventually become as powerful, or more powerful, than the
prior belief that the original finding is true.
In summary, auxiliary assumptions play a role, not only for the original theory-testing
experiment but also in replicationsÑeven in replications concerned only with the original
finding and not with the underlying theory. An important auxiliary assumption is the ever-
present ceteris paribus assumption, and the extent to which it applies influences the
ÒconvincingnessÓ of the replication attempt. Thus, a change in confidence in the original finding
is influence both by the quality and quantity of the replication attempts, as Equation 3 illustrates.
21
In presenting Equations 1-3, we reduced the theoretical content as much as possible, and more
than is realistic in actual research,
10
in considering so-called ÒdirectÓ replications. As the
replications serve other purposes, such as ÒconceptualÓ replications, the amount of theoretical
content is likely to increase. To link that theoretical content to the replication attempt, more
auxiliary assumptions will become necessary. For example, in a conceptual replication of an
experiment finding that attitudes influence behavior, the researcher might use a different attitude
manipulation or a different behavior measure. How do we know that the different manipulation
and measure are sufficiently theoretically unimportant that the conceptual replication really is a
replication (i.e., a test of the underlying theory)? We need new auxiliary assumptions linking the
new manipulation and measure to the corresponding constructs in the theory, just as an original
set of auxiliary assumptions was necessary in the original experiment to link the original
manipulation and measure to the corresponding constructs in the theory. Auxiliary assumptions
always matterÑand they should be made explicit so far as possible. In this way, it will be easier
to identify where in the chain of assumptions a ÒbreakdownÓ must have occurred, in attempting
to explain a apparent failure to replicate.
Conclusion
Replication is not a silver bullet. Even carefully-designed replications, carried out in good faith
by expert investigators, will never be conclusive on their own. But as Tsang and Kwan (1999)
point out:
If replication is interpreted in a strict sense, [conclusive] replications or experiments are
also impossible in the natural sciences. ... So, even in the ÒhardestÓ science (i.e., physics)
complete closure is not possible. The best we can do is control for conditions that are
plausibly regarded to be relevant. (p. 763)
10
Indeed, we have presented our analysis in this section in abstract terms so that the underlying reasoning
can be seen most clearly. However, this necessarily raises the question of how to go about implementing
these ideas in practice. As a reviewer points out, to calculate probabilities, the theory being tested would
need to be represented as a probability model; then in effect one would have Bayes factors to deal with.
We note that both Dienes (2014) and Verhagen and Wagenmakers (2014) have presented methods for
assessing the strength of evidence of a replication attempt (i.e., in confirming the original result) along
these lines, and we refer the reader to these papers for further consideration.
22
Nevertheless, ÒfailedÓ replications, especially, might be dismissed by an original investigator as
being flawed or ÒincompetentlyÓ performedÑbut this sort of accusation is just too easy. The
original investigator should be able to describe exactly what parameters she sees as being
theoretically relevant, and under what conditions her ÒeffectÓ should obtain. If a series of
replications is carried out, independently by different labs, and deliberately tailored to the
parameters and conditions so describedÑyet they reliably fail to produce the original resultÑ
then this should be considered informative. At the very least, it will suggest that the effect is
sensitive to theoretically-unspecified factors, whose specification is sorely needed. At most, it
should throw the existence of the effect into doubt, possibly justifying a shift in research
priorities. Thus, while ÒfalsificationÓ can in principle be avoided ad infinitum, with enough
creative effort by one who wished to defend a favored theory, scientists should not seek to
ÒrescueÓ a given finding at any empirical cost.
11
Informative replications can reasonably factor
into scientistsÕ assessment about just what that cost might be; and they should pursue such
replications as if the credibility of their field depended on it. In the case of experimental social
psychology, it does.
11
As Doyen et al. (2014, p. 28, internal references omitted) recently argued: ÒGiven the existence of
publication bias and the prevalence of questionable research practices, we know that the published
literature likely contains some false positive results. Direct replication is the only way to correct such
errors. The failure to find an effect with a well-powered direct replication must be taken as evidence
against the original effect. Of course, one failed direct replication does not mean the effect is non-
existentÑscience depends on the accumulation of evidence. But, treating direct replication as irrelevant
makes it impossible to correct Type 1 errors in the published literature.Ó
23
Acknowledgements
Thanks are due to Anna Alexandrova for feedback on an earlier draft of this essay.
References
Ajzen, I., & Fishbein, M. (1980). Understanding attitudes and predicting social behavior.
Englewood Cliffs, NJ: Prentice-Hall.
Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., ... &
Wicherts, J. M. (2013). Replication is more than hitting the lottery twice. European Journal of
Personality, 27, 108-119.
Bargh, J. A., & Chartrand, T. L. (1999). The unbearable automaticity of being. American
Psychologist, 54(7), 462.
Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior: Direct effects of
trait construct and stereotype activation on action. Journal of Personality and Social Psychology,
71(2), 230-244.
Bartlett, T. (2013, January). Power of suggestion. The Chronicle of Higher Education. Available
at: http://chronicle.com/article/Power-of-Suggestion/136907.
Bem, D. J. (2011). Feeling the future: experimental evidence for anomalous retroactive
influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407.
Billig, M. (2013). Learn to write badly: how to succeed in the social sciences. Cambridge:
Cambridge University Press.
Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach, F. J., Geller, J., Giner-Sorolla, R., ... &
Van't Veer, A. (2013). The Replication Recipe: What Makes for a Convincing Replication?.
Journal of Experimental Social Psychology, in press.
24
Braude, S. E. (1979). ESP and psychokinesis: A philosophical examination. Philadelphia, PA:
Temple University Press.
Carey, B. (2011). Fraud case seen as a red flag for psychology research. The New York Times.
Cartwright, N. (1991). Replicability, reproducibility, and robustness: Comments on Harry
Collins. History of Political Economy, 23(1), 143-155.
Cesario, J. (2014). Priming, replication, and the hardest science. Perspectives on Psychological
Science, 9(1), 40-48.
Chow, S. L. (1988). Significance test or effect size?. Psychological Bulletin, 103(1), 105.
Collins, H. M. (1975). The seven sexes: A study in the sociology of a phenomenon, or the
replication of experiments in physics. Sociology, 9(2), 205-224.
Collins, H. M. (1981). Son of seven sexes: The social destruction of a physical phenomenon.
Social Studies of Science, 11(1), 33-62.
Collins, H. M. (1985). Changing order: Replication and induction in scientific practice.
University of Chicago Press.
Cross, R. (1982). The Duhem-Quine thesis, Lakatos and the appraisal of theories in
macroeconomics. The Economic Journal, 92(366), 320-340.
Danzinger, K. (1997). Naming the mind. London: Sage.
Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in
Psychology, 5(Article 781), 1-17. doi: 10.3389/fpsyg.2014.00781
Doyen, S., Klein, O., Pichon, C. L., & Cleeremans, A. (2012). Behavioral priming: it's all in the
mind, but whose mind? PloS One, 7(1), e29081.
25
Doyen, S., Klein, O., Simons, D. J., Cleeremans, A., & Cleeremans, A. (2014). On the other side
of the mirror: Priming in cognitive and social psychology. Social Cognition, 32, 12-32.
Duhem, P. (1954). The aim and structure of physical theory (P.P. Wiener, Trans.). Princeton, NJ:
Princeton University Press. (Original work published 1906)
Earp, B. D. (2015). Does religion deserve a place in secular medicine? Journal of Medical
Ethics. E-letter. Available at http://jme.bmj.com/content/41/3/229/reply#medethics_el_17551.
Earp, B. D. (2011). Can science tell us whatÕs objectively true? The New Collection, 6(1), 1-9.
Earp, B. D., Everett, J. A. C., Madva, E. N., & Hamlin, J. K. (2014). Out, damned spot: Can the
ÒMacbeth EffectÓ be replicated? Basic and Applied Social Psychology, 36(1), 91-98.
Earp, B. D., & Everett, J. A. C. (2013). Is the N170 face-specific? Controversy, context, and
theory. Neuropsychological Trends, 13(1), 7-26.
Earp, B. D., & Darby, R. J. (2015). Does science support infant circumcision? A skeptical reply
to Brian Morris. The Skeptic, 25(3), in press. Available at
https://www.academia.edu/9872471/Does_science_support_infant_circumcision.
Earp, B. D., & Westermann, G. (under revision). Connectionist vs. rule-based models in
cognitive science: Parsimony, falsifiability, and the curious case of the English past tense.
Cognitive Science.
Elms, A. C. (1975). The crisis of confidence in social psychology. American Psychologist,
30(10), 967.
Fanelli, D. (2013). Only reporting guidelines can save (soft) science. European Journal of
Personality, 27, 124-125.
26
Ferguson, C. J., & Heene, M. (2012). A vast graveyard of undead theories publication bias and
psychological scienceÕs aversion to the null. Perspectives on Psychological Science, 7(6), 555-
561.
Fishbein, M. (1980). Theory of reasoned action: Some applications and implications.
In H. Howe & M. Page (Eds.), Nebraska Symposium on Motivation, 1979 (pp. 65Ð116).
Lincoln: University of Nebraska Press.
Fishbein, M., & Ajzen, I. (1975). Belief, attitude, intention and behavior: An introduction
to theory and research. Reading, MA: Addison-Wesley.
Folger, R. (1989). Significance tests and the duplicity of binary decisions. Psychological
Bulletin, 106(1), 155-160.
Francis, G. (2012). The psychology of replication and replication in psychology. Perspectives on
Psychological Science, 7(6), 585-594.
Giner-Sorolla, R. (2012). Science or art? How aesthetic standards grease the way through the
publication bottleneck but undermine science. Perspectives on Psychological Science, 7(6), 562-
571.
G—mez, O. S., Juristo, N., & Vegas, S. (2010, September). Replications types in experimental
disciplines. In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical
Software Engineering and Measurement (p. 3). ACM.
Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological
Bulletin, 82(1), 1.
Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Medicine, 2(8),
e124.
Ioannidis, J. P. (2012a). Why science is not necessarily self-correcting. Perspectives on
Psychological Science, 7(6), 645-654.
27
Ioannidis, J. P. (2012b). Scientific inbreeding and same-team replication: type D personality as
an example. Journal of Psychosomatic Research, 73(6), 408-410.
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable
research practices with incentives for truth telling. Psychological Science, 23(5), 524-532.
Jordon, G. (2004). Theory construction in second language acquisition. Philadelphia: John
Benjamins.
Jost, J. (2013) Introduction to: An Additional Future for Psychological Science. Perspectives on
Psychological Science, 8(4), 414-423.
Kepes, S., & McDaniel, M. A. (2013). How Trustworthy Is the Scientific Literature in Industrial
and Organizational Psychology?. Industrial and Organizational Psychology, 6(3), 252-268.
Koole, S. L., & Lakens, D. (2012). Rewarding Replications A Sure and Simple Way to Improve
Psychological Science. Perspectives on Psychological Science, 7(6), 608-614.
Lakatos, I. (1970). Falsification and the methodology of scientific research programmes. In I.
Lakatos & A. Musgrave (Eds.), Criticism and the growth of knowledge (pp. 91 -196). London:
Cambridge University Press.
Lakatos, I. (1978). The methodology of scientific research programmes. Cambridge, UK:
Cambridge University Press.
LeBel, E. P., & Peters, K. R. (2011). Fearing the future of empirical psychology: Bem's (2011)
evidence of psi as a case study of deficiencies in modal research practice. Review of General
Psychology, 15(4), 371.
Lieberman, M. (2012). Does thinking of grandpa make you slow? What the failure to replicate
results does and does not mean. Psychology Today. Available at
28
http://www.psychologytoday.com/blog/social-brain-social-mind/201203/does-thinking-grandpa-
make-you-slow.
Loscalzo, J. (2012). Irreproducible Experimental Results Causes,(Mis) interpretations, and
Consequences. Circulation, 125(10), 1211-1214.
Lykken, D. T. (1968). Statistical significance in psychological research. Psychological Bulletin,
70(3p1), 151.
Makel, M. C., Plucker, J. A., & Hegarty, B. (2012). Replications in Psychology Research How
Often Do They Really Occur?. Perspectives on Psychological Science, 7(6), 537-542.
Magee, B. (1973). Karl Popper. New York: Viking Press.
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow
progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806.
Meehl, P. E. (1990). Why summaries of research on psychological theories are often
uninterpretable. Psychological Reports, 66(1), 195-244.
Meehl, P.E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and
two principles that warrant using it. Psychological Inquiry, 1, 108Ð141.
Mulkay, M., & Gilbert, G. N. (1981). Putting philosophy to work: Karl Popper's influence on
scientific practice. Philosophy of the Social Sciences, 11, 389-407.
Mulkay, M., & Gilbert, G. N. (1986). Replication and mere replication. Philosophy of the Social
Sciences, 16(1), 21-37.
Nosek, B. A., & the Open Science Collaboration. (2012). An open, large-scale, collaborative
effort to estimate the reproducibility of psychological science. Perspectives on Psychological
Science, 7(6), 657-660.
29
Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific Utopia II. Restructuring Incentives and
Practices to Promote Truth Over Publishability. Perspectives on Psychological Science, 7(6),
615-631.
Pashler, H., & Wagenmakers, E. J. (2012). EditorsÕ Introduction to the Special Section on
Replicability in Psychological Science: A Crisis of Confidence? Perspectives on Psychological
Science, 7(6), 528-530.
Polanyi, M. (1962). Tacit knowing: Its bearing on some problems of philosophy. Reviews of
Modern Physics, 34(4), 601-615.
Popper, K. 1959. The logic of scientific discovery. London: Hutchison.
Quine, W.V.O. (1980). Two dogmas of empiricism. In W.V.O. Quine (Ed.), From a logical
point of view (2nd ed., pp. 20Ð46). Cambridge,MA: Harvard University Press. (Original work
published 1953)
Radder, H. (1992, January). Experimental reproducibility and the experimenters' regress. In PSA:
Proceedings of the Biennial Meeting of the Philosophy of Science Association (pp. 63-73).
Philosophy of Science Association.
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological
bulletin, 86(3), 638.
Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication is neglected
in the social sciences. Review of General Psychology, 13(2), 90.
Schnall, S. (2014). Simone Schnall on her experience with a registered replication project. SPSP
Blog. Available at http://www.spspblog.org/simone-schnall-on-her-experience-with-a-registered-
replication-project/.
30
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed
flexibility in data collection and analysis allows presenting anything as significant. Psychological
science, 22(11), 1359-1366.
Smith, N. C. (1970). Replication studies: A neglected aspect of psychological research.
American Psychologist, 25(10), 970.
Stroebe, W., Postmes, T., & Spears, R. (2012). Scientific misconduct and the myth of self-
correction in science. Perspectives on Psychological Science, 7(6), 670-688.
Trafimow, D. (2014) Editorial. Basic and Applied Social Psychology, 36:1, 1-2, DOI:
10.1080/01973533.2014.865505.
Trafimow, D. (2003). Hypothesis testing and theory evaluation at the boundaries: Surprising
insights from BayesÕs theorem. Psychological Review, 110, 526-535.
Trafimow, D. (2009). The theory of reasoned action: A case study of falsification in psychology.
Theory & Psychology, 19, 501-518.
Trafimow, D. (2010). On making assumptions about auxiliary assumptions: Reply to Wallach
and Wallach. Theory and Psychology, 20, 707-711.
Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, in press.
Trafimow, D., & Rice, S. (2009). A test of the NHSTP correlation argument. Journal of General
Psychology, 136, 261-269.
Tsang, E. W., & Kwan, K. M. (1999). Replication and theory development in organizational
science: A critical realist perspective. Academy of Management review, 24(4), 759-780.
Van IJzendoorn, M.H. (1994): A process model of replication studies: on the relation between
different types of replication. Leiden University Library. Available at:
https://openaccess.leidenuniv.nl/bitstream/handle/1887/1483/168_149.pdf?sequence=1.
31
Verhagen, J., & Wagenmakers, E. J. (2014). Bayesian tests to quantify the result of a replication
attempt. Journal of Experimental Psychology: General, 143(4), 1457-1475.
Westen, D. (1988). Official and unofficial data. New Ideas in Psychology, 6, 323Ð331.
Yong, E. (2012). A failed replication attempt draws a scathing personal attack from a psychology
professor. Discover Magazine. Available at http://blogs. discovermagazine.
com/notrocketscience/2012/03/10/failed-replication-barghpsychology-study-doyen.
Replicat ion, falsificat ion, and t he crisis of confidence in social psychology
Br i an D. Ear p and Davi d Tr af i mow
Jour nal Name: Fr ont i er s i n Psychol ogy
ISSN: 1664-1078
Ar t i cl e t ype: Hypot hesi s & Theor y Ar t i cl e
Fi r st r ecei ved on: 05 Mar 2015
Revi sed on: 22 Apr 2015
Fr ont i er s websi t e l i nk: www. f r ont i er si n. or g
Quantitative Psychology and Measurement
1
Replication, Falsification, and the
Crisis of Confidence in Social Psychology
Brian D. Earp
1, 2
& David Trafimow
3
1
University of Oxford
2
University of Cambridge
3
New Mexico State University
Abstract
The (latest) Òcrisis in confidenceÓ in social psychology has generated much heated
discussion about the importance of replication, including how such replication should be
carried out as well as interpreted by scholars in the field. What does it mean if a replication
attempt ÒfailsÓÑdoes it mean that the original results, or the theory that predicted them,
have been falsified? And how should ÒfailedÓ replications affect our belief in the validity of
the original research? In this paper, we consider the ÒreplicationÓ debate from a historical
and philosophical perspective, and provide a conceptual analysis of both replication and
falsification as they pertain to this important discussion. Along the way, we introduce a
Bayesian framework for assessing ÒfailedÓ replications in terms of how they should affect
our confidence in purported findings.
Key words: replication, falsifiability, crisis of confidence, social psychology, priming,
philosophy of science, Karl Popper
ÒOnly when certain events recur in accordance with rules
or regularities, as in the case of repeatable experiments,
can our observations be testedÑin principleÑby anyone. ...
Only by such repetition can we convince ourselves that we
are not dealing with a mere isolated ÔcoincidenceÕ, but with
events which, on account of their regularity and
reproducibility, are in principle inter-subjectively
testable.Ó
-- Karl Popper (1959, p. 45)
2
Introduction
Scientists pay lip-service to the importance of replication. It is the Òcoin of the scientific realmÓ
(Loscalzo, 2012, p. 1211); Òone of the central issues in any empirical scienceÓ (Schmidt, 2009, p.
90); or even the Òdemarcation criterion between science and nonscienceÓ (Braude, 1979, p. 2).
Similar declarations have been made about falsifiability, the Òdemarcation criterionÓ proposed by
Popper in his seminal work of 1959 (see epigraph). As we will discuss below, the concepts are
closely relatedÑand also frequently misunderstood. Nevertheless, their regular invocation
suggests a widespread if vague allegiance to Popperian ideals among contemporary scientists,
working from a range of different disciplines (Jordan, 2004; Jost, 2013). The cosmologist
Hermann Bondi once put it this way: ÒThere is no more to science than its method, and there is
no more to its method than what Popper has saidÓ (quoted in Magee, 1973, p. 2).
Experimental social psychologists have fallen in line. Perhaps in part to bolster our sense of
identity with the natural sciences (Danzinger, 1997), we psychologists have been especially keen
to talk about replication. We want to trade in the ÒcoinÓ of the realm. As Billig (2013) notes,
psychologists Òcling fast to the belief that the route to knowledge is through the accumulation of
[replicable] experimental findingsÓ (p. 179). The connection to Popper is often made explicit.
One recent example comes from Kepes and McDaniel (2013), from the field of industrial-
organizational psychology: ÒThe lack of exact replication studies [in our field] prevents the
opportunity to disconfirm research results and thus to falsify [contested] theoriesÓ (p. 257). They
cite The Logic of Scientific Discovery.
There are problems here. First, there is the ÒlackÓ of replication noted in the quote from Kepes
and McDaniel. If replication is so important, why isnÕt it being done? This question has become
a source of crisis-level anxiety among many psychologists in recent years, as we will explore in a
later section. The anxiety is due to a disconnect: between what is seen as being necessary for
scientific credibilityÑi.e., careful replication of findings based on precisely-stated theoriesÑand
what appears to be characteristic of the field in practice (Nosek, Spies, & Motyl, 2012). Part of
the problem is the lack of prestige associated with carrying out replications (Smith, 1970). To put
it simply, few would want to be seen by their peers as merely ÒcopyingÓ anotherÕs work (e.g.,
Mulkay & Gilbert, 1986); and few could afford to be seen in this way by tenure committees or
3
by the funding bodies that sponsor their research. Thus: while Òa field that replicates its work is
[seen as] rigorous and scientifically soundÓÑaccording to Makel, Plucker, and Hegarty (2012)Ñ
psychologists who actually conduct those replications Òare looked down on as bricklayers and
not [as] advancing [scientific] knowledgeÓ (p. 537). In consequence, actual replication attempts
are rare.
A second problem is with the reliance on PopperÑor, at any rate, a first-pass reading of Popper
that seems to be uninformed by subsequent debates in the philosophy of science. Indeed, as
critics of Popper have noted, since the 1960s and consistently thereafter, neither his notion of
falsification nor his account of experimental replicability seem strictly amenable to being put
into practice (e.g., Mulkay & Gilbert, 1981; see also Earp, 2011)Ñat least not without
considerable ambiguity and confusion. What is more, they may not even be fully coherent as
stand-alone ÒabstractÓ theories, as has been repeatedly noted as well (cf. Cross, 1992).
The arguments here are familiar. Let us suppose thatÑat the risk of being accused of laying
down bricksÑResearcher B sets up an experiment to try to ÒreplicateÓ a controversial finding
that has been reported by Researcher A. She follows the original methods section as closely as
she can (assuming that this has been published in detail; or even better, she simply asks
Researcher A for precise instructions). She calibrates her equipment. She prepares the samples
and materials just so. And she collects and then analyzes the data. If she gets a different result
from what was reported by Researcher AÑwhat follows? Has she ÒfalsifiedÓ the other labÕs
theory? Has she even shown the original result to be erroneous in some way?
The answer to both of these questions, as we will demonstrate in some detail below, is Òno.Ó
Perhaps Researcher B made a mistake (see Trafimow, 2014). Perhaps the other lab did. Perhaps
one of BÕs research assistants wrote down the wrong number. Perhaps the original effect is a
genuine effect, but can only be obtained under specific conditionsÑand we just donÕt know yet
what they are (Cesario, 2014). Perhaps it relies on ÒtacitÓ (Polanyi, 1962) or ÒunofficialÓ
(Westen, 1988) experimental knowledge that can only be acquired over the course of several
years, and perhaps Researcher B has not yet acquired this knowledge (Collins, 1975).
4
Or perhaps the original effect is not a genuine effect, but Researcher AÕs theory can actually
accommodate this fact. Perhaps Researcher A can abandon some auxiliary hypothesis, or take on
board another, or re-formulate a previously unacknowledged background assumptionÑor
whatever (cf. Lakatos, 1970; Folger, 1989; Cross, 1992). As Lakatos (1970) once put it: Ògiven
sufficient imagination, any theory ... can be permanently saved from ÔrefutationÕ by some
suitable adjustment in the background knowledge in which it is embeddedÓ (p. 184). We will
discuss some of these potential ÒadjustmentsÓ below. The upshot, however, is that we simply do
not know, and cannot know, exactly what the implications of a given ÒreplicationÓ attempt are,
no matter which way the data come out. There are no critical tests of theories; and there are no
objectively decisive replications.
Popper (1959) was not blind to this problem. ÒIn point of fact,Ó he wrote, in an under-appreciated
passage of his famous book, Òno conclusive disproof of a theory can ever be produced, for it is
always possible to say that the experimental results are not reliable, or that the discrepancies
which are asserted to exist between the experimental results and the theory are only apparentÓ (p.
50, emphasis added). Hence as Mulkay and Gilbert (1981) explain:
... in relation to [actual] scientific practice, one can only talk of positive and negative
results, and not of proof or disproof. Negative results, that is, results which seem
inconsistent with a given hypothesis [or with a putative finding from a previous
experiment], may incline a scientist to abandon [the] hypothesis but they will never
require him to abandon it ... Whether or not he does so may depend on the amount and
quality of positive evidence, on his confidence in his own and othersÕ experimental skills
and on his ability to conceive of alternative interpretations of the negative findings. (p.
391)
Drawing hard and fast conclusions, therefore, about ÒnegativeÓ resultsÑsuch as those that may
be produced by a ÒfailedÓ replication attemptÑis much more difficult than Kepes and McDaniel
seem to imagine (see, e.g., Chow, 1988, for similarly problematic arguments). This difficulty
may be especially acute in the field of psychology. As Folger (1989) notes, ÒPopper himself
believed that too many theories, particularly in the social sciences, were constructed so loosely
that they could be stretched to fit any conceivable set of experimental results, making them ...
devoid of testable contentÓ (p. 156, emphasis added). Furthermore, as Collins (1985) has argued,
5
the less secure a fieldÕs foundational theoriesÑand especially at the fieldÕs ÒfrontierÓÑthe more
room there is for disagreement about what should ÒcountÓ as a proper replication.
1
Related to this problem is that it can be difficult to know in what specific sense a replication
study should be considered to be Òthe sameÓ as the original (e.g., van IJzendoorn, 1994).
Consider that the goal for these kinds of studies is to rule out flukes and other types of error.
Thus we want to be able to say that the same experiment, if repeated one more time, would
produce the same result as was originally observed. But an original study and a replication study
cannot, by definition, be identicalÑat the very least, some time will have passed and the
participants will all be new
2
Ñand if we donÕt yet know which differences are theory-relevant,
we wonÕt be able to control for their effects. The problem with a field like psychology, whose
theoretical predictions are often Òconstructed so loosely,Ó as noted above, is precisely that we
simply do not knowÑor at least we do not know in a large number of casesÑwhich differences
are in fact relevant to the theory.
Finally, human behavior is notoriously complex. We are not like billiard balls, or beavers, or
planets, or paramecia (that is, relatively simple objects or organisms with comparatively
circumscribed behavior). This means that we should expect our behavioral responses to vary
across a Òwide range of moderating individual difference and experimental context variablesÓ
(Cesario, in 2014, p. 41)Ñmany of which are not yet known, and some of which may be difficult
or even impossible to uncover (Meehl, 1990). Thus, in the absence of Òwell-developed theories
for specifying such [moderating] variables, the conclusions of replication failures will be
ambiguousÓ (Cesario, 2014, p. 41; see also Meehl, 1978).
1
There are two steps to understanding this idea. First, because the foundational theories are so insecure,
and the fieldÕs findings so under dispute, the ÒcorrectÓ empirical outcome of a given experimental design
is unlikely to have been firmly established. Second, and insofar the first step applies, the standard by
which to judge whether a replication has been competently performed is equally unavailableÑsince that
would depend upon knowing the ÒcorrectÓ outcome of just such an experiment. Thus a Òcompetently
performedÓ experiment is one that produces the ÒcorrectÓ outcome; while the ÒcorrectÓ outcome is
defined by whatever it is that is produced by a Òcompetently performedÓ experiment. As Collins (1985)
states: ÒWhere there is disagreement about what counts as a competently performed experiment, the
ensuing debate is coextensive with the debate about what the proper outcome of the experiment isÓ (p.
89). This is the infamously circular experimenterÕs regress. Of course, as a reviewer for this paper notes,
a competently performed experiment should produce satisfactory (i.e., meaningful, useful) results on
"outcome neutral" tests.
2
Assuming that it is a psychology experiment, of course. Note that even if the ÒsameÓ participants are run
through the experiment one more time, theyÕll have changed in at least one essential way: theyÕll have
already gone through the experiment (opening the door for practice effects, etc.).
6
Summing up the problem
Hence we have two major points to consider. First, due to a lack of adequate incentives in the
reward structure of professional science (e.g., Nosek et al., 2012), actual replication attempts are
rarely carried out. Second, to the extent that they are carried out, it can be well-nigh impossible
to say conclusively what they mean, whether they are ÒsuccessfulÓ (i.e., showing similar, or
apparently similar, results to the original experiment) or ÒunsuccessfulÓ (i.e., showing different,
or apparently different, results to the original experiment). Thus Collins (1985) came to the
conclusion that, in physics at least, disputes over contested findings are likelier to be resolved by
social and reputational negotiationsÑover, e.g., who should be considered a competent
experimenterÑthan by any ÒobjectiveÓ consideration of the experiments themselves. Meehl
(1990) drew a similar conclusion about the field of social psychology, although he identified
sheer boredom (rather than social/reputational negotiation) as the alternative to decisive
experimentation:
... theories in the Òsoft areasÓ of psychology have a tendency to go through periods of
initial enthusiasm leading to large amounts of empirical investigation with ambiguous
over-all results. This period of infatuation is followed by various kinds of amendment and
the proliferation of ad hoc hypotheses. Finally, in the long run, experimenters lose
interest rather than deliberately discard a theory as clearly falsified. (p. 196)
So how shall we take stock of what has been said? A cynical reader might conclude thatÑfar
from being a Òdemarcation criterion between science and nonscienceÓÑreplication is actually
closer to being a waste of time. Indeed, if even replications in physics are sometimes not
conclusive, as Collins (1975, 1981, 1985) has convincingly shown, then what hope is there for
replications in psychology?
Our answer is simply as follows. Replications do not need to be ÒconclusiveÓ in order to be
informative. In this paper, we will highlight some of the ways in which replication attempts can
be more, rather than less, informative, and we will discussÑusing a Bayesian frameworkÑhow
they can reasonably affect a researcherÕs confidence in the validity of an original finding. The
same is true of Òfalsification.Ó While a scientist should not simply abandon her favorite theory on
7
account of a single (apparently) contradictory resultÑas Popper himself was careful to point out
3
(1959, p. 66-67; see also Earp, 2011)Ñshe might reasonably be open to doubt it, given enough
disconfirmatory evidence, and assuming that she had stated the theory precisely. Rather than
being a Òwaste of time,Ó therefore, experimental replication of oneÕs own and other findings can
be a useful tool for restoring confidence in the reliability of basic effectsÑprovided that certain
conditions are met. The work of the latter part of this essay will be to describe and to justify at
least a few of those essential conditions. In this context, we will draw a distinction between
ÒconceptualÓ or ÒreproductiveÓ replications (cf. Cartwright, 1991)Ñwhich may conceivably be
used to bolster confidence in a particular theoryÑand ÒdirectÓ or ÒcloseÓ replications, which
may be used to bolster confidence in a finding (Schmidt, 2009; see also Earp et al., 2014). Since
it is doubt about the findings that seems to have prompted the recent ÒcrisisÓ in social
psychology, it is the latter that will be our focus. But first we must introduce the crisis.
The (Latest) Crisis in Social Psychology and Calls for Replication
ÒIs there currently a crisis of confidence in psychological science reflecting an unprecedented
level of doubt among practitioners about the reliability of research findings in the field? It would
certainly appear that there is.Ó So write Harold Pashler and Eric-Jan Wagenmakers (2012, p.
529) in a recent issue of Perspectives on Psychological Science. The ÒcrisisÓ is not unique to
psychology; it is rippling through biomedicine and other fields as well (Ioannidis, 2005; Earp &
Darby, 2014; Loscalzo 2012) Ð but psychology will be the focus of this paper, if for no other
reason than that the present authors have been closer to the facts on the ground.
Some of the causes of the crisis are fairly well known. In 2011, an eminent Dutch researcher
confessed to making up data and experiments, producing a rŽsumŽ-full of ÒfindingsÓ that he had
3
On PopperÕs view, one must set up a "falsifying hypothesis," i.e., a hypothesis specifying how another
experimenter could recreate the falsifying evidence. But then, Popper says, the falsifying hypothesis itself
should be severely tested and corroborated before it is accepted as falsifying the main theory.
Interestingly, as a reviewer has suggested, the distinction between a falsifying hypothesis and the main
theory may also correspond to the distinction between direct vs. conceptual replications that we discuss in
a later section. On this view, direct replications (attempt to) reproduce what the falsifying hypothesis
states is necessary to generate the original predicted effect, whereas conceptual replications are attempts
to test the main theory.
8
simply invented out of whole cloth (Carey, 2011). He was outed by his own students, however,
and not by peer review nor by any attempt to replicate his work. In other words, he might just as
well have not been found out, had he only been a little more careful (Stroebe, Postmes, & Russell
Spears, 2012). An unsettling prospect was thus aroused: Could other fraudulent ÒfindingsÓ be
circulatingÑundetected, and perhaps even undetectableÑthroughout the published record? After
an exhaustive analysis of the Dutch fraud case, Stroebe et al. (2012) concluded that the notion of
self-correction in science was actually a ÒmythÓ (p. 670); and others have offered similar
pronouncements (Ioannidis, 2012a).
But fraud, it is hoped, is rare. Nevertheless, as Ioannidis (2005, 2012a) and others have argued,
the line between explicitly fraudulent behavior and merely ÒquestionableÓ research practices is
perilously thin, and the latter are probably common. John, Loewenstein, and Prelec (2012)
conducted a massive, anonymous survey of practicing psychologists and showed that this
conjecture is likely correct. Psychologists admitted to such questionable research practices as
failing to report all of the dependent measures for which they had collected data (78%
4
),
collecting additional data after checking to see whether preliminary results were statistically
significant (72%), selectively reporting studies that ÒworkedÓ (67%), claiming to have predicted
an unexpected finding (54%), and failing to report all of the conditions that they ran (42%). Each
of these practices alone, and even more so when combined, reduces the interpretability of the
final reported statistics, casting doubt upon any claimed ÒeffectsÓ (e.g., Simmons, Nelson, &
Simonsohn, 2011).
The motivation behind these practices, though not necessarily conscious or deliberate, is also not
obscure. Professional journals have long had a tendency to publish only or primarily novel,
Òstatistically significantÓ effects, to the exclusion of replicationsÑand especially ÒfailedÓ
replicationsÑor other null results. This problem, known as Òpublication bias,Ó leads to a file-
drawer effect whereby ÒnegativeÓ experimental outcomes are simply Òfiled awayÓ in a
researcherÕs bottom drawer, rather than written up and submitted for publication (e.g., Rosenthal
1979). Meanwhile, the Òquestionable research practicesÓ carry on in full force, since they
4
The percentages reported here are the geometric mean of self-admission rates, prevalence estimates by
the psychologists surveyed, and prevalence estimates derived by the John et al. from the other two
figures.
9
increase the researcherÕs chances of obtaining a Òstatistically significantÓ findingÑwhether it
turns out to be reliable or not.
To add insult to injury, in 2012, an acrimonious public skirmish broke out in the form of dueling
blog posts between the distinguished author of a classic behavioral priming study
5
and a team of
researchers who had questioned his findings (Yong, 2012). The disputed results had already been
cited more than 2,000 timesÑan extremely large number for the fieldÑand even been enshrined
in introductory textbooks. What if they did turn out to be a fluke? Should other Òpriming studiesÓ
be double-checked as well? Coverage of the debate ensued in the mainstream media (e.g.,
Bartlett, 2013).
Another triggering event resulted in Òwidespread public mockeryÓ (Pashler & Wagenmakers,
2012, p. 528). In contrast to the fraud case described above, which involved intentional,
unblushing deception, the psychologist Daryl Bem relied on well-established and widely-
followed research and reporting practices to generate an apparently fantastic result, namely
evidence that participantsÕ current responses could be influenced by future events (Bem, 2011).
Since such paranormal precognition is inconsistent with widely-held theories about Òthe
fundamental nature of time and causalityÓ (Lebel & Peters, p. 371), few took the findings
seriously. Instead, they began to wonder about the Ôwell-established and widely-followed
research and reporting practicesÕ that had sanctioned the findings in the first place (and allowed
for their publication in a leading journal). As Simmons et al. (2011) concludedÑreflecting
broadly on the state of the disciplineÑÒit is unacceptably easy to publish Ôstatistically
significantÕ evidence consistent with any hypothesisÓ (p. 1359).
6
The main culprit for this phenomenon is what Simmons et al. (2012) identified as researcher
degrees of freedom:
5
Priming has been defined a number of different ways. Typically, it refers to the ability of subtle cues in
the environment to affect an individualÕs thoughts and behavior, often outside of her awareness or control
(e.g., Bargh & Chartrand, 1999).
6
Even more damning, Trafimow (2003; Trafimow & Rice, 2009; Trafimow & Marks, 2015) has argued
that the standard significance tests used in psychology are invalid even when they are done Òcorrectly.Ó
Thus, even if psychologists were to follow the prescriptions of Simmons et al.Ñand reduce their
researcher degrees of freedom (see the discussion following this footnote)Ñthis would still
fail to address the core problem that such tests should not be used in the first place.
10
In the course of collecting and analyzing data, researchers have many decisions to make:
Should more data be collected? Should some observations be excluded? Which
conditions should be combined and which ones compared? Which control variables
should be considered? Should specific measures be combined or transformed or both? ...
It is rare, and sometimes impractical, for researchers to make all these decisions
beforehand. Rather, it is common (and accepted practice) for researchers to explore
various analytic alternatives, to search for a combination that yields Òstatistical
significanceÓ and to then report only what Òworked.Ó (p. 1359)
One unfortunate consequence of such a strategyÑinvolving, as it does, some of the very same
Òquestionable research practicesÓ later identified by John et al. (2012) in their survey of
psychologistsÑis that it inflates the possibility of producing a Òfalse positiveÓ (or a Type 1
error). Since such practices are ÒcommonÓ and even Òaccepted,Ó the literature may be replete
with erroneous results. Thus, as Ioaniddis (2005) declared after performing a similar analysis in
his own field of biomedicine, Òmost published research findingsÓ may be ÒfalseÓ (p. 0696,
emphasis added). Hence the Òunprecedented level of doubtÓ referred to by Pashler and
Wagenmakers (2012) in the opening quote to this section.
This not the first crisis for psychology. Roger Giner-Sorolla (2012) points out that ÒcrisesÓ of
one sort or another Òhave been declared regularly at least since the time of Wilhelm WundtÓÑ
with turmoil as recent as the 1970s inspiring particular dŽjˆ vu (p. 563). Then, as now, a string of
embarrassing eventsÑincluding the publication in mainstream journals of literally unbelievable
findings
7
Ñled to Òsoul searchingÓ amongst leading practitioners. Standard experimental
methods, statistical strategies, reporting requirements, and norms of peer review were all put
under the microscope; numerous sources of bias were carefully rooted out (e.g., Greenwald,
1975). While various calls for reform were put forwardÑsome more energetically than othersÑ
a single corrective strategy seemed to emerge from all the din: the need for psychologists to
replicate their work. Since Òall flawed research practices yield findings that cannot be
reproduced,Ó critics reasoned, replication could be used to separate the wheat from the chaff
(Koole & Lakens, 2012, p. 608, emphasis added; see also Elms, 1975).
7
For example, a Òstudy found that eating disorder patients were significantly more likely than others to
see frogs in a Rorschach test, which the author interpreted as showing unconscious fear of oral
impregnation and anal birth ...Ó (Giner-Sorrolla, 2012, p. 562).
11
The same calls reverberate today. ÒFor psychology to truly adhere to the principles of science,Ó
write Ferguson and Heene (2012), Òthe need for replication of research results [is] important ... to
considerÓ (p. 556). Lebel and Peters (2011) put it like this: ÒAcross all scientific disciplines,
close replication is the gold standard for corroborating the discovery of an empirical
phenomenonÓ and Òthe importance of this point for psychology has been noted many timesÓ (p.
375). Indeed, Òleading researchers [in psychology]Ó agree, according to Francis (2012), that
Òexperimental replication is the final arbiter in determining whether effects are true or falseÓ (p.
585).
We have already seen that such calls must be heeded with caution: replication is not
straightforward, and the outcome of replication studies may be difficult to interpret. Indeed they
can never be conclusive on their own. But we suggested that replications could be more or less
informative; and in the following sections we discuss some strategies for making them ÒmoreÓ
rather than Òless.Ó We begin with a discussion of ÒdirectÓ vs. ÒconceptualÓ replication.
Increasing Replication Informativeness: ÒDirectÓ vs. ÒConceptualÓ Replication
In a systematic review of the literature, encompassing every conceivable discipline, G—mez,
Juristo, and Vegas (2010) identified 18 different types of replication. Three of these were from
Lykken (1968), who drew a distinction between Òliteral,Ó Òoperational,Ó and ÒconstructiveÓÑ
which Schmidt (2009) then winnowed down (and re-labeled) to arrive at ÒdirectÓ and
ÒconceptualÓ in an influential paper. As Makel et al. (2012) have pointed out, it is SchmidtÕs
particular framework that seems to have crystallized in the field of psychology, shaping most of
the subsequent discussion on this issue. We have no particular reason to rock the boat; indeed
these categories will suit our argument just fine.
The first step in making a replication informative is to decide what specifically it is for. ÒDirectÓ
replications and ÒconceptualÓ replications are ÒforÓ different things; and assigning them their
proper role and function will be necessary for resolving the Òcrisis.Ó First, some definitions:
A ÒdirectÓ replication may be defined as an experiment that is intended to be as similar to the
original as possible (Schmidt, 2009; Makel et al., 2012). This means that along every conceivable
dimensionÑfrom the equipment and materials used, to the procedure, to the time of day, to the
12
gender of the experimenter, etc.Ñthe replicating scientist should strive to avoid making any kind
of change or alteration. The purpose here is to ÒcheckÓ the original results. Some changes will be
inevitable, of course; but the point is that only the inevitable changes (such as the passage of time
between experiments) are ideally tolerated in this form of replication. In a ÒconceptualÓ
replication, by contrast, at least certain elements of the original experiment are intentionally
altered, ideally systematically so, toward the end of achieving a very different sort of purposeÑ
namely to see whether a given phenomenon, assuming that it is reliable, might obtain across a
range of variable conditions. But as Doyen et al. (2014) note in a recent paper:
The problem with conceptual replication in the absence of direct replication is that there
is no such thing as a Òconceptual failure to replicate.Ó A failure to find the same ÒeffectÓ
using a different operationalization can be attributed to the differences in method rather
than to the fragility of the original effect. Only the successful conceptual replications will
be published, and the unsuccessful ones can be dismissed without challenging the
underlying foundations of the claim. Consequently, conceptual replication without direct
replication is unlikely to change beliefs about the underlying effect. (p. 28)
In simplest terms, therefore, a ÒdirectÓ replication seeks to validate a particular fact or finding;
whereas a ÒconceptualÓ replication seeks to validate the underlying theory or phenomenonÑi.e.
the theory that has been proposed to ÒpredictÓ the effect that was obtained by the initial
experimentÑas well to establish the boundary conditions within which the theory holds true
(Nosek, Spies, & Motyl, 2012). The latter is impossible without the former. In other words, if we
cannot be sure that our finding is reliable to begin with (because it turns out to have been a
coincidence, or else a false alarm due to questionable research practices, publication bias, or
fraud), then we are in no position to begin testing the theory by which it is supposedly explained
(Cartwright, 1991; see also Earp et al. 2014).
Of course both types of replication are important, and there is no absolute line between them.
Rather, as Asendorpf et al. (2013) point out, Òdirect replicability [is] one extreme pole of a
continuous dimension extending to broad generalizability [via ÒconceptualÓ replication] at the
other pole, ranging across multiple, theoretically relevant facets of study designÓ (p. 139).
Collins made a similar point in 1985 (e.g., p. 37). But so long as we remain largely ignorant
about exactly which Òfacets of study designÓ are Òtheoretically relevantÓ to begin withÑas is the
case with much of current social psychology (see Meehl, 1990), and nearly all of the most
13
heavily-contested experimental findingsÑwe need to orient our attention more toward the
ÒdirectÓ end of the spectrum.
8
How else can replication be made more informative? Brandt et al. (2013)Õs ÒReplication RecipeÓ
offers several important factors, one of which must be highlighted to begin with. This is their
contention that a ÒconvincingÓ replication should be carried out outside the lab of origin. Clearly
this requirement shifts away from the ÒdirectÓ extreme of the replication gradient that we have
emphasized so far, but such a change from the original experiment, in this case, is justified. As
Ioannidis (2012b) points out, replications by the original researchersÑwhile certainly important
and to be encouraged as a preliminary stepÑare not sufficient to establish ÒconvincingÓ
experimental reliability. This is because allegiance and confirmation biases, which may apply
especially to the original team, would be less of an issue for independent replicators.
Partially against this view, Schnall (2014, np) argues that Òauthors of the original work should be
allowed to participate in the process of having their work replicated.Ó On the one hand, this
might have the desirable effect of ensuring that the replication attempt faithfully reproduces the
original procedure. It seems reasonable to think that the original author would know more than
anyone else about how the original research was conductedÑso her viewpoint is likely to be
helpful. On the other hand, however, too much input by the original author could compromise
the independence of the replication: she might have a strong motivation to make the replication a
success, which could subtly influence the results (see Earp & Darby, 2014). Whichever position
one takes on the appropriate degree of input and/or oversight from the original author, however,
Schnall (2014, np) is certainly right to note that Òthe quality standards for replications need to be
at least as high as for the original findings. Competent evaluation by experts is absolutely
essential, and is especially important if replication authors have no prior expertise with a given
research topic.Ó
8
Asendorpf et al. (2013) explain why this is so: Ò[direct] replicability is a necessary condition for further
generalization and thus indispensible for building solid starting points for theoretical development.
Without such starting points, research may become lost in endless fluctuation between alternative
generalization studies that add numerous boundary conditions but fail to advance theory about why these
boundary conditions existÓ (p. 140, emphasis added).
14
Other ingredients in increasing the informativeness of replication attempts include: (1) carefully
defining the effects and methods that the researchers intend to replicate; (2) following exactly as
possible the methods of the original study (as described above); (3) having high statistical power
(i.e., an adequate sample size to detect an effect if one is really present); (4) making complete
details about the replication available, so that interested experts can fully evaluate the replication
attempt (or attempt another replication themselves); and (5) evaluating the replication results,
comparing them critically to the results of the study (Brandt et al., 2013, p. 218, paraphrased).
This list is not exhaustive, but it gives a concrete sense of how ÒstabilizingÓ procedures (see
Radder, 1992) can be employed to give greater credence to the quality and informativeness of
replication efforts.
Falsification, Replication, and Auxiliary Assumptions
Brandt et al.Õs (2013) Òreplication recipeÓ provides a vital tool for researchers seeking to conduct
high quality replications. In this section, we offer an additional ÒingredientÓ to the discussion, by
highlighting the role of auxiliary assumptions in increasing replication informativeness,
specifically as these pertain to the relationship between replication and falsification. Consider the
logical fallacy of affirming the consequent that provided an important basis for PopperÕs
falsification argument.
If the theory is true, an observation should occur ! ! ! (Premise 1)
The observation occurs ! (Premise 2)
Therefore, the theory is true ! (Conclusion)
Obviously, the conclusion does not follow. Any number of things might have led to the
observation that have nothing to do with the theory being proposed (see Earp, 2015 for a similar
argument). On the other hand, denying the consequent (modus tollens) does invalidate the theory,
strictly according to the logic given:
If the theory is true, an observation should occur ! ! ! (Premise 1)
The observation does not occur !! (Premise 2)
Therefore, the theory is not true !! (Conclusion)
15
Given this logical asymmetry, then, between affirming and denying the consequent of a
theoretical prediction (see Earp & Everett, 2013), Popper opted for the latter. By doing so, he
famously defended a strategy of disconfirming rather than confirming theories. Yet if the goal is
to disconfirm theories, then the theories must be capable of being disconfirmed in the first place;
hence, a basic requirement of scientific theories (in order to count as properly scientific; see Earp
& Westermann, under revision) is that they have this feature: they must be falsifiable.
As we hinted at above, however, this basic framework is an oversimplification. As Popper
himself noted, and as was made particularly clear by Lakatos (1978; also see Duhem, 1954;
Quine; 1980), scientists do not derive predictions only from a given theory, but rather from a
combination of the theory and auxiliary assumptions. The auxiliary assumptions are not part of
the theory proper, but they serve several important functions. One of these functions is to show
the link between the sorts of outcomes that a scientist can actually observe (i.e., by running an
experiment), and the non-observable, ÒabstractÓ content of the theory itself. To pick one classic
example from psychology, according to the theory of reasoned action (e.g., Fishbein, 1980),
attitudes determine behavioral intentions. One implication of this theoretical assumption is that
researchers should be able to obtain strong correlations between attitudes and subjective norms.
But this assumes, among other things, that a check mark on an attitude scale really indicates the
personÕs attitude, and that a check mark on an intention scale really indicates the personÕs
intention. The theory of reasoned action has nothing to say about whether check marks on scales
indicate attitudes or intentions; these are assumptions that are peripheral to the basic theory.
They are auxiliary assumptions that researchers use to connect non-observational terms such as
ÒattitudeÓ and ÒintentionÓ to observable phenomena such as check marks. Fishbein and Ajzen
(e.g., 1975; Ajzen & Fishbein, 1980) recognized this and took great pains to spell out, as well as
possible, the auxiliary assumptions that best aid in measuring theoretically relevant variables.
The existence of auxiliary assumptions complicates the project of falsification. This is because
the major premise of the modus tollens argumentÑdenying the consequent of the theoretical
predictionÑmust be stated somewhat differently. It must be stated like this: ÒIf the theory is true
and a set of auxiliary assumptions is true, an observation should occur.Ó Keeping the second
premise the same implies that either the theory is not true or that at least one auxiliary
assumption is not true, as the following syllogism (in symbols only) illustrates.
16
! ! !
!
! !
!
!!
!
! ! (Premise 1)
!! (Premise 2)
! !! !" ! !
!
! !
!
!!
!
! !! !" !!
!
!" !!
!
! !!
!
(Conclusion)
Consider an example. It often is said that NewtonÕs gravitational theory predicted where planets
would be at particular times. This is not precisely accurate. It would be more accurate to say that
such predictions were derived from a combination of NewtonÕs theory and auxiliary assumptions
not contained in that theory (e.g., about the present locations of the planets). To return to our
example about attitudes and intentions from psychology, consider the mini-crisis in social
psychology from the 1960s, when it became clear to researchers that attitudesÑthe kingly
constructÑfailed to predict behaviors. Much of the impetus for the theory of reasoned action
(e.g., Fishbein, 1980) was FishbeinÕs realization that there was a problem with attitude
measurement at the time: when this problem was fixed, strong attitude-behavior (or at least
attitude-intention) correlations became the rule rather than the exception. This episode provides a
compelling illustration of a case in which attention to the auxiliary assumptions that bore on
actual measurement played a larger role in resolving a crisis in psychology than debates over the
theory itself.
What is the lesson here? Due to the fact that failures to obtain a predicted observation can be
blamed either on the theory itself or on at least one auxiliary assumption, absolute theory
falsification is about as problematic as is absolute theory verification. In the Newton example,
when some of NewtonÕs planetary predictions were shown to be wrong, he blamed the failures
on incorrect auxiliary assumptions rather than on his theory, arguing that there were additional
but unknown astronomical bodies that skewed his findingsÑwhich turned out to be a correct
defense of his theory. Likewise, in the attitude literature, the theoretical connection between
attitudes and behaviors turned out to be correct (as far as we know) with the problem having
been caused by incorrect auxiliary assumptions pertaining to attitude measurement.
There is an additional consequence to the necessity of giving explicit consideration to oneÕs
auxiliary assumptions. Suppose, as often happens in psychology, that a researcher deems a
theory to be unfalsifiable because he or she does not see any testable predictions. Is the theory
really unfalsifiable or is the problem that the researcher has not been sufficiently thorough in
17
identifying the necessary auxiliary assumptions that would lead to falsifiable predictions? Given
that absolute falsification is impossible, and that researchers are therefore limited to some kind of
ÒreasonableÓ falsification, Trafimow (2009) has argued that many allegedly unfalsifiable theories
are reasonably falsifiable after all: it is just a matter of researchers having to be more thoughtful
about considering auxiliary assumptions. Trafimow documented examples of theories that had
been described as unfalsifiable that one could in fact falsify by proposing better auxiliary
assumptions than had been imagined by previous researchers.
The notion that auxiliary assumptions can vary in quality is relevant for replication. Consider, for
example, the case alluded to earlier regarding a purported failure to replicate Bargh et al.Õs
(1996) famous priming results. In the replication attempt of this well-known Òwalking timeÓ
study (Doyen et al., 2012), laser beams were used to measure the speed with which participants
left the laboratory, rather than students with stopwatches. Undoubtedly, this adjustment was
made on the basis of a reasonable auxiliary assumption that methods of measuring time that are
less susceptible to human idiosyncrasies would be superior to methods that are more subject to
them. Does the fact that the failed replication was not exactly like the original experiment
disqualify it as invalid? At least with regard to this specific feature of this specific replication
attempt, the answer is clearly Òno.Ó If a researcher uses a better auxiliary assumption than in the
original experiment, this should add to its validity rather than subtract from it.
9
But suppose, for a particular experiment, that we are not in a good position to judge the
superiority of alternative auxiliary assumptions. We might invoke what Meehl (1990) termed the
ceteris paribus (all else equal) assumption. This idea, applied to the issue of direct replications,
suggests that for researchers to be confident that a replication attempt is a valid one, the auxiliary
assumptions in the replication have to be sufficiently similar to those in the original experiment
that any differences in findings cannot reasonably be attributed to differences in the assumptions.
Put another way, all of the unconsidered auxiliary assumptions should be indistinguishable in the
relevant way: that is, all have to be sufficiently equal or sufficiently right or sufficiently
irrelevant so as not to matter to the final result.
9
There may be other reasons why the ÒfailedÓ replication by Doyen et al. should not be considered
conclusive, of course; for further discussion see, e.g., Lieberman (2012).
18
What makes it allowable for the researcher to make the ceteris paribus assumption? In a strict
philosophical sense, of course, it is not allowable. To see this, suppose that Researcher A has
published an experiment, Researcher B has replicated it, but the replication failed. If Researcher
A claims that Researcher B made a mistake in performing the replication, or just got unlucky,
there is no way to disprove Researcher AÕs argument absolutely. But suppose that Researchers C,
D, E, and F also attempt replications, and also fail. It becomes increasingly difficult to support
the contention that Researchers B-F all Òdid it wrongÓ or were unlucky, and that we should
continue to accept Researcher AÕs version of the experiment. Even if a million researchers
attempted replications, and all of them failed, it is theoretically possible that Researcher AÕs
version is the unflawed one and all the others are flawed. But most researchers would conclude
(and in our view, would be right to conclude) that it is more likely that it is Researcher A who
got it wrong and not the million researchers who failed to replicate the observation. Thus, we are
not arguing that replications, whether successful or not, are definitive. Rather, our argument is
that replications (of sufficient quality) are informative.
Introducing a Bayesian framework
To see why this is the case, we shall employ a Bayesian framework similar to Trafimow (2010).
Suppose that an aficionado of Researcher A believes that the prior probability of anything
Researcher A said or did is very high. Researcher B attempts a replication of an experiment by
Researcher A and fails. The aficionado might continue confidently to believe in Researcher AÕs
version, but the aficionadoÕs confidence likely would be decreased slightly. Well, then, as there
are more replication failures, the aficionadoÕs confidence would continue to decrease
accordingly, and at some point the decrease in confidence would push the aficionadoÕs
confidence below the 50% mark, in which case the aficionado would put more credence in the
replication failures than on the success obtained by Researcher A.
In the foregoing scenario, we would want to know the probability that the original result is
actually true given Researcher BÕs replication failure ! !!! . As Equation 1 shows, this
depends on the aficionadoÕs prior level of confidence that the original result is true ! ! , the
probability of failing to replicate given that the original result is true ! !!! , and the
probability of failing to replicate ! ! , as Equation 1 shows.
19
! !!! !
! ! ! !!!
! !
(1)
Alternatively, we could frame what we want to know in terms of a confidence ratio that the
original result is true or not true given the failure to replicate
!!!!!!
!!!!!!!
. This would be a function
of the aficionadoÕs prior confidence ratio about the truth of the finding
! !
! !!
and the ratio of
probabilities of failing given that the original result is true or not
! !!!
! !!!!
. Thus, Equation 2
gives the posterior confidence ratio.
!!!!!!
!!!!!!!
!
! !
! !!
! !!!
! !!!!
(2)
Suppose that the aficionado is a very strong one, so that the prior confidence ratio is 50. In
addition, the probability ratio pertaining to failing to replicate is .5. It is worthwhile to clarify
two points about this probability ratio. First, we assume that the probability of failing to replicate
is less if the original finding is true than if it is not true, so that the ratio ought to be substantially
less than 1. Second, how much less than 1 this ratio will be depends largely on the quality of the
replication; as the replication becomes closer to meeting the ideal ceteris paribus condition, the
ratio will deviate increasingly from 1. Put more generally, as the quality of the auxiliary
assumptions going into the replication attempt increases, the ratio will decrease. Given these two
ratios of 50 and .5, the posterior confidence ratio is 25. Although this is a substantial decrease in
confidence from 50, the aficionado still believes that the finding is extremely likely to be true.
But suppose there is another replication failure and the probability ratio is .8. In that case, the
new confidence ratio is !" ! ! ! !". The pattern should be clear here: As there are more
replication failures, a rational person, even if that person is an aficionado of the original
researcher, will experience continually decreasing confidence as the replication failures mount.
If we imagine that there are N attempts to replicate the original finding that fail, the process
described in the foregoing paragraph can be summarized in a single equation that gives the ratio
of posterior confidences in the original finding, given that there have been N failures to replicate.
This is a function of the prior confidence ratio and the probability ratios in the first replication
failure, the second replication failure, and so on.
20
!!!!!
!
!
!!!!!!
!
!
!
!!!!
!!!!!
!!!
!
!!!
!!!
!
!!!!
!!!
!
!!!
!!!
!
!!!!
!
!!!
!
!!!
!!!
!
!!!!
!
!!!!
!!!!!
!!!
!
!!!
!!!
!
!!!!
!
!!!
(3)
For example, staying with our aficionado with a prior confidence ratio of 50, imagine a set of 10
replication failures, with the following probability ratios: .5, .8, .7, .65, .75, .56, .69, .54, .73, and
.52. The final confidence ratio, according to Equation 3, would be:
!" ! ! ! ! ! ! ! !" ! !" ! !" ! !" ! !" ! !" ! !" ! !!"!
Note the following. First, even with an extreme prior confidence ratio (we had set it at 50 for the
aficionado), it is possible to overcome it with a reasonable number of replication failures
providing that the person tallying the replication failures is a rational Bayesian (and there is
reason to think that those attempting the replications are sufficiently competent in the subject
area and methods to be qualified to undertake them). Second, it is possible to go from a state of
extreme confidence to one of substantial lack of confidence. To see this in the example, take the
reciprocal of the final confidence ratio (.54), which equals 1.84. In other words, the Bayesian
aficionado now believes that the finding is 1.84 times as likely to be not true as true. If we
imagine yet more failed attempts to replicate, it is easy to foresee that the eventual belief that the
original finding is not true could eventually become as powerful, or more powerful, than the
prior belief that the original finding is true.
In summary, auxiliary assumptions play a role, not only for the original theory-testing
experiment but also in replicationsÑeven in replications concerned only with the original
finding and not with the underlying theory. An important auxiliary assumption is the ever-
present ceteris paribus assumption, and the extent to which it applies influences the
ÒconvincingnessÓ of the replication attempt. Thus, a change in confidence in the original finding
is influence both by the quality and quantity of the replication attempts, as Equation 3 illustrates.
21
In presenting Equations 1-3, we reduced the theoretical content as much as possible, and more
than is realistic in actual research,
10
in considering so-called ÒdirectÓ replications. As the
replications serve other purposes, such as ÒconceptualÓ replications, the amount of theoretical
content is likely to increase. To link that theoretical content to the replication attempt, more
auxiliary assumptions will become necessary. For example, in a conceptual replication of an
experiment finding that attitudes influence behavior, the researcher might use a different attitude
manipulation or a different behavior measure. How do we know that the different manipulation
and measure are sufficiently theoretically unimportant that the conceptual replication really is a
replication (i.e., a test of the underlying theory)? We need new auxiliary assumptions linking the
new manipulation and measure to the corresponding constructs in the theory, just as an original
set of auxiliary assumptions was necessary in the original experiment to link the original
manipulation and measure to the corresponding constructs in the theory. Auxiliary assumptions
always matterÑand they should be made explicit so far as possible. In this way, it will be easier
to identify where in the chain of assumptions a ÒbreakdownÓ must have occurred, in attempting
to explain a apparent failure to replicate.
Conclusion
Replication is not a silver bullet. Even carefully-designed replications, carried out in good faith
by expert investigators, will never be conclusive on their own. But as Tsang and Kwan (1999)
point out:
If replication is interpreted in a strict sense, [conclusive] replications or experiments are
also impossible in the natural sciences. ... So, even in the ÒhardestÓ science (i.e., physics)
complete closure is not possible. The best we can do is control for conditions that are
plausibly regarded to be relevant. (p. 763)
10
Indeed, we have presented our analysis in this section in abstract terms so that the underlying reasoning
can be seen most clearly. However, this necessarily raises the question of how to go about implementing
these ideas in practice. As a reviewer points out, to calculate probabilities, the theory being tested would
need to be represented as a probability model; then in effect one would have Bayes factors to deal with.
We note that both Dienes (2014) and Verhagen and Wagenmakers (2014) have presented methods for
assessing the strength of evidence of a replication attempt (i.e., in confirming the original result) along
these lines, and we refer the reader to these papers for further consideration.
22
Nevertheless, ÒfailedÓ replications, especially, might be dismissed by an original investigator as
being flawed or ÒincompetentlyÓ performedÑbut this sort of accusation is just too easy. The
original investigator should be able to describe exactly what parameters she sees as being
theoretically relevant, and under what conditions her ÒeffectÓ should obtain. If a series of
replications is carried out, independently by different labs, and deliberately tailored to the
parameters and conditions so describedÑyet they reliably fail to produce the original resultÑ
then this should be considered informative. At the very least, it will suggest that the effect is
sensitive to theoretically-unspecified factors, whose specification is sorely needed. At most, it
should throw the existence of the effect into doubt, possibly justifying a shift in research
priorities. Thus, while ÒfalsificationÓ can in principle be avoided ad infinitum, with enough
creative effort by one who wished to defend a favored theory, scientists should not seek to
ÒrescueÓ a given finding at any empirical cost.
11
Informative replications can reasonably factor
into scientistsÕ assessment about just what that cost might be; and they should pursue such
replications as if the credibility of their field depended on it. In the case of experimental social
psychology, it does.
11
As Doyen et al. (2014, p. 28, internal references omitted) recently argued: ÒGiven the existence of
publication bias and the prevalence of questionable research practices, we know that the published
literature likely contains some false positive results. Direct replication is the only way to correct such
errors. The failure to find an effect with a well-powered direct replication must be taken as evidence
against the original effect. Of course, one failed direct replication does not mean the effect is non-
existentÑscience depends on the accumulation of evidence. But, treating direct replication as irrelevant
makes it impossible to correct Type 1 errors in the published literature.Ó
23
Acknowledgements
Thanks are due to Anna Alexandrova for feedback on an earlier draft of this essay.
References
Ajzen, I., & Fishbein, M. (1980). Understanding attitudes and predicting social behavior.
Englewood Cliffs, NJ: Prentice-Hall.
Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., ... &
Wicherts, J. M. (2013). Replication is more than hitting the lottery twice. European Journal of
Personality, 27, 108-119.
Bargh, J. A., & Chartrand, T. L. (1999). The unbearable automaticity of being. American
Psychologist, 54(7), 462.
Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior: Direct effects of
trait construct and stereotype activation on action. Journal of Personality and Social Psychology,
71(2), 230-244.
Bartlett, T. (2013, January). Power of suggestion. The Chronicle of Higher Education. Available
at: http://chronicle.com/article/Power-of-Suggestion/136907.
Bem, D. J. (2011). Feeling the future: experimental evidence for anomalous retroactive
influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407.
Billig, M. (2013). Learn to write badly: how to succeed in the social sciences. Cambridge:
Cambridge University Press.
Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach, F. J., Geller, J., Giner-Sorolla, R., ... &
Van't Veer, A. (2013). The Replication Recipe: What Makes for a Convincing Replication?.
Journal of Experimental Social Psychology, in press.
24
Braude, S. E. (1979). ESP and psychokinesis: A philosophical examination. Philadelphia, PA:
Temple University Press.
Carey, B. (2011). Fraud case seen as a red flag for psychology research. The New York Times.
Cartwright, N. (1991). Replicability, reproducibility, and robustness: Comments on Harry
Collins. History of Political Economy, 23(1), 143-155.
Cesario, J. (2014). Priming, replication, and the hardest science. Perspectives on Psychological
Science, 9(1), 40-48.
Chow, S. L. (1988). Significance test or effect size?. Psychological Bulletin, 103(1), 105.
Collins, H. M. (1975). The seven sexes: A study in the sociology of a phenomenon, or the
replication of experiments in physics. Sociology, 9(2), 205-224.
Collins, H. M. (1981). Son of seven sexes: The social destruction of a physical phenomenon.
Social Studies of Science, 11(1), 33-62.
Collins, H. M. (1985). Changing order: Replication and induction in scientific practice.
University of Chicago Press.
Cross, R. (1982). The Duhem-Quine thesis, Lakatos and the appraisal of theories in
macroeconomics. The Economic Journal, 92(366), 320-340.
Danzinger, K. (1997). Naming the mind. London: Sage.
Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in
Psychology, 5(Article 781), 1-17. doi: 10.3389/fpsyg.2014.00781
Doyen, S., Klein, O., Pichon, C. L., & Cleeremans, A. (2012). Behavioral priming: it's all in the
mind, but whose mind? PloS One, 7(1), e29081.
25
Doyen, S., Klein, O., Simons, D. J., Cleeremans, A., & Cleeremans, A. (2014). On the other side
of the mirror: Priming in cognitive and social psychology. Social Cognition, 32, 12-32.
Duhem, P. (1954). The aim and structure of physical theory (P.P. Wiener, Trans.). Princeton, NJ:
Princeton University Press. (Original work published 1906)
Earp, B. D. (2015). Does religion deserve a place in secular medicine? Journal of Medical
Ethics. E-letter. Available at http://jme.bmj.com/content/41/3/229/reply#medethics_el_17551.
Earp, B. D. (2011). Can science tell us whatÕs objectively true? The New Collection, 6(1), 1-9.
Earp, B. D., Everett, J. A. C., Madva, E. N., & Hamlin, J. K. (2014). Out, damned spot: Can the
ÒMacbeth EffectÓ be replicated? Basic and Applied Social Psychology, 36(1), 91-98.
Earp, B. D., & Everett, J. A. C. (2013). Is the N170 face-specific? Controversy, context, and
theory. Neuropsychological Trends, 13(1), 7-26.
Earp, B. D., & Darby, R. J. (2015). Does science support infant circumcision? A skeptical reply
to Brian Morris. The Skeptic, 25(3), in press. Available at
https://www.academia.edu/9872471/Does_science_support_infant_circumcision.
Earp, B. D., & Westermann, G. (under revision). Connectionist vs. rule-based models in
cognitive science: Parsimony, falsifiability, and the curious case of the English past tense.
Cognitive Science.
Elms, A. C. (1975). The crisis of confidence in social psychology. American Psychologist,
30(10), 967.
Fanelli, D. (2013). Only reporting guidelines can save (soft) science. European Journal of
Personality, 27, 124-125.
26
Ferguson, C. J., & Heene, M. (2012). A vast graveyard of undead theories publication bias and
psychological scienceÕs aversion to the null. Perspectives on Psychological Science, 7(6), 555-
561.
Fishbein, M. (1980). Theory of reasoned action: Some applications and implications.
In H. Howe & M. Page (Eds.), Nebraska Symposium on Motivation, 1979 (pp. 65Ð116).
Lincoln: University of Nebraska Press.
Fishbein, M., & Ajzen, I. (1975). Belief, attitude, intention and behavior: An introduction
to theory and research. Reading, MA: Addison-Wesley.
Folger, R. (1989). Significance tests and the duplicity of binary decisions. Psychological
Bulletin, 106(1), 155-160.
Francis, G. (2012). The psychology of replication and replication in psychology. Perspectives on
Psychological Science, 7(6), 585-594.
Giner-Sorolla, R. (2012). Science or art? How aesthetic standards grease the way through the
publication bottleneck but undermine science. Perspectives on Psychological Science, 7(6), 562-
571.
G—mez, O. S., Juristo, N., & Vegas, S. (2010, September). Replications types in experimental
disciplines. In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical
Software Engineering and Measurement (p. 3). ACM.
Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological
Bulletin, 82(1), 1.
Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Medicine, 2(8),
e124.
Ioannidis, J. P. (2012a). Why science is not necessarily self-correcting. Perspectives on
Psychological Science, 7(6), 645-654.
27
Ioannidis, J. P. (2012b). Scientific inbreeding and same-team replication: type D personality as
an example. Journal of Psychosomatic Research, 73(6), 408-410.
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable
research practices with incentives for truth telling. Psychological Science, 23(5), 524-532.
Jordon, G. (2004). Theory construction in second language acquisition. Philadelphia: John
Benjamins.
Jost, J. (2013) Introduction to: An Additional Future for Psychological Science. Perspectives on
Psychological Science, 8(4), 414-423.
Kepes, S., & McDaniel, M. A. (2013). How Trustworthy Is the Scientific Literature in Industrial
and Organizational Psychology?. Industrial and Organizational Psychology, 6(3), 252-268.
Koole, S. L., & Lakens, D. (2012). Rewarding Replications A Sure and Simple Way to Improve
Psychological Science. Perspectives on Psychological Science, 7(6), 608-614.
Lakatos, I. (1970). Falsification and the methodology of scientific research programmes. In I.
Lakatos & A. Musgrave (Eds.), Criticism and the growth of knowledge (pp. 91 -196). London:
Cambridge University Press.
Lakatos, I. (1978). The methodology of scientific research programmes. Cambridge, UK:
Cambridge University Press.
LeBel, E. P., & Peters, K. R. (2011). Fearing the future of empirical psychology: Bem's (2011)
evidence of psi as a case study of deficiencies in modal research practice. Review of General
Psychology, 15(4), 371.
Lieberman, M. (2012). Does thinking of grandpa make you slow? What the failure to replicate
results does and does not mean. Psychology Today. Available at
28
http://www.psychologytoday.com/blog/social-brain-social-mind/201203/does-thinking-grandpa-
make-you-slow.
Loscalzo, J. (2012). Irreproducible Experimental Results Causes,(Mis) interpretations, and
Consequences. Circulation, 125(10), 1211-1214.
Lykken, D. T. (1968). Statistical significance in psychological research. Psychological Bulletin,
70(3p1), 151.
Makel, M. C., Plucker, J. A., & Hegarty, B. (2012). Replications in Psychology Research How
Often Do They Really Occur?. Perspectives on Psychological Science, 7(6), 537-542.
Magee, B. (1973). Karl Popper. New York: Viking Press.
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow
progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806.
Meehl, P. E. (1990). Why summaries of research on psychological theories are often
uninterpretable. Psychological Reports, 66(1), 195-244.
Meehl, P.E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and
two principles that warrant using it. Psychological Inquiry, 1, 108Ð141.
Mulkay, M., & Gilbert, G. N. (1981). Putting philosophy to work: Karl Popper's influence on
scientific practice. Philosophy of the Social Sciences, 11, 389-407.
Mulkay, M., & Gilbert, G. N. (1986). Replication and mere replication. Philosophy of the Social
Sciences, 16(1), 21-37.
Nosek, B. A., & the Open Science Collaboration. (2012). An open, large-scale, collaborative
effort to estimate the reproducibility of psychological science. Perspectives on Psychological
Science, 7(6), 657-660.
29
Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific Utopia II. Restructuring Incentives and
Practices to Promote Truth Over Publishability. Perspectives on Psychological Science, 7(6),
615-631.
Pashler, H., & Wagenmakers, E. J. (2012). EditorsÕ Introduction to the Special Section on
Replicability in Psychological Science: A Crisis of Confidence? Perspectives on Psychological
Science, 7(6), 528-530.
Polanyi, M. (1962). Tacit knowing: Its bearing on some problems of philosophy. Reviews of
Modern Physics, 34(4), 601-615.
Popper, K. 1959. The logic of scientific discovery. London: Hutchison.
Quine, W.V.O. (1980). Two dogmas of empiricism. In W.V.O. Quine (Ed.), From a logical
point of view (2nd ed., pp. 20Ð46). Cambridge,MA: Harvard University Press. (Original work
published 1953)
Radder, H. (1992, January). Experimental reproducibility and the experimenters' regress. In PSA:
Proceedings of the Biennial Meeting of the Philosophy of Science Association (pp. 63-73).
Philosophy of Science Association.
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological
bulletin, 86(3), 638.
Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication is neglected
in the social sciences. Review of General Psychology, 13(2), 90.
Schnall, S. (2014). Simone Schnall on her experience with a registered replication project. SPSP
Blog. Available at http://www.spspblog.org/simone-schnall-on-her-experience-with-a-registered-
replication-project/.
30
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed
flexibility in data collection and analysis allows presenting anything as significant. Psychological
science, 22(11), 1359-1366.
Smith, N. C. (1970). Replication studies: A neglected aspect of psychological research.
American Psychologist, 25(10), 970.
Stroebe, W., Postmes, T., & Spears, R. (2012). Scientific misconduct and the myth of self-
correction in science. Perspectives on Psychological Science, 7(6), 670-688.
Trafimow, D. (2014) Editorial. Basic and Applied Social Psychology, 36:1, 1-2, DOI:
10.1080/01973533.2014.865505.
Trafimow, D. (2003). Hypothesis testing and theory evaluation at the boundaries: Surprising
insights from BayesÕs theorem. Psychological Review, 110, 526-535.
Trafimow, D. (2009). The theory of reasoned action: A case study of falsification in psychology.
Theory & Psychology, 19, 501-518.
Trafimow, D. (2010). On making assumptions about auxiliary assumptions: Reply to Wallach
and Wallach. Theory and Psychology, 20, 707-711.
Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, in press.
Trafimow, D., & Rice, S. (2009). A test of the NHSTP correlation argument. Journal of General
Psychology, 136, 261-269.
Tsang, E. W., & Kwan, K. M. (1999). Replication and theory development in organizational
science: A critical realist perspective. Academy of Management review, 24(4), 759-780.
Van IJzendoorn, M.H. (1994): A process model of replication studies: on the relation between
different types of replication. Leiden University Library. Available at:
https://openaccess.leidenuniv.nl/bitstream/handle/1887/1483/168_149.pdf?sequence=1.
31
Verhagen, J., & Wagenmakers, E. J. (2014). Bayesian tests to quantify the result of a replication
attempt. Journal of Experimental Psychology: General, 143(4), 1457-1475.
Westen, D. (1988). Official and unofficial data. New Ideas in Psychology, 6, 323Ð331.
Yong, E. (2012). A failed replication attempt draws a scathing personal attack from a psychology
professor. Discover Magazine. Available at http://blogs. discovermagazine.
com/notrocketscience/2012/03/10/failed-replication-barghpsychology-study-doyen.