Each year governments spend billions of dollars on programs addressing poverty, education, health, and other social issues. Unfortunately, only a small fraction of these funds are being spent on programs with strong evidence that they work.

Evidence-based policy—sometimes called "Moneyball for Government"—is a small, but growing, effort to change this. But, just as it is getting started, is it on the verge of a major shift? Yes, according to a group of contrarian thinkers who believe that the dominant evidence paradigm in US social policy is too narrowly focused on replicating programs evaluated with randomized controlled trials.

In an article by Srik Gopal and Lisbeth Schorr published earlier this year (and in an earlier article by Schorr), the authors say they are skeptical of evidence hierarchies like those that have been developed by the Obama administration to steer greater funding to programs that work—a view also shared by the group Friends of Evidence. These ideas, they say, need rethinking.

Is this analysis correct? It is an important question, because while it is both thoughtful and provocative, it is a direct challenge to the prevailing thinking on evidence-based policy.

To be sure, Schorr and her colleagues make many good points that command broad support. Their emphasis on continuous improvement is sound, as is their support for a broad and inclusive definition of evidence that also values qualitative research, case studies, results drawn from performance management efforts, insights from experience, and professional judgment.

But all evidence is not equal. Such evidence, while valuable, is often wrong, driven by biases that may be either overt or hidden. For example, facing possible backlash from funders, policymakers, and public opinion, it is rare when a practitioner is brave enough to publicly admit failure. Even among those without vested interests, professional judgement is often clouded by deeply held ideological beliefs.

Are charter schools a good thing or bad thing? Do social impact bonds represent a step forward or step backward? The answer usually depends on who you ask.

The Friends of Evidence critique of the dominant evidence paradigm includes many legitimate criticisms, but it is too forgiving of the flaws inherent in its own proposed alternative. It seems to reject the test-and-replicate model favored by the Obama administration, instead favoring localized and context-driven approaches that emphasize continuous improvement. But it does so without proof that this alternative will produce better results. Indeed, Schorr rejects the very idea that such proof is needed, arguing, “We must be willing to shake the intuition that certainty should be our highest priority.”

Certainty? No. But greater confidence? Yes. Unfortunately, to reject evidence hierarchies is to promote is a form of evidence relativism, where everyone is entitled to his or her own views about what constitutes good evidence in his or her own local or individualized context. By resisting the notion that some evidence is more valid than others, they are defining evidence down.

Such relativism would risk a return to the past, where change has too often been driven by fads, ideology, and politics, and where entrenched interests have often preserved the status quo. Regardless of one’s views, most can agree that the status quo is not good enough. The prevailing (although still relatively new) evidence paradigm represents a rationalist challenge to that status quo.

The Prevailing Evidence Paradigm

Evidence-based policy as it is currently defined by the Obama administration includes substantial investments in evaluations, the development of evidence clearinghouses to review and rate these and other outside studies, and a growing effort to push more federal funding into programs with demonstrated results.

This effort is substantially bipartisan in nature, with Republicans calling for similar policies, including tiered-evidence funding strategies that shift a greater share of federal spending to programs with proven results and pay-for-success initiatives that make payments contingent upon achieving pre-determined performance metrics. This bipartisanship is also evident in the work of organizations like Results for America, whose “Moneyball for Government” efforts routinely pair Democrats and Republicans in support of such policies.

These efforts have a long history (such as in welfare policy), but they accelerated during the Bush and Obama administrations. Their efforts have led to the development of evidence clearinghouses like the Department of Education’s What Works Clearinghouse, the Department of Labor’s Clearinghouse for Labor Evaluation and Research, and the Department of Justice’s CrimeSolutions.

The Obama administration also pioneered tiered-evidence initiatives, creating programs like the Social Innovation Fund, Investing in Innovation program at the Department of Education, and evidence-based teen pregnancy prevention programs. Evidence-based home visiting took root late in the Bush administration, but it embraces the same ideas.

More recently, such strategies, which typically emphasize evidence rooted in randomized controlled trials (RCTs), have also taken hold in education legislation that replaced No Child Left Behind. They are also at the center of new bipartisan proposals to reorient the nation’s foster care system by keeping more children with their families.

Similar work is taking place at the state and local levels. The John D. and Catherine T. MacArthur Foundation and Pew Charitable Trusts have invested heavily in the Results First Initiative, which is helping states develop evidence-based approaches to public policy. Bloomberg Philanthropies is making similar investments at the local level in an initiative called What Works Cities.

These efforts are arguably just a beginning, but they collectively represent the dominant evidence-based paradigm that Gopal, Schorr, and their colleagues are criticizing.

Evidence Hierarchies: Panning for Gold in the Evidence Stream

The core of this group’s argument is directed against the use of evidence hierarchies, particularly those that emphasize the use of RCTs, a widely used experimental method often referred to as the “gold standard” of evaluation.

Arguments for and against RCTs have a long history. Schorr herself accepts that RCTs have their uses, but views them as just one tool among many. Their larger criticisms are directed not against RCTs, but their dominant position in the evidence hierarchy.

She and Gopal argue that evidence hierarchies as they are currently constructed are flawed. “We are skeptical of the prevailing enthusiasm for a social change strategy that relies on scaling up model programs with preference going to those with the most elegant evaluation methodology,” they write.

In arguing for a broader evidence base, they are on solid ground. In fact, the Obama administration agrees, writing in its latest budget submitted to Congress earlier this year:

The best government programs use a broad range of analytical and management tools, which collectively comprise an “evidence infrastructure,” to learn what works (and what does not) for whom and under what circumstances, as well as improve results. Broadly speaking, “evidence” is the available body of facts or information indicating whether a belief or proposition is true or valid. Evidence can be quantitative or qualitative and may come from a variety of sources, including performance measurement, evaluations, statistical series, retrospective reviews, and other data analytics and research.

But supporting a broad definition of evidence is not the same thing as saying that all evidence is equally valid. According to the same presidential document, “rigorous impact evaluations, particularly randomized experiments, can provide the most credible information on the impact of the program on outcomes, isolated from the effects of other factors.”

According to the 2014 Economic Report of the President, which discussed the role of evaluation in improving federal programs:

It is well recognized within Congress and other branches of government (for example, GAO 2012, National Research Council 2009), in the private sector (Manzi 2012), in non-governmental research organizations (Coalition for Evidence-Based Policy 2012, Walker et al. 2006), and in academia (for example, Imbens 2010; Angrist and Krueger 1999; Burtless 1995) that evaluations measuring impacts on outcomes using random assignment provide the most definitive evidence of program effectiveness.

Not everyone is willing to take the administration’s word for it. Critics sometimes argue that observational studies can be as accurate as random assignment studies. Two of these critics once humorously suggested in a faux research study that RCT supporters should participate in a randomized evaluation of parachutes (some with, some without) to be certain of their usefulness. RCT proponents, they wrote, “need to come down to earth with a bump.”

Unfortunately, few issues in the social sciences are as straightforward as a parachute, and Schorr and Gopal themselves note that such issues are often complex. As Caroline Fiennes has previously argued, “The real purpose of the scientific method is to make sure nature hasn’t misled you into thinking you know something you actually don’t know.”

Such complexity can pose challenges, but it is usually not as challenging as Gopal and Schorr imply. In fact, one of the advantages of RCT-based designs, which utilize both comparison groups and randomization, is that they isolate program impacts and protect against other complex and often hidden influences on their results (such as participant motivation) that commonly undermine competing study designs and lead to misleading conclusions. The real challenges lie elsewhere.

In their article, Gopal and Schorr point to John Ioannidis, who estimated that experimental replications in psychology may fail at a rate of 80 percent or more. But they ignore his real reasons why. According to both the article they cited and his earlier work, this panoply of wrongness is largely due to various forms of research bias. Among his conclusions: “The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true.”

RCTs are not perfect, of course, nor are they universally applicable. If they are poorly designed they can be as misleading as any other study, which is why there are checklists to determine whether they have been well-conducted. Specialized skills are usually needed to review these and other studies, which is one of the reasons that evidence clearinghouses exist.

Evaluation is also an evolving field. Other research designs are beginning to emerge that are nearly as powerful as RCTs and may be usable when RCTs are not. Some researchers are using design replication studies to evaluate their validity, but (importantly) their accuracy is determined by comparing their results to gold standard, randomized experiments within the context of the same study.

However, agreeing that evaluation is a rapidly evolving field, that RCTs are not appropriate in all cases, that many forms of evidence are valuable, and that they have different (and often complementary) uses is not the same thing as denying the validity of the existing evidence hierarchies.

Replication: A Challenging but Obtainable Goal

What good is a study, even one using “gold standard” evaluation methods, if we cannot replicate the results? According to Gopal and Schorr, the dominant evidence paradigm “fetishizes the notion that if a program is certified as ‘effective’ in one setting, it will—when replicated with fidelity to the original model—be effective elsewhere.”

“Context can make or break an initiative,” they write. “What worked in Peru may not work in Poughkeepsie, even if the same intervention is used with the same ‘dosage’ and even in clinical trials, which are in many ways the intellectual forbearers of social sector RCTs.”

True enough, but this is not a new argument. External validity, as evaluators know it, is a major focus of evidence-based policy. This is one reason why the highest place in the various administration-backed evidence hierarchies is usually reserved for programs with two or more well-conducted RCT studies, or at least one large, multi-site study. Systematic reviews that synthesize results across multiple studies, such as those conducted by Cochrane or the Campbell Collaboration, take a similar approach.

Does every evidence-based initiative work the same way in all cases? No. Does context matter? Absolutely. Gopal and Schorr’s suggestion that clearinghouses should better document such conditions is well made.  But they take their arguments too far if they are suggesting that rigorously evaluated initiatives cannot be successfully replicated. In fact, there have been many successful replications. One recent example is the successful five-year scale up of KIPP charter schools, which was rigorously evaluated by Mathematica Policy Research. Another is the Nurse Family Partnership, which was successfully replicated in Holland (a very different context from the United States, where it originated), among other places. The Coalition for Evidence-Based Policy’s Top Tier Evidence website lists many others and more are emerging every year.

Peering Inside “Black Boxes”: A Key to Continuous Improvement

One commonly made argument against randomized experiments is that their focus on “what works” is too simple and binary, and that they provides too little information about “why, how, for whom, and under what conditions.” Garnering such information often requires that we peek inside the so-called “black box” of a program’s inner workings to determine the relative importance of its various component parts.

Such information is often necessary for successful replication, because it identifies which components can be adapted to local conditions and which must be replicated faithfully. It also provides a basis for refinements and adaptations that could improve a program further.

This is a solid argument as far as it goes, but it ultimately fails for a simple reason: It is a false choice. There is no requirement that researchers conduct randomized experiments as stand-alone studies, with no additional analysis. In fact, most studies of this caliber include implementation studies, which are often substantially qualitative in nature. Mixed methods of this kind are standard operating procedure.

Researchers can also subject individual program components to more-rigorous analysis. Many evaluation techniques include randomization and other aspects of more-rigorous studies. Examples include implementation science, rapid-cycle evaluation, and factorial designs (such as this approach in Head Start). The Administration for Children and Families recently sponsored a meeting of experts devoted to reviewing such techniques, and the White House has an entire unit devoted to rigorous A/B testing of proposed program improvements to produce—in the words of The New York Timesa better government, one tweak at a time.

One federal effort in child welfare, called the Permanency Innovations Initiative, combines the best of both worlds. Under this demonstration program, local innovators spend the early phases of their grant developing and fine-tuning their initiatives through a process of continuous improvement, subjecting them to formal impact evaluations only when they are ready. The process does not end with this evaluation, however. It continues with ongoing continuous improvement activities coupled with occasional evaluations to validate the results.

Far from being a barrier to continuous improvement, rigorous evaluations—including those involving random assignment—have become a tool that enables improvement. Shaun Donovan, President Obama's director of the White House Office of Management and Budget, noted the shift at event earlier this year: "Too often the history of performance management and evidence has used a ‘gotcha’ approach," he said. "We have started to focus on how you find ways to improve programs and make them work better."

Systems Change

Finally, Schorr supports the idea that the dominant evidence paradigm too often erroneously assumes that stand-alone “silver bullet” programs can achieve ambitious goals. Complex problems demand complex interventions, she says, and success often demands systems-level change.

A report Schorr co-authored with Frank Farrow and Joshua Sparrow cites Patrick McCarthy, the president of the Annie E. Casey Foundation, who wrote that "decades of experience tell us that a bad system will trump a good program—every time, all the time."

Schorr’s observations might sound reasonable, but again they represent a false choice. Both are important and both can be subjected to rigorous evaluation. Far from being in conflict, they are highly synergistic.

Much systems-level change can be (and has been) evaluated rigorously. Take schools: Whole-school reforms have been subjected to randomized studies at the school-wide level. Success for All is a recent example. A recent study by MDRC examined its impact by comparing results across 37 randomly assigned schools.

Counties are also complex systems. A rigorous study of the Positive Parenting Program that examined its impact across randomly assigned counties showed that it successfully prevented child maltreatment at the county level. Another study examined the impact of cash transfers on children’s health and education outcomes by comparing effects across 505 randomly assigned villages in Mexico.

Moreover, randomized evaluations are not the only rigorous way to examine systems-level change. Researchers can often use quasi-experimental evaluations to examine policy changes across counties or other complex systems, particularly when the launch or phase-in of new programs or policies is staggered, as often occurs naturally.

Changing Tides, Changing Prospects

Is the tide “shifting away from a narrow focus on experimental evidence of program impact,” as Schorr suggests? If emphasis is placed on the word “narrow,” then yes. But it is not clear that the focus was ever as narrow as she says. It certainly is not today.

In fact, there are areas of strong agreement between what Schorr advocates and the evidence paradigm she is criticizing. Both share a devotion to continuous improvement. Both acknowledge the value of a broad range of evidence.

But there are also areas of strong disagreement. The dominant view continues to place greater value on more-rigorous evidence—particularly evidence drawn from well-conducted, multi-site RCTs—more than other, less-rigorous evidence. Counterarguments based on replication difficulties, supposed conflict between rigorous evaluation and continuous improvement, and a reputed devotion to one-off “silver bullet” solutions to the exclusion of systems change all miss the mark.

If the tide is turning, there is little sign of it yet. But that does not mean it could not happen.

Setting aside the real possibility of improvement, evaluations can create winners and losers. Those deemed “losers” often do not go away quietly. We can expect entrenched interests that feel threatened by the evidence-based policy paradigm to fight back.

The bipartisanship that has been so prevalent on this issue may also come to an end. There are already differences between reformers and Democrats, who argue that evidence should be used to shift more funding to programs that work, and conservatives and Republicans who often argue that it should be used to defund those that don’t.

But there is also a more optimistic scenario. A clear-eyed review of the history of evidence-based policy shows that the trend so far has been toward greater use of evidence, not less. Ideas that have been shown to work usually spread, even if not as quickly as their supporters might hope.

The development of credible evidence for social policy is still in infancy, and we must do more to spread and incentivize its use. But if the current trend continues, we may look forward to a brighter future, one where the status quo is not preserved, but overcome—one where societal problems, long seen as intractable, at long last yield to the forces of scientific progress.

Tracker Pixel for Entry