Measurement & Evaluation

Fueling Nonprofit Innovation: R&D Vigor Trumps Randomized Control Trial Rigor

Research and development can help more nonprofits learn, innovate, and reach goals faster and for less money.

At the Social Impact Exchange conference in New York a few months ago, I heard the leader of a rapidly growing, national youth-serving nonprofit proudly declare that the organization was about to begin a costly randomized control group study to evaluate its programs. The chief executive described this as the “gold standard” for assessing and refining program effectiveness in the nonprofit sector. But should it be? Or—for on-the-ground program designers—should the gold standard be a research and development (R&D) approach to evaluation that pulls apart a program to figure out the nuts and bolts of what works, for whom, and under what conditions?

Most nonprofit leaders believe that a comparison group study that employs control trials is the most valuable tool for evaluating program effectiveness. These rigorous research methods are used to prove whether a whole program (i.e., an intervention that has been completed and/or has crossed some dosage threshold that deems it fully implemented) led to the achievement of greater results for those who participated, versus those who did not. If the participants’ group average on a measurable outcome was “significantly” higher than the non-participant group’s average, then the program is considered successful. But it is important to note that a comparison study has to prove only that the group outcome average of those receiving the whole program experience is statistically higher than the group outcome average for those who did not participate in the program, and that the difference in the average scores wasn’t due to chance.

Statistical significance in no way means that everyone in the intervention group succeeded! In fact, most statistically significant differences in comparison group studies are unremarkable when you look at the average scores of the intervention and control groups. For example, in a research brief on Early Head Start, published on the US Department of Health and Human Services website makes the following claim: “Early Head Start programs produced statistically significant, positive impacts on standardized measures of children’s cognitive and language development. When children were age 3, program children scored 91.4 on the Bayley Mental Development Index, compared with 89.9 for control group…” This finding may benefit funders and policy makers, but what is a program designer supposed to do with these data points? The statistically significant difference can’t be leveraged from a design standpoint—what is a program designer supposed to do with the knowledge that the program participants scored 1.5 points better?

Comparative group designs do not lend themselves to the real-time learning and fast program adaptations demanded by the complex and tumultuous environment in which nonprofits operate today. This type of evaluation is not required to examine why some program participants do not achieve a desired outcome. To continually refine their programs, nonprofit leaders need to know much more, including which members of the group benefited, which did not, why, and the explicit cause-and-effect relationships. And nonprofit leaders must be involved in interpreting the data. They cannot afford to be on the sidelines, waiting for a professional evaluator to collect data, draw conclusions, and write and deliver a report, while programmatic rigor mortis sets in.

In the private sector, R&D helps product and service designers analyze what is and is not working for different customers, understand the various contributing factors, and continually test new ways to serve more people better. What does an excellent R&D function look like in a nonprofit organization? Research begins by going straight to the source—the student, the theater-goer, the homeless person—to get their direct feedback on how the program has impacted their lives. It looks at which program elements worked and for whom, and designers pay attention to both strong and weak performance to decipher how particular program ingredients cause short-term results for specific sub-groups. After preliminary and often-times sophisticated data analysis are completed, program leaders are deeply engaged in interpreting data and spearheading the innovation or re-design process, with an evaluator in a technically supportive role. And the testing process is ongoing.

Though these R&D practices can benefit nonprofits, few practice them. Recently, TCC Group examined the aggregate results of over 2,500 nonprofit organizations across the country and found that only 5 percent of nonprofits are engaging in R&D practices. The study also discovered that organizations that use R&D practices are almost two and a half times more likely to grow at or above the annual rate of inflation, regardless of the size of the organization’s budget. In particular, the following R&D behaviors are uniquely and significantly correlated with organizational sustainability and growth:

• Gathering data directly from program recipients to determine how to improve services
• Determining outcome metrics by listening to, documenting, and sharing actual client success stories and results
• Engaging key leaders and staff in interpreting the client-derived data
• Evaluating a program to figure out what aspects of it work, rather than whether the program as a whole makes an impact
• Bringing program design leaders together to assess and address the resources needed to deliver programs effectively
• Leveraging R&D insights to inform the program implementation team

Compared to rigid social science methods, R&D can help more nonprofits learn, innovate, and reach goals faster—for less money. It is also a more pragmatic way to assess program design replicability and costs, and to develop better business models that support the realistic expansion of high-impact nonprofit programs. To accelerate the scaling of social innovation over the next few years, nonprofits need to rely less on the rigor of academic experimental design and more on the vigor of R&D.

Peter York is hosting a free webinar on this topic on August 18. Register for “Success by Design: An R&D Approach to Evaluation” here.

imagePeter York is senior partner and director of research at TCC Group, a national management consulting firm.

Tracker Pixel for Entry


  • BY Elizabeth Kronoff, Insaan Group

    ON August 16, 2011 10:51 PM

    Great point.

    I think that rather than look at non-profits, though, we need to look at funders. Bi-lateral and multi-lateral agencies, and governments, do not often fund NGOs’ playing around with ideas. They want specific plans, with specific goals that they can later evaluate.

    I would also point out that in this case, there may be a missing link between growth and R&D expenditures, which is not causation. It may be that better-connected people in richer areas who have access to more money are more likely to spend time doing R&D, because they can. They don’t have rent to pay because they can room with affluent friends, or their parents pay, or their spouses.

    R&D is, without a doubt, necessary and effective. But it’s not possible without at least a little bit of access to no-or-few-strings capital.

  • BY Chris Baca

    ON August 17, 2011 09:55 AM

    Great article and a preceptive and knowledgeable response from by Elizabeth Kronoff, Insaan Group.  I head up to substantial non profits here in New Mexico and have for the past 38 years.  Almost no one will fund evaluation and certainly not R & D activities to the extent private for profit entities can.  What we are left with is the comparative studies, surveys, or programmatic delivery outcomes that funders require.  Then we are criticized for not “proving” our services work.  It is an exhausting and viscious cycle. 

    The other point made by Elizabeth is access to funding from philanthropic and corporate organizations.  Unfortunately, states like New Mexico do not have access to those dollars because these large entities tend to be in the Northeast, the West Coast or in larger metropolitan areas.  Though our needs are great we often work with bare bones budgets with most resources mandated to go to service delivery.

  • Karl Murray, FW Business Ltd's avatar

    BY Karl Murray, FW Business Ltd

    ON August 18, 2011 01:28 AM

    Too often funders want result for programmes of social change that cannot be delivered within 12 - 18mths and to add insult to injury, they often have monitoring processes that then ensures you measure and report on outputs, which is not what they say they want but outcomes - which takes, certainly as it relates to attitudinal changes, much longer to determine.  R&D, if broken down, actually reflects action research methodology, which shares much of the key concepts indicated as advantages over pure and/or clinical quantitative approaches. What most non-profit organisations fail to do, due largely to size and capacity, is to invest sufficient time at the outset of programme design to monitoring and evaluation. This will drive the development potential and the success criteria not whether a programme proves to be statistically significant by what could amount to be a variance of 1% which does not inform what that 1% represent in real terms to allow changes to either programme content or delivery mode.

  • BY Ellie Buteau, Vice President - Research, Center fo

    ON August 21, 2011 01:26 PM

    I appreciate the potential value of an R & D approach, Peter, but that shouldn’t negate the value of RCTs. So much of the debate about evaluation in the nonprofit sector ends up being about RCTs versus other approaches, rather than a focus on when a particular approach is the best choice and will yield the most useful information given the question at hand.

    Any design should be selected because it is the best way to answer a particular question, and the question to be answered should be directly related to the stage of the organization or program being tested. Not all questions in the field are best answered through an RCT. But some are.

    In addition, I was surprised to see part of the argument against RCTs be that “statistical significance in no way means that everyone in the intervention group succeeded!” But where in the field is the myth being perpetuated that statistical significance does mean that the relationships found in the data apply to everyone and in the same way? I’ve not heard it before. The fact that an RCT can pick up on statistical variability in the data – beyond whether or not a program may have been successful – based on certain characteristics of the individuals involved, is part of its value. If designed to do so, RCTs can allow us to test hypotheses about the ‘for whom’ and ‘under what conditions’ questions for which this post is advocating.

  • BY Chris Langston

    ON August 22, 2011 07:58 AM

    Dear Peter - Thanks for your provocative post, white paper, and the talk you gave on this material a few weeks ago in NYC.

    I find that I agree with your conclusion as to the value of R&D but strongly disagree with the reasoning whereby you got there.

    On the R&D aspect, I agree that simple (affordable) on-going measurement of important process aspects of a program (e.g., proportion seen in a timely way, proportion participating as designed); beneficiary characteristics; and program outcomes (health status, educational attainment, etc) are important.  With these measures you can continue to monitor a program to be sure that it continues to to be delivered as designed and results in promised benefits.  It also enables you to look at subsets of the beneficiaries and tweak your program/intervention so that its effectiveness might be improved.

    However, I don’t think fielding a major new program and taking it to scale is a good idea without serious evidence of its efficacy and effectiveness from as strong a design - i.e., a real experiment. 

    I suspect that part of our difference in thinking stems from different experience with the value of interventions in general.  In your talk, I believe you said something in passing along the lines that most things work, at least a little.  This would not be my experience.  In my experience most things don’t work and using the R&D approach on an intervention that in truth has no true benefit, will be misleading at best and harmful at worst.  Even in the absence of any real effect, there is always some subset for which the intervention will seem to work by chance alone.  Trying to explain, understand, or refine a vapor effect is unlikely to get anywhere good.

    Secondly, the major strength of experiments (aka RCTs) is their ability to test causal relationships.  Your proposed R&D approach can not test causality, although you describe it as helpful. Unfortunately, in all of the interrelated phenomena in human social systems, without external control of a hypothesized cause and random assignment of people/units to the cause, you just can’t be sure what is cause, effect or spurious correlation. 

    I think that another source of our difference in views is an assumption about what are appropriate control conditions.  In careful theory driven, hypothesis testing experimental research, you will try to derive simplified system (e.g., genetically identical lab rats) to enable you to manipulate one factor at a time (drug vs placebo).  In applied research, we can relax this simplification principle, at the cost of some of the theory clarity, but use more real world control conditions.  In modern medical research it is understood that comparing a new drug to a placebo fails to answer questions about comparative effectiveness (i.e., the new drug versus an old drug).

    Lastly, you over simplify your characterization of experiments.  I think the essence is specification and control over the hypothesized cause and random assignment.  There is nothing in the definition (or practice) of experimental research that prevents you from measuring (not manipulating) characteristics of participants, contexts, and hypothetical intermediate causes - and doing much of what you claim for R&D.

    For example, you might have an educational intervention that you hypothesize has its effect by increasing self efficacy for reading and is supposed to improve standardized test scores.  You can randomly assign people to the intervention or education as usual, measure efficacy for reading, and test scores.  You might find that the intervention increases scores but efficacy doesn’t seem to mediate the effect.  You might find that while there is an effect on average, it mostly works for girls rather than boys, etc.  All the advantages you claim for the R&D approach, plus strong evidence of the causal relationship of the intervention to the outcome - i.e., real reason to believe that (on average) people assigned to the intervention will benefit. 

    Admittedly it is expensive, but given the costs of getting it wrong, I think it is worthwhile.  As a funder, what I have learned is to buy the highest quality evaluation possible.  The costs of having to do an evaluation over again to rule out alternative interpretations or convince some sceptical stakeholder, is even more expensive in money and delay.

  • BY Kim Cook, Nonprofit Finance Fund

    ON August 23, 2011 08:16 AM

    I completely appreciate the value of the many layers of thinking in this post and subsequent comments.  While not a researcher, I do endeavor to think about measurement as it relates to impact, data as it relates to supporting (or negating) an argument about effectiveness, and often work with clients to evaluate their financial model as well as articulate their case to the funding community.  I find that I appreciate Karl Murray’s comment about integrating monitoring and evaluation into the program design process. This, to me, creates the potential for a living systems approach to programs, wherein the evaluation and monitoring is utilized and valued as a feedback loop for adjustments mid-course.  In the nonprofit sector we rarely have a beginning, middle, end with a product as a result, we are in process, and change processes at that, so thoughtful creation of tools that inform us on progress through-out delivery could provide us with a state of continual learning and improvement.  For me, this is critical because I’d like to see the utility of monitoring and evaluation move beyond the demonstration of worth, or meeting funding reporting requirements.  There are really exciting things going on in the world of data analytics an data visualization which may also move us forward in remarkable depth as we seek to understand what works, what doesn’t, what happened.  May I also say that I appreciate Chris Langston’s comments because they are so deeply informative in an arena that is more abstract for me.  Finally, when it comes to the term R & D - it seems closely linked to corporate jargon, and typically is understood as part of product development for sales - I completely value the term as metaphor, but wonder if it will be confusing in the nonprofit sector (and hence all the comments about money in the for profit sector).  Cheers - thanks for such a great post that brings about valuable interaction.  I would say as a measure of impact and worth - all the responses say, thumbs up!

  • BY Isaac Castillo, Latin American Youth Center

    ON August 23, 2011 12:26 PM

    Peter (and others that have contributed in the comments) - I very much welcome and appreciate the comments made here.  My hope is that this sort of dialogue can continue to happen collectively across three important groups:  funders/philanthropists/grant-making agencies, evaluators/researchers, and the nonprofits/service providers. 

    Too often, these three groups talk past each other, rather than having an honest and frank discussion about the best form of evaluation for everyone’s needs and who ultimately is going to bear which portions of the cost.

    I agree with Chris Langston that RCTs are valid and useful in certain situations - but we all have a responsibility to communicate to funders that RCTs aren’t the most appropriate way to measure effectiveness in every instance, even if they are the most likely to eliminate other variables.  I’m guessing the chief executive in Peter’s introduction didn’t arrive at the language of RCTs as ‘gold standard’ from other non-profits, or even most modern evaluators, but rather from the large funder that is providing support only because an RCT is being done. 

    But while I agree with Chris in a theoretical sense - that we need to evaluate and measure all program interventions (since not all will work, and some may actually cause harm) - expecting every non-profit to do a high level evaluation (like an RCT) is simply beyond the financial and technical scope of all but the largest nonprofits.  Not every program intervention needs to ‘prove’ causality (which in itself is a bit of a misnomer, since we can only talk about ‘proving’ causality in a statistical sense) - they just need to know if their intervention works or not so they can make improvements.

    ‘Causality’ of programmatic impacts IS important to funders and other that want to scale or replicate programs however.  But if funders truly want results that can be replicated in other places (and this of course assumes that replication is even possible) then the funders should be willing to pay the millions of dollars necessary to do high quality evaluation. 

    And then my train of thought was lost due to the earthquake here in Washington.  More later.

  • BY Wendy McClanahan, Senior Vice President for Resear

    ON August 24, 2011 04:21 PM

    I agree with the essence of Peter’s argument, which I believe to be that randomized control trials (RCTs) are not “the end all-be all” in program evaluation. We are in a current policy environment where there is much ado about program effectiveness with a very limited common understanding of what is meant by “effective”.  As such, our response should not be to categorically abandon the RCT but to fight for responsible and program-focused evaluation—which can and should include RCTs and alternative evaluation approaches. Here’s how we can do it:

    •  Stop using RCTs at the wrong times and for the wrong programs.

    There are many programs that are simply not appropriate for random assignment. For example: those that are too small or too new, those that are struggling with implementation challenges, programs that don’t turn any applicants away and thus can’t create a control group, or programs that provide broadly enriching experiences for young people (visiting museums, playing sports) rather than attempting to make a distinct measurable impact with a precisely defined intervention. Learning about what constitutes program effectiveness in each case is not just a matter of gauging causality, as RCTs do, nor of collecting data on outcomes, which is a frequent technique for programs that can’t afford (or aren’t ready for) an RCT. In fact, one could argue (and I will) that judging the effectiveness of a single program is really not what we should be focused on in social policy. To grow up happy and healthy and to become productive members of society, young people need MULTIPLE positive interventions along the way. We don’t expect any single experience to have life-changing impacts on middle-class youth, though we do hold programs that serve young people from high-poverty communities accountable for achieving this goal. We need to have a broader view of what it takes to change young people’s lives and to adjust our evaluation methods (as well as our definition of effective) accordingly. 

    •  Translate the results of RCTs into something programs can use.

    When RCTs do make sense, it is critical that researchers effectively translate the findings to actually improve program performance—and make a conscious effort to combine this rigorous approach with good implementation research that gathers information about the how, who and what of a program’s day-to-day execution.

    •  Use RCTs more creatively.

    Random assignment could be used—though it hasn’t typically been—to test the effectiveness of specific program practices. While evaluators generally approach these questions in an exploratory way, in some cases it is possible to answer them more rigorously—by experimentally manipulating different program components (for example, the length or intensity of a program, or the type of training and support provided to program staff). By combining this research with thoughtful cost/benefit analysis, we can determine if programs, funders and taxpayers are getting their money’s worth for various practices.

    •  Interpret the results of RCTs (and other evaluations) appropriately. 

    As Peter points out, we too often assume that a statistically significant positive finding indicates that a program works—but the size of those impacts may be small and, therefore, not that meaningful. This is a problem with many different research and evaluation approaches, not just RCTs. In fact, even an R&D approach—to the extent that one tallies and compares information from different groups—would have to face the same “meaningful difference” test. The good news is that the field is growing and beginning to expect that evaluators and programs demonstrate that the size of the differences detected using impact studies is meaningful.

    In sum, there is room in the social program evaluation world for a variety of approaches—including both R&D efforts and more traditional RCTs—and each should be used to meet programs where they are and then identify the correct evaluative method. All of these approaches can provide extremely valuable information to program leaders, policymakers and the field at large—and grantmakers should invest in them more frequently.

  • BY Mark Dynarski, President, Pemberton Research

    ON August 29, 2011 02:23 PM

    Like any carpenter, researchers have tools for different kinds of questions to be answered. Trials are powerful tools for measuring program effectiveness, if that is the question. Using other research tools can add to what is known.

    Contrasting trials with R&D creates a false dichotomy, like suggesting it’s preferable to practice medicine rather than surgery. The post argues that using R&D methods rather than trials is the way for programs to “pull apart a program to figure out the nuts and bolts of what works, for whom, and under what conditions… the explicit cause-and-effect relationships.” Knowing what’s working and what’s not working quickly becomes a complex question when more than one service or component is in the package. Doing the unpacking is desirable and the most valid way to do it is…with trials. Other research approaches can yield misleading or just plain wrong conclusions.

    The post notes that “statistical significance in no way means that everyone in the intervention group succeeded!” It’s true that trials compare means of two groups and do not provide findings for individuals. But the post goes on to argue that R&D methods can accomplish this by asking individuals whether they are succeeding. Individuals can know whether their situations are better than before, but they (and program designers) cannot know whether their success is because of the program. This kind of “post hoc” attribution is a fallacy and combating the fallacy is one of the primary reasons trial methods were created. Misinterpreting the informational yield of trials should not be a basis for arguing not to use them.

    The post reports its own research finding that “organizations that use R&D practices are almost two and a half times more likely to grow at or above the annual rate of inflation, regardless of the size of the organization’s budget.” This is a classic example of a correlation being confused with a cause. The kinds of R&D practices used by an organization are correlated with its culture, management style, staff experience, and other factors. But it’s a fallacy to imply, as the quote does, that using these methods will mean an organization will be more likely to grow.

    The post argues that “nonprofit leaders must be involved in interpreting the data.” But accountability and the possibility, if not the likelihood, of real or perceived conflicts of interest argues for caution when those whose ideas are being tested also are those who interpret the data about the effectiveness of the ideas. It’s useful for nonprofit leaders to examine the data and develop conclusions about what may or may not be happening in a program, but allowing researchers to arrive at their own conclusions is an important ingredient of objectivity.

    Asking how best to use powerful tools like trials is useful. Things can go wrong if tools are used wrong. Let’s discuss how to exploit the enormous potential of trials as tools for innovation. More trials, even if small and highly focused on a new service or new idea, will contribute to knowing what works.

  • Fikisha Thomas banning smoking on (AHEC)'s avatar

    BY Fikisha Thomas banning smoking on (AHEC)

    ON November 14, 2011 08:15 AM

    Research and development along with comparing groups, and having group leaders can benefit non profit organizations the trick is finding out how many of the non- profit organizations are willing to give it a try. Few non profit organizations use research and development from what the information in the article claimed. Analyzing and studying people may seem a little lab rattish but it seems to contribute to how we understand ourselves a little better. Saving money in this economy is the thing to do right now and if research and development can help with that than that would be beneficial.

Leave a Comment


Please enter the word you see in the image below:


SSIR reserves the right to remove comments it deems offensive or inappropriate.