Measurement & Evaluation

Most Charities Shouldn’t Evaluate Their Work: Part Two

Who should measure what? Mind the gap.

So what should happen if no one has properly evaluated an idea yet? If it’s important, an independent and suitably skilled researcher should evaluate it in enough detail and in enough contexts for other charities and donors to rely on the findings. The leading medical journal The Lancet cites a tenet of good clinical research: “Ask an important question, and answer it reliably.”

A countercultural implication follows from this. It’s often said that the evaluation of a grant should be proportionate to the size of the grant. It’s also often said that evaluations should be proportionate to the size of the charities. We can see now that both views are wrong. The aim of an evaluation is to provide a reliable answer to an important question. From there, the amount worth spending on an evaluation is proportionate to the size of the knowledge gap and the scale of the programs that might use the answer.

To illustrate, suppose a small company has developed a new drug for breast cancer. The “first-in-(wo)man studies,” as they’re called, involve only a few people, for obvious safety reasons. Relative to the cost of dispensing the drug to those few women, how much should the company spend on evaluating the effect on them? The answer is “a lot,” because the answer is important for many people. So the cost of the “pilot” is irrelevant. So too is the size of the company running the “pilot.” Often, the cost of robustly evaluating a program will exceed the cost of delivering that program—which is fine, if the results are useful to a wide audience.

Conflicted out

So not only are most charities unskilled at evaluations—and we wouldn’t want them to be—but also we wouldn’t want most charities to evaluate their own work even if they could. Despite their deep understanding of their work, charities are the worst people imaginable to evaluate it because they’re the protagonists. They’re selling. They’re conflicted. Hence, it’s hardly surprising that the Paul Hamlyn Foundation study found “some, though relatively few, instances of outcomes being reported with little or no evidence to back this up.”

I’m not saying that charities are corrupt or evil. It’s just unreasonable—possibly foolish—to expect that people can be impartial about their own work, salaries, or reputations. As a charity CEO, I’ve seen how “impact assessment” and fundraising are co-mingled: Charities are encouraged to parade their self-generated impact data in fundraising applications. No prizes for guessing what happens to self-generated impact data that isn’t flattering.

To my knowledge, nobody’s ever examined the effect of this self-reporting among charities. But they have in medicine, where independent studies produce strikingly different results to those produced by the protagonists. Published studies funded by pharmaceutical companies are four times more likely to give results favorable to the company than are independent studies. It’s thought that around half of all clinical trial results are unpublished, and it doesn’t take a genius to figure out which half that might be.

Who should do evaluations?

Skilled and independent researchers, such as academics, should normally take on evaluation of ideas. They should be funded independently and as a public good, such that charities, donors, and others can access them to decide which ideas to use. It’s no accident that The Wellcome Trust, the UK’s largest charity, requires that all the research it funds is published in open-access journals.

The charity itself can normally take on monitoring of implementation.

Useful resources and ideas for funders

Academics and others have evaluated ideas well and published the results. Smart funders use that material to avoid funding something that is known to be useless or even harmful. To be fair, much available information could be easier to find and understand. But some of it is already easily accessible:

On first world education, The Education Endowment Foundation (a £125 million fund of UK government money to improve education for 5-16 year olds in England) has collated and analysed evidence on interventions in many countries, and created a wonderful toolkit (“menu”) that shows the quality of the evidence and the apparent strength of the intervention. The foundation is rigorously evaluating all the interventions it funds, publishing the results, and adding them to the toolkit.

In international development, several entities publish all their evaluation findings. They include the Abdul Latif Jameel Poverty Action Lab at MIT (J-PAL) and Innovations for Poverty Action, which have collectively run some 400 impact evaluations. Also, the International Initiative for Impact Evaluation (3ie) database includes more than 600 impact evaluations, as well as systematic reviews of all the evidence on particular topics. Recent systematic reviews have looked at small-scale farmers, female genital cutting, and HIV testing.

In health, The Cochrane Collaboration uses a network of more than 28,000 medics in over 100 countries to produce systematic reviews of many types of intervention. Its database already has more than 5,000 such reviews, including some related to health care in disaster and emergency situations.

Various universities have centers that produce and publish such research. One example is Oxford University’s Centre for Evidence-Based Intervention, which looks at social and psychosocial problems.

The UK government’s new What Works Centres will create libraries of evidence about crime and policing, aging, and various other sectors in which UK charities and donors operate.

These are free resources. Smart donors and charities use them. And they publish their own evaluations—with full methodological detail—so that others can learn from them. It’s essential that charities’ work is evaluated properly so that resources can flow to the best. That means appropriately skilled and independent people should evaluate a charity’s work only when necessary.

Tracker Pixel for Entry


  • Jacob M.'s avatar

    BY Jacob M.

    ON May 30, 2013 01:18 PM

    While provocative, I think your blog posts miss the mark on a few fronts.

    Beyond I dive into my critique, there are several points that you make which are spot on. 1) I think we can all agree that better implementation of interventions is probably a good thing.  So let’s take that off the table.  2) There’s no question that self-evaluations by any organization - be it for-profit, nonprofit or a government org - is not credible for external audiences.  That is why, for example, the US government uses the GAO (among others) to evaluate efficacy. 

    With that out of the way, as I understand it, your proposed solution is largely to have nonprofits select program models which have RCTs performed by a reputable 3rd party. This is problematic for two reasons.  First, it discourages innovation.  The fact is if nonprofits adopt the “best” programs of today, it will come at the expense of trying new models out that might be better.  While in some sectors we might be happy with the best programs we have now, in others - e.g., education - the state-of-the-art isn’t really that great. Personally, I don’t think the sector’s at a stage where we’d be happy homogenizing our program models.

    Second, your focus on solely implementation ignores the role of experimentation by nonprofits at a more micro-level.  For profits do this type of small-scale experimentation all the time - in the tech sector they call it “A/B” testing.  So while fidelity is one goal, it should be balanced by testing small tweaks, gathering feedback and revising.  For me, this is a form of self-evaluation that is valuable. It is along the lines of what my former colleague, Matt Forti, might call “measuring to improve”. I’d note that this deliberate testing requires an organization to be great at implementing their program, otherwise any comparison between groups would be meaningless - so we don’t disagree there.

    Basically, I don’t trust just a few fringe programs to be doing the bulk of the “R&D” for the sector.  I don’t think they have the capacity to do so and I think our base of evidence in most sectors is incredibly limited. Until we have a reasonable base, I’d favor diversity over homogeneity - and that’s exactly what local experimentation encourages.


  • kevin starr's avatar

    BY kevin starr

    ON May 31, 2013 10:14 AM

    I think Jacob nailed it.  While Caroline is right that that it isn’t useful to continually answer the same questions, every organization tweaks the idea a bit and unless an idea has been 1) thoroughly proven and b) is executed with disciplined exactitude, it’s not really safe to equate implementation with impact.  It’s kind of like a business making a product that was lucrative for someone else and and deciding to forgo the measurement of profit.  Especially for start-ups, it is necessary to measure impact to iterate toward maximizing it.  We’re not talking JPAL-esque RCT’s, just a way of keeping your finger on the pulse of impact in a way that allows you to integrate an ongoing stream of into operations and the iteration of the model.

    There are too many links in the idea-implementation-behavior change-impact chain for most organizations to simply measure implementation.  Caroline’s piece is carefully nuanced and makes some excellent and important points, and she clearly doesn’t think impact is unimportant, but nuance is something that is all too easy to ignore - and the last thing the social sector needs is an unintended meta-message that we don’t need to measure impact.

  • Richard Piper's avatar

    BY Richard Piper

    ON June 3, 2013 03:12 AM

    I’m pleased to see this from Caroline. I have been arguing for three years (and covering it on my Impact Leadership training courses) that UK civil society needs a purpose-led approach to evaluation (that is, research to answer specifc questions). We also need less data and more learning and a more collective approach to evaluation. The idea that everything should be evaluated is nuts. Sometimes, frontline workers find themselves again and again asking the beneficiaries if the same intervention is working, when they know it is (or isn’t).

    Three points of disagreement with Caroline. First, I don’t understand why you keep equating ideas with evaluation and implementation with monitoring. Ideas can be monitored and implementation can be evaluated. I think its positively unhelpful to link these two elements in this way.

    Second I think we need to separate the concept of design from implementation. Idea > design > implmentation would be a better way of thinking about the success of a programme. In my experience, the design phase is often the place where too many assumptions are made and where the mistakes occur. Drawing attention to it would help.

    Third we do need to accept that organisation’s need ‘impact evidence’ for funding purposes - at least as long as funders or donors expect it. Rather than be dismissive of this, we need to accept it is one possible purpose for an evaluation. Not all evaluations are about service design and improvement - some are about proving ourselves to funders. Reseachers might not like that, but they should instead embrace it.

    What I’d really like to see is organisations using evaluations done by others to show how their similar project is likely to fare, and using this to secure funding. Funders should be more explicit in asking for information about why a project might work and why it might not, and accepting evidence for that judgement from sources other than the organisation itself.

    Final point: while I’m with diversity over homogeneity, Jacob, (and who couldn’t be if you phrase it like that) I also favour smart efficiency over bloated waste, and we need a collective vision from how to move towards smarter, better designed projects based on more accessible, context-rich, sector-wide learning about what has worked and what hasn’t.

  • BY Caroline Fiennes

    ON June 4, 2013 02:25 PM

    Caroline here.
    Great, thoughtful comments.

    I totally agree that “organisations [should use ] evaluations done by others to show how their similar project is likely to fare, and ... Funders should be more explicit in asking for information about why a project might work and why it might not, and accepting evidence for that judgement from sources other than the organisation itself.”

    I also agree that it’s not the case that everything should be RCT’d. Not everything can be. Most obviously, where n=1 so there’s no control group, e.g., much advocacy and campaigning work, and some climate change / societal attitudinal change work. Nonetheless, even in those situations, it’s reasonable to (a)ask what we’ve learnt from other situations which indicates how the work may fare (b)not ask the protagonist to evaluate itself, and (c)look for comparative data, since we’re allocating scarce resources and need to optimise, as best we can.

    It’s not the case that the approach here impedes innovation or experimentation. Actually it might aid it (by freeing up resources from repetitive, pointless ‘reporting’) Rather, it helps us to see better which innovations are worth having. You can tell that rigorous evaluations don’t deter innovation by looking at health: that has masses of innovations, each of which gets evaluated rigorously & independently.

  • Ryan Edwards's avatar

    BY Ryan Edwards

    ON June 7, 2013 11:29 PM

    Hi everyone,

    Good on you for writing these articles, Caroline.

    I broadly agree with all the points raised, by yourself and some of the commentators, and think you have done a nice job of demystifying ‘social impact measurement’.

    The most fundamental point you raise is that this is just quantitative impact evaluation, as measurement means quantitative and impact, by definition, necessitates the consideration of a change in outcomes that can be attributed to (cause) something. It is not complicated and I welcome domestic charitable sectors lining up with the international charitable and international development sectors in sharing expertise, the latest techniques and lessons from different things - long overdue.

    There are four brief comments I would like to add though, as you do simplify things a bit (which is probably necessary):

    1. It is up to the funder and organization to determine what they ‘need’, as Jeremy Nichols put it, and this will likely result in different levels of rigor. But as you allude to, as this rigor declines the scope for self-biased results (even if they aren’t) creeps in quickly and they may lose credibility, which is not a risk I would be comfortable with if, either as an investor or program manager.

    I do not see why you would not try and get the highest level of rigor possible all the time, at least for ‘new ideas’. This increases certainty for all involved, creates the right incentives for researchers to want to collaborate and helps to build a bit of capacity in the sector and the stock of credible knowledge to be shared. Moreover, research needs to answer new questions and have novelty to have any chance of getting published, so if practitioners are interested in social innovation, they should work with the researcher community as much as they can to leverage resources. The claims that experiments and replication can hinder innovation are misguided as well. Sure, it limits mid-program adaptation and innovation if you would like to preserve to reliability of your impact estimate, but this can be managed and worked around quite easily and it is misguided to let this be a deal-breaker. The research-driven examples listed by Caroline above have not let do homogeneity or killed innovation, rather the opposite, as well as started this very healthy conversation.

    2. RCTs obviously aren’t the only quantitative impact evaluation method, and method should be set by the program and context (what economists call ‘identifying assumptions’). This is important to stress, as many kick and scream at the thought of randomizing, which might not actually be necessary or feasible. That said, it is more often than not possible to find a way to randomize in a ethical and fair way, even including everyone, with a careful design. 

    3. Impact evaluation design is rapidly evolving and the most innovative designs can look at different types of activities at once, iterative changes, and all other kinds of complexities. Indeed, implementation evaluation is a critical part of any good impact evaluation too.  Rather than criticizing good things that are happening in the community with the same old rhetoric, it is more productive s to keep building on this to address current concerns as is evidently being done.

    4. I would caution against saying that we only need to test an idea once, or that we can be certain that it works in different places with different organizations after a few replications, as the implementation arrangements and contexts may differ substantially. Beyond very specific mechanism studies, most evaluations have limited external validity, RCT or otherwise, and it is quite easy to argue why a particular demonstrated impact in one place from one activity may not hold in another. In this case, at least a replication evaluation would be prudent. 



  • Alan Ratermann's avatar

    BY Alan Ratermann

    ON December 1, 2013 02:22 PM

    Hello Caroline,
    I am very thankful for your posts and have found them insightful and helpful. I was just wondering how you would address the issue of the individuality of nonprofit organizations. While I agree that academics have a better whole picture understanding of theory and how to create the best program based on research, each city and community is unique and has its own twists on challenges and resources. Here I could see a need for nonprofit organizations to evaluate.

  • BY Steve Lurie

    ON January 8, 2014 01:53 PM

    Interesting articles and exchange of views. Part of charity program monitoring involves testing for quality- which is a must if you want your NGO accredited. Monitoring involves assessing inputs, processes, outputs and outcomes. If there isn’t a focus on this, programs will just keep doing what they do without questioning what could be better and learning from failure.
    For example some years ago we did an analysis of of the relatively few client deaths in our service ( we run a community mental health service) and found the most deaths (73%) were attributable to chronic disease. As a result we have developed resources for our service users and staff to better manage their chronic physical illnesses.

    In addition to this we are now mining our client data bases to identify predictors of eviction from supportive housing, rehospitalization etc.

    Without an internal focus on evaluation, intelligent use of the data we collect, it is difficult to discern quality or tell our funders what we are accomplishing and learning.

Leave a Comment


Please enter the word you see in the image below:


SSIR reserves the right to remove comments it deems offensive or inappropriate.