desks in a classroom (Photo via Unsplash/Feliphe Schiarolli)

Many nonprofits in low- and middle-income countries face a critical mismatch: urgent social problems demand rapid program iteration, yet organizations often wait years for externally-produced evaluation results. When they do conduct rigorous evaluations, these are typically one-off studies that rarely keep pace with evolving implementation contexts or inform real-time decisions.

This tension between problem urgency and evidence generation speed is familiar to many implementers. After our organization, Youth Impact, ran an initial Randomized Controlled Trial (RCT) in Botswana on an HIV and teen pregnancy prevention program, we faced new questions relevant for government scale-up. The RCT showed near-peer educators effectively changed risky teen behavior while other messengers like public school teachers did not, but government partners needed ongoing answers about cost-effectiveness, implementation variations, and program adaptations. Waiting years between evaluation cycles meant missing the window to influence program design and consequential government reforms.

We needed an approach that maintained rigorous standards but operated at implementation speed. The technology sector offered a model: Microsoft alone runs approximately 100,000 A/B tests each year to continuously optimize products. A famous Gmail experiment, testing different advertising link colors, generated $200 million annually for Google and showed how small, rigorously tested variations can have outsized impact.

While social impact programs present unique complexities, we have found that a similar underlying approach can translate well to the social sector. Iterative A/B testing uses randomization to compare multiple program variations to answer questions about efficiency and cost-effectiveness, in addition to questions about general effectiveness (as in a traditional RCT). A/B testing also produces causal evidence in weeks or months, instead of years as in traditional randomized trials. Iterative A/B testing has a critical role to play to unlock social impact: causal evidence delivered rapidly enough to optimize programs during implementation and scale-up.

Closing the Gap Between Evidence Generation and the Pace of Implementation

At Youth Impact we have progressed from running one RCT in 2014 to running 75+ randomized tests (RCTs and A/B tests) cumulatively as of 2025, and this number continues to grow. We have identified three core principles that make A/B testing well-suited to implementation needs—what we call the “Three Rs.”

Rigorous: A/B testing uses randomization to generate causal evidence with the same rigor as RCTs. Beyond establishing proof-of-concept and answering the question “does the program work?” organizations, iterative A/B tests focus on ongoing optimization questions like “how can the program work better and more cheaply?” This type of question centers efficiency and cost-effectiveness in service of refining programs for scale. Given the focus on optimization, A/B tests most often compare different versions of a program rather than treatment versus control. Since all participants receive programming, just different optimized versions of the program, this enables evidence generation and rapid scale up to take place simultaneously.

Rapid: A/B testing produces results in weeks or months rather than years, enabling real-time program adjustments during implementation cycles. “Golden indicators”—outcomes consequential enough to drive decisions yet measurable quickly enough to inform them—enable this speed. For education programs, this might be foundational literacy and numeracy outcomes; for health programs, it could be knowledge assessments or behavior measures. This rapid turnaround makes A/B testing especially valuable during scale-up, when small improvements in cost-effectiveness compound across thousands of beneficiaries and when waiting years for evidence may mean missing the window to influence program design.

Regular: Iterative A/B testing is not an event but a process designed to test programmatic tweaks every implementation cycle, creating a continuous feedback loop where each test informs the next and learning accelerates over time. For example, for one tutoring program, we conducted 12 successive A/B tests that yielded efficiency improvements in seven of 12 tests, ranging from 5 to 30 percent each, with the large efficiency gains derived from cost-reducing modifications and caregiver engagement strategies. This regular cadence transforms organizational culture from viewing evaluation as a separate episode to seeing it as part of an iterative learning process core to program operations. Gains come from occasional breakthroughs in individual tests, and from small individual effects which compound cumulatively, typically after five to ten iterations.

Operationalizing A/B testing: Guidance for Fellow Implementers

Many organizations wonder whether they have the capacity to implement A/B testing. To get started, the key is a willingness to experiment, solid data collection infrastructure, and a commitment to using evidence for decisions. Youth Impact started simple and gradually built more robust systems. Our partners, organizations ranging from small NGOs to large international organizations, followed similar paths, each adapting A/B testing to their context and capacity. Here we share several tips, drawn from our in-depth toolkit, for how other organizations can integrate A/B testing into their own measurement and evaluation practices.

Get the Plumbing Right With Strong Data Systems

Because A/B testing is an ongoing process, data used for A/B testing is typically internal and already part of an organization’s ongoing program monitoring. There are three central monitoring system characteristics that should be “A/B testing ready”: right-fit indicators, high-frequency of data collection on those indicators, and sample sizes sufficient to detect meaningful differences between program versions A and B.

In order to make consequential, quick decisions it is important to have a program indicator that is “golden”—that is, an indicator that both changes fast enough and is meaningful enough. This type of indicator ideally measures project progress over a few weeks or months, rather than years, and is sufficiently “far to the right” in the theory of change of a program that an improvement in the indicator also means that important social outcomes might improve. For an educational program like Teaching at the Right Level (TaRL), which Youth Impact supports across four countries, foundational literacy and numeracy are golden indicators. They directly measure the program’s goal (children’s learning), respond to implementation changes within a school term, and can be assessed through a simple learning assessment, such as ASER. Unlike input metrics such as teachers trained or distant outcomes like exam pass rates, foundational learning sits in the sweet spot of fast enough and meaningful enough for decision-making.

Not all programs have a ready-to-go golden indicator in this sweet spot. For example, in Choices, our HIV and teen pregnancy prevention program, knowledge of risky behavior is easy to measure, but it was not always clear how much it predicted later impact. On the hand, HIV rates and pregnancy are rare and expensive to measure outcomes, making them difficult to collect for routine monitoring. Organizations can invest in a process to develop their golden indicator, or use a tiered approach for rapid testing, starting with “bronze” indicators (e.g., knowledge of HIV risks) and “silver” indicators (e.g., changing dating behaviors) and over time validating that changes in these indicators lead to impact (e.g. reductions in HIV / pregnancy).

Once an organization has identified a primary indicator for A/B testing, it is important to have a strong system for data collection at sufficient scale and frequency. A/B testing generally needs at least 60-80 units for cluster randomization (e.g., schools or classrooms) or 1,200+ individuals for individual-level randomization to detect effects of small program variations. This data should be collected at high frequency, ideally monthly or every program cycle (e.g., per school term for education programs), and processed quickly enough to inform real-time decisions. As an organization scales, serving more people allows for more A/B testing across geographies and programs and in turn more learning and cost-effectiveness gains.

Simple Starts, Substantial Gains

Organizations new to A/B testing understandably want to start generating transformative insights from the outset. Our experience is that the surest path to developing organizational capacity and confidence for A/B testing is to begin with “muscle-building” tests that prioritize the underlying learning methodology over having the perfect first test. An advantage of iterative A/B testing is that there are many chances to refine a design, so the first test is meant to grease the wheels; breakthrough insights will follow once the system is fully up and running.

As an example, we kicked off A/B testing for TaRL with a simple variation: adding icebreakers, which are participatory games and songs at the beginning and end of class. This was an easy-to-implement, low-cost tweak that helped jumpstart our process. We later went on to examine how children should be further grouped during TaRL lessons (e.g., subgrouping students by multiple learning competencies in a classroom, including operations and number recognition) and introducing structured observation guides for teacher mentors.

Our partners, Meerkat Learning, Save the Children Bangladesh, and Building Tomorrow took the same approach and tackled increasingly sophisticated questions. For example, Meerkat Learning, which supports the government of Namibia in scaling TaRL, started with phone call follow-ups to increase data submission by teachers. Later, they progressed to an even more consequential A/B test which replaced in-person school visits with phone call coaching visits to support teachers, reducing coaching costs at scale by over a third. Save the Children in Bangladesh and Building Tomorrow in Uganda started by sending text messages to parents and teachers to improve engagement in education. Later, Save the Children tested optimal sequencing of numeracy and literacy programming, seeing which subject should come first; Building Tomorrow optimized the cadence of teacher coaching visits. By engaging in an ongoing learning journey, with multiple progressive tests, organizations can start simple and build towards breakthrough innovations over time.

Once an organization has an A/B testing rhythm, it can progress from tweaks to transformative insights. For example, with ConnectEd, a phone tutoring program using TaRL-inspired targeted instruction principles, we started simple: Does twice-weekly SMS outreach improve outcomes versus once-weekly? We found this tweak made no difference, so we moved on to other questions. As our testing maturity grew, so did the question and logistical complexity. We tested whether encouraging caregiver participation in tutoring sessions improved outcomes, which led to doubling impact at almost no cost, one of the most cost-effective tweaks in global education. This breakthrough innovation sparked additional tests on optimal caregiver engagement strategies.

Iterative, Ongoing Learning

As learning agendas progress over time, A/B tests often fall into two broad categories: cost-reducing tests and effectiveness-enhancing tests. Effectiveness-enhancing tests typically add an element, with the aim of improving effectiveness at low marginal cost. Cost-reducing tests, on the other hand, remove or simplify a program component to reduce costs, while aiming to preserve impact rather than improve it. Cost-reducing tests remain rare in the social sector, yet we have found they have a particularly high hit rate in identifying areas to improve efficiency.

This typology of tests aims to address both sides of the scaling equation—effectiveness and cost—tackling both simultaneously: programs can become both cheaper and more effective when tested iteratively.

Over time, A/B testing evolves from single operational tweaks to continuous testing, unlocking cumulative program transformation and ongoing optimization. With a portfolio approach, an organization also places less pressure on individual tests, accepting that most individual tests will show modest or null effects. The cumulative impact of multiple small optimizations often exceeds the impact of any single large intervention.

An important part of moving from simple starting questions to more consequential ones as part of an ongoing learning agenda is gradually refining the questions over time. We have found several characteristics make for the best questions:

  • Feasible to implement: Consider the cost and feasibility of Version B to be easily integrated into day-to-day operations.
  • Effectiveness-enhancing or cost-reducing: Tests should fall into one of these categories.
  • Priority: changes represented by Version B should be important to decision makers.
  • Implementer-driven: Those closest to the program and the front lines of implementation often have the best ideas for how to improve it.
  • Initial uncertainty: When teams have genuine curiosity about the question, the answer is both of higher interest and more immediately informative to guide decision-making.

Accelerating Rigorous, Rapid, and Regular Decision-Making

A/B testing represents a fundamental shift in how organizations learn and operate. In 2014, Youth Impact’s leadership waited years for RCT results before making a programmatic decision. Today, the same team reviews evidence from multiple tests each school term, adjusting teacher training protocols, refining lessons, and optimizing parent engagement strategies in real time. Program staff no longer view evaluation as something external researchers conduct; they propose tests, interpret findings, and implement changes themselves.

Other organizations working in the social and government sectors are joining the A/B testing movement. In addition to the examples we have given from Save the Children, Meerkat Learning, and Building Tomorrow, IPA has launched a right-fit evidence unit, and IDinsight and the Agency Fund recently developed a tool to help tech-oriented implementers automate their A/B testing processes. We are collaborating with several organizations in the Mulago Foundation portfolio, an early champion of the iterative A/B testing movement, as well as the Jacobs Foundation, Agency Fund, and What Works Hub for Global Education networks, all leading lights in iterative learning, to grow A/B testing practices in the social sector. Several foundations have taken a keen interest in these approaches, including the Gates Foundation, the Marshall Foundation, Echidna Giving, the Prevail Fund, and more. Iterative learning techniques and tools coupled with trust-based, numbers-based philanthropy can be a powerful combination, together encouraging cycles of continuous learning and improvement.

For organizations in low- and middle-income countries where unmet needs vastly exceed available resources, A/B testing offers a path to closing the evidence gap faster. Moreover, programs typically lose effectiveness as they scale resulting in a “voltage drop.” A/B testing may help counter this pattern. Through continuous experimentation during scale-up, organizations can identify which program elements to add to drive greater impact and which unnecessary costs can be removed. Organizations that embrace rapid, rigorous, and regular testing can prevent the voltage drop—and even reverse it—dramatically improving cost-effectiveness over time as they scale.

Read more stories by Noam Angrist, Amanda Beatty, Claire Cullen & Tendekai Mukoyi Nkwane.