We Tested AI Social Impact Assessments of Venture Investments

laptop, paper checklist, and magnifying glass

Impact assessment, either pre-investment or post-investment, is a critical component of robust impact measurement and management (IMM). As many social and environmental issues worsen, high-quality data and insights are needed, more than ever, to effectively allocate resources to solutions that address these challenges. However, impact assessment is often a resource-intensive and difficult process to do well. Artificial Intelligence (AI) is an exciting umbrella of technologies that has the potential to transform how investors think about IMM (for example, in deep listening).

Our firm, Better Society Capital (BSC) is an impact fund-of-funds with a mandate to build the UK impact investing market. We have spent many years developing our IMM toolkit and processes, and we were recently placed on the Bluemark Leaderboard as having top-quartile scores across the eight categories of the Operating Principles of Impact Management (Bluemark is a leading impact management verification company, and the operating principles are a recognized framework outlining impact management best practices for impact investors). We are interested in how using AI alongside existing processes and judgment can bring additional insight, so we decided to run an experiment using our own portfolio to test the question: Can AI give investors the impact assessment rigor they crave at the speed they need?

What We Did

To test the value of AI, we ran an experiment alongside our annual impact performance assessment run between May and June of this year. As part of this assessment, we use a rating scale (1-4, with definitions for each score) to score all our fund investments and their portfolios (where relevant) across some of the Impact Frontiers five dimensions of impact: who, depth, and contribution (we also consider scale, but this is based on objective numbers rather than our scoring). Our experiment focused on material entities in our venture portfolio (45 startups assessed) where we have a lot of information on the underlying companies and could more easily assess the quality of AI outputs.

First, we asked the individual responsible for a given investment, our “deal leads,” to score the investments they manage as usual. Three scores—one for each dimension of impact: who, depth, and contribution—on each of the 45 startups generated 135 scores. Then we used AI (Perplexity Deep Research) to do the same, giving it a huge amount of contextual information and very specific instructions (the prompts were over 200 lines long). It only used publicly available information on the companies.

The output was then reviewed by our Venture Impact Lead, who oversees all impact process in the venture team. Any scores where the deal lead and AI assessments weren’t an exact match was further considered, excluding ones where AI had made obvious errors. This was the case for 72 scores (53 percent of the 135 total scores). Our Venture Impact Lead and deal leads then looked at all the available information and decided whether an update to the score was needed. This generated a final set of scores based on all the available information we had. With this final set, we were able to compare the deal leads and AI’s original scores to this final dataset.

What We Found and Why

The AI-driven scores matched the final set in 81 percent of cases. Our deal leads’ scores matched the final set 64 percent of the time. This would suggest AI can add value in assessing a portfolio for impact, although we would want to do more rigorous testing before making any relative claims. Teams spent anywhere from 15 minutes to a couple of hours for each assessment, depending on how much had changed since the last assessment. When optimized, the AI-driven process can be run at a fraction of the time (we estimate no more than 15-20 minutes per assessment) generating significant efficiency.

We dug into the outputs and “reasoning logic” of Perplexity to better understand the results. Here are some reasons AI can deliver high-quality outputs:

AI can find and analyze much more information in much more depth. For most investments, it was common for AI to conduct over 50 tasks while considering over 50 sources. It wouldn’t be possible for a human to do this in the time we have available for this exercise.
AI applies the full process incredibly consistently. Our team, on the other hand, as any team does, had varying degrees of familiarity with the investments and process (for example, in the case of new joiners, deals being handed over to new team members, etc.). This introduced levels of variation in process, which we did not see with AI. There is an additional dimension here around being human and things like mood, biases, and external events, which have all been shown to affect decision-making, but that is for a much longer article!
AI doesn’t make input errors such as pressing the wrong number on the keyboard.

However, there were also factors that led to poor outputs:

Even with long and structured prompts, AI outputs included many incorrect facts and statements (known as “hallucination” in the AI world) despite several lines of instructions not to do this. In many of these cases, it would take 2-3 further prompts for AI to acknowledge it had made a mistake. This makes it particularly hard to check AI if you’re not confident it has made a mistake. We didn’t track this, but would estimate that somewhere between 10-20 percent of assessments included hallucinations.
Where there is less publicly available data, for example, in the case of an early-stage company, AI makes more mistakes.

There are other pros and cons to using AI beyond efficiency and quality of outputs. One benefit is around retaining the explanation of scores. Due to time constraints, our team is asked to include a short rationale for the score they provide, typically one to three sentences. This often doesn’t mean much to anyone who isn’t already familiar with the investment. AI can provide a more comprehensive output, with references, that can be used in several ways including handing deals over to new deal leads, discussions with investees, and broader communications.

A second benefit was the value of AI in iterating our thinking. AI tools like Perplexity also give you their logic, which enables you to understand how your rules are applied by a machine. This can show where there are gaps in your thinking and improve frameworks and definitions. For example, at BSC, we are particularly focused on who is experiencing the intended impact. Typically, in venture investments, we think about this through an affordability lens: can a person in need afford the product or service? However, the AI outputs often gave scores based on indirect effects on our target end user/beneficiary. One example would be preventative health solutions whereby mass-market solutions can reduce burden on our national health system, freeing up capacity for those who are otherwise unable to access private health care. This has reopened past conversations about these routes to change, but now with additional examples and data to inform our thinking.

Finally, a potential negative consequence. We want AI to drive higher rigor and assessment of impact, however, there is a possibility that the team trusts AI over their own judgment (AI can be very compelling, even when wrong), which could reduce overall quality. We are thinking more broadly about how to introduce checks and balances to mitigate this risk as we embed AI into our workflows.

Lessons for Others Experimenting With Impact Assessment (or Similar) Prompts:

Treat the prompt like a product—build a prototype, test, assess, iterate. Our (somewhat arbitrary) threshold was that each element of a prompt had to deliver consistently high-quality outputs on five cases before we could consider it part of our standard prompt. For example, the first companies we tested the prompt with were all business-to-consumer (B2C), such as a consumer mental health app, and after a few iterations, the prompt returned relevant and useful information, broken down by the Impact Frontiers dimensions of impact. When we then tested it with a B2B2C business (for example, a startup that uses digital technology to improve health care system efficiency), Perplexity assumed the employees were the key beneficiaries and applied the impact dimension analysis to them, measuring outcomes like time saved, reduction in workload, and therefore stress, etc. While this was useful, it’s not the full picture, and we also care about what health care-related outcomes the end users experience, such as shorter waiting times, faster diagnoses, and longer sessions with doctors.

Once we spotted this, we included a section in the prompt that “taught” the AI to think through the entire theory of change of an organization all the way to end users and ultimate outcomes. Through this process of iteration, we went from a 10-line prompt to over 200 lines of prompt, over dozens of iterations, and a lot of development time.
Have very clear definitions for what key terms mean. AI tools like Perplexity apply your exact definitions. If you have a loosely defined version of “impact” or “affordability,” it will use this, and you might end up with results that go against your implicit understanding. For example, a personalized nutrition startup in our portfolio costs about £10 per month ($14). While their randomized controlled trials show significant health impact, we have traditionally believed this to be too expensive to reach the bottom deciles of income (based on analysis of disposable income and typical spend on health products in these groups). However, Perplexity was able to find many related offerings, for example, a national chain of social enterprise gyms in the UK, that were priced at higher rates yet still considered affordable. Therefore, it also classified the personalized nutrition company as affordable.

What this highlighted was a difference between absolute affordability (£10 is not affordable at X income), and relative affordability (product X is Y percent cheaper than mainstream equivalents, therefore we consider it to be affordable). As we hadn’t explicitly considered this in how we define affordability, Perplexity sometimes applied an absolute definition and sometimes a relative one. While that was initially quite confusing in the output, it was ultimately helpful as it triggered a deep dive into what we mean by affordability and how we define it.
Human in the loop. Human insight was essential in generating high-quality outputs, first in designing and iterating the prompt, then in validating final outputs, and finally in assimilating the AI outputs and our proprietary data. IMM and thematic expertise are crucial across the process, but particularly in building the initial prompts. Humans remain core, but with more time freed up, they can focus on other higher-value tasks or dive into deeper levels of assessment using AI.

Where We Go From Here

We are actively thinking about how to adapt our usual process to better leverage this tool. We want to reduce load on our deal leads while at the same time helping them engage in a more sophisticated and rigorous assessment process. For example, could a centralized function run the AI analysis first and then pass this to our deal leads and Venture Impact Lead for review? This would dramatically reduce the amount of admin and fact-finding that the team currently engages in.

Could we build on this by adding in other sources of public data that tools like Perplexity don’t (currently) search well? These include LinkedIn, Trustpilot or Google reviews, Reddit comments, etc. What about other proprietary data sources that we currently can access but require a lot of manual labour to integrate into our impact analyses (for example, Crunchbase data)? We are now experimenting with this.

The ultimate goal would be to combine all this information and make it queryable by a deal lead that is looking for specific information. This means shifting away from team members as data collectors, aggregators, and analysts to sophisticated consumers of data, using their experience to ask the right questions and draw conclusions. Many might be thinking that AI will replace humans; if anything, we think it could enhance their role and importance in the context of investing for positive impact.

As we begin to raise the bar on what good looks like, there is another potential benefit in that impact measurement can sometimes feel extractive and not particularly value-add for several actors along the impact chain. AI could make the cost of producing impact insights and data very low. If this sits on the investor side, it could increase the value that investors offer to social innovators alongside their capital. This frees up more time for the entrepreneur/social innovator to focus on what matters most to them.

Conclusion

AI technologies are developing at a rapid pace, offering the prospect of rethinking what IMM looks like. Our experience has shown that with the right inputs, AI can deliver meaningful and useful outputs. However, this is far from straightforward or easy. We’re sharing our experience here in the hope that this is useful for others in navigating this space. As mentioned, we have a broader “market building” mandate. If you are interested in learning more about our work or collaborating with us, we would love to hear from you.

Read more stories by Nicholas Andreou, Philipp Essl & Jeremy Rogers.

Impact Investing

We Tested AI Impact Assessments. Here’s What We Learned.

What We Did

What We Found and Why

Lessons for Others Experimenting With Impact Assessment (or Similar) Prompts:

Where We Go From Here

Conclusion

Create a free SSIR account to access this content.

This article is free.