(Illustration by iStock/drogatnev)

Organizations of every size collect written texts, from documents and notes to surveys and forms. However, these rich sources of information can be overwhelming. For example, a community-oriented nonprofit may have gathered research from websites, financial reports, or news reports—or have conducted and transcribed hours of interviews with school administrators, community leaders, and local artists. But all that intel might be buried in piles of text data when it’s time to focus their strategy. Or when an international NGO’s field workers write their observations in a case management system, managers might notice that many of their teams seem to be encountering similar challenges. But how can they tell if these challenges are coincidental or if there are deeper, more structural patterns?

Machine Learning and “Natural Language Processing” (NLP) can both uncover previously invisible patterns in these kinds of datasets, but can also automate certain tasks, freeing up people to do the higher-value, more creative work that machines can’t do. But while these terms may seem to require a high level of technical expertise and expensive infrastructure, many applications are broadly accessible and require no special hardware or expensive licenses. Beyond the tech talk, NLP simply applies algorithms to human language so we can analyze and process amounts of text that would otherwise be impossible for humans to handle efficiently. Nuance is always a struggle—e.g., ambiguity, metaphors, sarcasm—but the state-of-the-art in NLP can be usefully applied: chatbots can answer questions from knowledge bases, apps can translate intelligibly between languages, algorithms generate news stories based on financial reports, spam filters classify unwanted messages, and software can take one document as a point of reference, and search for similar documents.

At DataKind, we have seen how relatively simple techniques can empower an organization. For example, working with Conservation International, volunteer data scientists from DataKind launched Colandr, a computer-assisted evidence synthesis tool that helps surface evidence relevant to a user-identified topical area, and helps researchers, practitioners, and policy makers find resources to make evidence-based decisions. To date, Colandr has already been adopted by over 200 other organizations, a good example—like many others—that showcases how mission-driven organizations can collaboratively design innovative and impactful solutions to address tough social challenges.

Here we present six mature, accessible NLP techniques, along with potential use cases and limitations, and access to online demos of each (including project data and sample code for those with a technical background). We use a dataset of 28,000 bills from the past 10 years signed into law in five US states (California, New York, South Dakota, New Hampshire, and Pennsylvania) for our examples. These are the kinds of texts that might interest an advocacy organization or think tank, that are publicly available (with some effort), but which is the kind of large and varied dataset that would challenge a human analyst.

Are you enjoying this article? Read more like this, plus SSIR's full archive of content, when you subscribe.

1. “What Are People Talking About?”: Pre-Processing and Term Frequencies

You may have encountered word clouds, which visually represent the top words used in a text, and give a quick snapshot of the contents. But simply listing the most frequent words can cause you to miss common nuances of written language. The first step to any NLP task is therefore pre-processing, which depending on the text, can include a combination of steps like:

  • Removing common and uninformative words, known as stop words, such as “a,” “an,” “the” etc. (For specialized texts, a custom list of stop words may be used.)
  • Converting everything to lowercase—e.g., “HOUSES” to “houses.”
  • Standardizing words so that we treat “meal/meals” or “is/am” as the same. (One method is lemmatization, which replaces words with their root form—e.g., “studies” and “studying” are replaced with “study.”)
  • Replacing certain types of terms with a generic token—e.g., any number with “*NUMBER*”, any date with “*DATE*”.

Once you have prepared the text, you can capture more nuance by weighting the term frequency scores down if they appear in more texts/documents, i.e., by each term’s inverse document frequency, to get its “tf-idf”: if a term occurs frequently in a particular text, but only occurs in a few of the texts in the collection, this indicates that it’s probably more important to that specific text. Applying this technique to the bill text data, for example, terms like “include,” “bill,” and “follow” occur across many of the bills, but the highest tf-idf scores go to more meaningful terms, with lower frequencies, like “police,” “animal,” or “poverty.”

Tf-idf scoring can be useful when first exploring a collection of texts. For example, the top scoring terms from year to year might hint at qualitative trends in legislative interest. Applied to case notes, the technique might surface hotspot issues that caseworkers are recording. From survey responses, you may learn that your volunteers often describe the experience as “rewarding,” but there’s a cohort that also calls out “stress.” However, these would only be preliminary insights; tf-idf scores are more frequently used as inputs for other algorithms, since they offer information on which terms might be worth more weight. Tf-idf is also less useful for collections of short texts (e.g., tweets), in which it’s unlikely that a particular word will appear more than once or twice in any given text.

2. “Give Me Just the Nouns.”: Part-of-Speech Tagging

While tf-idf treats every word the same as every other, we can find new information by tagging them by parts of speech (POS). Pre-trained models can automatically classify words based on grammatical and semantic context and even differentiate between ambiguous POS categories in a sentence, as in sentences like “I can [verb] kick the can [noun].” (Try out this demo from spaCy, a popular NLP library.) Of course, POS tagging is hard for humans and doubly so for algorithms. Consider: “The hurried ask, the prompt reply.” Are “hurried” and “prompt” nouns or adjectives? Are “ask” and “reply” verbs or nouns? As humans, we can accept both readings, but a POS algorithm can only output a single assignment. POS tagging therefore trains a language model on a large collection of text, in which every word is labeled by its part of speech. The more sophisticated the model and the more training examples included in the dataset, the better the model will account for the variations of natural language. Ideally, we would fine-tune a POS tagging model on data similar to the data of interest—bill text, in our case—in order to optimize for the particular style.

In our example, we can look for nouns modified by adjectives, like in the phrase: “Mental health and behavioral problems in New Hampshire children and students, as studied in June.” Here, there are two instances of such a pattern, namely “mental health” and “behavioral problems.” Such phrases are much more illustrative of the content of the source bill than the individual words we were counting above, an extra layer of filtering could improve the tf-idf searches from before by answering, for example, which salient adjectives people are using in their survey answers.

3. “Who Did What to Whom?”: Named Entity Recognition

Named Entity Recognition (NER) is a technique that identifies and segments named entities—like an organization, a geographic entity, or a person—and classifies them into predefined categories. Popular NLP libraries allow NER to be performed in a few easy steps using strong pre-trained models. One such library, spaCy, has a long list of entity types that it recognizes, including times, named laws, and even works of art. SpaCy recognizes two named entities in our sample phrase from the bill text above: “Mental health and behavioral problems in New Hampshire children and students, as studied in June.” It recognizes New Hampshire as a geopolitical entity and June as a date. We can imagine automatically cataloging all of the geopolitical entities or companies referenced in these bills. SpaCy also recognizes money values, so with some clever filtering, it would also be possible to scan for budget allocations or minimum wages.

NER can also be handy for parsing referenced people, nationalities, and companies as metadata from news articles or legal documents. Such a database would permit more sophisticated searches, filtering for events, people, and other proper nouns across the full text of a knowledge base to find references that need a link or a definition. Combined with POS tagging and other filters, such searches could be quite specific.

As with POS tagging, NER can encounter ambiguous cases, misclassify, or fail to recognize an entity altogether. If stock NER models don’t perform well enough on your data, you can train your own specific model: a political news article NER model might be trained to identify politicians, newspapers, and dates, while a model for bills might identify agencies, section references, and named committees.

4. “I Think I Really Have Five Categories of Text Here.”: Topic Modeling

Another approach to discovering patterns in the content of a collection of texts is a class of techniques called topic modeling, the most popular method of which is Latent Dirichlet Allocation (LDA), which generates a requested number of “topics” (5-20 is typical). Applying LDA to bills containing the word “education,” for example, we trained three topic models with 3, 5, and 10 topics each. The 3-topic model generates topics that pertain roughly to School Governance, Campus Construction, and Taxes, which are based on the most relevant terms for each. (In our demo, you can explore the topics in all three models and decide for yourself which is the most useful or which discriminates topics the best.) We thus cut through the noise and quickly identified the main topics of our documents. Combined with domain expertise, we can apply this technique to other texts to answer questions, such as: What’s the range of ideas in a chosen set of documents? Do they share similarities and/or differences compared with other texts? How do these topics allow us to organize these documents?       

What if we create “texts” that encode a set of user actions? Imagine combining the titles and descriptions of all of the articles a user has read or all the resources they have downloaded into a single, strange document. The “topics” generated by LDA may then reflect categories of user interests. These can form the basis of interest-based user personas to help focus your product, fundraising, or strategic decision-making. A technique for understanding documents becomes a technique for understanding people.

Using LDA to provide useful insights can be challenging, since LDA topics aren’t assigned names, but are rather defined by the words they tend to produce (for example, a researcher would see a list of salient words for “Topic 1” and decide if there’s a coherent label that can be assigned). Compounding this difficulty, while the model will return the number of topics requested, the right number of topics is seldom obvious. There are some available metrics that can help, but choosing the best number (to minimize overlap but maximize coherence within each topic) is often a subjective matter of trial and error.

5. “What’s the TL;DR Version of This Text?”: Automatic Text Summarization

Automatic text summarization distills the most important information from a written source, producing an abridged version that preserves the overall meaning. Text summarization algorithms can be categorized into two types: extractive and abstractive. Extractive summarization uses a heuristic to select a requested number of the most representative sentences from the original text to form a summary (such as scoring sentences by how many of the top tf-idf-scoring keywords are included). Abstractive summarization, on the other hand, attempts to develop an understanding of a text’s main concepts, retrieving information and expressing its concepts in a human-understandable way by paraphrasing and shortening the source text.

LexRank is a popular text summarization algorithm and produces the following two-sentence summary of the bill “AB-1733 Public records: fee waiver.”:

(b) The State Department of Public Health shall develop an affidavit attesting to an applicant’s status as a homeless person or homeless child or youth.

(a) Except as otherwise provided in subdivisions (b), (c), and (d) of this section, subdivision (c) of Section 13002, and subdivision (c) of Section 14900, upon an application for an identification card a fee of twenty dollars ($20), and on and after January 1, 2010, a fee of twenty-six dollars ($26), shall be paid to the department.

Compare this text to the official summary:

An act to add Section 103577 to the Health and Safety Code, and to amend Section 14902 of the Vehicle Code, relating to public records.

The extraction reads awkwardly, since the algorithm doesn’t consider the flow between the extracted sentences, but bill’s special emphasis on the homeless isn’t evident in the official summary.

This technique can accelerate the consumption of any collection of texts of moderate length. One organization may want summaries of a news stream, while another may want a synopsis of journal or conference abstracts. The technique could also be used to generate representative pull quotes — for example, highlighting research ideas from a call for proposals or scanning a decade’s worth of impact assessment surveys. Extractive summarization isn’t how humans write summaries, but they’re very easy to start with on any text. However, if the results aren’t proving useful on your dataset and you have abundant data and sufficient resources to test newer, experimental approaches, you may wish to try an abstractive algorithm.

6. “We’ve Seen Text Like This Before.”: Classification

Finally, we come to the classic and ubiquitous task that has made machine learning so successful: classification. Classification takes a set of input features and produced an output classification, frequently a binary yes/no: whether an email is spam or not; the topic of a news article—e.g., politics, technology, education; or whether a message likely violates community standards. In a social sector context, text classification may be able to predict a user’s propensity to donate given their survey responses, whether a case worker will likely require extra support on a case, the severity of a help-line text message, etc.

Applied to the bill text, we demonstrate a classifier trained on Rhode Island bills labeled with a health-related topic and use this model to identify health-related bills in New York, which aren’t labeled. The model achieves high accuracy on Rhode Island data, although it fails to recognize actual health-related bills more often than we’d like. Applied to New York bills, the model does flag for us bills that superficially appear to match. However, the unusually high accuracy should tell us that this topic is easily discriminable, not that this technique is easily generalizable. And although the surfaced New York bills match our topic, we don’t know how many of the unsurfaced bills should also match the topic. Since the New York data aren’t labeled, we may be missing some of the New York Health & Safety bills.

The performance of the model depends strongly on the quantity of labeled data available for training and the particular algorithm used. There are dozens of classification algorithms to choose from, some more amenable to text data than others, some better able to mix text with other inputs, and some that are specifically designed for text. There are also advanced techniques— including word embeddings (Word2vec from Google, GloVe from Stanford) and language models (BERT, ELMo, ULMFiT, GPT-2)—that can boost performance. These typically provide ready-to-use, downloadable models (pre-trained on large amounts of data) that can be fine-tuned on smaller (relevant) datasets, so you don’t need to train from scratch. Still, selecting and using these effectively takes special skill. Typically, the most straightforward way to improve the performance of a classification model is to give it more data for training.

How to Propose an NLP Project

As with any research project, it’s imperative to think through each step of your plan. In our work here at DataKind, we have observed three key contributing factors to success:

1. A Clear Problem Statement

Good problem statements address the actual problem you want to solve—which, in this case, requires data science capabilities. For example, suppose you want to understand what certain beneficiaries are saying about your organization on social media. A good problem statement would describe the need to understand the data and identify how these insights will have an impact.

Since research is, by nature, curiosity-driven, there’s an inherent risk for any group of researchers to meander down endless tributaries that are of interest to them, but of little use to the organization. A problem statement is vital to help guide data scientists in their efforts to judge what directions might have the greatest impact for the organization as a whole.

2. The Right Data Available—and Enough of It.

The “right” data for a task will vary, depending on the task—but it must capture the patterns or behaviors that you’re seeking to model. For example, state bill text won’t help you decide which states have the most potential donors, no matter how many bills you collect, so it’s not the right data. Finding state-by-state donation data for similar organizations would be far more useful.

If you don’t have the necessary data on hand, then you need to figure out how to acquire it. Aside from open data repositories, data can sometimes be scraped from the web (check the terms of service) or other databases, or purchased from vendors. You may need to use other methods, such as conducting field work, online surveys, or labeling the pre-existing data that you do have. The latter option can be expensive or time-consuming, but new tools such as Prodigy and Snorkel are making it faster, cheaper, and easier. It’s not always obvious what the right data are, or how much data is required to train a particular model to the necessary level of performance. Determining this should be part of initial feasibility studies.

3. The Right People Supporting Your NLP Project.

No project succeeds in isolation; you’ll need support from many parties to ensure lasting impact:

  • Executive support: The most successful NLP project may fall into oblivion if efforts aren’t made to incorporate it into your organization. Maintaining a model and ensuring appropriate use are ongoing efforts that may require training and changes to standard operating procedures. All of this work requires both human and material resources. Executive support is vital for this.
  • Project lead: Projects need a champion—someone who can make a case to leadership for taking risks and articulate how the potential rewards mitigate those risks. A project lead should also maintain momentum behind a potentially long project and keep everyone involved and focused on the goal. Since any NLP project includes abundant failure modes and opportunities for distraction, the project lead ensures that the project has continued support and direction.
  • Data scientist(s): To perform this work, you need to have someone with NLP knowledge. Even relatively novice data scientists have sufficient skills to develop working prototypes using the techniques above. This person can be a staff member, a contractor, or even a volunteer. (Data sensitivity is of course a concern, in the latter cases.) If none of your current staff have NLP experience, but do have programming skills and an interest to learn, there are several free courses online which teach these techniques in detail.
  • Tech support: The data scientists themselves will need technical support. If the data scientists are expected to take an organization’s data and produce a prototype, the organization needs to provide that data and integrate that prototype into the existing digital infrastructure. Providing either of these is nuanced and requires regular dialogue. Analysts, database administrators, and/or software engineers must partner with the data scientists to keep these needs satisfied. If there’s no technical support, even the most brilliant work will have no future.
  • End user: The project lead will define the problem and communicate this to the data scientist, but ultimately, it’s the end user who is best positioned to evaluate the utility of what the data scientist attempts to do, because they can give the most informed feedback on whether a particular NLP solution will address their actual needs.

Ethics

As with any business decision, the last thing you want is to harm the very people you’re trying to help, or to accomplish your mission at the expense of an already marginalized group. The following are common concerns, but are by no means exhaustive.

Widespread interest in data privacy continues to grow, as more light is shed on the exposure risks entailed in using online services. On the one hand, more granular data can lead to more accurate models. On the other hand, those data can also be exposed, putting the people represented at risk. The potential for harm can be reduced by capturing only the minimum data necessary, accepting lower performance to avoid collecting especially sensitive data, and following good information security practices.

Since data science research advances on high-quality datasets, opening datasets to the public is regarded as a positive contribution to the field—but even after anonymization and aggregation, open-sourcing a dataset carries privacy risks, since a surprisingly small amount of information is often sufficient to re-identify someone out of seemingly anonymized data. For example, in one famous study, MIT researchers found that just four fairly vague data points – the dates and locations of four purchases – are enough to identify 90% of people in a dataset of credit card transactions by 1.1 million users. More alarmingly, consider this demo created by the Computational Privacy Group, which indicates the probability that your demographics would be enough to identify you in a dataset. Aggregated datasets may risk exposing information about individuals belonging to groups that only contain a small number of records—e.g., a zip code with only two participants. Sensitive and identifying information can surface in unexpected ways. It’s best to learn from history to avoid repeating mistakes.

The latent information content of free-form text makes NLP particularly valuable. It also makes it particularly dangerous. Free-form text isn’t easily filtered for sensitive information including self-reported names, addresses, health conditions, political affiliations, relationships, and more. The very style patterns in the text may give clues to the identity of the writer, independent of any other information. These aren’t concerns in datasets like state bill text, which are public records. But for data like health records or transcripts, strong trust and data security must be established with the individuals handling this data.

Another important ethical concern in NLP projects is bias. Style patterns in text risk biasing NLP algorithms in a harmful way, especially if the input data are themselves biased. Even without inputs like ethnicity and gender, algorithms may pick up on regional dialects instead of semantic content, and end up working against people who speak a particular way. One mitigation is to ensure that the training data represent the same population that your model is intended to serve, not whatever data you happen to have. Another is to directly inspect the final model to understand whether it’s exhibiting biases, and why it makes the decisions it does. (Libraries like Shap and LIME can help with interpretability.)

When presented with an algorithm that seems to make fast and reliable decisions, the temptation is for automation of entire processes. This is especially risky with NLP algorithms. Language is nuanced, and algorithms are dumb, so even when a model has very high accuracy, there may be systematic failures in particular cases—e.g., with a particular regional dialect. It’s important to perform a thorough error analysis to understand when a model performs well and when it doesn’t. It’s also advisable to deploy algorithms with a full audit trail and to keep humans in the loop of any decision-making, at least until the algorithm is demonstrably safe. (Think augmentation before automation.)

Finally, a subtle ethical concern around bias also arises when defining our variables—that is, how we represent the world as data. These choices are conscious statements about how we model reality, which may perpetuate structural biases in society. For example, recording gender as male or female forces non-binary people into a dyadic norm in which they don’t fit. Recording data in this way denies them their identity. Conversely, we might train a text classifier that classifies people as “kwertic” or not, and statistical fluctuations may support a working model, even if “kwertic” is completely made up and refers to nothing. But the existence of this classifier now legitimizes the concept, perpetuating a fiction. Replace “kwertic” with any category we apply to people, though, and the problem becomes clear.

It’s important to acknowledge and discuss these issues, and to acknowledge that ethics is a practice—checklists and toolkits are useful, but insufficient. We must pay attention to how our data do and don’t represent the world; we must design datasets, models, and processes that equitably serve the interests of users and indirect stakeholders. Otherwise, we’ll remain in a world which is designed to serve a minority of the population and neglect the rest. (For more on data ethics, read DataKind UK’s ethical principles, which links to other useful resources, and the book, Ethics and Data Science, co-written by DataKind advisor Hilary Mason.)

Conclusion

At DataKind, our hope is that more organizations in the social sector can begin to see how basic NLP techniques can address some of their real challenges. These tools aren’t perfect, but they are powerful, and they don’t tire. Begin by assessing what data you have, and initiate conversations. With a well-posed problem statement, the right data, the right people, and careful anticipation of possible unintended consequences, any organization can put NLP to work for real impact, as we work to improve people’s lives in ways both great and small.

Support for this article was provided by the Robert Wood Johnson Foundation. The views expressed here do not necessarily reflect the views of the Foundation.

Support SSIR’s coverage of cross-sector solutions to global challenges. 
Help us further the reach of innovative ideas. Donate today.

Read more stories by Alfred Lee & Benjamin Kinsella.