Learning with Big Data: The Future of Education

Viktor Mayer-Schönberger & Kenneth Cukier

60 pages, Houghton Mifflin Harcourt, 2014

Buy the book »

Luis von Ahn looks like your typical American college student, and acts like one too. He likes to play video games. He speeds around in a blue sports car. And like a modern-day Tom Sawyer, he likes to get others to do his work for him. But looks are deceiving. In fact, von Ahn is one of the world’s most distinguished computer science professors. And he’s put about a billion people to work.

A decade ago, as a 22-year-old grad student, von Ahn helped create something called CAPTCHAs—squiggly text that people have to type into websites in order to sign up for things like free email. Doing so proves that they are humans and not spambots. An upgraded version (called reCAPTCHA) that von Ahn sold to Google had people type distorted text that wasn’t just invented for the purpose, but came from Google’s book-scanning project, which a computer couldn’t decipher. It was a beautiful way to serve two goals with a single piece of data: register for things online, and decrypt words at the same time. Since then, von Ahn, a professor at Carnegie Mellon University, has looked for other “twofers”—ways to get people to supply bits of data that can serve two purposes. He devised it in a startup that he launched in 2012 called Duolingo. The site and smartphone app help people learn foreign languages—something he can empathize with, having learned English as a young child in Guatemala. But the instruction happens in a very clever way.

The company has people translate texts in small phrases at a time, or evaluate and fix other people’s translations. Instead of presenting invented phrases, as is typical for translation software, Duolingo presents real sentences from documents that need translation, for which the company gets paid. After enough students have independently translated or verified a particular phrase, the system accepts it—and compiles all the discrete sentences into a complete document. Among its customers are media companies such as CNN and BuzzFeed, which use it to translate their content in foreign markets. Like reCAPTCHA, Duolingo is a delightful “twin-win”: students get free foreign language instruction while producing something of economic value in return.

But there is a third benefit: all the “data exhaust” that Duolingo collects as a byproduct of people interacting with the site—information like how long it takes someone to become proficient in a certain aspect of a language, how much practice is optimal, the consequences of missing a few days, and so on. All this data, von Ahn realized, could be processed in a way that let him see how people learn best. It’s something we aren’t very easily able to do in a nondigital setting. But considering that in 2013 Duolingo had around one million visitors a day, who spent more than 30 minutes each on the site, he had a huge population to study.

The most important insight von Ahn has uncovered is that the very question “how people learn best” is wrong. It’s not about how “people” learn best—but which people, specifically. There has been little empirical work on what is the best way to teach a foreign language, he explains. There are lots of theories, positing that, say, one should teach adjectives before adverbs. But there is little hard data. And even when data exists, von Ahn notes, it’s usually at such a small scale—a study of a few hundred students, for example—that using it to reach a generalizable finding is shaky at best. Why not base a conclusion on tens of millions of students over many years? With Duolingo, this is now becoming possible.

Crunching Duolingo’s data, von Ahn spotted a significant finding. The best way to teach a language differs, depending on the students’ native tongue and the language they’re trying to acquire. In the case of Spanish speakers learning English, it’s common to teach pronouns early on: words like “he,” “she,” and “it.” But he found that the term “it” tends to confuse and create anxiety for Spanish speakers, since the word doesn’t easily translate into their language. So von Ahn ran a few tests. Teaching “he” and “she” but delaying the introduction of “it” until a few weeks later dramatically improves the number of people who stick with learning English rather than drop out.

Some of his findings are counterintuitive: women do better at sports terms; men lead them in cooking- and food-related words. In Italy, women as a group learn English better than men. And more such insights are popping up all the time.

The story of Duolingo underscores one of the most promising ways that big data is reshaping education. It is a lens into three core qualities that will improve learning: feedback, individualization, and probabilistic predictions.

Excerpted from Learning with Big Data: The Future of Education by Viktor Mayer-Schönberger and Kenneth Cukier (Houghton Mifflin Harcourt, 2014)