Should an event coder be more like a baby?

Last evening, as is my wont, I was reading the current issue of Science [1]—nothing like a long article on, say, the latest findings on mantle convection beneath the Hawai’i hotspot to lull one to sleep—when an article titled “Basic Instincts: Some say AI needs to learn like a child” jolted me into one of those “OMG, this is exactly the issue I’ve been dealing with!” experiences.

That issue: whether there is any future to dictionary-based political event coders. Of late—welcome to my life, such as it is—I’ve been wrestling with whether to invest my time:

  • Writing a new coder based on universal dependency parsing and my mudflat proof-of-concept: seems like low-hanging fruit
  • Adapting an existing universal dependency coder (seems increasingly unlikely for an assortment of reasons)
  • Or just tossing the whole project since everybody—particularly every U.S. government funder—knows that dictionary-based coders are oh-so-1990s and from this point on everything will be done with machine learning (ML) classifiers

This article may tilt the scale back to the first option. At least for me.

The “baby” reference here and in the article comes from the almost irrefutable evidence that humans are born hard-wired to efficiently learn various skills, and probably the most complex of these is language. A normally developing human child picks up language, typically using sound, but sign language is learned with equal facility—and outside the United States, usually multiple languages, and keeps them distinct—at a phenomenal rate. Ask any three-year-old. And try to shut them up. Provide a chimpanzee with exactly the same stimuli—yes, the experiment has been tried, on multiple occasions—and they never achieve remotely similar abilities to that of humans.

However, there’s an attraction to ML classifiers in being, well, rather mindless. [2] But this comes with the [huge] problem of requiring an extraordinary number of labeled training cases, which we simply don’t have for event data, nor does anyone seem inclined to generate them, because that process is expensive and involves the recruitment, management, and, critically, successful retention of a large number well-trained human coders. [3] Consequently event data coding is in a totally different situation from ML problems where vast numbers of labeled cases are available, typically from the web at little expense.

It’s completely possible, of course, to generate essentially unlimited labelled event data cases from the existing coding systems, and it is certainly conceivable that the magic of neural networks (or your classifier of choice) will produce some wonderful generalization that cannot be obtained from those coders. Or, more likely, will produce one interesting generalization that we will then see repeated endlessly, much like the man-woman-king-queen example for word embeddings. But another possibility is the classifiers will just sloppily approximate what the dictionary-based systems are already doing.

And doing reasonably well because dictionary-based automated event coding has been around for more than a quarter century, and now benefits from a wide range of on-going developments throughout the field of computational natural language processing. As a consequence, those programs start with a lot of “instinct.” Consider just how much comes embedded in a contemporary system:

  • The language model of the parser, which is the result of thousands of hours of experimentation across multiple major NLP research projects across decades
  • In some systems, notably VRA-Reader, PETRARCH-2 and Raytheon/BBN’s ACCENT/Serif, an explicit language model for political events
  • Models of language subcomponents such as dates, locations, and named entities
  • Two decades of human-coded dictionary development from the KEDS and TABARI projects [4]
  • The WordNet synonym sets, again the product of thousands of hours of effort, which have been incorporated into those dictionaries
  • A variety of very large data sets such as rulers.org, CIA World Leaders and Wikipedia for named-entity resolution
  • Extensive idiomatic human translation by native speakers of the Spanish and Arabic dictionaries currently being produced by the NSF RIDIR event data project

Okay, people, I know that your neural networks are cool—like they are really, really cool, fabulously cool, in fact you can’t even begin to tell me how cool they are, even if a four-variable logit model matches their performance out-of-sample—but frankly, I’ve just presented you with a rather extensive list of things that the dictionary-based coders are already starting with but which the ML programs have to learn on their own. [5] 

So in practical terms, for example, the VRA-Reader coder from the 1990s—now lost, alas, because it was proprietary…sad…—provided 128 templates for the possible structure of a sentence describing a political event. JABARI in the early 2010s—now lost, alas, because it was proprietary, and was successfully targeted by a duplicitous competitor…sad…—gained an additional 15% accuracy over TABARI using a set of very specific tweaks dealing with idiosyncratic characteristics of political events (e.g. the fact that the Red Cross rarely if ever engages in armed attacks). A dictionary-based system knows from the beginning that if A meets with B, B met with A, but if A arrests B, B didn’t arrest A. More generally, the failure—in numerous attempts across decades—of generic event “triple” coding systems to compete in this space is almost certainly due to the fact that domain-specific information provides a very significant boost to performance.

Furthermore, the environment in which we are deploying dictionary-based coding programs is becoming increasingly friendly: In the 1990s KEDS and VRA-Reader only had the texts and small dictionaries to work with, and had to do this on very limited hardware. Contemporary systems, in contrast, have access to sophisticated parsers and huge dictionaries with hardware easily able to accommodate both. Continuing the childhood metaphor, this is the difference between riding a tricycle and riding a 20-speed bicycle. With an electric assist.

I don’t expect this simple metaphor to be the last word on the subject and I may, in the end, decide that classifiers are going to rule us all (and in any case, that seems to be pretty much where all of the funding is going at the moment anyway but if that’s the case, please, can’t someone, somewhere fund an open set of gold standard records??). But I’m also beginning to think dictionary based approaches—or more probably, a hybrid of dictionary and classifier approaches—are more than an anachronistic “damn those neural nets: young whippersnappers don’t appreciate what it was like hacking into Nexis from the law school library account via an acoustical modem for weeks every morning from 2 a.m. to 5 a.m. [6]…get off my lawn” but rather, given the remarkable resources we can now deploy on the problem, dictionary-based coding represents a hugely more efficient approach than learning by example.

Time (and experimentation) will tell.

Footnotes

1. Okay, so it was actually last week’s issue: I wait for the paper version to arrive and hope it doesn’t get too soaked in the mailbox. The Economist I read electronically as soon as it is available.

2. The article quotes in passing Oregon State CS professor Thomas Dietterich that “[academic] computer scientists…have an aversion to debugging complex code.” Yeah, tell me about it…followed closely by their aversion to following quality control practices that have been common in industry since the 1990s. I digress.

3. The relatively new prodigy software is certainly a far more efficient approach to doing this than many earlier alternatives—I’ve also written a simple low-footprint variant of its annotation functions here—but human annotation remains vastly more labor intensive than, say, downloading millions of labeled images of cats and dogs.

4. Which I’ve got pretty good empirical evidence still provide most of the verb patterns for all of the CAMEO-based coding systems…figuring out verb patterns used to generate any data where you know both the codings and the URL of the source text is relatively straightforward, at least for the frequent patterns.

5. The other fabulously cool recent application of deep learning, the ability to play Go at levels beyond that of the best human expert, depended on a closed environment with fixed rules: event data coding is nothing like this.

6. Not a joke: this is the KEDS project ca 1990.

Advertisements
This entry was posted in Methodology, Programming. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s