Seven Guidelines for Generating Data using Automated Coding [1]

Background:

I’m at yet another workshop on standards and lessons-learned for conflict data collections—I believe the formal title of this one is “So, punk, think you can manage a conflict data set??”—and in the near future may be writing a short journal article on the topic addressing the issues for automated systems, so rather than confine my prepared remarks to a small audience of specialists, I’ll develop it in a blog. This is almost certainly going to get modified over the next few days and weeks, and as the title implies, I will be extending it to another 7-point-post [7] on guidelines for using automated data. So if you are planning to assign this, say in a class, check back. [2]

The List

1. Write [and document] everything on the assumption you will be using—and re-using—it far longer than you expect. Doubly so for coding frameworks. Save everything in ASCII or UniCode formats, not binary.

There is a rule of thumb in computer programming that any code you return to after six months might as well have been written by someone else, so document accordingly. There is a large, if not entirely convergent, literature on what constitutes “adequate” documentation—too much can be as bad a too little, as the details get lost in excessive documentation, and it isn’t read—so aim for the happy medium. Meanwhile, share not just your primary documentation, but also those one- and two-page “cheat sheets” and codebook summaries that your project uses all the time.

Little utility programs that one originally writes as a one-off kludge to solve some pressing problem end up being central to the project, sometimes so central that they are unnoticed until someone tries to duplicate your work.[3] Take time to both document these and occasionally “refactor”—clean up the code without changing what it does—and combine them so that multiple steps become a single step.

The stability of coding frameworks, particularly coding frameworks which are considered a “first approximation,” is under-appreciated: Charles McClelland figured the WEIS framework would be rewritten and improved after five or so years; it was in use virtually unchanged forty years later. We wrote CAMEO for a specific project on mediation and it was adopted by ICEWS as a general purpose coding framework.

I recently learned a lesson the hard way on the importance of saving in text formats: I was contacted by a group in Spain concerning some data on the first Palestinian intifada that had been collected in 1988-1991 by a long disbanded Palestinian NGO and which Deborah Gerner and I had used in a 1995 article. I found the files easily enough—with inexpensive high-density media, it is easy to make multiple back ups, which I’d done—but the main file, which had originally been in an obscure MS-DOS database, had been saved in an early version of Excel for the Mac. Fortunately, we’d saved enough extracts in tab-delimited files—no problems at all in reading those—that I think most of the data can be recovered, but I should have had the sense of save the entire file that way. You can pretty safely assume that any binary format will be unreadable after ten years [12]: plan accordingly.

2. Version control, version control, version control. On the programs, data and documentation.

As the Ionian rationalist Heraclitus said 2,600 years ago, the only thing permanent is change. [4] Automated data sets in particular are meant to be recoded as the dictionaries and coding systems improve, so unlike a survey, the data never become canonical: they are always subject to change. For the last few years, TABARI has had the ability to automatically prefix any data set it generates with a record of the date, dictionaries, and comments, a feature we will be incorporating into our new systems as well.

The flexibility of automated coding also means that it is quite likely you will change your coding framework, particularly in the early phases: that’s a feature of the automated approach, not a bug. Just document it. The data sources are going to change over time as well, particularly right now: document those as well.

At the purely pragmatic level, at various times we have lost track of unintentional “forks” in the TABARI source code, the TABARI validation suite, and multiple versions of the CAMEO manual, and putting them back together was not fun. All projects with any reasonable level of complexity need version control; the software required to do this is open source and very mature, if occasionally counter-intuitive until you get the hang of it, so use it.

3. Reliably human validated records—both real cases and unit test cases—are an extremely valuable resource.

TABARI has an extended set of artificial “unit test” cases—about 700—which are run after every change to the program, however trivial, and if the new code fails on even one of these, it is fixed, promptly, before the code is used or modified further. This isn’t our invention: it is a standard practice in software development, and has saved us from many bugs which might have otherwise gone unnoticed.

What TABARI—and our automated event data projects more generally—do not have is a very large set of genuine news articles where a “correct” coding has been determined and consequently can be used to test both the software and the dictionaries. We’ve certainly done those tests, but did not have the foresight to keep the inputs. And these are very expensive to generate: easy cases are cheap, but the program will get almost all of them anyway. The same for cases which the program will never get. But the difficult cases are frequently those which are also ambiguous to human coders—which then must be resolved in long discussions—or those which occur rarely but reveal systemic problems in the dictionaries. We really should have retained these over the years, and in the future we might be able to generate a set comparable to those we have for unit testing. Or perhaps, just perhaps, such sets which were generated with public funding might be made available?

4. Data are noisy: discuss and document the sources of noise rather than pretending that the data are perfect.

I addressed this in a recent post, and plan to return to it later, but I repeat: event data are noisy. And, by the way, human coded data are not perfect either: one of the presentations at this conference was discussing a method called MSE that is used for determining total casualty figures in conflict zones and it is based on the assumption of random errors in [human compiled] data.

Ideally, in fact, we need multiple measures of accuracy. In the case of event data, for example, I can think of at least ten [8] measures, and no system I know—including our efforts—has been evaluated on anything close to all of these. Because of these multiple indicators, single measures of “accuracy” are meaningless: for example the widely-reported measure in King and Lowe (2003) is quite quirky, and has been misinterpreted in multiple ways over the years.

By the way, there seems to be some serious rot—for want of a better term—in the claims of inter-coder reliability in human-coded projects. After reading the results found by Mikhaylov, Laver and Benoit (2012) and Ruggeri, Gizelis and Dorussen (2011) [12], I’ve started noticing multiple instances where claims of 80% reliability cannot be replicated, and true replicability is frequently as low as half that level. Furthermore, once I started noticing this, I’m seeing the problem pop up in multiple fields—it is certainly not confined to political science—and I’m beginning to wonder whether it needs to be addressed more seriously. It indicates sloppiness at best, and can be considered a mild form of fraud at worst.

More generally, I do not care what the intercoder reliability was on your first 200 data points: I want to know what it is on the last 200 data points. [5] And if the project involves multiple institutions, and the teams of coders have changed over time (particularly for multi-year—or multi-decade—projects) I want the inter-coder reliability between the original coding team and the latest coding at another institution.

We almost never see these figures. My guess is that if we had them, the reliability of coding for mature automated systems would already be substantially higher than human coding reliability, and it is going to improve further, whereas human coding is probably not going to improve further except to the extent it incorporates machine-assisted methods. I could be wrong, but based on the evidence from the few attempts at replication, those claims of 80% should be viewed with a high level of skepticism.

5. Automate as many steps of your data generating process as possible.

At one point in the 2000s, I was systematically updating the KEDS Levant series every three months for the Swiss Peace Foundation—prior to this it tended to get updated whenever we had a conference paper to write—and I was noticing the process seemed fairly complex (in particular, sufficiently complex that it was not possible to delegate several of the tasks, though we had delegated some of it). So I actually wrote down all of the individual steps: There were about sixty.

With other demands—and most certainly, the fact that I was only doing this every three months—I never really automated this further, though it could have been considerably simplified even using the tools I had at the time, and most certainly could have been scripted as we moved into a Unix environment. GDELT, in contrast, gets updated every day around 2 a.m. using fully automated methods.

Scripts are your friends: they are little assistants who are completely reliable, do the same thing every time, and never show up late with a hangover or have a midterm they need to study for. Scripts also force you to systematize and document things which otherwise fall into that undocumented “Oh, I’ll just do it for you” category that will come back to haunt you five years later. They aren’t always appropriate, but I’m guessing most social scientists use these these less frequently than they should.

6. You will be lucky if two-thirds of the students you hire to work on the human-coding aspects of the project actually get it. For programmers, make that one-third.

Unless your system works entirely with unsupervised machine-learning methods—and thus far, no open coding system has achieved this, though there are a few undocumented and implausible claims that such systems exist—you are going to require the assistance of carbon-based life forms. Good luck.

As with other aspects of the coding projects, design for robustness and failure. Dictionary development and coding of validation sets is not as mind-numbing as repetitive coding, but it still requires sustained attention to oftentimes seemingly endless little details, and most humans are not good at this. In our experience, we see at least a 80/20 rule operating—and more likely this is a 90/10 rule—with 80% of the best dictionary development due to 20% of the coders. [6] Every once in a while we get an extraordinary coder—at the height of our early work on KEDS, there was a coder working on the PANDA project who could describe bugs in the program with such precision that I could sometimes almost identify the line of code that had to be involved—and then the project jumps forward. But most people aren’t at this level, and things move slowly and incrementally.

It would be great if we had a nice diagnostic test for identifying the really good people, but we’ve never come close to it. In the KEDS project, we learned that to involve the coders themselves in the hiring process, which improved our yield but didn’t make it perfect. Consequently we had a fairly long “shake-down” period of coders working on supervised test cases before they did actual development, and figured we would get a 60% to 70% yield, if that.[13] Though once we had someone trained, they would usually work for us the remainder of their time at Kansas, and sometime longer. Using experienced coders to supervise the less experienced also helps, though sometimes the less experienced coders are good at noticing the inconsistencies in your protocols.

Programmers?—I used to think I was uniquely bad at hiring programmers, or that good programmers didn’t want to work for political science projects, until I started reading the general management literature on this, and realized it is a problem that everyone has, and poor (and/or poorly managed) programming teams have caused the collapse of multi-billion-dollar projects. Our issues pale in comparison. In those cases when I get a good programmer, the project advances disproportionately; at other times, we just muddle through. But any project director who thinks “We’ll just hire a programmer” is living in a fantasy world.

7. Link to as many reliable standards and existing resources as possible, and of course, open source.

Finally, use open source materials, and contribute to the open source communities. The situation here is hugely different than when we started work 25 years ago, and our project is still catching up on using all of the readily-available material on the web: for example it took us a lot longer than it should have to make full use of WordNet, though in CountryInfo.txt, we have made good use of resources such as rulers.org and the CIA World Factbook. [9] Still, there is a lot more “dictionary fodder” we should be taking advantage of, both for the basis of dictionaries and updating them.

Some of these resources are easy to incorporate, others more difficult. Fortunately, any large-scale compendium available on the web will be embedded in some sort of standard HTML page which, after a bit of routine if customized programming to extract the fields, is effectively structured data. And despite my continual screeds on the obstinacy of academic quantitative conflict analysis—rapidly approaching the status of a methodological suicide cult [10]—in using their funny little COW country codes instead of ISO-3166, that’s simply a table lookup problem and there’s even an R package, countrycode, which solves it. Besides, COW codes are positively sane compared to FIPS codes.

And of course, it goes without saying, borrow from, and contribute to, the open source projects. There is some weak evidence that for really large projects—those above 1-million lines of code—closed projects may still have an edge, but the existing systems are nowhere remotely close to this level, nor will they ever be: the required processing simply isn’t that complicated, particularly once coding is broken out into discrete components, for example for geocoding, parsing, and feature extraction. Below that level, study after study has shown not just that open source is cheaper—if only “free as in puppy”—but better. What’s not to like? [14]

Addendum 14.01.28

In light of events of the past ten days [15], I’ll add another point that could apply in a couple of places in this list: document your sources. In the KEDS project, we always did this: data sets were actually defined by their data sources and search strings. PITF Atrocities, same thing: the data are defined by the sources and search string: it’s in the manual. ICEWS: you’betcha—even in the early research stages one of the many unreadable green-on-yellow (or was it yellow-on-green?) PowerPoint slides listed the sources, and as ICEWS moved closer to operational phases and licensed the source texts so that they could be shared within the U.S. government user community, those sources are precisely specified.

But this is a relatively recent development: I was surprised to find, when I became involved with automated filtering for the generation of MID-4, that there was not a common documentation format for the admittedly highly decentralized MID-3. Go back further in time and one gets to a level sometimes barely better than “We sent a bunch of undergraduate work-study students to the library and told them to look through micro-film, and we have reason to believe some of them actually did.”[16]. Ideally somewhere the sources are documented—ideally, somewhere that does not require a ninja-style raid on a deserted garage to recover boxes of documents [17]—but in very large, extended, decentralized project, standards are difficult to enforce.

In the absence of documentation of at least the corpus of documents—e.g. “Reuters downloaded from Factiva; Agence France Presse downloaded from Lexis-Nexis” [18]—you may find yourself in potentially murky territory, and certain data sets can be pretty darn murky. Just two days before the recent series of unfortunate events commenced, I was contacted by a computer science graduate student in California as to the TABARI/CAMEO codings of a specific article that he could trace through the URL [19]. He didn’t think the coding made a whole lot of sense, and about eight “events” had been generated. The article was a report on an automobile trade show in Detroit!

Look, folks, I have never claimed that TABARI/CAMEO was designed to code reports of trade shows. I do claim it can code reports of interactions dealing with political conflict, and it does a pretty good job of that, as reflected in two decades of NSF-funded research and refereed articles. But trade shows, no. Jerry Pournelle (Ph.D., political science) famously said that no cause is so virtuous that it will not attract idiots, and we seem to have an issue with that here.

Now, realistically, trade-show reports in an event data set just add noise (unless, I suppose, the Japanese exhibitors stage a ninja-style raid on the Chinese exhibitors, but press releases tend to gloss over that sort of thing). How much noise: well, as an analyst, that’s one of your issues, and it is true of any data set, and that’s why [some of us] prefer statistical analysis to postmodernist hermeneutic deconstructionism. But you could eliminate a whole lot of that noise with judicious application of white-listed sources and open-sourced filtering. Some data sets do that better than others.

Odin and the lawyers willing, perhaps there will be more said on this in the future.

Footnotes

1. After the last posting, I am told there was a tweet to the effect “Schrodt should write everything in ‘sevens.'” I now take that as a challenge!

2. No, I’m not just trying to drive traffic to the site. Hmmm, but this caveat is silly, right?: even if you do use this in a class, you’ll just provide a link, not print it off. But you might re-read it, so you can counter the individual who says “But Schrodt says that for government contracting work, best practice involves [1] vacuuming up everything you can in the open source world without attribution; [2] doing “hide and hoard” for anything original, using proprietary software; [3] “ghosting” open-source alternatives with proprietary and opaque tests with ludicrously implausible results whose details change every time they are challenged; [4] hiding everything under “unclassified but sensitive” and NDAs; [5] never forget the Prime Directive of the Two-Year-Old: “What’s mine is mine, and if I want it, what’s yours is mine as well.”; [6] open source is for chumps;  and [7] consuming as much taxpayer funds as possible on this exercise, since Paul Krugman assures us this austerity thing is way over-rated, and if you need to spend money, let’s spend it here.  The greatest chump of all is the taxpayer: lambs to the slaughter, sheep to be sheared.

3. Evolution works the same way: by current accounts, only 10% of the DNA in your body is human; the rest is that of a complex biome of micro-organisms without which you would die.

4. The Buddha, with rather more substantial impact, said much the same thing at much the same time. Ah, the mystery of the Axial Age…what was going around then?…

5. Okay, in fact knowing these figures for the average coded unit would be the best indicator.

6. By convention, we’ve always called them “coders” even though they are actually figuring out patterns for the dictionaries.

7. Which would be a good name for a bluegrass band.

8. Here goes…

  1. Accuracy of the source actor code
  2. Accuracy of the source agent code
  3. Accuracy of the target actor code: note that this will likely be very different from the accuracy of the source, as the object of a complex verb phrase is more difficult to correctly identify than the subject of a sentence.
  4. Accuracy of the target agent code
  5. Accuracy of the event code
  6. Accuracy of the event quad code
  7. Absolute deviation of the Goldstein score on the event code
  8. False positives: event is coded when no event is actually present in the sentence
  9. False negatives: no event is coded despite one of more events in the sentence
  10. Global false negatives: an event occurs which is not coded in any of the multiple reports of the event

9. GDELT, in contrast, made very effective use of GeoNames and related resources to add a geospatial component to the data.

10. Remember that awkward conversation with your parents before you went off to college? “Dear, we know you are going to be out on your own, and we want you to experiment with new things, but please, honey, don’t join any suicide cults…” Yes, even a suicide cult devoted to the glorification of garbage can models run on obsolete data sets and hypotheses no one has taken seriously since before the era of Ronald Reagan, disco music and bell-bottom jeans. Suicide cults: JUST SAY NO. Unless you want tenure.

11. In fact, if you want to retrieve the data badly enough, there are commercial services that can do this—for a price—at least for formats as common as the variations of MS-Office. Though not necessarily for a database written by three guys in a garage in East Jerusalem in 1986. Still, it is a whole lot easier just to open a text file.

12. Paywalled.

13. Usually they just quit, or even more typically, just stop showing up: coding is like that. As a consequence, unlike Mitty Romney and Donald Trump, I’ve never developed a knack or taste for firing people. How sad.

14. We can also learn from the systematic studies of how the successful communities function: open source has been around sufficiently long that there are some fairly consistent results here, and we don’t need to re-invent the wheel.

15. As a number of people have correctly inferred—and others I have told directly—I have been advised by legal counsel not to comment on a current dispute involving third parties (neither I nor any of my software is at issue) who all have initials that would be integers in FORTRAN.[20] Some of those parties, in fact, would prefer that I not even comment upon the fact that I’ve been advised not to comment, or presumably comment upon the fact I am advised not to comment—turtles all the way down—but this is advice, not an active gag order. Though I’d rather like it to stay that way. When this issue is eventually resolved—there have been hints this will occur perhaps around the time the echinacea start blooming on the eastern Plains—I will comment if this if possible.

Advice of legal counsel, by the way, is also the reason eventdata.psu.edu has disappeared, though most of the content is available on eventdata.parusanalytics.com. Lost a #3 Google page ranking for the generic search “event data” on that one, accumulated over two decades of links. Sigh…

Welcome to my life: Shit happens. Though not nearly the shit that is happening in the places we code every month in the atrocities data set, so let us not feel too sorry for our privileged professional selves.

16. The COW site also has a god-awful frame-based structure that is the epitome of worst-practice in contemporary web design, but hey, not my problem.

17. Not an urban legend, but definitely legendary: ask someone in the field.

18. This is generally sufficient to locate conflict data: it is very rare that one cannot locate a set of stories using a source, date and two actors. You may not be able to locate the precise stories because of de-duplication, but you can find out what was going on (and if the event was incorrectly coded, often as a not guess why). Ideally, one should have unique story identifiers: Factiva provides these (and we include them in the PITF atrocities data set) but Lexis-Nexis does not: this would involve maybe thirty lines of code, apparently not a priority. News aggregators such as Google News and European Media Monitor provide URLs, though the stability of these is always an issue [19].

19. Which now generates a 404-error, presumably unrelated to the series of unfortunate events, and instead due to the fact that publicity stories are very transient.

20. “B” designates a floating-point variable in FORTRAN.

Advertisements
This entry was posted in Methodology. Bookmark the permalink.

5 Responses to Seven Guidelines for Generating Data using Automated Coding [1]

  1. Pingback: Links to some useful posts from the past | Global Database of Events, Language, and Tone (GDELT)

  2. Pingback: When the ISA outlaws blogs, only outlaws will blog | asecondmouse

  3. Thanks for the blog. Good stuff!

    Glad you enjoy my countrycode package. Without download statistics, it sometimes feels as if I’m the only one using it.

  4. Pingback: Seven observations on the newly released ICEWS data | asecondmouse

  5. Pingback: Seven Conjectures on the State of Event Data | asecondmouse

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s