Seven Conjectures on the State of Event Data

[This essay was originally prepared as a memo in advance of the “Workshop on the Future of the CAMEO Ontology”, Washington DC, 11 October 2016, said workshop leading to the new PLOVER specification. I had intended to post the memo to the blog as well, but, obviously, neglected to do so at the time. Better late than never, and I’ve subsequently updated it a bit. It gets rather technical in places and assumes a fairly high familiarity with event data coding and methods. Which is to say, most people encountering this blog will probably want to skip past this one.]

The purpose of this somewhat unorthodox and opinionated document [1] is to put on the table an assortment of issues dealing with event data that have been floating around over the past year in various emails, discussions over beer and the like. None of these observations are definitive: please note the word “conjecture”.

1. The world according to CAMEO will look pretty much the same using any automated event coder and any global news source

The graph below shows the CAMEO frequencies across its major categories (two-digit) using three different coders, PETRARCH 1 and 2 [2], and Raytheon/BBN’s ACCENT (from the ICEWS data available on Dataverse) for the year 2014. This also reflects two different news sources: the two PETRARCH cases are Lexis-Nexis; ICEWS/ACCENT is Factiva, though of course there’s a lot of overlap between those.cameo_compare

 

 

 

 

Basically, “CAMEO-World” looks pretty much the same whichever coder and news source you use: the between-coder variances are completely swamped by the between-category variances. What large differences we do see are probably due to changes in definitions: for example PETRARCH-2 uses a more expansive definition of “express intent to cooperate” (CAMEO 03) than PETRARCH-1; I’m guessing BBN/ACCENT did a bunch of focused development on IEDs and/or suicide bombings so has a very large spike in “Assault” (18) and they seem to have pretty much defined away the admittedly rather amorphous “Engage in material cooperation” (06).

I think this convergence is due to a combination of three factors:

  1. News source interest, particularly the tendency of news agencies (which all of the event data projects are now getting largely unfiltered) to always produce something, so if the only thing going on in some country on a given day is a sister-city cultural exchange, that will be reported  (hence the preponderance of events in the low categories). Also the age-old “when it bleeds, it leads” accounts for the spike on reports of violence (CAMEO categories 17, 18,19).
  1. In terms of the less frequent categories, the diversity of sources the event data community is using now—as opposed to the 1990s, when the only stories the KEDS and IDEA/PANDA projects coded were from Reuters, which is tightly edited—means that as you try to get more precise language models using parsing (ACCENT and PETRARCH-2), you start missing stories that are written in non-standard English that would be caught by looser systems (PETRARCH-1 and TABARI). Or at least this is true proportionally: on a case-by-case basis, ACCENT could well be getting a lot more stories than PETRARCH-2 (alas, without access to the corpus they are coding, I don’t know) but for whatever reason, once you look at proportions, nothing really changes except where there is a really concentrated effort (e.g. category 18), or changes in definitions (ACCENT on category 06; PETRARCH-2 on category 03).
  2. I’m guessing (again, we’d need the ICEWS corpus to check, and that is unavailable due to the usual IP constraints) all of the systems have similar performance in not coding sports stories, wedding announcements, recipes, etc:  I know PETRARCH-1 and PETRARCH-2 have about a 95% agreement on whether a story contains an event, but a much lower agreement on exactly what the event is. The various coding systems probably also have a fairly high agreement at least on the nation-state level of which actors are involved.

2. There is no point in coding an indicator unless it is reproducible, has utility, and can be coded from literal text

IMHO, a lot of the apparent disagreements within the event data community about coding of specific texts, as well as the differences between the coding systems more generally stem from trying to code things that either can’t be consistently coded at all—by human or automated systems—or which will never be used. We should really not try to code anything unless it satisfies the following criteria:

  • It can be consistently categorized by human coders on multiple projects working with material from multiple sources who are guided solely by the written documentation. I.e. no project-level “coding culture” or “I know it when I see it.”; also see the discussion below on how little we know about true human coding accuracy.
  • The coded indicators are useful to someone in some model (which probably also puts a lower bound on the frequency with which a code will be found in the news texts). In particular, CAMEO has over 200 categories but I don’t think I’ve ever seen a published analysis that doesn’t either collapse these into the two-digit top-level cue categories, or more frequently the even more general “quad” or “penta” categories (“verbal cooperation” etc.), or else pick out one or two very specific categories. [3]
  • It can be derived from the literal text of a story (or, ideally, sentence): the coding of the indicators should do not require background knowledge except for information explicitly embedded in the patterns, dictionaries, models or whatever ancillary information is used by the automated system. Ideally, this information should be available in open source files that can be examined by users of the data.

If an indicator satisfies those criteria, I think we usually will find we have the ability to create automated extractors/classifiers for it, and to do so without a lot of customized development: picking a number out of the air, one should be able to develop a coder/extractor using pre-existing code (and models or dictionaries, if needed) for at least 75% of the system.

3. There is a rapidly diminishing return on additional English-language news sources beyond the major international sources

Back in the 1990s, with the beginnings of the expansion of the availability of news sources in aggregators and on the Web, the KEDS project at the University of Kansas was finally able to start using some local English-language sources in addition to Reuters, where we’d done our initial development. We were very surprised to find that while these occasionally contributed new events, they did not do so uniformly, and in most instances, the international sources (Reuters and AFP at the time) actually gave us substantially more events, and event streams more consistent with what we’d expected to see (we were coding conflicts in the former Yugoslavia, eastern Mediterranean, and West Africa). This is probably due to the following

  1. The best “international” reporters and the best “local” reporters are typically the same people: the international agencies don’t hire some whiskey-soaked character from a Graham Greene novel to sit in the bar of a fleabag hotel near the national palace, but instead hire local “stringers” who are established journalists, often the best in the country and delighted to be paid in hard currency. [19]
  2. Even if they don’t have stringers in place, international sources will reprint salient local stories, and this is probably even more true now that most of those print sources have web pages.
  3. The local media sources are frequently owned by elites who do not like to report bad news (or spin their own alt-fact version of it), and/or are subject to explicit or implicit government censorship.
  4. Wire-service sourcing is usually anonymous, which substantially enhances the life expectancy of reporters in areas where local interests have been known to react violently to coverage they do not like.
  5. The English and reporting style in local papers often differs significantly from international style, so even when these local stories contain nuggets of relevant information, automated systems that have been trained on international sources—or are dependent on components so trained: the Stanford CoreNLP system was trained on a Wall Street Journal corpus—will not extract these correctly.

This is not to say that some selected local sources could not provide useful information, particularly if the automated extractor was explicitly trained to work with them. There is also quite a bit of evidence that in areas where a language other than English predominates, even among elites, non-English local sources may be very important: this is almost certainly true for Latin America and probably also true for parts of the Arab-speaking world. But generally “more is better” doesn’t work, or at least it doesn’t have the sort of payoff people originally expected.

4. “One-a-day” (OAD) duplicate filtering is a really bad idea, but so is the absence of any duplicate filtering

I’m happy to trash OAD filtering without fear of attack by its inventor because I invented it. To the extent it was ever invented: like most things in this field, it was “in the air” and pretty obvious in the 1990s, when we first started using it.

But for reasons I’ve recently become painfully aware of, and I’ve discussed in an assortment of papers over the past eighteen months (see http://eventdata.parusanalytics.com/papers.dir/Schrodt.TAD-NYU.EventData.pdf for the most recent rendition), OAD amplifies, rather than attenuating, the inevitable coding errors found in any system, automated or manual.

Unfortunately, the alternative of not filtering duplicates carries a different set of issues. While those unfamiliar with international coverage typically assume that an article which occurs multiple times will be somehow “more important” than an article that appears only once (or a small number of times), my experience is that this is swamped by the effects of

  • The number of competing news stories on a given day: on a slow news day, even a very trivial story will get substantial replications; when there is a major competing story, events which otherwise would get lots of repetition will get limited mentions.
  • Urban and capital city bias. For example, when Boko Haram set off a car bomb in a market in Nigeria’s capital Abuja, the event generated in excess of 400 stories. Events of comparable magnitude in northeastern regional cities such as Maiduguri, Bui or Damaturu would get a dozen or so, if that. Coverage of terrorist attacks over the past year in Paris, Nice, Istanbul and Bangkok—if not Bowling Green—show similar patterns.
  • Type of event. Official meetings generate a lot of events. Car bombings generate a lot of events, particularly by sources such as Agence France Press (AFP) which broadcast frequent updates.[4] Protracted low level conflicts only generate events on slow news days and when a reporter is in the area. Low-level agreements generate very few events compared to their likely true frequency. “Routine” occurrences, by definition, generate no reports—they are not “newsworthy”—or generate these on an almost random basis.
  • Editorial policy: AFP updates very frequently; the New York Times typically summarizes events outside the US and Western Europe in a single story at the end of the day; Reuters and BBC are in between. Local sources generally are published only daily, but there are a lot of them.
  • Media fatigue: Unusual events—notably the outbreak of political instability or violence in a previously quiet area—get lots of repetitions. As the media become accustomed to the new pattern, stories drop off.[18] This probably could be modeled—it likely follows an exponential decay—but I’ve rarely seen this applied systematically.

So, what is to be done? IMHO, we need to do de-duplication at the level of the source texts, not at the level of the coded events. In fact, go beyond that and start by clustering stories, ideally run these through multiple coders—as noted earlier, I don’t think any of our existing coders are optimal for everything from a Reuters story written and edited by people trained at Oxford to a BBC radio transcript from a static-filled French radio report out of Goma, DRC and which is then quickly translated by a non-native speaker of either language—then base the coded events on those that occur frequently in that cluster of reports. Document clustering is one of the oldest applications in automated text analysis and there are methods that could be applied here.

5. Human inter-coder reliability is really bad on event data, and actually we don’t even know how bad it is.

We’ve got about fifty years of evidence that the human coding [5] on this material doesn’t have a particularly high correlation when you start, for example, comparing across projects, over time, and in the more ambiguous categories.[6] While the human coding projects typically started with coders at a 80% or 85% agreement at the end of their training (as measured by Kronbach’s-alpha, typically) [7], no one realistically believes that was maintained over time (“coding drift”) and across a large group of coders who, as the semester rolled on, are usually always on the verge of quitting. [8] And that is just within a single project.

The human-coded WEIS event data project [10] started out being coded by surfers [11] at UC Santa Barbara in the 1960s. During the 1980s WEIS was coded by Johns Hopkins SAIS graduate students working for CACI, and in Rodney Tomlinson’s final rendition of the project in the early 1990s [12], by undergraduate cadets at the U.S. Naval Academy. It defies belief that these disparate coding groups had 80% agreement, particularly when the canonical codebook for WEIS at the Inter-University Consortium for Social and Political Research was only about five (mimeographed) pages in length.

Cross-project correlations are probably more like 60% to 70% (if that) and, for example, a study of reliability on (I think [20]) some of the Uppsala (Sweden) Conflict Data Project conflict data a couple years ago found only 40% agreement on several variables, and 25% on one of them (which, obviously, must have been poorly defined).

The real kicker here is that because there is no commonly shared annotated corpus, we have no idea of what these accuracy rates actually are, nor measures of how widely these vary across event categories. The human-coded projects rarely published any figures beyond a cursory invocation of the 0.8 Kronbach’s-alpha for their newly-trained cohorts of human coders; the NSF-funded projects focusing on automated coding were simply not able to afford the huge cost of generating the large-scale samples of human-coded data required to get accurate measures, and various IP and corporate policy constraints have thus far precluded getting verifiable information on these measures on the proprietary coders.

6. Ten possible measures of coder accuracy

This isn’t a conjecture, just a point of reference. These are from  https://asecondmouse.wordpress.com/2013/05/10/seven-guidelines-for-generating-data-using-automated-coding-1/

  1. Accuracy of the source actor code
  2. Accuracy of the source agent code
  3. Accuracy of the target actor code: note that this will likely be very different from the accuracy of the source, as the object of a complex verb phrase is more difficult to correctly identify than the subject of a sentence.
  4. Accuracy of the target agent code
  5. Accuracy of the event code
  6. Accuracy of the event quad code: verbal/material cooperation/conflict [13]
  7. Absolute deviation of the “Goldstein score” on the event code [14]
  8. False positives: event is coded when no event is actually present in the sentence
  9. False negatives: no event is coded despite one or more events in the sentence
  10. Global false negatives: an event occurs which is not coded in any of the multiple reports of the event

This list is by no means comprehensive, but it is a start.

7. If event data were a start-up, it would be poised for success

Antonio Garcia Martinez’s highly entertaining, if somewhat misogynistic, Chaos Monkeys: Obscene Fortune and Random Failure in Silicon Valley quotes a Silicon Valley rule-of-thumb that a successful start-up at the “unicorn” level—at least a temporary billion-dollar-plus valuation—can rely on only a single “miracle.” That is, a unicorn needs to solve only a single heretofore unsolved problem. So for Amazon (and Etsy), it was persuading people that nearly unlimited choice was better than being able to examine something before they bought it; for AirBNB, persuading amateurs to rent space to strangers; for DocuSign [21], realizing that signing documents was such a pain that you could attain a $3-billion valuation just by providing a credible alternative [22].  If your idea requires multiple miracles, you are doomed.[15]

In the production of event data, as of 2016, we have open source solutions—or at least can see the necessary technology in open source—to solve all of the following parts for the low-cost near-real-time provision of event data:

  • Near-real-time acquisition and coding of news reports for a global set of sources
  • Automated updating of actor dictionaries through named-entity-recognition/resolution algorithms and global sources such as Wikipedia, the European Commission’s open source JRC-Names database, CIA World Leaders and rulers.org
  • Geolocation of texts using open gazetteers, famously geonames.org and resolution systems such as the Open Event Data Alliance’s mordecai.
  • Inexpensive cloud based servers (and processors) and the lingua franca of Linux-based systems and software
  • Multiple automated coders (open source and proprietary) that probably well exceed the inter-coder agreement of multi-institution human coding teams

More generally, in the past ten years an entire open source software ecosystem has developed relevant to this problem (but typically in contexts far removed from event data): general-purpose parsers, named-entity-recognition/resolution systems, geolocation gazetteers and text-to-location algorithms, near-duplicate text detection methods, phrase-proximity (word2vec etc) and so forth

The remaining required miracle:

  • Automated generation of event models, patterns or dictionaries: that is, generating and updating software to handle new event categories and refine the performance on existing categories.

This last would also be far more easy if we had an open reference set of annotated texts, and even Garcia Martinez allows that things don’t require exactly one miracle. And we don’t need a unicorn (or a start-up): we just need something that is more robust and flexible than what we’ve got at the moment.

SO…what happened???

The main result of the workshop—which covered a lot of issues beyond those discussed here—was the decision to develop the PLOVER coding and data interchange specification, which basically simplifies CAMEO to focus on the levels of detail people actually use (the CAMEO cue categories with some minor modifications [16]), as well as providing a systematic means—“modes” and “contexts”—for accommodating politically-significant behaviors not incorporated into CAMEO such as natural disasters, legislative and electoral behavior, and cyber events. This is being coordinated by the Open Event Data Alliance and involves an assortment of stakeholders (okay, mostly the usual suspects) from academia, government and the private sector. John Beieler and I are writing a paper on PLOVER that will be presented at the European Political Science Association meetings in Milan in June, but in the meantime you can track various outputs of this project at https://github.com/openeventdata/PLOVER. A second effort, funded by the National Science Foundation, will be producing a really large—it is aiming for about 20,000 cases, in Spanish and Arabic as well as English—set of PLOVER-coded “gold standard cases” which will both clearly define the coding system [16] and also simplify the task of developing and evaluating coding programs. Exciting times.[24]

Footnotes:

1. Unorthodox and opinionated for a workshop memo. Pretty routine for a blog.

2. The blue bar shows the count of codings where PETRARCH-1 and PETRARCH-2 produce the same result; despite the common name, they are essentially two distinct coders with distinct verb phrase dictionaries.

3. Typically with no attention as to whether these were really implemented in the dictionaries: I cringe when I see someone trying to use the “kidnapping” category in our data, as we never paid attention to this in our own work because it wasn’t relevant to our research questions.

4. I read a lot of car bomb stories: http://eventdata.parusanalytics.com/data.dir/atrocities.html

5. When such things existed for event data: There really hasn’t been a major human coded project since Maryland’s GEDS event project shut down about 15 years ago. Also keep in mind that if one is generating on the order of two to four thousand events per day—the frequency of events in the ICEWS and Phoenix systems—human coding is completely out of the picture.

6. In some long-lost slide deck (or paper) from maybe five or ten years ago, I contrasted the requirements of human event data coding to research—this may have been out of Kahneman’s Thinking Fast and Slow—on what the human brain is naturally good at. The upshot is that it would be difficult to design a more difficult and tedious task for humans to do than event data coding.

7. Small specialized groups operating for a shorter period, of course, can sustain a higher agreement, but small groups cannot code large corpora.

9. In our long experience at Kansas, we found that even after the best selection and training we knew how to do, about a third of our coders—actually, people developing coding dictionaries, but that’s a similar set of tasks—would quit in the first few weeks, and another sixth by the end of the semester. A project currently underway at the University of Oklahoma is finding exactly the same thing.

10.The WEIS (World Events Interactions Survey) ontology, developed in the 1960s by Charles McClelland, formed the basis of CAMEO and was the de facto standard for DARPA projects from about 1965 to 2005.

11. Okay, “students” but at UCSB, particularly in the 1960s, that was the same thing.

12. Tomlinson actually wrote an entirely new, and more extensive, codebook for his implementation of WEIS, as well as adding a few minor categories and otherwise incrementally tweaking the system, much as we’ve been seeing happening to CAMEO. Just as CAMEO was a major re-boot of WEIS, PLOVER is intended to be a major modification of CAMEO, not merely a few incremental changes.

13. More recently, researchers have started pulling out the high-frequency (and hence low information) “Make public statement” and “Appeal” categories out of “verbal cooperation”, leading to a “pentacode” system. PLOVER drops these.

14. The “Goldstein scale” actually applies to WEIS, not CAMEO: the CAMEO scale typically referred to as “Goldstein” was actually an ad hoc effort around 2002 by a University of Kansas political science grad student named Uwe Reising, with some additional little tweaks by his advisor to accommodate later changes in CAMEO. Which is to say, a process about as random as that which was used to develop the original Goldstein scale by an assistant professor and a few buddies on a Friday afternoon in the basement of the political science department at the University of Southern California. Friends don’t let friends use event scales: Event data should be treated as counts.

15. Another of my favorite aphorisms from Garcia Martinez: “If you think your idea needs an NDA, you might as well tattoo ‘LOSER’ on your forehead to save people the trouble of talking to you. Truly original ideas in Silicon Valley aren’t copied: they require absolutely gargantuan efforts to get anyone to pay serious attention to them.” I’m guessing DocuSign went through this experience: it couldn’t possibly to worth billions of dollars.

16. To spare you the suspense, we eliminated the two purely verbal “comment” and “agree” categories, split “yield” into separate verbal and material categories, combined the two categories dealing with potentially lethal violence, and added a new category for various criminal behaviors. Much of the 3- and 4-digit detail is still retained in the “mode” variable, but this is optional. PLOVER also specifies an extensive JSON-based data interchange standard in hopes that we can get a common set of tools that will work across multiple data sets, rather than having to count fields in various tab-delimited formats.

17. CAMEO, in contrast, had only about 350 gold standard cases: these have been used to generate the initial cases for PLOVER and are available at the GitHub site.

18. For example, a recent UN report covering Afghanistan 2016 concluded there had been about 4,000 civilian casualties for the year. I would be very surprised if the major international news sources—which I monitor systematically for this area—got even 20% of these, and those covered were mostly major bombings in Kabul and a couple other major cities.

19. With which they may use to buy exported whiskey, but at least that’s not the only thing they do.

20. Because, of course, the article is paywalled. One can buy 24-hour access for a mere $42 and 30-day access for the bargain rate of $401. Worth every penny since, in my experience, the publisher’s editing probably involved moving three commas in the bibliography, and insisting that the abstract be so abbreviated one needs to buy the article.

21. The original example here was Uber, until I read this. Which you should as well. Then #DeleteUber. This is the same company, of course, where just a couple years ago one of their senior executives was threatening a [coincidentally, of course…] female journalist. #DeleteUber. Really, people, this whole brogrammer culture has gotten totally out of control, on multiple dimensions.

Besides, conventional cabs can be, well, interesting: just last week I took a Yellow Cab from the Charlottesville airport around midnight, and the driver—from a family of twelve in Nelson County, Virginia, and sporting very impressive dreadlocks—was extolling his personal religious philosophy, which happened to coincide almost precisely with the core beliefs of 2nd-century Gnosticism. Which is apparently experiencing a revival in Nelson County: Irenaeus of Lyon would be, like, so unbelievably pissed off at this.

22. Arguably the miracle here was simply this insight, though presumably there is some really clever security technology behind the curtains. Never heard of DocuSign?: right, that’s because they not only had a good idea but they didn’t screw it up. Having purchased houses in distant cities both before and after DocuSign, I am inordinately fond of this company.

23. PLOVER isn’t the required “miracle” alluded to in item 7, but almost certainly will provide a better foundation (and motivation) for the additional work needed in order for that to occur. Like WEIS, CAMEO became a de facto “standard” by more or less accident—it was originally developed largely as an experiment in support of some quantitative studies of mediation—whereas PLOVER is explicitly intended as a sustainable (and extendible) standard. That sort of baseline should make it easier to justify the development of further general tools.

Advertisements
This entry was posted in Methodology. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s