Seven observations on the newly released ICEWS data

pdf_iconBefore we get to the topic of the post, the usual set of apologies about the absence of recent postings—starting with that Duke Nukem Forever style “Feral+well, whatever.” Isn’t that I’ve dropped out, it is rather that I’ve been too busy with other projects. And by the way, I haven’t retired,[1] unless logging 2,200 hours of work last calendar year is “retired.” Someday, things will slow down, I’ll get to the backlog. But enough about me…that’s not what we’re up to today.

Instead, the point of today’s posting is to comment upon the long, long awaited release of a public version of the Integrated Conflict Early Warning System (ICEWS) dataset, which appeared without fanfare on Dataverse late in the afternoon on Friday, 27 March, with Jay Ulfelder probably responsible for first spotting it.

This is a massive resource: The investment in the ICEWS project, albeit not all of the funding going into the data, is probably roughly equalled the whole of NSF spending on all international relations and comparative politics research during the time it was active. As with any large data set it is going to take a while to figure out all of its quirks. The ever-resourceful David Masad already has some excellent instructions and initial analyses and visualizations here, and I’m sure more will be forthcoming in coming weeks—and years—but I wanted to use the occasion to alert my [ever declining] readership to this, and provide some initial observations.

1. It exists!

Long overdue, to be sure, given that at the ICEWS “Kick-off Meeting” in 2007 we were assured that everything in the project would be open, a concept almost immediately quashed, and then some, by the prime contractors. I’m pretty sure we have the persistent and unrelenting efforts of Mike Ward and Philippe Loustaunau to thank for the release. I’ve also got a pretty good idea of who is responsible for the delays but, well, let’s just focus on positive things right now.[2]

Here’s the link to the data:  There are actually four “studies” involved

  • 28075: This is the main data set: 26 files, most of these are around 30 Mb each. Took me a couple tries to get some of them, though that may have been due to a lousy wireless connection on my end. It’s Dataverse; it will work.
  • 28117: Aggregated data: this may the quickest way to get into the data, assuming what you are interested in is covered by one of the very large number of aggregations that have already been computed. I’ve not really looked at these yet, and much of the documentation appears oriented to a proprietary dashboard which is not provided, but particularly for people not comfortable working with very large disaggregated data sets, it could be very useful.
  •  28118:These are the dictionaries, more on this below.
  • 27119: This was the big disappointment: we had been told the release would include a set of “gold standard cases”, which we assumed would be the much-needed gold standard cases needed to validate event coding systems, but these, alas, are just some sort of esoteric records associated with the much-disputed ICEWS “events of interest.” Hard to imagine it will be much use for anyone, but I’ve been wrong before.

2. Dictionaries!

As we’ve [3] been arguing for decades, the primary advantage of automated coding is the ability to maintain consistent coding across data sets being coded across a number of years and, through the dictionaries, to have a high level of transparency.[4] For that you need the dictionaries as well as the coder, and the KEDS project [5] has been consistently providing those as part of the data. To its great credit, ICEWS has followed this norm, and what dictionaries they are!: a primary actor dictionary with over 100,000 political actors. The format is derivative of that used in TABARI and PETRARCH—I’m guessing it will take about fifteen minutes to write a converter.

The agent dictionary—also provided, along with a somewhat cryptic “sectors” dictionary— on the other hand, is definitely a work in progress, though probably fine for the major sub-national actors, which for the most part are still those established by Doug and Joe Bond’s work on PANDA and IDEA back in the 1990s, and subsequently incorporated into CAMEO. The quirk that really sticks out for me is the treatment of religion: for ICEWS, it seems “Christian” and “Catholic” are separate primary categories [6]—granted, the late Ian Paisley [7] would agree—and the Great Schism of 1054 apparently was no big deal. The whole of Judaism gets only two entries, rather an oversimplification for the neck of the woods I’m usually studying. There is an extraordinarily eclectic set of ethnic groups—with a distinct oversampling in India—and, well, overall, the agent dictionary is sort of like rummaging through some old trunk in your grandparent’s attic [8], and I’m pretty sure we’re well ahead of this at the Open Event Data Alliance.

ICEWS did not provide event code dictionaries, which are presumably tightly linked with the proprietary BBN ACCENT coder. This is a bit of an issue, since ACCENT does not actually code CAMEO, but their own variant which they have documented in a very extensive manual. Not ideal but no worse than the situation with any human-coded data.

Update 10-Feb-2020: Uh, not so fast: it turns out that the ICEWS “actors” dictionary on Dataverse, (updated from the original release), is not remotely close to comprehensive, focusing mostly on individuals and organizations but, for the most part, without national names and demonyms (e.g. “Nigeria”/”Nigerian”, “Iraq”/”Iraqi”). Though just to further provide inconsistency, it has a few, e.g. “South Sudan”/”South Sudanese” and “France” (6 times)/”French (twice): go figure.

In a small sample test (with fairly simple pattern matching) I did with about 11,000 records from the FUOU version, which contains the source texts, there were matches to an actor in the text on only about 30% of the cases using the file, but with the same methodology, this goes up to 86% when one adds an open source list of national names, synonyms and demonyms: I used the Phoenix.Countries.actors.txt file from the various PETRARCH programs on GitHub, and it is derived from the CountryInfo.120116.txt file. I was not checking for unattached agents (that is, agents without an explicit national identity) so those probably account for most of the remaining cases.

The charitable explanation here is that BBN had some internal files with national names and demonyms which it routinely used in NLP work, and those were already part of the  ACCENT system and thus, like the event dictionaries, was considered proprietary even though this information is readily available, e.g. from individual country entries in the CIA World Factbook, one of the sources for  CountryInfo.120116.txt. So this problem can be readily corrected from open sources, but the key point remains that should be considered only a supplemental dictionary for event coding, not a comprehensive one.

3. You’ll need to convert it for statistical analysis but I’ve got a program for that.

The public-release ICEWS uses a very quirky format that is apparently designed to be read rather than analyzed, as the underlying codes are presented in verbose, English-language equivalents. Unless you are ready to settle into a few quiet evenings reading through the 5-million records, you’ll probably want to use the data in statistical analyses, which means you’ll want to get shorter codes. It just so happens, I’ve got an open-source program for that at and I’ve even provided you’all with COW codes as well as ISO-3166-alpha3 codes. I didn’t fully convert the sector dictionaries, but this will at least give you a good start.

4. Massive use of local sources

That old criticism that event data are nothing but the world as viewed from the point of Western imperialists? This will be hard to sustain with ICEWS, which uses hundreds of local sources, and each event contains information on the source. I’ve only looked in detail at 2013, and here these follow more or less a rank-size distribution, with some of the major international sources (Xinhua, BBC) being major contributors, but the tail of that distribution is extremely long.

5. The distribution is flat

While the internet, and new social media more generally, are revolutionizing our ability to inexpensively generate large-scale datasets relevant to the study of political behavior, a serious problem has been dealing with the exogenous effects of the rapid expansion of internet-based sources that began in the mid-2000s. Any “dumpster diving the web” approach leads to an exponential increase starting about this time, which for any statistical analysis is a bug, not a feature.

ICEWS avoids this: they seem to be using a relatively fixed set of sources, and the total density is largely flat. As Masad’s visualizations and some others I’ve seen show, there appears to be a bit of variation—1995 and 1996 seem undersampled—and more will probably appear as further research is done, since there have been major changes in the international journalism environment beyond just the increase in the availability of reports, but these variations are not exponential, and can probably be accommodated with relatively simple statistical adjustments.

6. 80% precision, but no assessment of the accuracy

The release is accompanied by an extensive analysis showing that the “accuracy” of the ACCENT coder is around 80%. Which would be very nice, except that the study actually assesses not accuracy, but precision, which, while interesting, gives us no information whatsoever on the measure most people are interested in: the probability of correctly coding a randomly chosen sentence (accuracy), rather than the probability that a sentence that was coded was coded correctly (precision). Echoing the exchange between Col. Harry Summers and one of his Vietnamese counterparts over the unbroken string of US battlefield successes, the assessed precision “May be so, but it is also irrelevant.”

The arguments here are a bit technical, though involve nothing more than simple algebra, so I’ve relegated this to an appendix to this post. The upshot, to paraphrase Ray Stevens, “Yo selected on the dependent variable, and I can hear yo’ mama sayin’, “You in a heap o’ trouble son, now just look what you’ve done””

7. It should splice with Phoenix

The current dataset has a one-year embargo, though the word on the street is that the embargo will remain at just one year, more or less . That is, the data will be periodically updated, ideally monthly, perhaps quarterly. [Addendum: in a very promising sign, the March 2014 data were indeed made available on 1 April 2015.] This will be adequate for most retrospective studies, but still won’t help with the real-time forecasting that event data are increasingly used for.

Here the recently-released OEDA Phoenix data set comes to the rescue, or will once we’ve got another four or five months of ICEWS data, as Phoenix gets going around the beginning of July 2014. Provided ICEWS is updated regularly, within a fairly short period of time one should be able to use the ICEWS 1995-2014 data for calibration, and then use Phoenix to cover the end of ICEWS to the present (Phoenix is updated daily).

Assuming, of course, that the data are sufficiently similar that they can be spliced, possibly with some adjustments. The major distinction between the data sets is likely to be the sources, with ICEWS using Open Source Center feeds and proprietary data services, and Phoenix a white-list of Web-based sources. This is likely to make a big difference in some areas—in the very limited exploration I’ve done, ICEWS seems to disproportionately focus on India, for example, and for statutory reasons, contains no internal data on the US—and less in others. Actor dictionaries will not be an issue as the ICEWS dictionaries could be used to code the Phoenix sources, though this may not be necessary.

The different coding engines may or may not make a difference: in the absence of a confirmable set of gold standard cases for events, and verb dictionaries, we will need a significant period when the two sets overlap to find out whether the two systems perform significantly differently. My guess is that they won’t differ all that much, particularly if common actor dictionaries are used, since that both coders are based on full parsing, and the differing sources will be the bigger issue. Both Phoenix and ICEWS provide information on the publications where the coded text came from, so these could be filtered to get similar source sets.

In the absence of a public version of ICEWS overlapping with the [still relatively brief] Phoenix data, we can only do indirect measures of the likely similarities, but some quick analyses I’ve done comparing marginals of the first six months of Phoenix with the last six months of ICEWS indicate two promising points of convergence: the density of data (events per day) was quite similar and—even more telling—the marginal densities of the event types were very similar (actors less so but again, that’s easily corrected since the ICEWS actor dictionaries are public).  Again, we won’t be able to do the more crucial test—the correlation of dyad-level event counts—until there is a substantial overlap in the public data, but initial indications are promising.

What needs to be done (all open)

Call me a greedy anti-intellectual knuckle-dragging Neanderthal—and you will—but when I read a recent article in Science about the construction of an esoteric scientific instrument whose construction cost was $300-million and annual operating costs are $30-million, and then compared that with the pittance that is being allocated—when we can avoid our programs being shut down altogether [9]—for event data which could contribute significantly to at the very least to increasing the ability of NGOs to accurately anticipate situations where “bad things might happen” [10], or even to a reality-based foreign policy, I get a tad irritated. Consider these aspects of the instrument in question:

  • it may not work—its also-costly predecessor did not—and half of the project is situated in a place in Louisiana that makes it less likely to work, suggesting it is largely a mindless boondoogle. A boondoogle located in Louisiana, I’m shocked, shocked.
  • if it does work, it merely further confirms a century-old theory which we’ve got complete confidence in already, and as the Science article points out, is confirmed billions of times each day, for example as a smart phone displays inappropriate content having determined that you are a male walking within fifty meters of a Victoria’s Secret outlet store. [14]
  • and the predictions of the theory at issue were already confirmed by other observational evidence four decades ago, for which the discoverers got a nice trip to Stockholm.

Which is to say, this is just the natural sciences equivalent of a performance art project [13], but at a rather higher price tag. And unlike space telescopes and Mars rovers, we don’t even get nice pictures from it.

So, like, if we can spend what will probably eventually total some half-billion dollars before this thing winds down, presumably with the yawn-inducing equivalent of the umteenth iteration of  “Hey, ya’know, Mars once had water on it!!” how about spending 1%—just a lousy 1%—of the that amount (which is probably also about 10% of the cost of ICEWS) on enhancing event data? And this time with social scientists in charge, not folks whose prime competence is raiding the public purse under the guise of protecting our national interests against opponents who disappeared decades ago. Oh, and every single line and file of the project open source. I’m just asking for 1%!  A guy can dream, right?

So, say we’ve got $5-million. Here’s my list

1. Open gold standard cases. Do it right: the baseline will be the openly available Linguistic Data Consortium GigaWord news files, use a realistically large set of coders with documented training protocols and inter-coder performance evaluation, do accuracy assessments, not just precision assessments. Sustained human coder performance is typically about 6 events per hour—probably faster on true negatives—and we will need at least 10,000 gold standard cases, double-coded, which comes to a nice even $50K for coders at $15/hour, double this amount for management, training and indirects, and we’re still at only $100K.

2. Solve—or at least improve upon—the open source geocoding issue. This is going to be the most expensive piece, and could easily absorb half the funds available. But the payoffs would be huge and apply in a wide number of domains, not just event data. I’d put $2M into this.

3. Extend CAMEO and standard sub-state actor codes, using open collaboration among assorted stakeholders with input from various coding groups working in related domains. We know, for example, that one of the main things missing in CAMEO are routine democratic processes such as elections, parliamentary coalition formation, and legislative debate, and there are people who know how to do this better than us bombs-and-bullets types. On sub-state actor coding, religious and ethnic groups are particularly important. I’m guessing one could usefully spend $250K here. Also call it something other than CAMEO.

4. Automated verb phrase recognition and extraction, which will be needed for extending the CAMEO successor ontology. I actually think we’re pretty close to solving this already, and we could get some really good software for $50K. [11]  If that software works as well as I hope it will, then spend another $250K getting verb-phrase dictionaries for the new comprehensive system.

5. Event-specific coding modules, for example for coding protests and electoral demonstrations. Open-ended, but one could get a couple templates for $100K.

6. Systematic assessment of the native language versus machine translation issue. That is, do we need coding systems (coders and dictionaries) specific to languages other than English, particularly French, Spanish, Arabic and Chinese [12], or is machine-translation now sufficient—remember, we’re just coding events, not analyzing poetry or political manifestos—so given finite resources, we would be better off continuing the software development in English (perhaps with source-language-specific enhancement for the quirks of machine translation). Hard to price this one but it is really important so I’d allocate $500K to it

7. Insert your favorite additional requirements here: we’ve still got $1.75M remaining in our budget, which also allows a fair amount of slack for excessively optimistic estimates on the other parts of the project. Or if no one has better ideas, next on my list would be systematically exploring splicing and other multiple-data-set methods such as multiple systems estimation. And persuade Lockheed to dust off the unjustly maligned JABARI—or make the code open source if they have no further use for it—and give us another alternative sequence based on that program.

All this for only 1% of the cost of a single natural science performance art project! Come on, someone out there with access to the public trough—or even some New Gilded Age gadzillionaire—let’s go for it! Pretty please?


1. Yeah, I can just imagine the conversations at ISA in New Orleans (I was on Maui. Just on vacation. Really.)

“Hey, Schrodt really disappeared once he left Penn State. Figured that would happen…”

“Really, it’s bad: I heard that he was last seen on the side of the exit ramp off I-99 to Tyrone, looking really gaunt and holding a cardboard sign that said ‘Will analyze mass atrocities for food’.”

“Yes, that’s right: so sad. So keep that in mind if you are thinking about leaving academia, or even imaging the possibility of asking any senior faculty to get their fat Boomer butts out of the way.”

Well, no, that’s not really accurate. But we’ll save that for another blog entry. Meanwhile, you can follow me on GitHub. And I’ll be at EPSA in Vienna.

2. And keep our faith in the wheel of karma.

3. I’m not exactly sure who “we” is—I’m neither royalty nor, to my knowledge, have a tapeworm—but I’m trying to represent the views of a loose amalgam of people who have been working with machine-coded event data for a good quarter-century now.

4. Total transparency when the coding software is available, which is not the case here, but even without the software these dictionaries are a huge improvement over the transparency in most human coding projects, where too many decisions rest on an undocumented and ever-shifting lore known only to the coders.

5. Or whatever it should be called: it will always be KEDS—Kansas Event Data System—to me.

6. Two Protestant denominations get designations at the same level as “Herdswoman” and “Pirate Party”—Episcopal (but not Anglican) and Methodist—and there is an entry for “Maronite.” That’s it: no Lutherans, no Baptists, no Pentecostals, no Mormons, not even the ever-afflicted Jehovah’s Witnesses. In fact in the ICEWS agent ontology, the only religions worthy of subcategories are Christian, Catholic, Buddhist, Hindu and Moslem, though the latter has not been affected by that unpleasantness at Karbala in 680 CE.  The ontology developers, however, appear to have spent a bit too much time watching re-runs of the Kung-Fu television series—or more ambiguously, Batman Begins—as only Buddhism produces “warriors.”

7. To say nothing of the late Fred Phelps.

8. Yeah, yeah, they moved to a condo in Arizona two decades ago and the old place was torn down and replaced with a MacMansion, but it still makes for a nice metaphor.

9. Though I did notice that Senator Jeff Flake was one of the few Republicans not to throw his lot in with the GOP efforts to provide free policy advice to the Islamic Republic of Iran, so perhaps his M.A. in Political Science did some good.

10.I think we are now at a point where these things can make a serious difference: The absence of major electoral violence in the 2013 Kenyan elections and—fingers crossed—the 2015 Burundi elections may eventually be seen as breakthroughs on this issue.

11. But meanwhile, don’t get me started on the vast amounts that is wasted on hiring programmers who never finish the job. Really, people, the $75 to $150 an hour to hire someone with a professional track record who will actually write the programs you need is a better deal than spending $25,000+ a semester—stipend, tuition and indirects, and this is actually a low estimate for many private institutions—for one or more GRAs are supposed to be learning programming but who, in fact, stand a pretty good chance of getting absolutely nowhere because writing sophisticated research software does not, in many instance, provide a good pedagogical platform. No matter what your Office for the Suppression of Research says.

12. Hindu/Urdu is also important in terms of the number of speakers, but, for better or worse, the media elites in the region use English extensively.

13. If you aren’t familiar with this concept, Google “bad performance art.” NSFW.

14. To clarify, not the precise example used by Science.

Appendix: Why “precision” is not the same as “accuracy”

If I were appointed the Ultimate Education Czar, my first move would be to impose a requirement that no one should leave college without understanding the concept of “selection on the dependent variable.” And I might even make BBN’s assessment of the ACCENT coder, included in the DataVerse Study 28075, a required case study in the perils of ignoring this precept.

As noted in the body of the text, BBN has measured not the accuracy of the ACCENT coder, but the precision. Taking the usual classification matrix [A1]

Basic Classification Matrix
False True
False True Negative (TN) False Positive (FP)
True False Negative (FN) True Positive (TP)

the BBN research design selects on the positive cases: we only know the second column. This information can be used to compute Precision= \frac{TP}{TP + FP} but not Accuracy = \frac{TP + TN}{TN + FP + FN + TP} and since we have no idea what is in that first column, the accuracy—which is the value most people are interested in—could take any value.

Suppose, however, that the accuracy is equal to the precision. Some quick algebra will show that this can occur only if the ratio of TN to FN in the unobserved first column is equal to the ratio of TP to FP in the second column. In a hypothetical case where there are a total of 1,000 cases, and 10% of these were coded in an event category being assessed, this would look like

State of the World Where Accuracy = Precision
False True
False 720 20
True 180 80


This has a false negative rate FNR = \frac{FN}{FN + TP} of 0.692.

Attaining this, however, requires the cooperation of the Data Fairy, who determines the row marginals. In this particular case, the roughly 3:1 ratio would almost never be observed except possibly in the two most frequent CAMEO categories, 01 and 02, and that only if one had extraordinarily accurate pre-filter, a topic I will return to momentarily.

Turning to JABARI—apparently relegated to the dustbin of automated coding systems by the BBN analysis—what would be required to get an accuracy of 0.8 given a precision of 0.4 (roughly the reported values)? Again, provided we had the cooperation of the Data Fairy, a bit of algebra shows the required changes in table are remarkably small:

Alternate State of the World Where
Accuracy = 0.8 while Precision = 0.4
False True
False 760 60
True 140 40

We’ve now increased the FNR to 0.77, but otherwise the classification matrix does not look particularly odd, and the ratio of the false to true cases has now shifted to a still-optimistic but at least somewhat more realistic 4.5:1. What is interesting here is that not only could the low precision still result in high accuracy, but the conditions under which that could occur are only modestly different than the conditions under which the high precision corresponds to a comparable level of accuracy.

As we are all aware, however, the Data Fairy is generally not very kind, and the reality of event coding is probably dramatically different. As Ben Bagozzi and I showed in 2012, in a broad set of news stories—rather than a selective set of news stories focused on conflict zones, where automated coding was originally developed—the proportion of stories that will not generate any CAMEO coding could easily be as high as two-thirds, because CAMEO was never intended as a general-purpose event ontology and there are a very large number of newsworthy political events that have no corresponding code in CAMEO.[A2] By definition, the average frequency in the 20-code CAMEO scheme is 5%, not 10% but, as the statistician George E.P. Box said, let us not be concerned with mice when there are tigers about, and for purposes of illustration stay with 10%.

If we hold the number of positive cases fixed, increase the sample size—the Data Fairy provides the row marginals—and hold the precision fixed at 80%, there is only one possible classification table:

Counter-factual State of the World with Precision = 0.8
when Positive Cases are 3% of the Data
False True
False 2880 20
True 20 80

If we actually saw this coming from a coding system, that would be truly stunning performance—the accuracy is 98.6%!—but remember, this is a counter-factual, and because they have selected on the dependent variable, the BBN test provides no evidence at all that ACCENT would have this performance under this more realistic scenario.  We do know that neither human coding, nor the various tests of automated systems that have tested accuracy, not just precision, have come anywhere close to that accuracy. But if the accuracy is going to drop to a realistic level—say 80%—either TP has to decrease, or FP has to increase and in either scenario, the precision drops.

This postulated drop in accuracy does not, of course, necessarily translate into a corresponding drop in the number of events, since contemporary multi-source news streams have a massive level of redundancy. For example, a search on Factiva of only four international sources—Reuters, AFP, BBC and All-Africa— using the search term “abuja and (bombing or explosion or killed)” in the period 14 to 21 April 2014, the period following the  bombing with bus station in Nyanya on the outskirts of  the Nigeria capitol of Abuja yields 384 stories.[A4]  In a world where we had ideal de-duplication, only one of these needs to be coded in order for the incident to be part of the event stream. For this and other reasons, as I’ve noted elsewhere, there are at least ten distinct metrics that could be used to evaluate various dimensions of the effectiveness of an event coding system, not one.[A5]

Once again, the take-away point is that with a design that selects on the dependent variable, we know nothing about the accuracy: only a design which is based on the sample of inputs, which is trivial to implement, will do this.

A related issue here is whether this design inherently favors the precision of the BBN coder, and to what extent this occurs. With a few rather innocuous assumptions about the nature of the coding process, we can assess this.

The big issue in the establishment of gold standard measures is, of course, inter-coder reliability. In both the KEDS project, and the efforts at Lockheed I was familiar with during the research phase of ICEWS, inter-coder disagreements stopped just short of the level of homicide, but only just. BBN achieves an 80% inter-coder agreement, but through the mechanism of using only two coders, which is completely unrepresentative of a large scale human coding project, but again, those are mice, and the tigers are elsewhere. The critical variable here is not the BBN inter-coder reliability, but rather the BBN coder agreement with ACCENT and the  inter-project reliability: that is, to what extent would the BBN coders have agreed with the Lockheed coders?

As before, we are not provided with these numbers, but we can make some reasonable guesses. It seems fairly likely that the BBN coder/ACCENT intercoder reliability is higher than even the two-person intercoder reliability, since presumably the coders were working with CAMEO as it had been implemented in ACCENT. So let’s put this at 90%.

Inter-project agreement is much more problematic because BBN is not coding into CAMEO, but rather their modification of CAMEO, effectively a dialect, which by rights should probably be given another name, say CAMEO-B. Nothing wrong with modifying CAMEO—I’ve been advocating for that for some time now, though I think it should be done in an open collaborative fashion with input from the larger user community—but we can be pretty certain that CAMEO-B is not CAMEO, whereas Lockheed was coding into CAMEO.[A3] Consequently it is reasonable to assume that the inter-project reliability is substantially lower (which, by the way, occurs in all human-coding projects as well: this is an extraordinary difficult measure to maintain). We’ll set this at 60%.

Now, suppose ACCENT and JABARI actually have the same precision, let’s say 80%. Assuming the validation sample consists of 50% each of ACCENT-coded and JABARI-coded records we get the following

ACCENT coding ACCENT records: 80% precision in coding x 90% agreement with the human coders = 72% overall

ACCENT coding JABARI records: 80% precision in coding x 60% agreement with BBN coders = 48% overall

ACCENT precision 50/50 sample: 60%

JABARI coding ACCENT records: 80% precision in coding x 60% agreement with the human coders = 48% overall

JABARI coding JABARI records: 80% precision in coding x 60% agreement with BBN coders = 48% overall

JABARI precision 50/50 sample: 48%

ACCENT, in other words, gains a full 12% advantage by chance alone, even if the two coders have identical precision. BBN shouldn’t be in Boston, they should be in Vegas! Those numbers don’t entirely correspond to what we are seeing, but except under completely unrealistic assumptions, the deck is stacked in favor of ACCENT, probably significantly.

But the more important point here is that this test conflates three separate factors

  • The precision of the ACCENT and JABARI coding programs
  • The convergence of the human coders with ACCENT
  • The differences between the BBN interpretation of CAMEO and Lockheed’s interpretation

While it is perfectly possible to measure all three of these independently, the existing test does not do so. What is purported to be a decisive improvement in precision may in fact be nothing more than the fact that JABARI does an inferior job of coding into an alternative dialect, CAMEO-B, that it was never programmed to code in the first place: hardly an earthshaking finding.[A6]

There is, of course, a very simple—and in the grand scheme of research projects, not particularly expensive—way around both these issues. First, evaluate against the full text sample: this is research design 101 stuff. And, while we’re at it, do this using an openly accessible set of texts, and the Linguistics Data Consortium GigaWord series fits the bill perfectly.[A7] Second, use a neutral set of evaluators working from a single codebook, preferably under circumstances which much more realistically mirror those found in large scale human coding (or machine-assisted coding) projects such as ACLED, the various Uppsala conflict data set and START/GTD. Until that exercise is done, we have basically no information on the true accuracy, and we have a statistically biased estimate of the relative performance of the two systems.

Appendix Footnotes

A1. And ignoring for the time being the fact that I can’t get WordPress to correctly format the column heading. I’m open to suggestions: I’m using HTML code that is rendered with the heading centered when I put it in a separate file and open it in Chrome.

A2. Depending on the filtering that is used, there is also a broad class of stories that have no political content at all—sports stories, recipes, media reviews, celebrity gossip—which could drive this number even higher.

A3. I’m in a pretty good position to know that Lockheed consulted extensively on the original CAMEO, and JABARI had the benefits of more than three years of experience, including work with actual coding teams managed by social scientists with extensive coding experience. BBN has never communicated with so much as a single email. They are perfectly within their rights to do so, but short of a miracle, that will substantially diminish the inter-project correlation, and consequently the reportedly questionable effectiveness of JABARI, independent of the true precision of JABARI. Funny coincidence, that.

A4. Granted, this is a high level of coverage, particularly for Africa, and I picked it because at the time I noted, in the process of doing coding for the Worldwide Atrocities Data Set, the disproportionate journalistic attention to this incident compared to various small villages which were being totally wiped out by Boko Haram attacks, though much to their credit the Nigerian press appears to do a pretty good job even on these. Attacks in the developed world, such as the Charlie Hebdo attack, will of course generate on order of magnitude  more coverage for an order of magnitude fewer casualties.

A5. It is also possible that we are seeing a methodological shift here as we simultaneously transition from loose pattern-based coders such as KEDS and TABARI to stricter parser-based coders such as ACCENT and PETRARCH, while simultaneously shifting from single-stream news sources—typically Reuters, AFP or, historically, the New York Times—to massively eclectic source streams such as those used in ICEWS and Phoenix. While I can’t speak for ACCENT, I know in the development of PETRARCH there were many occasions when I expected a sentence to code—hey, it made perfect sense to me—but after staring at the parsed version long enough I realized, yep, PETRARCH (or, more generally, CoreNLP) was right, that’s not the object of the verb, and we’re not supposed to code it. Those sorts of decision-rules are needed increase precision—apparently the sole objective for ACCENT—but at the cost, at least in PETRARCH, of a very high number of false negatives.  (In the case of PETRARCH, this is further exacerbated by the fact we are still, for the most part, using verb dictionaries developed for TABARI.)

The introduction of multi-stream news sources further complicates matters. When we were developing the original KEDS dictionaries, we were working with a single source, Reuters, which was edited (more or less) to the standards of a 200-page style manual (which at one point we got our hands on, though it didn’t prove that useful). Because we were generally focused on a single region, the Levant, we were probably actually dealing with just a small number of reporters and editors [A8], and the dictionaries consequently accommodated the style of those individuals. We later added a second source, AFP, and coded two other regions, but this was still limited. In a somewhat similar manner, the research phase of ICEWS focused only on Asia—a focus still seemingly evident in the dictionaries—and initially on a relatively small number of news sources, mostly the usual international suspects.

The ICEWS public release and Phoenix, in contrast, both use hundreds of sources, the bulk of these small and written by non-native speakers of English. This is going to be far more challenging to code than Oxbridge-edited Reuters English, and PETRARCH, at least, misses large numbers of these just because of linguistic quirks. Maybe ACCENT does better, but they haven’t provided this information.

A6. The “dialect” analogy actually works pretty well here: Take an article in Norwegian and tell Google Translate to translate it from the closely-related Danish into English, and you’ll still get something that will allow you to figure out most of the article. [try it!] But translating the text from Norwegian works a lot better. Surely at least some of this is going on as JABARI tries to code into the CAMEO-B dialect.

A7. Though this doesn’t deal with the accuracy in multi-source streams. One could certainly do this as well using, say, texts from Open Source Center and URLs from Phoenix, but things start getting a whole lot more expensive at this people because of the sheer sample sizes involved. In the end, I daresay, there are just going to be a lot of things we aren’t going to now unless someone wants to spend a lot of money to find out. Though if we could set up our coding facility in the midst of some Louisiana clearcut, maybe we could get the money.

A8. I actually met one of Reuter’s Lebanon correspondents while having drinks with a mutual friend by the pool at some luxury hotel on the Nile in Cairo. While we were talking—no, I’m not making this up—no less than the then notorious NYT war correspondent Chris Hedges stopped by for a chat. Field research is like that, all the time! Yeah, right.

And then there was the time I asked the academic husband of a Moroccan friend what he thought of  coverage of Morocco in The Economist. He paused and then said “I am The Economist correspondent for Morocco.”

This entry was posted in Methodology. Bookmark the permalink.

8 Responses to Seven observations on the newly released ICEWS data

  1. Pingback: Down the Country-Month Rabbit Hole | Dart-Throwing Chimp

  2. Pingback: A quick look at the public ICEWS data | MI Regression

  3. Pingback: Changing My Mind III: Quantitative Research Doesn’t Have Much to Add to IR | Chaos and Governance

  4. Pingback: What sources are in ICEWS? | Zachary Steinert-Threlkeld

  5. Hi Phil, apparently I missed this a while back. I should revisit more carefully, but a quick note on precision and accuracy — in addition to precision having an unsatisfactory relationship to accuracy, it’s also the case that high accuracy doesn’t tell you much about precision, especially in these sparse data cases where true negatives dominate. Even if specificity is high you can get terrible precision: maybe you’re correct 99% of the time on actual negative instances, but if there are 100 times more negative than positive instances, at least half of your positive predictions are going to be false positives.*

    Low precision systems, of course, are also unsatisfying for users!

    I wish there was a good way to test the recall (sensitivity) of these things beyond human coding of uniform sampling of the dataset, which is just way too much effort for rare classes.

    * Less fuzzy math … say sensitivity = 100% and specificity = 99% and true data distribution is 100/101:
    100 neg ==> 1 FP prediction
    1 pos ==> 1 TP prediction

    ==> precision = 50% (and lower if sensitivity is lower)
    ==> accuracy = 99% (or so)

  6. Pingback: Yeah, I blog… | asecondmouse

  7. Pingback: Stuff I tell people about event data | asecondmouse

  8. Pingback: Seven current challenges in event data | asecondmouse

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s