Stuff I tell people about event data

Every few weeks—it’s a low-frequency event with a Poisson distribution, and thus exponentially distributed inter-arrival times—someone contacts me (typically from government, an NGO or a graduate student) who has discovered event data and wants to use it for some project. And I’ve gradually come to realize that there’s a now pretty standard set of pointers that I provide in terms of the “inside story” [1] unavailable in the published literature, which in political science tends to lag current practice by three to five years, and that’s essentially forever in the data science realm. While it would be rare for me to provide this entire list—seven items of course—all of these are potentially relevant if you are just getting into the field, so to save myself some typing in the future, here goes.

(Note, by the way, this is designed to be skimmed, not really read, and I expect to follow this list fairly soon with an updated entry—now available!—on seven priorities in event data research.)

1. Use ICEWS

Now that ICEWS  is available in near real time—updated daily, except when it isn’t— it’s really the only game in town and likely to remain so until the next generation of coding programs comes along (or, alas, its funding runs out).

ICEWS is not perfect:

  • the technology is about five years old now
  • the SERIF/ACCENT coding engine and verb/event dictionaries are proprietary (though they can be licensed for non-commercial use: I’ve been in touch with someone who has successfully done this)
  • the output is in a decidedly non-standard format, but see below
  • sources are not linked to traced back to specific URLs—arrgghhh, why not???
  • the coding scheme is CAMEO, never intended as a general ontology [2], and in a few places—largely to resolve ambiguities in the original—this is defined somewhat differently than the original University of Kansas CAMEO
  • the original DARPA ICEWS project was focused on Asia, and there is definitely still an Asia-centric bias to the news sources
  • due to legal constraints on the funding sources—no, not some dark conspiracy: this restriction dates to the post-Watergate 1970s!—it does not cover the US

But ICEWS has plenty of advantages as well:

  • it provides generally reliable daily updates
  • it has relatively consistent coverage across more than 20 years, though run frequency checks over time, as there as a couple quirks in there, particularly at the beginning of the series
  • it is archived in the universally-available and open-access Dataverse
  • it uses open (and occasionally updated) actor and sector/agent databases
  • there is reasonably decent (and openly accessible) documentation on how it works
  • it was written and refined by a professional programming team at BBN/Raytheon which had substantial resources over a number of years
  • it has excellent coverage across the major international news sources (though again, run some frequency checks: coverage is not completely consistent over time)
  • it has a tolerable false-positive rate

And more specifically, there is at least one large family of academic journals which now accepts event data research—presumably with the exception of studies comparing data sets—only if they are done using ICEWS: if you’ve done the analysis using anything else, you will be asked to re-do it with ICEWS. Save those scripts!

As for the non-standard data format: just use my text_to_CAMEO program to convert the output to something that looks like every other event data set.

The major downside to ICEWS is a lack of guaranteed long-term funding, which is problematic if you plan to rely on it for models intended to be used in the indefinite future. More generally, I don’t think there are plans for further development, beyond periodically updating the actor dictionaries: the BBN/Raytheon team which developed the coder left for greener pastures [3] and while Lockheed (the original ICEWS contractor) is updating the data, as far as I know they aren’t doing anything with the coder. For the present it seems that the ICEWS coder (and CAMEO ontology) are “good enough for government work” and it just is what it is. Which isn’t bad, just that it could be better with newer technology.

2. Don’t use one-a-day filtering

Yes, it seemed like a good idea at the time, around 1995, but it amplifies coding errors (which is to say, false positives): see the discussion in http://eventdata.parusanalytics.com/papers.dir/Schrodt.TAD-NYU.EventData.pdf (pp. 5-7). We need some sort of duplicate filtering, almost certainly based on clustering the original articles at the text level (which, alas, requires access to the texts, so it can’t be done as a post-coding step with the data alone), but the simple one-a-day approach is not it. Note that ICEWS does not use one-a-day filtering.

3. Don’t use the “Goldstein” scale

Which for starters, isn’t the Goldstein scale, which Joshua Goldstein developed in a very ad hoc manner back in the late 1980s [https://www.jstor.org/stable/174480: paywalled of course, this one at $40] for the World Events Interaction Survey (WEIS) ontology. The scale which is now called “Goldstein” is for the CAMEO ontology, and was an equally ad hoc effort initiated around 2002 by a University of Kansas graduate student named Uwe Reising for an M.A. thesis while CAMEO was still under development, primarily by Deborah Gerner and Ömür Yilmaz, and then brought into final form by me, maybe 2005 or so, after CAMEO had been finalized. But it rests entirely on ad hoc decisions: there’s nothing systematic about the development. [4]

The hypothetical argument that people make against using these scales—the WEIS- and CAMEO-based scales are pretty much comparable—is that positive (cooperative) and negative (conflictual) events in a dyad could cancel each other out, and one would see values near zero both in dyads where nothing was happening and in dyads where lots was happening. In fact, that perfectly balanced situation almost never occurs: instead  any violent—that is, material—conflict dominates the scaled time series, and completely lost is any cross-dyad or cross-time variation in verbal behavior—for example negotiations or threats—whether cooperative or conflictual.

The solution, which I think is common in most projects now, is to use “quad counts”: the counts of the events in the categories material-cooperation, verbal-cooperation, verbal-conflict and material-conflict.

4. The PETRARCH-2 coder is only a prototype

The PETRARCH-2 coder (PETR-2) was developed in the summer of 2015 by Clayton Norris, at the time an undergraduate (University of Chicago majoring in linguistics and computer science) intern at Caerus Analytics. [14] It took some of the framework of the PETRARCH-1 (PETR-1) coder, which John Beieler and I had written a year earlier—for example the use of a constituency parse generated by the Stanford CoreNLP system, and the input format and actor dictionaries are identical—but the event coding engine is completely new, and its verb-phrase dictionaries  are a radical simplification of the PETR-1 dictionaries, which were just the older TABARI dictionaries. The theoretical approach underlying the coder and the use of the constituency parse are far more sophisticated than those of the earlier program, and it contains prototypes for some pattern-based extensions such as verb transformations.  I did some additional work on the program a year later which made PETR-2 sufficiently robust as to be able to code a corpus of about twenty-million records without crashing. Even a record consisting of nothing but exam scores for a school somewhere in India.

So far, so good but…PETR-2 is only a prototype, a summer project, not a fully completed coding system! As I understand it, the original hope at Caerus had been to secure funding to get PETR-2 fully operational, on par with the SERIF/ACCENT coder used in ICEWS, but this never happened. So the project was left in limbo on at least the following dimensions

  • While a verb pattern transformation facility exists in PETR-2, it is only partially implemented for a single verb, ABANDON
  • If you get into the code, there are several dead-ends where Norris clearly had intended to do more work but ran out of time
  • There is no systematic test suite, just about seventy more or less random validation cases and a few Python unit-tests [5]
  • The new verb dictionaries and an internal transformation language called pico effectively defines yet-another dialect of CAMEO
  • The radically simplified verb dictionaries have not been subjected to any systematic validation and, for example, there was a bug in dictionaries—I’ve now corrected this on GitHub—which over-coded the CAMEO 03 category
  • The actor dictionaries are still essentially those of TABARI at the end of the ICEWS research phase, ca. 2011

This is not to criticize Norris’s original efforts—it was a summer project by an undergraduate for godsakes!—but the program has not had the long-term vetting that several other programs such as TABARI (and its Java descendent, JABARI [6]) and SERIF/ACCENT have had. [7]

Despite these issues, PETR-2 has been used to produce three major data sets—Cline Phoenix , TERRIER and UT/Dallas Phoenix. All of these could, at least in theory, be recoded at some point since all of these are based on legal copies of the relevant texts [8]

5. But all of these coders generate the same signal: The world according to CAMEO looks pretty much the same using any automated event coder and any global news source

Repeating a point I made in an earlier entry [https://asecondmouse.wordpress.com/2017/02/20/seven-conjectures-on-the-state-of-event-data/] which I simply repeat here with minimal updating as little has changed:

The graph below shows frequencies across the major (two-digit) categories of CAMEO using three different coders, PETRARCH 1 and 2 , and Raytheon/BBN’s ACCENT (from the ICEWS data available on Dataverse) for the year 2014. This also reflects two different news sources: the two PETRARCH cases are LexisNexis; ICEWS/ACCENT is Factiva, though of course there’s a lot of overlap between those.

Basically, “CAMEO-World” looks pretty much the same whichever coder and news source you use: the between-coder variances are completely swamped by the between-category variances. What large differences we do see are probably due to changes in definitions: for example PETR-2 over-coded “express intent to cooperate” (CAMEO 03) due to the aforementioned bug in the verb dictionaries; I’m guessing BBN/ACCENT did a bunch of focused development on IEDs and/or suicide bombings so has a very large spike in “Assault” (18) and they seem to have pretty much defined away the admittedly rather amorphous “Engage in material cooperation” (06).

I think this convergence is due to a combination of three factors:

  1. News source interest, particularly the tendency of news agencies (which all of the event data projects are now getting largely unfiltered) to always produce something, so if the only thing going on in some country on a given day is a sister-city cultural exchange, that will be reported  (hence the preponderance of events in the low categories). Also the age-old “when it bleeds, it leads” accounts for the spike on reports of violence (CAMEO categories 17, 18,19).
  2. In terms of the less frequent categories, the diversity of sources the event data community is using now—as opposed to the 1990s, when the only stories the KEDS and IDEA/PANDA projects coded were from Reuters, which is tightly edited—means that as you try to get more precise language models using parsing (ACCENT and PETR-2), you start missing stories that are written in non-standard English that would be caught by looser systems (PETR-1 and TABARI). Or at least this is true proportionally: on a case-by-case basis, ACCENT could well be getting a lot more stories than PETR-2 (alas, without access to the corpus they are coding, I don’t know) but for whatever reason, once you look at proportions, nothing really changes except where there is a really concentrated effort (e.g. category 18), or changes in definitions (ACCENT on category 06; PETR-2 unintentionally on category 03).
  3. I’m guessing (again, we’d need the ICEWS corpus to check, and that is unavailable due to the usual IP constraints) all of the systems have similar performance in not coding sports stories, wedding announcements, recipes, etc:  I know PETR-1 and PETR-2 have about a 95% agreement on whether a story contains an event, but a much lower agreement on exactly what the event is: again, their verb dictionaries are quite different. The various coding systems probably also have a fairly high agreement at least on the nation-state level of which actors are involved.

6. Quantity is not quality

Which is to say, event data coding is not a task where throwing gigabytes of digital offal at the problem is going to improve results, and we are almost certainly reaching a point where some of the inputs to the models have been deliberately and significantly manipulated. This also compounds the danger of focusing on where the data is most available, which tends to be areas where conflict has occurred in the past and state controls are weak. High levels of false positives are bad and contrary to commonly-held rosy scenarios, duplicate stories aren’t a reflection of importance but rather of convenience, urban, and other biases. But you need the texts to reliably eliminate duplicates.

The so-called web “inversion”—the point where more information on the web is fake than real, which we are either approaching or have already passed—probably marks the end of efforts to develop trigger models—the search for anticipatory needles-in-a-haystack in big data—in contemporary data. That said, a vast collection of texts from prior to the widespread manipulation of electronic news feeds exists (both in the large data aggregators—LexisNexis, Factiva, and ProQuest—and with the source texts held, under unavoidable IP restrictions, by ICEWS, Cline, the University of Oklahoma TERRIER project and presumably the EU JRC) and these are likely to be extremely valuable resources for developing filters which can distinguish real from fake news.

Due to the inversion, particularly when dealing with politically sensitive topics (or rather, topics that are considered sensitive by some group with reasonably good computer skills and an internet connection), social media are probably now a waste of time in terms of analyzing real-world events (they are still, obviously, useful in analyzing how events appear on social media), and likely will provide a systematically distorted signal.

7. There is an open source software singularity (but not the other singularity…)

Because I don’t live in Silicon Valley, some of the stuff coming out of there by the techno-utopians —Ray Kurzweil is the worst, with Peter Thiel (who has fled the Valley) and Elon Musk close seconds, and Thomas Friedman certainly an honorary East Coast participant—seems utterly delusional. Which, in fact, it is, but in my work as a programmer/data scientist I’ve begun to understand where at least some of this is coming from, and that is what I’ve come to call the “software singularity.” This being the fact that code—usually in multiple ever-improving variants—for doing almost anything you want is now available for free and has an effective support community on Stack Overflow: things that once took months now can be done in hours.

Some examples relevant to event data:

  • the newspaper3k library downloads, formats and updates news scrapping in 20 lines of Python
  • requests-HTML can handle downloads even when the content is generated by javascript code
  • universal dependency parses provide about 90% of the information required for event coding [9]
  • easily deployed data visualization dashboards are now too numerous to track [10]

And this is a tiny fraction of the relevant software: for example the vast analytical capabilities of the Python and R statistical and machine learning libraries would have, twenty years ago, cost tens if not hundreds of thousands of dollars (but the comparison is meaningless: the capabilities in these libraries simply didn’t exist at any price) and required hundreds of pounds—or if you prefer, linear-feet—of documentation.

To take newspaper3k as an illustrative example, the task of downloading news articles, even from a dedicated site such as Reuters, Factiva, or LexisNexis (and these are the relatively easy cases) requires hundreds of lines of code—and I spent countless hours over three decades writing and modifying such code variously in Pascal, Simula [11], C, Java, perl, and finally Python—to handle the web pipeline, filtering relevant articles, getting rid of formatting, and extracting relevant fields like the date, headline, and text. With newspaper3k , the task looks pretty much [READ THIS FOOTNOTE!!!] like this:

import newspaper

reut_filter = ["/photo/", "/video", "/health/", "/www.reuters.tv/",
"/jp.reuters.com/",...,  "/es.reuters.com/"] # exclude these

a_paper = newspaper.build("https://www.reuters.com/")
for article in a_paper.articles:
    if "/english/" not in article.url: # section rather than article
        continue
    for li in reut_filter:
        if li in article.url: break
    else
        article.download()
        article.parse()
        with open("reuters_" + article.url + ".txt") as fout:
            fout.write("URL: " + article.url + "\n")
            fout.write("Date: " + str(article.publish_date) + "\n")
            fout.write("Title: " + article.title + "\n")
            fout.write("Text:\n" + article.text + "\n")

An important corollary: The software singularity (and inexpensive web-based collaboration tools) enables development to be done very rapidly with small decentralized “remote” teams rather than the old model of large programming shops. In the software development community in Charlottesville, our CTO group [12] focuses on this as the single greatest current opportunity, and doing it correctly is the single greatest challenge, and I think Gen-Xers and Millennials in academia have also largely learned this: for research at least, the graduate “bull-pen” [13] is now global.

That other singularity?: no, sentient killer robots are not about to take over the world, and you’re going to die someday. Sorry.

A good note to end on.

Reference

Blog entries on event data in rough order of utility/popularity:

and the followup to this:

Footnotes

READ THIS FOOTNOTE!!!: I’ve pulled out the core code here from a working program which is about three times as long—for example it adjusts for the contingency that article.publish_date is sometimes missing—and this example code alone may or may not work. The full program is on GitHub: it definitely works and ran for days without crashing.

1. The working title for this entry was “S**t I tell people about event data.”

2. See the documentation for PLOVER —alas, still essentially another prototype—on problems with using CAMEO as a general coding framework.

3. Though I have heard this involved simply taking jobs with another company working out of the same anonymous Boston-area office park.

4. Around this same time, early 2000s, the VRA project undertook a very large web-based effort using a panel of experts to establish agreed-upon weights for their IDEA event coding ontology, but despite considerable effort they could never get these to converge. In the mid-1990s, I used a genetic algorithm to find optimal weights for a [admittedly somewhat quirky] clustering problem: again, no convergence, and wildly different sets of weights could produce more or less the same results.

5. TABARI, in contrast, has a validation suite—typically referred to as the “Lord of the Rings test suite” since most of the actor vocabulary is based on J.R.R. Tolkien’s masterwork, which didn’t stop a defense contractor from claiming “TABARI doesn’t work” after trying to code contemporary news articles based on a dictionary focused on hobbits, elves, orcs, and wizards—of about 250 records which systematically tests all features of the program as well as some difficult edge cases encountered in the past.

6. Lockheed’s JABARI, while initially just a Java version of TABARI—DARPA, then under the suzerainty of His Most Stable Genius Tony Tether, insisted that Lockheed’s original version duplicate not just the features of TABARI, but also a couple of bugs that were discovered in the conversion—was significantly extended by Lockheed’s ICEWS team, and was in fact an excellent coding program but was abandoned thanks to the usual duplicitous skullduggery that has plagued US defense procurement for decades: when elephants fight, mice get trampled. After witnessing a particularly egregious episode of this, I was in our research center at Kansas and darkly muttered to no one in particular “This is why you should make sure your kids learn Chinese.” To which a newly hired secretary perked up with “Of course my kids are learning Chinese!”

7. I will deal with the issue of UniversalPETRARCH—another partially-finished prototype—in the next entry. But in the meanwhile, note that the event coding engines of these three “PETRARCH” programs are completely distinct; the main thing they share in common is their actor dictionaries.

8. See in particular the Cline Center’s relatively recent “Global News Archive“:  70M unduplicated stories, 100M original, updated daily. The Cline Center has some new research in progress comparing several event data sets: a draft was presented at APSA-18 and a final version is near completion: you can contact them. Also there was a useful article comparing event data sets in Science about two years ago:  http://science.sciencemag.org/content/353/6307/1502

9. 90% in the sense that in my experiments so far, specifically with the proof-of-concept mudflat coder, code sufficient for most of the functionality required for event coding is about 10% the length of comparable code processing a constituency parse or a just doing an internal sparse parse. Since mudflat is just a prototype and edge cases consume lots of code, 90% reduction is probably overly generous, but still, UD parses are pretty close to providing all of the information you need for event coding.

10. Curiously, despite the proliferation of free visualization software, the US projects ICEWS, PITF and UT/D RIDIR never developed public-facing dashboards, compared to the extensive dashboards available at European-based sites such as ACLED, ViEWS, UCDP and EMM NewsBrief.

11. A short-lived simulation language developed at the University of Oslo in the 1960s that is considered the first object-oriented language and had a version which ran on early Macintosh computers that happened to have some good networking routines (the alternative at the time being BASIC). At least I think that’s why I was using it.

12. I’ve been designated an honorary CTO in this group because I’ve managed large projects in the past. And blog about software development. Most of the participants are genuine CTOs managing technology for companies doing millions of dollars of business per year, and were born long after the Beatles broke up.

13. I think this term is general: it refers to large rooms, typically in buildings decades past their intended lifetime dripping with rainwater, asbestos, and mold where graduate students are allocated a desk or table typically used, prior to its acquisition by the university sometime during the Truman administration, for plotting bombing raids against Japan. Resemblance to contemporary and considerably more expensive co-working spaces is anything but coincidental.

14. Norris was selected for this job by an exhaustive international search process consisting of someone in Texas who had once babysat for the lad asking the CEO of Caerus in the Greater Tyson’s Corner Metropolitan Area whether she by chance knew of any summer internship opportunities suitable for someone with his background. 

Advertisements
This entry was posted in Methodology. Bookmark the permalink.

1 Response to Stuff I tell people about event data

  1. Pingback: Seven current challenges in event data | asecondmouse

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s