Two followups, ISA edition

So those of you who follow this blog closely—yes, both of you…—have doubtlessly noticed the not-in-the-least subtle subtext of an earlier entry that something’s coming, and it’s gonna be big, really big, and I can’t wait to say more about it!

Well, finally, finally, the wait is over, with the presentation at the International Studies Association meetings in Montreal [1] of two papers:

Halterman, Andrew, Philip A. Schrodt, Andreas Beger, Benjamin E. Bagozzi and Grace I. Scarborough. 2023. “Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks.” Working paper presented at the International Studies Association, Montreal, March-2023.

Halterman, Andrew, Benjamin E. Bagozzi, Andreas Beger,  Philip A. Schrodt, and Grace I. Scarborough. 2023. PLOVER and POLECAT: A New Political Event Ontology and Dataset.” Working paper presented at the International Studies Association, Montreal, March-2023.

There are 160 pages of material here,[2] including a nice glossary that defines all of the technical terms and acronyms I’m using here, plus some supplementary code in Github repos: not a complete data generation pipeline but, as I was told once at a meditation retreat, “If you know enough to ask that question, you know enough to find the answer.” And more to the point, we put this together pretty quickly, about eight months from the dead start to the point where the system was capable to producing data comparable to the ICEWS system, and this to create both a radically new coder and a new ontology that had never been implemented, and as noted in the papers, under such circumstances as you’d want to refactor the thing anyway. Plus the large language model (LLM) space upon which our system depends is changing unbelievably rapidly right now so the optimal techniques will change in coming months, if not weeks [or hours: [12]]: tens of billions of dollars are being invested in these approaches right now.

But, you say breathlessly, I’m a lazy sonofabitch, I just want your data! When do I get the data?!?

Good question, and this will be decided at levels far above the pay grade of any of us on the project, to say nothing of the decisions of legions of Gucci-shod lawyers at both private and public levels, and could go in any direction. Maybe the funders will just continue to generate ICEWS, maybe the POLECAT data stays internal to the US government, maybe, as was the pattern with ICEWS, it gradually goes public in near real time [3], maybe released with the backfiles coded to 2010, maybe not: who knows? Mommas, don’t let your babies grow up to be IC subcontractors.

Sigh. But the redeeming feature of which I’m completely confident is the Roger Bannister effect: in track, the four-minute-mile stood as an unbroken record for decades, until Roger Bannister broke it in 1954. Two months later both Bannister and Australian John Landy ran under four minutes in regular competition. In a scant ten years, a high school student, Kansan Jim Ryun, ran the mile under four minutes.[4]

Similar story from OMG Arnold Schwarzenegger on Medium

For a long time, there was a “limit” on the Olympic lift, the clean and jerk. For decades, nobody ever lifted 500. But then, one of my heroes, Vasily Alekseyev did it. And you know what happened? Six other lifters did it that year.

It’s been done, and having been once done, it can be done again, and better. Event data never catches on but it never goes away.

The [likely] soon-to-be-fulfilled quest for the IP-free training and validation sets

As the papers indicate at numerous points, in addition to not providing the full pipeline due to intellectual property (IP) ambiguities [5], we also have not provided the training cases due to decidedly unambiguous licensing requirements from the news story providers. This, of course, has been an issue with respect to the development and assessment/replication of automated event data coders from the beginning: the sharing of news stories is generally subject to copyright and/or licensing limitations, even while the coded data are not, nor, of course, are the dictionaries if these are open source, as is true for the TABARI/PETRARCH coders. [6]

But that was then, and we now see light at the end of this tunnel, and it isn’t an on-coming train, it is LLMs. Which should be absolutely perfect for the generation of synthetic news stories which, for training and validation (T/V) purposes, will be indistinguishable, in fact likely preferable to, actual stories, and will be both timeless and IP-free. It’s not that LLMs are merely capable of producing realistic yet original texts, the entire purpose of LLMs is doing this: we’re not on the periphery here, we’re at the absolute core of the technology. A technology upon which tens of billions of dollars is currently being invested.

As discussed in the papers, we’ve already begun experimenting with synthetic cases to fill out event-mode types that were rare in our existing data, using the GPT-2 system. The results were mixed: we got about a 30% positive yield, which was far more efficient than the <5% yield (often <1%) we got from the corpus of true stories, but GPT-2 could only generate credible stories out to about two sentences, whereas typical inputs to POLECAT are 4 to 8 sentences, and it codes at the story level, not at the sentence-level used by all previous automated coders that have produced event data used in published work in conflict analysis.[7]. GPT-2 also tended to lock-in to a few common sentence structures and event-mode descriptions—e.g. protesters attacking police with baseball bats—and just varying the actors: after a few of these, additional similar cases were not that useful for training.

While we’ve not done the experiments (yet), there is every reason to believe GPT-3 (the base model for ChatGPT)—and as of the date of this writing, the rumor mill says Microsoft will release a variant of GPT-4 next week, months earlier than originally anticipated (!)[8][12]—will easily be able to produce credible full stories comparable to those of international news agencies. Based on some limited, and rather esoteric (albeit still in the range of Wikipedia’s knowledge base), experiments I’ve done with ChatGPT, it is capable of producing highly coherent (and factually correct with only minor corrections) text in roughly the range of two detailed PowerPoint slides, and it is very unlikely it would fail at the task of producing short synthetic news articles, given, we note again for emphasis, that word/sentence/paragraph generation is the core capability of LLMs.

So this changes everything, solving two problems at once. First,  the need to get sufficient rare events and corner cases: A current major issue in our system, for example, is distinguishing street protest from legislative and diplomatic protest: the content of the articles outside the word “protest” will clearly be different, but you’ve got to get the examples, which with real cases is labor-intensive. And second,  removing all IP concerns that currently prevent the sharing of these cases.

That said, these synthetic cases will still need human curation—LLMs are now notorious for generating textual “hallucinations”—and that’s an effort where a decentralized community could work, and here we have three advantages over the older dictionary/parser systems. First, the level of training required for an individual, particularly someone already reasonably familiar with and interested in political behavior, to curate cases is far lower than that required for developing dictionaries, even if the task remains somewhat tedious. Second, training examples are “forever” and don’t require updating as new parsers are developed, whereas to be fully effective, dictionaries needed to be updated to use information provided by the new parsers.[9] Third, as we discuss at multiple points in the papers, we can readily deploy various commercial and open source “active learning” systems to drastically reduce the cognitive load, while increasing the accuracy and yield, of the curation.

One and done. Really. A big task at the beginning—given that it has over 100 distinct events, modes, and contexts PLOVER probably needs a corpus of T/V cases numbering in the tens of thousands [14]—but once a set of these effectively define an accepted and stable version of PLOVER—as the papers indicate, our existing training sets were generated simultaneously with the on-going refinement of PLOVER, a necessary but by no means ideal situation—that can hold through multiple generations of coder technology. In this respect, it should be rather like TeX/LaTeX, originally running on bulky mainframes and now, with the same core commands, running on hardware and into standardized formats inconceivable at the time, but the documents produced for the original would still compile, or do so with routine modifications.

PLOVER, obviously, isn’t as general purpose as LaTeX, but we’d like to think a sufficient community exists to put this together in a year or so of decentralized coordinated effort, ideally with a bit of seed funding from one or more of the usual suspects.

“Hey, what about the long-series and near-real-time data?? I don’t want to contribute any time, talent, or treasure to this effort—that’s for losers! I just want your damn data!!!”

Yeah, yeah, we hear you. Asshole. Once we’ve got a system running—and by the way, as long as we’ve got versioning (another feature only loosely implemented in many prior event data systems) we can start coding after almost any point that we feel we’ve got reasonably credible T/V sets, rather than waiting until they are fully curated. Near-real-time is easy, since as noted a while back in this blog, with sophisticated open libraries, web scraping (for real-time news stories in this application) is now so simple it is used as an introductory exercise in at least one popular on-line Python class. At present, the coding system runs quite well with a single GPU—subsequent implementations could probably make use of multiple GPUs in the internal pipeline, though the near-100% efficiency of “embarrassingly parallel” file splitting is hard to beat—so those just need to be set up and run. And very gradually, a day at a time (which is indeed very gradual…), that does accumulate a long time series, and in any case since far and away the most common application of event data has been conflict monitoring and fairly short-term forecasting, that’s adequate (at least for operations; model estimation could still be an issue).

Long-term sequences similar to the 1995-2023 ICEWS series on Dataverse are more difficult due to the cost of acquiring appropriate rights to some news archive, and, per discussions in the papers, the fact that the computational requirements of these LLM-based models are far higher than those of dictionary/parser systems. There are numerous possibilities for resolving this. First, obviously, is just to splice the existing ICEWS long series, which at least gets the event and mode codings, though not the contexts. Second, academic institutions that have already licensed various long-time-series corpora might be able to run this across those (though given the computational costs, I’d suggest waiting until the T/V set has had a fair amount of curating. Though if you’ve got access to one of those research machines with hundreds of GPUs, the coding could be done quite quickly once you’ve split the files). Finally, maybe some public or private benefactor would fund the appropriate licensing of an existing corpus.

And then there’s my dream: You want a really long time series, like really, really long: code Wikipedia into PLOVER. Code that unpleasantness between the armies of Ramses II and Muwatalli II at Qadesh in late May 1274 BCE: we actually have pretty good accounts of this.[10] And code every other political interaction in Wikipedia, and that’s a lot of the content of Wikipedia. We can readily download all of the Wikipedia text, and since the PLOVER/POLECAT system uses Wikipedia as its actor dictionary, we’ve got the actors (getting locations may remain problematic, though most historical events are more or less localized to geographical features even if the named urban areas were long ago reduced to tall mounds of wind-blown rocks and mud). The format of Wikipedia differs sufficiently from that of news sources that this would take a fair amount of slogging work, but it’s doable.[11] 

Footnotes 

1. The actual panel is at 8:00 a.m. on the Saturday morning following St. Patrick’s Day, thus ensuring a large and fully attentive audience [joke]. Whatever. I have a fond memory of being at an ISA in Montreal on St. Patrick’s Day and walking past a large group of rather young people waiting to get into a bar, and a cop telling them “Line up, line up: before you go in I have to look at your fake IDs.”

2. I’ve been out of the academic conference circuit for some years now, but back when I was, major academic organizations such as the ISA and APSA maintained servers for the secure deposit of conference papers, infrastructure which of course nowadays would cost tens of dollars per month on the cloud. For the whole thing, not per paper. But then some Prof. Harold Hill wannabees, who presumably amuse themselves on weekends by cruising in their Teslas snatching the tip jars from seven-year-olds running 25-cent lemonade stands, persuaded a number of these chumps to switch to their groovedelic new “not-for-profit” open resource—a.k.a. lobster trap—and then without warning pulled a switcheroo and took it, and all those papers, proprietary. Is this a great country or what!

So presumably you can get the papers by contacting the authors.

Meanwhile, the wheels of karma move slowly but inexorably: 

[Abrahamic traditions] May the evil paper-snatchers burn forever in the hottest fires of the lowest Hell along with the people who take not-for-profit hospitals private.

[everyone else] May they be reborn as adjunct professors in a wild dystopia where they teach for $5,000 a course at institutions where deans, like Hollywood moguls of old, viewing The Hunger Games as a guide to personnel management, sit in richly paneled rooms snorting lines of cocaine while salivating over the ever-increasing value of their unspent endowments, raising tuition at twice the rate of inflation, and budgeting for ever-increasing cadres of subservient associate deans, assistant deans, deanlets, and deanlings.  But I exaggerate: deans don’t snort coke at these meetings (just the trustees…), and the endorphin surge from untrammeled exercise of arbitrary power would swamp cocaine’s effects in any case.

3. For those who haven’t noticed, ICEWS is currently split across multiple Dataverse repositories, due to the transition of the ICEWS production from Lockheed to Leidos. But as of this writing, the most recent ICEWS file on Dataverse is current as to yesterday [13-March-2023] and FWIW, that’s the same level of currency I have with my contractor’s access to the Leidos server. I also see from Dataverse that these files are getting hundreds of downloads—currently 316 downloads for the data for the first week of the 2023 calendar year—so someone must be finding it interesting.

The inevitable story of automated coding: First they tell you it is impossible, then they tell you it is crap, then they just use it.

4. This paragraph was not written by ChatGPT, but probably could have been. It did, of course, benefit hugely from Wikipedia. I will respect Jim Ryun’s athletic prowess and refrain from commenting on his politics.

5. Why utterly mundane code funded entirely by U.S. taxpayers remains proprietary while billions of dollars—have we mentioned the billions of dollars?—of pathbreaking and exceedingly high quality state-of-the-art software generated by corporations such as Alphabet/Google, Meta/Facebook, Amazon, and Microsoft has been made open source is, well, a great mystery. Though as the periodic discourses in War on the Rocks on the utterly dysfunctional character of US defense procurement note repeatedly, the simple combination of Soviet-style central planning and US-style corporate incentives gets you most of the way: nothing personal, just business.

6. The only open resource I’m aware of that partially gets around this is the “Lord of the Rings” validation set for the TABARI/PETRARCH family, but it is designed merely to test the continuing proper functioning of a parser/coder, not the entire data-generation system, and contains only about 450 records, many of them obscure corner cases, and small subsets of the dictionaries.

As mentioned countless times across the years of this blog, this did not stop a contractor—not BBN— from once “testing” TABARI on current news feeds using these dictionaries and reporting out that “TABARI doesn’t work.” Yes, the Elves and Ring-bearers have departed from the Grey Havens, while Sauron, Saruman, and the orcs of Mordor have been cast down, and the remains of the Shire rest beneath a housing development somewhere in the Cotswolds: the validation dictionaries don’t work.

7. Which is to say, NLP systems from the likes of IBM and BBN and their academic collaborators have experimented with coding at the story level, particularly in the many DARPA data-extraction-from-text competitions, which go back more than three decades. But these systems appear to have largely remained at the research level and never, to my knowledge, produced event data used in publications, at least in conflict analysis (there are doubtlessly published toy examples/evaluations in the computer science literature). Human coders, of course, work at the story level.

8. AKA “let’s kick Google while they are still down…”

9. Or at least that’s how it worked via the evolution of the KEDS/TABARI/PETRARCH-X/ACCENT automated coding systems from 1990 to 2018: some elements of parsing, for example the detection of compound actors, remained more or less the same but others changed substantially and dictionaries needed to account for this. For example even after the PETRARCH series shifted to the external parsers provided by the Stanford CoreNLP project, there was an additional fairly radical shift in the parsing, never fully implemented in a coder, from constituency parsing to dependency parsing. ACCENT almost certainly—the event dictionaries have never been open-sourced—used parsing information based on decades of NLP experience within BBN and made modifications as their parsers improved.

10. The Egyptians pretty much got their collective butts kicked and narrowly escaped a complete military disaster, with the area remaining under the control of the Hittites. Ramses II returned home and commissioned countless monuments extolling his great victory: some things never change.

11. Then move further to my next dream: take the Wikipedia codings (or heck, any sufficiently large event series) and apply exactly the LLM masking training and attention models (or whatever is next down the line: these are rapidly developing) to the dyadic event sequences. Hence solving the long-standing chronology generator problem and creating purely event-driven predictive models: PLOVER coding effectively “chunks” Wikipedia into politically-meaningful segments that are far more compact than the original text. The required technology and algorithms are all in place (if not complete off-the-shelf…) and available as open source.

12. [Takes a short break and opens the Washington Post…] WTF, GPT-4 is getting released today. Albeit by OpenAI, which is only sort of Microsoft. Stepping out ahead of the rumor mill, I suppose. But fundamentally, [8]. And at the very same time when the “Most read…” story in the WP concerns Meta [13] laying off another 10,000 employees…cruel they are, tech giants. The article also notes, scathingly, that GPT-3 is “an older generation of technology that hasn’t been cutting-edge for more than a year.”…oh, a whole year…silly us…

13. Hey, naming your corporation after a feature in a thoroughly dystopian novel (and genre): how’s that working for you? At least when Steve Jobs released the Macintosh in 1984 he mocked, rather than glorified, the world of the corresponding novel. Besides, we’ve had a metaverse for fully two decades: it’s called Second Life and remains a commercially viable, if decidedly niche, application. Some bits managed remotely here in Charlottesville.

14. As noted in the papers, we’re currently working with training cases that aim for a total of around 500 training cases, more or less balanced between positives and negatives (which may or may not be a good idea, and the representativeness of our negative cases probably needs some work). Given the high false-positive rates we’re getting, that may be insufficient, at least for the transformer-based models (but there are only 16 of these: the better-understood SVMs seem to be satisfactory for modes and contexts, though we still need to fill out some of the rarer modes). Using the fact that we can probably safely re-use some cases in multiple sets—in particular, all of the positive mode cases also need to correspond to a positive on their associated event, which presents considerably greater coverage for some of the events likely to be of greatest interest and/or frequency, notably CONSULT, PROTEST, COERCE, and ASSAULT—that’s roughly 40,000 to 50,000 cases. But these are relatively easy to code, requiring just a true/false decision.

Validation cases are much more complex, requiring correct answers for all of the coded components of the story, which can be extensive given that POLECAT typically generates multiple events from its full-story coding, and each of these can have multiple entities (actor, recipient, location) and those, in turn, can have multiple components (albeit these generally are simply derived from Wikipedia and Geonames). Initially these need to be generated from the source stories—we have multiple custom platforms for doing this—but eventually, once the system has been properly seeded and is working most of the time, can mostly be done by confirming the automated annotation and only correcting the codings that are in error. Nonetheless, this is a much slower and cognitively taxing task than simply verifying training cases.

How many validation cases do we need: well, how many can you provide? But realistically, with 100 or so positive cases for each type of event, maybe with fewer from some of the more distinct modes, which are easy to code, and with a general set of perhaps 2,000 null cases, 10,000 to 12,000 validation cases would probably be a useful start, and that’s sufficient to embed a lot of corner cases.

That said, “active learning” components make both these processes far more efficient than dictionary development, and in some instances (notably the assignment of contexts) these converge after just a couple estimation iterations (or in the case of the commercial prodigy program, its ongoing evaluation) to a situation where most of the assignments are correct.

This also lends itself very well to decentralized development, which is particularly important given that curators/annotators tend to burn out pretty quickly on the exercise. This decentralization goes back to ancient days ca. 1990 of the first automated event data coder dictionary development, which was shared between our small KEDS team in Kansas and Doug Bond’s PANDA project at Harvard. In the current environment, tools, procedures, and norms for decentralized work are far more developed, and this should be relatively straightforward.

Advertisement
This entry was posted in Methodology and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s