[Okay, okay, I'm clearly in the "he just doesn't get it" camp on the "post brief/post often" aspect of blogging—and Twitter is beyond the pale for me—so sorry, yet again you're just going to have to use the vertical scroll bar.]
GDELT continues to get attention, and there is an article forthcoming on it in New Scientist. But as people get beyond the level of cool visualizations, some concerns are arising as well, particularly on the issue of the high level of false positives and noise. These are completely legitimate but before these are used as excuses to ignore GDELT and just keep throwing tax-payer dollars down proprietary rat-holes, some points to consider
which I should have posted weeks ago.
GDELT is a disruptive innovation
Harvard Business School’s Clayton M. Christensen has, in an assortment of articles and books, explored the issue of “disruptive innovation”
“Generally, disruptive innovations were technologically straightforward, consisting of off-the-shelf components put together in a product architecture that was often simpler than prior approaches. They offered less of what customers in established markets wanted and so could rarely be initially employed there. They offered a different package of attributes valued only in emerging markets remote from, and unimportant to, the mainstream.” [link to source]
Observe that this is a nearly perfect description of GDELT. The components consist largely of existing software: web-scraping, the 12-year-old [NSF-funded] open-source TABARI coder and its open-source dictionaries, servers and databases, Geonames and other geospatial resources, R and various other open-source packages for analysis and visualization. This is not to diminish the role that Kalev Leetaru had in properly assembling all of this, nor his insights in writing ancillary programs to significantly extend the capabilities of the existing systems, but in the end, this was the project of a single graduate student, not a multi-million-dollar investment. And massively complex, multi-million-dollar systems are precisely what GDELT could supplant. It will, of course, be rejected in the stratospheric precincts of the established users: it isn’t perfect [or costly] enough, and the massive sunk costs of the proprietary alternatives must be justified. But it opens a huge new “market” to users who could not break through the “hide and hoard” barriers of the existing near-real-time datasets, and is simple enough to be adapted as well as adopted. And let’s not forget the cost of acquisition: zero. 
Political forecasting has a different set of signal/noise issues than many engineering problems
People seem to be registering that properly using event data involves dealing with a signal-to-noise issue, but I sense that they don’t fully appreciate how the characteristics of those problems in political forecasting are different from those of many, if not most, engineering problems.
Just to choose an example at random, let’s take the problem discerning signal from noise in a sonar system that is trying to determine whether there is a submarine in the vicinity. This has remarkable technological challenges, and can only be solved with a great deal of expertise but, compared to a political forecasting problem, one aspect is very simple: the submarine is either there, or it is not.
Contrast this to a very real [qualitative] forecasting issue from a few months back: on 2 October 2012 Israeli Prime Minister Netanyahu was reported to be pursuing increased sanctions against Iran, in contrast to constant rumors over the previous six months  that Israel would attack Iran’s nuclear facilities.
Does this mean:
- Netanyahu changed his mind and decided sanctions are an effective approach
- Netanyahu concluded Obama would be re-elected [international considerations]
- The Israeli military finally persuaded Netanyahu that an attack was a bad idea [domestic considerations]
- Israel was going to attack Iran in the near future [deception strategy]
Furthermore, was there even a single answer to the question of Israeli plans at, say, a three-month lead time?  Any analyst could think of several dozen low-probability contingencies that could modify that probability. The “submarine” is not either “there” or “not there”, but rather exists in a probabilistic haze of possibilities, at least some of which are very low probability black swans.
Yes, messy it is, but it’s been this way for a very long time: The debates between Alcibiades and Nicias on the wisdom of invading Sicily—very much a forecasting problem—vary little from the debates [or absence of debate] 2,400 years later between Rumsfeld/Wolfowitz and Mearsheimer/Walt on the wisdom of invading Iraq, and Big Data isn’t going to change that. As the [presumably apocryphal] exchange at the Congress of Vienna went
- Aide: Your excellency, the Russian Ambassador has just died!
- Prince von Mitternich: Fascinating…now, what were his intentions?
A full discussion of this can—and has—filled many books, and not all of them by Nicholas Taleb, but the point here is that unlike a sonar problem, where if you reduced the noise to zero—for example if you found some [fantasy] method by which water became as transparent as air—you would reduce your uncertainty to zero. But in the political forecasting problem, the uncertainty at realistic forecasting horizons never goes to zero, and in fact in both PITF and ICEWS, seems to stabilize pretty consistently around 20% for a wide variety of problems and methods.
But this, in turn, means that reducing the measurement error in the event data to arbitrarily small levels is not going to reduce the uncertainty of the forecast to arbitrarily small levels. To the contrary, as we’ve known for at least a decade, event indicators can be hugely simplified—typically with Goldstein scaling or quad counts—with little or no discernible loss in the predictive accuracy of the models.
Furthermore, since both the genesis and virtually all of the applications of event data have been to the issue of forecasting, this has been built into the analytical approaches: once the data are “good enough”, it makes much more sense to invest in improving the models, not in an infinite pursuit of greater precision in the measurement, as that will not contribute to the eventual objective.
Remember the origins of the dictionaries
The event data coding of GDELT used off-the-shelf technology (the geolocation coding was new and original), and in particular it used very general coding dictionaries. The verbs dictionary was the [NSF-funded] Kansas CAMEO dictionary, which had largely been developed to code events in the Middle East (with subsidiary development for the Balkans and West Africa, but never beyond that). CAMEO itself was originally designed as a specialized coding system to study mediation, and much of the detailed [NSF-funded] dictionary development was done with this objective, not for the study of all political behavior. The fact that a code exists in CAMEO does not mean that it has been thoroughly instantiated in the dictionaries. Finally, the actor dictionaries are from the general CountryInfo file—originally developed for the filtering programs used by the [NSF-funded] Militarized International Disputes project—supplemented with an assortment of NGO and MNC lists obtained from the Web, and a new WordNet-based [NSF-funded] agents dictionary.
Despite this re-use, the overall package works pretty well, but is certainly not all that could be done…bringing us to…
GDELT is a beta, not the limit of the technology
GDELT introduced several new features—beyond sheer scale—into event data coding. The obvious one is geolocation, but it also treated common nouns (“agents”) differently than in the past, did full-story coding, was the first major dataset based on a CountryInfo dictionary, used more sophisticated pre-processing than we had done, used a date-shifting feature of TABARI that has not been extensively tested, and used geospecific duplicate filtering. These should all be considered experiments, not definitive answers: I think most of these worked but, for example, I remain skeptical about full-story coding, and as several people have noticed, there are some very odd agent-based codes in there, as TABARI assembled these on the fly based on some fairly simple rules.
Beyond this, there are some additional enhancements developed in the course of the ICEWS project which are conceptually available in public sources that could be incorporated and which would certainly reduce the false positive problem (which was also a very big issue at some phases of ICEWS): these include filtering on certain actor/action combinations (NGOs and journalists rarely engage in material violence  and this is a common source of coding error); thorough filtering of historical, entertainment, and sports stories; and special treatment of certain “poison words”—mostly negatives—that are easily misinterpreted. Existing NSF-funded work at Penn State has produced and almost-but-not-quite implemented a complete reorganization of the verbs dictionaries based on WordNet synonym sets.
All of which is to say GDELT 2.0 (and 3.0…) will almost certainly have fewer false positives than GDELT 1.0. Because we’ve got more development coming down the pike…
Skating just ahead of the Coburn Amendment , Mike Ward, Jay Ulfelder and I have received funding from NSF for a project titled “Multiple Attribute Data Coded On Web”—MADCOW —which is assembling a series of tools to generate several web-based data sets, including a near-real-time event data set which will second-source GDELT’s, probably using slightly different technology.
MADCOW is funding the development of a new, Python-based coder which will replace TABARI and which will feature
- a extendible parser based on the Python nltk package
- a far richer, JSON-formatted dictionary structure
- hooks for the incorporation of additional packages for feature extraction (for example, the topics and size of demonstrations)
- partially automated verb dictionary development, as well as named-entity-recognition systems in the MADCOW system generally
- designed from the beginning for distributed processing of very large datasets
This will, of course, be open-source, and unlike TABARI, it will be hosted on GitHub: we will put out an announcement when we are ready for contributors beyond our initial core of developers. We also expect the Python code to be far easier to understand, maintain and extend than TABARI’s C++ codebase. Our current expectation is that this will be available for at least experimental use by the end of the summer.
MADCOW is also going to be hosting platforms for various aspects of crowd-sourced data validation, and we are hoping this can be used to refine the dictionaries. Need financial, criminal, and natural disaster categories in CAMEO?: we’re adding those as well.
I know, people don’t read manuals, or much of anything, any more, but indulge me for a moment and at least consider looking at the following materials before treating GDELT (and automated coding methods generally) as though they were artifacts dredged out of a cave somewhere accompanied only by a few pottery shards with cryptic inscriptions in early Etruscan.
- the 200-page TABARI manual, with two chapters plus an appendix on general aspects of machine coding and dictionary development. It is not just a guide to menu and project file options (though it has that as well)
- the 200-page CAMEO manual, which likewise has extended discussions of why CAMEO has been constructed as it has
- a book-length manuscript on event data coding, Analyzing International Event Data: it’s from around 2000 and the later chapters are a bit dated, but I updated the first three chapters last year
- a history of the event data project, which will give you some sense of how all of this material came into being
and if you still won’t read, here’s a two-hour video (with thanks to Indiana University).
Though if you are inclined to read (or, in a project, have someone else read), there’s a lot of material here.
In the end, open source will win
Returning to the original conundrum, where do you put your chips?:
- GDELT, with open-source software, dictionaries and costs , an emerging technical community, but like all disruptive innovations, a bit rough around the edges and we still don’t fully understand how to effectively use it;
- The multi-million dollar proprietary systems—assuming you can get access to them—with none of the above except, let us be honest, that analytical community is also only gradually learning how to use the data, and in many cases, trying to learn social science modeling as they go along.
We know how this story ends: the open source community will win. The basic technology works—and in fact it may already work as well as it needs to for forecasting, if not for monitoring—and it merely needs to be further refined and applied in a larger number of domains. But—Wiley Coyote ten feet beyond the end of the cliff—you can sustain proprietary alternatives for a long, long time provided you’ve got unimaginable gobs of money, and the U.S. government is under no constraints whatsoever in that domain, right?  Just don’t look down.
1. I had a brief discussion with someone associated with one of the government-supported projects who had a glimmer of hope on the end of GDELT. “So, how long do you have funds to sustain those updates?” My response: “We’ve got some NSF funding, and the marginal cost of updating the data is zero. Take a positive number, subtract N times zero from it, and the value of N where that goes negative is when we will be forced to shut down due to costs.”
Okay, I wasn’t quite that clever. And I get a monthly bill from Linode for $19.95 [!] to host some cloud resources we’re using. Drat…have to cut back on my Starbucks habit to keep this going!
2. I’d say it was a combination of the second and third factors.
3. In particular, GDELT makes no use of any ICEWS developments: it is entirely open-source components, largely accumulated through various NSF-funded projects. ICEWS has available very extensive actor dictionaries which it would be nice to get unrestricted access to, though at the point when I was last associated with the project, had not made any significant changes in the verbs dictionary.
4. Yeah, we will miss a few events involving Greenpeace as an NGO and Hunter S. Thompson as a journalist.
5. And in any case, with partial funding from Methods, Measurement and Statistics, which thus far is not in the sights of Coburn and Flake, though probably soon to be, as it funds people who do politically relevant analysis using tools more sophisticated than Excel.
6. Blame Ward, not me, for the acronym.
7. Pointers…love’em, but when they go five levels deep, the code is getting a bit obtuse. Though TABARI coded the 200-million records of GDELT without crashing.
9. Jay Ulfelder’s issue of using event data for monitoring, on the other hand, is closer to the submarine problem, with the difference that as long as we are depending on open sources (and in much of the world, that’s all we’ve got: intelligence resources are finite, despite what you see in the movies), the sources contain irreducible levels of error. Though we can certainly do better than we are currently doing on the false positives in GDELT.
That said, having spent much of the past month human coding—yes, I do human coding as well—atrocity reports (killings of five or more noncombatants in a single incident)
rather than writing blog posts, the distinctive aspect is that these only rarely generate a single report: instead, they usually generate a cluster of reports, and then a cluster of reactions to the report. A “bolt out of the blue” dyadic interaction of material conflict which does not generate any sort of followup, usually within a 24-hour period, is almost certainly incorrectly coded. Additional enhancements to the dictionaries—which at the CAMEO stage were developed to code mediation (we had coded conflict in the earlier dictionaries for WEIS, but WEIS had only a single violent conflict code)—would probably also help.
10. One had the sense many of the pundits were so certain this would happen that they were listening on their cell phones for the sounds of Israeli bombers starting their engines.
11. Open source solutions are, famously, “free as in puppy.” But the marginal costs are very close to zero.