Entropy, Data Generating Processes and Event Data

Or more precisely, the Santa Fe Institute, Erin Simpson, and, well, event data. With a bit of evolutionary seasoning from Robert Wright, who is my current walking-commute listening.

Before we get going, let me make completely clear that there are perhaps ten people—if that—on this entire planet who will gain anything from reading this—particularly the footnotes, and particularly footnote 12—and probably only half of them will, and as for everyone else: this isn’t the blog you are looking for, you can go about your business, move along. Really: this isn’t even inside baseball, it’s inside curling.[1] TL;DR!

This blog is inspired by a series of synchronistic events over the past few days, during which I spent an appallingly large period—if during inclimate weather—going through about 4,000 randomly-selected sentences to ascertain the degree to which some computer programs could accurately classify these into one of about 20 categories. Yes, welcome to my life.

The first 3,000 of these were from a corpus of Reuters lede sentences from 1979-2015 which are one of the sources for the Levant event data set which I’ve been maintaining for over a quarter-century. While the programs—both experimental—were hardly flawless, overall they were producing, as their multiple predecessors had produced, credible results, and even the classification errors were generally for sentences that contained political content.

I then switched to another more recent news corpus from 2017 which, while not ICEWS, was not dissimilar to ICEWS in the sense of encompassing a massive number of sources, essentially a data dump of the world’s English-language press. This resulted in a near total meltdown of the coding system, with most of the sentences themselves, never mind the codings, bordering on nonsense so far as meaningful political content was concerned. O…M…F…G… But, here and there, a nice clean little coding poked its little event datum head out of the detritus, as if to say “Hey, look, we’re still here!”, rather as seedlings sprout with the first rainfall in the wake of a forest fire.

So what gives? Synchronicity being what it is, up pops a link to Erin Simpson’s Monktoberfest  talk where Dr. Simpson, as ever, pounds away at the importance of understanding the data generating process (DGP) before one blithely dives into any swamp presenting itself as “big data.” Particularly when that data is in the national security realm. Having wallowed in the coding of 4,000 randomly sequenced news ledes, particularly the last largely incoherent 1,000, her presentation resulted in an immediate AHA!: the difference I observed is accounted, almost totally, by the fact that international news sources such as Reuters [2] have an almost completely different DGP than that of the local sources.

Specifically, the international sources, since the advent of modern international journalism around the middle of the 19th century [3] have fundamentally served one purpose: providing people who have considerable money with information that they can use to get even more money. Yes, there are some positive externalities attached to this in terms of culture and literature, but reviews of how a Puccini opera was received at La Scala didn’t pay the bills: information valuable in predicting future events did.

This objective conveniently aligning the DGP of the wire services quite precisely with the interests of [the half-dozen or so…] consumers of political event data, since in the applied realm event data is used almost exclusively for forecasting.

Now, turning our attention to local papers in the waning years of the second decade of the 21st century: O…M…F…G: these can be—and almost certainly are—pretty much absolutely anything except—via the process of competitive economic exclusion—they aren’t international wire services.[4] In the industrialized world, start with the massive economic changes that the internet has wrought on their old business models, leading to two decades of nearly continuous decline in staffing and bureaus. In the industrializing world, the curve goes the other way, with the internet (and cell phone networks) enabling access and outreach never before possible. Which can be a good thing—more on this below—but is not necessarily a good thing, as there is no guarantee either of the focus nor, most certainly, the stability of these sources. The core point, however, is that the DGP of local sources is fundamentally different than the DGP of international sources.[5]

So, different DGPs, yeah, got that, in fact I had you at “Erin Simpson”, right? But what’s with the “entropy” stuff?

Well, I’m now really going to go weird—well, SFI/complexity theory weird, which is, well, pretty weird—on you here, so again, you’ve been warned, and thus you probably just want to break off here, and go read something about chief executives and dementia or somesuch. But if you are going to continue…

Last summer there was an article which despite including the phrase “Theory of reality” in the title—this is generally a signal to dive for the ditches—got me thinking—and by the way, I am probably about to utterly and completely distort the intent of the authors, who are likely a whole lot smarter than me even if they don’t realize that one should never, ever, under any circumstances put the phrase “theory of reality” on anything other than a Valentine’s Day candy heart or inside a fortune cookie…I digress…—on their concept of “effective information”:

With [Larissa] Albantakis and [Guilio] Tononi (both neuroscientists at Univ of Wisconsin-Madison),  [Erik] Hoel (Columbia neuroscience)[6] formalized a measure of causal power called “effective information,” which indicates how effectively a particular state influences the future state of a system. … The researchers showed that in simple models of neural networks, the amount of effective information increases as you coarse-grain over the neurons in the network—that is, treat groups of them as single units. The possible states of these interlinked units form a causal structure, where transitions between states can be mathematically modeled using so-called Markov chains.[7] At a certain macroscopic scale, effective information peaks: This is the scale at which states of the system have the most causal power, predicting future states in the most reliable, effective manner. Coarse-grain further, and you start to lose important details about the system’s causal structure. Tononi and colleagues hypothesize that the scale of peak causation should correspond, in the brain, to the scale of conscious decisions; based on brain imaging studies, Albantakis guesses that this might happen at the scale of neuronal microcolumns, which consist of around 100 neurons.

With this block quote, we are moving from OMFG to your more basic WTF: why oh why would one have the slightest interest in this, or what oh what does this possibly have to do with event data???

Author pauses to take a long drink of Santa Fe Institute Kool-Aid…

The argument just presented is at the neural level but hey, self-organization is self-organization, right? Let’s just bump up things up to the organizational level and suddenly we have answers to two (or three) puzzles:

  • why is the content of wire service news reports relatively stable across decades?
  • why can that information be used by both [some] humans and [most] machines to predict political events at a consistently high level?
  • why are reductionist approaches on modeling organizational behavior doomed to fail? [8]

My version of the Hoel-Albantakis-Tonini hypothesis is that there is a point in organizational structure where organizations, assuming they are under selective pressure, will settle on a scale (and mechanisms) which maximizes—or at least goes to some local maximum on an evolutionary landscape—the tradeoff between the utility of predictive power and the cost of information required to maintain that level of predictability. While the sorts of political organizations which are the focus of event data forecasts have costly private information, the international media provide the inexpensive shared public information that sustains this system. In particular, we can readily demonstrate through a variety of statistical and/or machine-learning models (or human “super-forecasters”) that this public information alone is sufficient to predict most of these political behaviors, typically to 80% to 85% accuracy. [9] Information to get the remaining 15% to 20%, however, is not going to be found in local sources (with one exception noted below) and, as I’ve argued elsewhere (for years…) most of the remaining random error is irreducible due to a set of about eight fundamental sources of uncertainty that will be found in any human (or, presumably, non-human) political system.[10][30]

In order to survive, organizations must be able to predict the consequences of their actions and the future state of the world with sufficient accuracy that they can take actions in the present that will have implications for their well-being far into the future: the feed-forward problem: Check. [11] You need information of some sort to do this: Check. Information is costly and information-collection detracts from other tasks required for the maintenance and perpetuation of the organization: Check.[12] Therefore, in the presence of noise and systems which are open and stochastic, a point is reached—and probably with a lot less information than we think we need [13]—where information at a disaggregated scale is more expensive than the benefits it provides for forecasting. QED. [14]

Take ICEWS. Please… The DARPA-sponsored research phase, not the operational version, such as it is. Consider the [in]famous ICEWS kick-off meeting in the fall of 2007, where the training data were breathlessly unveiled along with the fabulously difficult evaluation metrics for the prediction problems, vetted by no less than His Very Stable Genius Tony Tether.[15] Every social scientist in the room skipped the afternoon booze-up with the prime contractor staff, went back to their hotel rooms with their laptops and by, say, about 7 p.m. had auto-regressive models far exceeding the accuracy of the evaluation metrics. Can we just have the money and go home now? The subsequent history of the predictive modeling in ICEWS—leaving aside the fact that the Political Instability Task Force (PITF) modeling groups had already solved essentially the same problems five years earlier—was one of the social scientists finding straightforward models which passed the metrics, Tether and his minions (abetted, of course, by the prime contractors, who did not want the problems solved: there is no funding for solved problems) imposing additional and ever-more ludicrous constraints, and then new models developed which got around even those constraints.

But this only worked for the [perfectly plausible] ICEWS “events-of-interest,” which were at a very predictable nation-year scale. The ICEWS approach could be (and in the PITF research, to some degree has been) scaled downwards quite a bit, probably to geographical scales on the order of a typical province and temporal scales on the order of a month, but there is a point where the models will not scale down any further and, critically, adding the additional information available in local news sources will not reliably improve forecasting once the geo-temporal scale is below the level where organizations have optimized the Hoel-Albantakis-Tonini effective information. Below the Hoel-Albantakis-Tonini limit, more is not better: more is just noise, because organizations aren’t paying attention to this information and consequently it is having no effect on their future behaviors.

And in the case of event data, a particular type of noise. [Again, I warned you, we’re doing inside curling here.] There are basically three approaches to generating event data

  • Codebook-based (used by human coders, and thus irrelevant to real-time coding)
  • Pattern-based (used by the various dictionary-based coding programs that are currently responsible for all of the existing data sets)
  • Example-based (used by the various machine-learning systems currently under development, though none, to my knowledge, currently produce operational data)

While at present there is great furor—mind you, among not a whole lot of frogs in not a particularly large pond—as to whether the pattern-based or example-based approaches will prove more effective [16], this turns out to be irrelevant to this issue of noise: Both the pattern-based and example-based systems, alas, are subject to the same weakness in generating noise [17] as each generates false positives when they encounter something in a politically-irrelevant news context that sufficiently resembles something from the politically-relevant cases they were originally developed to code that it triggers the production of an event. As more and more local data—which is almost but not quite always irrelevant—is thrown into the system, the false positive rate soars.[18][19]

CAVEAT: Yes, as I keep promising, there is a key caveat here: For a variety of reasons, most importantly institutional lag, language differences, and areas with a high risk and low interest (for example Darfur, South Sudan, or Mexican and Central American gang violence) the coverage of the international news sources is not uniform, and there are some places where local coverage can fill in gaps. Two places where I think this has been (or will be once the non-English coding systems come on-line) quite important is the “cell phone journalism” coverage of violence in Nigeria and southern Somalia, and Spanish and Portuguese language coverage in Latin America.[20] But by far, the vast bulk of the local sources currently used to generate event data do not have this characteristic.

Whew…so you’ve made it this far, what are the practical implications of this screed? I see five:

First, the contemporary mega-source data sets are a combination of two populations with radically different DGPs: the “thick head” of international sources, most of which are coded tolerably well by the techniques which, by and large, were originally developed for international sources, and the “thin tail” of local sources, which are generally neither coded particularly well, nor particularly relevant even when coded correctly.[21]

Second, as noted earlier, in event data, more is not necessarily better. “More” may be relatively harmless—well, for the consumers of the data; it remains at least somewhat costly to the producers [22]—when the models involve just central tendency (the Central Limit Theorem is, as ever, our friend) and the false positives are pretty much randomly distributed.[23] Models sensitive to measurement error, heterogeneous samples, and variance in the error terms—for example most regression-based approaches—are likely to experience problems.

Third—sorry DARPA [24]—naive reductionism is not the answer, nor is blindly throwing more machine cycles and mindless data-dumps at a problem. Any problem.[25] Figuring out the scale and the content of the effective information is important [26], and this requires substantive knowledge. Some aspects of the effective information problem are what political and organizational theories have been dealing with for at least a century. Might think about taking a look at that sort of thing, eh? Trust in the Data Fairy alone has, once again, proven woefully misplaced.

Fourth, keep in mind my CAVEAT! above: it is not the case that all local data are useless. But it is almost certainly the case that because the DGPs differ so greatly between contemporary local sources and international sources, it is very likely that separate customized coding protocols will be needed for these, at the very least well-tested filters to eliminate irrelevant texts and in many cases customized patterns/training cases. That said, the effective information scale can vary by process, and if, for example, one is focused on a localized conflict (say Boko Haram or al-Shabab) some of those sources could be quite useful, again possibly with customization. But the vast bulk of local sources are just generating noise.[27]

Finally, don’t listen to me: experiment! Most of the issues I’ve raised here can be readily tested in your approach of choice using existing event sequences: for your problem of choice, go out and actively test the extent to which the exclusion of local sources (or specific local sources) does or does not affect your results. And please publish these in some venue with a lag time of less than five years! [28]  


1. As with everything in this blog, these opinions are mine and mine alone and no one I have ever worked for or with directly or indirectly anytime now or in the past or future bears any responsibility for them. And that includes Mrs. Chapman whose lawn I mowed when I was in seventh grade.

2.  And the other major news wires such as Xinhua, Agence France Press, assorted BBC monitoring services and the Associated Press, but that’s pretty much the list.

3. This largely coincided with the proliferation of telegraph connections, though older precedents existed in the age of sail once a sufficiently independent business class—and weakening of state control of communications—existed to sustain it. 

4. Just two or three decades ago, “newspapers of record” such as the Times of London and the New York Times served much the same role that the international wire services do today by focusing on international coverage for a political elite, using their vast networks of foreign correspondents, proverbially gin- and/or -whiskey-soaked Graham Greene wannabees hanging out in the bars of cheap hotels convenient to the Ministry of Information of—hey, Trump’s making me do this, I can’t miss my one chance to use this word!—shithole countries. Or colonies. Those days are long gone, though for this same reason historical time series data based on the NYT such as that produced by the Cline Center may be quite useful.

5. In contrast, your typical breathlessly hyped commercial big data project involves data generated by a relatively uniform DGP: people looking at, and then eventually buying [or not] products on Amazon are, well, all looking at, and then eventually making a decision about, products on Amazon, and they are also mostly people (Amazon presumably having learned to filter out the price-comparison bots and deal with them separately). Actually, except for the human-bot distinction, it is hard to think of a comparable cases in data science where the data generators are as divergent as a Reuters editor and the reporters for a small city newspaper in Sumatra. Unless it is the difference between the DGP of news media, even local, and the DGP of social media…

6. Affiliations provided to indicate that the authors’ qualifications go beyond “Part-time cannabis sales rep living in parents’ basement in Ft. Collins and in top 20 percentile of Star Wars: Battlefront II players.” Which would be the credentials of the typical author of a discourse on “theory of reality.”

7. Otherwise known as “Markov chains.”

8. This last is a puzzle only if you’ve had to sit through multiple agonizingly stupid meetings over the years with people who believe otherwise.

9. Thanks to the work of Tetlock and his colleagues, the breadth of this accuracy across a wide variety of political domains is far more systematically established for human superforecasters than it is for statistical and machine-learning methods, but I’m fairly confident this will generalize.

10. Dunbar’s number is probably another example of this in social groups generally. I’m currently involved in a voluntary organization whose size historically was well below Dunbar’s number and consequently was run quite informally, but is now beginning to push up against it. The effects are, well, interesting.

11. For an extended discourse on counter-examples, see this. Which for reasons I do not understand is the most consistently viewed posting on this blog.

12. Descending first into the footnotes, then into three interesting additional rather deep rabbit holes on this:

  1. I’m phrasing this in terms of “organizations,” which have institutionalized information-processing structures. But certainly we are also interested in mass mobilization, where information processing exists but is much more diffuse (and in particular, is almost certainly influenced more by networks than formal rules, though some of those networks, e.g. in families, are sufficiently common that they might as well be rules). I think the core argument for a prediction-maximizing scale is still relevant—in mass mobilization in authoritarian systems the selection pressures are very intense but even non-authoritarian mobilizations have the potential costs of collective action problems—but they are likely to be different than the scale for organizations, as well as differing with the scale of the action. That said, the incentives for the international wire services to collect this information remain the same, and the combination of [frequent] anonymity and correspondents not being dependent on local elites for a paycheck (the extent to which this is true varies widely and has certainly changed in recent years with the introduction of the internet) may result in these international sources being considerably more accurate than local sources. Local media sources which are under the control of local political and/or economic elites may actually be at their least informative when the conditions in an area are ripe for mass mobilization. [29]
  2. An interesting corollary is that liberal democracies have an advantage in being generally robust against the manipulation of information—the rise of fascist groups in the inter-war period in the 20th century and, possibly, recent Russian manipulation of European and US elections through social media are possible exceptions—and consequently they don’t incur the costs of controlling information. This is in contrast to most authoritarian regimes, and specifically the rather major example of China, which spends a great deal of effort doing this, presumably due to a [quite probably well-founded] fear by its elites that the system is not robust against uncontrolled information. Even if the Chinese authorities can economically afford this control—heck, they can afford the bullet trains the US seems totally incapable of creating—this suggests a brittleness to the system which is not trivial. Particularly in light of a rapidly changing information environment. Much the same can probably be said of Saudi Arabia and Russia.
  3. A really deep rabbit hole here, but given the fact that the support for the international news media is very diffuse, what we probably see here is essentially a generally stable [Nash? product-possibility-frontier? co-evolution landscape?] equilibrium between information producers and consumers where the costs and benefits of supply and demand end up with a situation where the organizations (both public and private) can work fairly well with the available information and the producers have learned to focus on providing information that will be useful. From the perspective of political event data, for example, the changes between the WEIS and COPDAB ontologies from the 1960s and the 2016 PLOVER ontology—all focusing on activities relevant to forecasting political conflict—are relatively minor compared to total scope of human behaviors: political event ontologies have consistently used on the order of 10¹ primary categories, whereas ontologies covering all human behaviors tend to have on the order of 10². Furthermore, except for the introduction of idiomatic expressions like “ethnic cleansing” and “IED”, vocabulary developed for articles from the 1980s still works well for articles in the 2010s (and works vastly better than trying to cross levels of scale from international to local sources). Organizations, particularly those associated with large nation states, will of course have information beyond those public sources—this is the whole point of intelligence collection—but opinions vary widely—wow, do they ever vary widely!—as to the utility of such information at the “policy relevant forecasting interval” of 6 to 24 months.  Meanwhile, given its level of decentralization, the system that has depended on this information ecosystem is phenomenally stable compared to any other period in human history.

13. In the early “AI” work in human-crafted “expert systems” in 1980s, “knowledge engineers”—a job title insuring generous compensation at the time, rather like “data scientist” today—generally found that if an expert said they needed some information that couldn’t be objectively measured, but they knew it by “intuitive feelings” or something equivalent, when the models were finally constructed and tested, it turned out these variables were not needed: the required classification information was contained in variables that could be measured. The positive interpretation of this is that the sub-cognitive “intuition” was in fact integrating these other factors through a process similar to, well, maybe neural networks? Ya-think? The negative interpretation is that the individuals were trying to preserve their jobs. 

14. With a bit more work, one could probably align this—and the Hoel-Albantakis-Tonini approach more generally—with Hayek’s conceptualization of markets as information processing systems, and certainly the organizational approach is consistent with Hayek’s critique of central planning. Even if Hayek is probably the second-most ill-used thinker in contemporary so-called conservative discourse, after Machiavelli (and followed by Madison). Seriously. I digress.

15. Tether, of course, was the model for Snoke in The Last Jedi, presiding over DARPA seated on a throne of skulls in a cavernous torch-lit room surrounded by black-robed guandao-armed guards. His successor at DARPA, of course, was the model for Miranda Priestly in The Devil Wears Prada.

16. The answer, needless to say, is that hybrid approaches using both will be best. Part of the reason I annotated 4000 sentences over the weekend and am planning to do a lot more on a couple of upcoming transoceanic flights.

17. This is similar to the argument that all machine-learning systems are effectively using the same technique—partitioning very high dimensional spaces—and therefore allowing for similar levels of complexity will have similar levels of accuracy, particularly out of sample.

18. Even worse: the coding of direct quotations, where the DGP varies not just with the reporter but with the speaker. These are the perfect storm for computer-assisted self-deception: as social animals, we have evolved to consider direct quotations to be the single most important source of information we can possibly have, and thus our primate brains are absolutely screaming: “Code the quotations! Code the comments! Code the affect! General Inquirer could do this in the 1960s using punch cards, why aren’t you coding comments???”

But our primate brains evolved to interpret quotations deeply embedded in a social context that, during most of evolution, usually involved individuals in a small group with whom we’d spent our entire lives. A rather different situation than trying to interpret quotations first non-randomly selected, then often as not translated, out of a barely-known set of circumstances—possibly including “paraphrased or simply made up”—that were spoken by an unknown individual operating in a culture and context we may understand only vaguely. And that’s before we get to the issues of machine coding. “Friends don’t let friends code quotations.”

For this reason, by the way, PLOVER  has eliminated the comment-based categories found in CAMEO, COPDAB, and WEIS.

19. Okay, it’s a bit more complicated: the false positive rate is obviously going to depend on the tradeoff a given coding system has made between precision and recall, and a system that was really optimized for precision could probably avoid this issue, or at least dramatically reduce it. But none of the systems I’m familiar with have done that.

20. Raising another split in the event data community, whether machine-translation combined with English-language coders will be sufficient or whether native-language coders are needed. Fortunately, once the appropriate multiple-language coding programs are available, this can be resolved empirically.

21. See this table which was generated from a sample of ICEWS sources from 2016. Of the sources which could be determined, 37.5% are from 10 international sources, and fully 27.2% from just four sources: Xinhua, Reuters, AFP and BCC. Incongruously, three Indian sources account for another 15.2%, then past this “thick head” we go to a “thin tail” of 372 sources accounting for the remaining 47.3% of the known sources (with an additional 25.8% of the total cases being unidentified: these could either be obscure local sources or garbled headers on the international sources, which occurs more frequently than one might expect).

22. Rather like bitcoin mining. I actually checked into the possibility that one could use bitcoin mining computers—which I suspect at some point will be flooding the market in vast quantities—to do, well, something, anything that could enhance the production or use of event data. Nope: they are specialized and optimized to such a degree that “door stop” and “boat anchor” seem to be about the only options.

23.They may or not be: With a sufficiently representative set of gold standard records—which more than fifty years into event data coding we still don’t have—this becomes an empirical question. My own guess is that they are randomly distributed to a larger degree than one might expect, at least for pattern-based coders.

24. Yeah, right… “I feel a great disturbance in the Force, as if millions of agent-based-models suddenly cried out in terror and were suddenly silenced, their machine cycles repurposed to mining bitcoin. I fear something, well, really great, has happened. Except we all know it won’t.”

25. Sharon Weinberger Imagineers of War (2017)—coming soon to the Virginia Festival of the Book!—is a pretty sobering assessment of DARPA’s listless intellectual drift in the post-Cold War period, particularly in dealing with prediction problems and anything involving human behavior. Though its record on that front during the Vietnam War also left a bit to be desired. Also see this advice from the developer of spaCy.

26. “Data mining” may identify useful surrogate indicators that provide substitutes for other more complex and less-accessible data: PITF’s discovery of the robustness of infant mortality rate as a surrogate for economic development and state capacity is probably the best example of this. These, however, are quite rare, and tend to be structural rather than dynamic. The fate of the supposed correlation between Google searches for flu symptoms and subsequent flu outbreaks (and a zillion other post-hoc correlations discovered via data mining) is a useful cautionary tale. Not to mention the number of such correlations that appear to be urban legends.

27. I attempted to get a reference from the American Political Science Review for a colleague a couple days ago and found that their web site doesn’t even allow one to access the table of contents without paying for a membership! Cut these suckers off: NSF (and tenure committees) should not allow any references or publications that are not open access. Jeff Flake is rapidly becoming my hero. And I hope Francis Bacon is haunting their dreams in revenge for their subversion of the concept of “science.”

Addendum: Shortly after writing this I was at a fairly high level meeting in Europe discussing the prospects for developing yet another quantitative conflict early warning system, and got into an extended discussion with a couple quite intelligent, technically knowledgable and diligent fellows who, alas, had been trying to learn about the state of the art of political science applications of machine-learning techniques by reading “major” political science journals. And were generally appalled at what they found: an almost uninterrupted series of thoroughly dumbed-down—recalling Dave Barry’s observation that every addition you make to a group of ten-year-old boys drops their effective IQ by ten points, that’s the effect of peer review these days—logistic regressions with 20 highly correlated “controls'” and even these articles only available—well, paywalled—five years after the original research was done. So I tried to explain that there is plenty of state-of-the-art work going on in political science, but it’s mostly at the smaller specialized conferences like Political Methodology and Text-As-Data, though some of it will make it into a very small number of methodologically sophisticated journals like Political Analysis and Political Science Research and Methods. But if you are trying to impress people who are genuinely sympathetic to quantitive methods using the contents of the “major” journals, you’ll find that’s equivalent to demonstrating you can snare rats and cook them on a stick over an open fire, and using this as evidence that you should work in a kitchen that carries a Michelin star.

Alas, from the perspective of the typical department chair/head, dean, associate dean, assistant dean, deanlet and deanling, I suppose there is a certain rationale to encouraging this sort of thing, as it makes your faculty far less attractive for alternative employment, and maybe the explanation for the persistence of this problem is no more complicated than that. Though the 35% placement rate in political science may be an unfortunate side-effect. If it is indeed a side-effect. Another “side effect” may be the precipitous decline in public support for research universities.

Again, whoever is in charge of this circus needs to stop supporting the existing journal system, insist on publications which have roughly contemporaneous and open access—if people want to demonstrate their incompetence, let them do so where all can see, and the sooner the better—and let the folks currently trying to run journals get back to their core competency, managing urban real estate.

28. Also, as has been noted from the dawn of event data, “local” sources are incorporated into the international news services, as these depend heavily on local “stringers,” often among the most well-connected and politically savvy individuals in their region, and not necessarily full-time journalists. I was once in a conversation with the husband of a visiting academic from Morocco and asked what he thought about The Economist‘s coverage of that country. He gave me a quizzical look and then said “You’ll have to ask someone else: I am The Economist‘s correspondent for Morocco.”

29. In rare circumstances, this can be a signal: in 1979 the Soviet media source Pravda went suddenly quiet on the topic of Afghanistan a week or so before the invasion after having covered the country with increasing alarm for several months. This sort of thing, however, requires both stupidity and very tight editorial control, and I doubt that it is a common occurrence. At least the tight editorial control part.

30. Addendum (which probably deserves expansion into its own essay at some point) There’s an interesting confirmation of this from an [almost] entirely different domain in an article in Science (359:6373 19 Jan 2018, pg. 263; the original research is reported in Science Advances 10.1126/sciadv.aao5580 (2018)) which found that similar results on predicting criminal recidivism could be obtained from

  • A proprietary 137-variable black-box system costing $22,000 a year
  • Humans recruited from Mechanical Turk and provided with 7 variables
  • A two-variable regression model

It turns out that for this problem, there is a widely-recognized “speed limit” on accuracy of around 70%—the various methods in this specific study are a bit below that, particularly the non-expert humans—and, as with conflict forecasting, multiple methods can achieve this.

On reading this, I realize that there is effectively an “PITF predictive modeling approach” which evolved over the quarter-century of that organization’s existence:

  • Accumulate a large number of variables and exhaustively explore combinations of these using a variety of statistical and machine-learning approaches: this establishes the out-of-sample “speed limit”
  • The “speed limit” should be similar to the accuracy of human “super-forecasters”
  • Construct operational models with “speed limit” performance using very simple sets of variables—typically fewer than five—using the most robustly measured of the relevant independent variables

This is, of course, quite a different approach than the modeling gigantism practiced by the organization-that-shall-not-be-named under the guidance of the clueless charlatans who have quite consistently been directing it down fruitless blind alleys—sort of a reverse Daniel Boone—for decades. Leaving aside the apparent principle that these folks don’t want to see the problems solved—there are no further contracting funds for a solved problem—I believe there are two take-aways here:

  • Anyone who is promising arbitrarily high levels of accuracy is either a fool or is getting ready to skin you. If government funding is involved, almost certainly the latter. There are “speed limits” to predictive accuracy in every open complex system.
  • Anyone who is trying to sell a black-boxed predictive model based on its complexity and data requirements is also either a fool or is getting ready to skin you: everything in our experience shows that simple models are the most robust models over the long term.
This entry was posted in Methodology. Bookmark the permalink.

7 Responses to Entropy, Data Generating Processes and Event Data

  1. Good Reason says:

    “this requires substantive knowledge”

    I think that hits the nail on the head right there.

    Tangentially, it also explains why my FPA students this past semester seemed positively giddy to find themselves in an IR theory class that actually incorporated substantive knowledge as part of the analytic task.

  2. Jedburgh says:

    You remain stunningly awesome and hilarious at levels that defy description.

    Scott Longman

    CAS ’87.

  3. Renee Marlin-Bennett says:

    This is very cool, Phil. (Does that make me one of the 10?) Even though I decided early in my career that (a) I really did not want to send a lot of time learning sophisticated math and (b) I did not want to spend the years testing boring hypotheses using what Hayward referred to as “stupid inferential stastistics,” I have an appreciation for machine learning, complexity, and the like. And of course, I have a fondness for events data. (Yeah, “fondness” just about captures it. Reminds me of Ed Azar and my days cleaning COPDAB, which reminds me of Hayward & and my days cleaning FACS (and of Frank & SherFACS) which, though not structured as events data per se, was generated using narratives that captured events…. ) Anyway, I have an inkling that what you have written here is very important for the theory building work I am doing on power. (Because there’s insufficient work out there on the nature of power? What am I thinking. Oh well. I have tenure.) Without saying that other approaches to defining power are wrong, I focus on the utility of defining power in terms of the ability to control the flow of information. By not assuming that A and B are already existing (a known A trying to influence a known B), and focusing instead on information flow and control of the flow allows the analyst to see an unexpected actor –Snowden, for example –as it becomes meaningful politically. There are other things this conceptualization is useful for, too. Parsing out how this works is my current task. I have published a couple articles, written a few conference papers, and now I’m writing the book. I have written about being able to extract information, access information, project it, and guard against receiving it as ways that agents emerge and engage in instances of power. What I have not yet worked out is the way having access to the right quantity and kind of information matters.That’s what your post reminded me of. So collecting events data and analyzing events data for fairly robust predictions become micro-instances of power. (This actually fits especially well with my chapter on surveillance.)

    I just realized that I’m writing a comment that is almost as long as your post. That’s obnoxious. I will stop now. Cheers!

    • schrodt735 says:

      The control of information could explain some of the “power of small actors” issues. E.g. in a well-run guerrilla campaign, the guerrillas should have an asymmetrical advantage of information (and if they lose this by alienating the population, they are typically defeated). The Netherlands vs Philip II of Spain might be another example, and Metternich (and his balance-of-power successors in the UK) another. Mongols probably also played this one pretty well. I think there might have been some writing on this sort of thing by the early international relations systems theorists back in the 1950s or 1960s, though nothing comes immediately to mind.

  4. Renee Marlin-Bennett says:

    Indeed, though I’m loath to see this in terms of control of info as yet another power resource. Rather, I’m thinking of a different perspective (not actor centric, but information flow centric). To switch metaphors: Think of the difference between studying nodes and seeing how they are connected by edges vs looking for edges and seeing if the edges lead you to identify nodes. But these are very neat examples, Deutsch in the 1950s is relevant. Anyway, there was a lot of things to think about in your post. Thanks.

  5. Pingback: Stuff I tell people about event data | asecondmouse

  6. Pingback: Seven current challenges in event data | asecondmouse

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s