Seven thoughts on neural network transformers

If an elderly but distinguished scientist says that something is possible, he is almost certainly right; but if he says that it is impossible, he is very probably wrong.
Arthur C. Clarke. (1962)[1]

So, been a while, eh: last entry was posted here in March-2020—yes, the March-2020: how many things do we now date from March-2020 and probably will indefinitely?— when, like, everyone, was suddenly doing remote work, which I’d been doing for six years. But, well, the remote-work revolution has taken on a life of its own—though my oh my do I enjoy watching those butts-on-seats “managers” [2] squirming upon the discovery that teams are actually more productive in the absence of their baleful gaze, random interruptions, and office political games, to say nothing of not spending three hours a day commuting—so no need for further contributions on that topic. Same on politics: so much shit flying through the air right now and making everyone thoroughly miserable that the world doesn.t need any more from me. At the moment…

And I was busy, or as busy as I wanted to be, and occasionally a bit busier, on some projects, the most important of which is the backdrop here. This is going to be an odd entry as I’m “burying the lead” on a lot of details, though I expect all of these to come out at some point in the future, generally with co-authors,  in an assortment of open access media. But I’m not sure they will, and meanwhile things in the “space” are moving very rapidly and, as I’m writing this, sort of in the headlines such as this, this, this, this, and this, [5-Aug-22:hits keep on coming…]so I’m going ahead.

So y’all are just going to have to trust that I’ve been following this stuff, and have gained a lot of direct experience, over the past year, as well as tracking it in assorted specialized newsletters, particularly TheSequence, which in turn links to a lot of developments in industry. However, as usual, my comments are going to be primarily directed to political science applications. Also see this groveling apology for length. [3]

So, what the heck is a transformer??[4]

These are the massive neural networks you’ve been reading about in applications that range from the revolutionary to the utterly inane. While the field of computing is subject to periodic—okay, continuous—waves of hype, the past five years have seen a genuine technical revolution in the area of machine learning for natural language processing (NLP). This is summarized in the presumably apocryphal story that when Google saw the results of the first systematic test of a new NLP translation system, they dumped 5-million lines of carefully-crafted code developed at huge expense over a decade, and replaced it with 5,000 lines of neural network configuration code.

I’ve referred to this, with varying levels of skepticism, as a plausible path for political models for several years now, but the key defining attributes in the 2022 environment are the following:

  • These things are absolutely huge, with the current state of the art involving billions of parameters, and they require weeks to estimate and their estimation is beyond the capabilities of any organizations except large corporations using vast quantities of specialized hardware.
  • Once a model has been estimated, however, it can be fine-tuned for a very wide variety of specific applications using relatively—still hours or even days—small amounts of additional computation and small numbers of training cases.

The technology, however, is probably—probably—still in flux, though it has been argued that the basis is in place [5]:, and we’re now entering a period where the practical applications will be fleshed out. Which is to say, we’re entering a Model T phase: the basic technology is here and accessible, but the infrastructure hasn’t caught up with it, and thus we are just beginning to see the adaptations that will occur as secondary consequences.

The mostly widely used current models in the NLP realm appear to be Google’s 300-million parameter BERT and the more compact 100-million parameter distilBERT. However, new transformers with ever larger neural networks and training vocabularies are coming on-line with great frequency: The most advanced current models such as OpenAI’s GPT-3 (funded, apparently in “billions of dollars”, by Microsoft) was trained on around 400-billion words and is thought to have cost millions of dollars just to estimate, but it is too large for practical use. China’s recent Yuan 1.0 system was trained on 5 terabytes of Chinese text, and with almost 250-billion estimated parameters in the network, is almost twice the size of the GPT-3 network. So the observations here can be seen as a starting point and not, by any means, the final capabilities, though we’re also at the limits of hardware implementations except for organizations with very high levels of resources. And all of these comparisons will be outdated by the time most of you are reading this.

So on to seven observations about these things relevant to political modeling.

1. Having been trained on Wikipedia, transformer base models have the long-sought “common sense” about political behavior.

This feature, ironically, occurred sort of by accident: the developers of these things wanted vast amounts of reasonably coherent text to get their language models, and inadvertently also ingested political knowledge. But having done so, this allows political models to exploit an odd, almost eerie, property called “zero-shot classification”: generalizing well beyond the training data. As one recent discussion [citation misplaced…] phrased this: 

Arguably, one of the biggest mysteries of contemporary machine learning is understanding why functions learned by neural networks generalize to unseen data. We are all impressed with the performance of GPT-3, but we can’t quite explain it. 

In experiments I hope will someday be forthcoming, this is definitely happening in models related to political behavior. In all likelihood, this occurs because there is a reasonably good correspondence between BERT’s training corpus—largely Wikipedia—and political behaviors of interest:  Wikipedia contains a very large number of detailed descriptions of complex sequences of historical political events, and it appears these are sufficient to give general-purpose transformer models at least some “common sense” ability to infer behaviors that are not explicitly mentioned. 

2. These are relatively easy to deploy, both in terms of hardware and software

Transformer models have proven to be remarkably adaptable and robust, and are readily accessible through open source libraries, generally in Python.  And again, the Model T analogy—any color so long as it is black—in my experiments just using default hyperparameters gives decent results, a useful aspect given that hyperparameter optimization on these things has a substantial computational cost.

Three key developments here

  • For whatever reason—probably pressure from below to retain staff, or corporate ego/showing off to investors, or figuring (quite accurately, I’m sure) that there are so many things that can be done no one would have time to explore them all, and in any case they are still retaining the hardware edge, and network effects…the list goes on and on and on—the corporate giants—we’re mostly talked Google, Facebook, Amazon, and Microsoft—have open sourced and documented a huge amount of software representing billions of dollars of effort [6]
  • A specific company, HuggingFace pivoted from creating chatbots to transformers and made a huge amount of well-documented code available
  • Google, love’em, created an easy-to-use (Jupyter) cloud environment called Colaboratory available with GPUs [7] and charges a grand $10/month for reasonable, if not unlimited, access to this. Which is useful as the giants appear to be buying every available GPU otherwise.

That said, it’s not plug-and-play, particularly for a system that is going to be used operationally, rather than simply as an academic research project: there’s still development and integration involved, and the sheer computational load required, even with access to GPUs, is a bit daunting at times. But…this is the sort of thing that can be implemented by a fairly small team with basic programming and machine learning skills. [8]

3. The synonym/homonym problems solved through word embeddings and context

Synonyms—distinct words with equivalent meanings—and homonyms—identical words with distinct meanings—are the bane of dictionary-based systems for NLP, where changes in phrasing which would not even be noticed by a human reader cause a dictionary-based pattern to fail to match. This is particularly an issue for texts which are machine-translated or written by non-native speakers, both of which will tend to use words that are literally correct but would not be the obvious choice of a native speaker. In dictionary-based systems, competing meanings must be disambiguated using a small number of usually proximate words, and can easily result in head-scratchingly odd codings. An early example from the KEDS automated event data coding project was a coding event claiming a US military attack on Australia that was eventually traced to the headline “Bush fires outside of Canberra”. This coding resulted from the misidentification of “Bush”—this usage is apparently standard Australian English, but typically in the US English of our KEDS developers, it would have been “brush”—as the US president, and the noun “fires” as a verb, as in “fires a missile.” Dictionary developers collect such howlers by the score.

The synonym problem is solved through the use of word embeddings, which is also a related neural network based technology which became widely available a couple years before transformers took off, and are useful in a number of contexts. Embeddings place words in a very high dimensional space such that words that have similar meanings—”little” and “small”—are located close to each other in that space. This is determined, like transformers, from word usage in a large corpus, 

In terms of dealing with homonyms, transformer models look at words in context and thus can disambiguate multiple meanings of a word. For example the English word “stock” would usually refer to a financial instrument if discussed in an economic report, but would refer to the base of a soup if discussed in a recipe, farm animals in a discussion of agriculture, a railroad car (“rolling stock”) in a discussion of transportation, or supplies in general in the context of logistics. Similarly, the phrase “rolled into” has different meanings if the context is “tanks” (=take territory) or “an aid convoy” (=provide aid). BERT is trained on 512 word—actually, “tokens”, so numbers and punctuation also count as “words”—segments of text so there is usually sufficient context to correctly disambiguate.

Both the these features, which involve very large amounts of effort when specialized dictionaries are involved, are just part of the pre-trained system when transformers are used.

4. 3 parameters good; 12 parameters bad; billions of parameters, a whole new world

As I elucidated a number of years ago (click here for the paywalled version, which I see has 286 citations now…hit a nerve, I did…), I hates garbage can models with a dozen or two parameters, hates them forever I do. Chris Achen’s “Rule of Three” is just fine, but go much beyond this, and particularly pretending that these are “controls” (we truly hates that forever!!!), and you are really asking for trouble. [9]. Or, alas, publication in a four-letter political science journal.

So, like, then what’s with endorsing models that have, say, a billion or so parameters? Which you never look at. Why is this okay?

It’s okay because you don’t look at the parameters, and hence are not indulging in the computer-assisted self-deception that folks running 20-variable regression/logit garbage can models do on a regular basis.

Essentially any model has a knowledge structure. This can be extremely simple: in a t-test it is two (tests for zero require only estimates of the mean and standard deviation) or four (comparisons of two populations). In regression it is treated as 2*(N+1) parameters—the coefficients, constant, and their standard errors, though in fact it should also include the (N*N-1)/2 covariance of the estimates (speaking of things never looked at…)

So instead of p-fishing—or, in contemporary publications, industrial-scale p-trawling—one could imagine getting at least some sense of what is driving a model by looking at the simple correspondence of inputs and outputs: the off-diagonal elements, the false positives and false negatives, are your friends! Like a gossipy office worker who hates their job, they will give you the real story!

Going further afield, and at a computational price, vary the training cases, and compare multiple foundation models. There is also a fairly sizable literature—no, I’ve not researched it—on trying to figure out causal links in neural network models, and this is only likely to develop further—academics (and, my tribe these days, the policy community) are not the only people who hate black-boxed models—and while some of these are going to be beyond the computational resources available to all but the Big Dogs, some ingenious and computationally efficient techniques may emerge. (Mind you, these have been available for decades for regression models, and are almost never used)

5. There’s plenty more room to develop new “neural” architectures

Per this [probably paywalled for most of you] Economist article, existing computational neural networks have basically taken a very simple binary version of a biological “neural” structure and enlarged it to the point where—to the chagrin of some ex-Google employees—it can do language and image tasks that appear to involve a fair amount of cognitive ability. But as the article indicates, nature is way ahead of that model, and not just in terms of size and power consumption (though it is hugely better on both),

For now. But just as simple (or not so simple) neural structures could be simulated, and had sufficient payoff the specialized hardware could be justified, some of these other characteristics—signals changing in time and intensity, introducing still more layers and subunits of processing—could also be, and this can be done (and doubtlessly is being done: the payoffs are potentially huge) incrementally. So if we are currently at the Model T stage, ten years from now we might be at a Toyota Camry stage. And this could open still more possibilities.

At the very least, we are clearly doing this neural network thing horribly inefficiently, since the human brain, a neural system some orders of magnitude more complex than even the largest of the neural networks, consumes about 20 watts, which is apparently—estimates vary widely depending on the task—about half to a third to energy used by Apple’s M1 chip. Which has a tiny fraction of the power of a brain. Suggesting that there is a long way to go in terms of more efficient architecture.

6. Prediction: Sequences, sequences, sequences, then chunk, chunk, chunk

Lisa Feldman Barrett’s Seven and a Half Lessons About the Brain revolves around the theme that prediction is the fundamental task of neural systems developed by evolution, both predicting their external environment (Is there something to eat here? Am I about to get eaten? Where do I find a mate?) and their internal environment, specifically maintaining homeostasis in systems that may have very long delays (e.g. hibernation). So the fact that neural networks, even at a tiny fraction of the level of interconnections of many biological networks, are good at prediction is not surprising. The specific types of sequence prediction can be a little quirky they aren’t terribly far removed.

Suggesting these might be really useful for that nice little international community doing political forecasting of international conflict, but, alas, those are relatively rare events and novel conflicts are even rarer. So as a little project, what about a parallel problem in business: predicting whether companies will fail (or their stock will fall: think it would be possible to make money on that?): presumably resources beyond our imagination are being invested in this and perhaps some of the methods will spill over into conflict forecasting.

And arguably this is just a start. We’ve got Wikipedia—and by now, I’m sure Lord of the Rings, A Song of Ice and Fire, and the entire Harry Potter corpus—in our pre-trained knowledge bases. But this is all text, which is quite useful, but it is inefficient, and it is not how experts work: experts chunk and categorize. Given the cue “populist policies,” a political analyst can come up with a list of those common across time, specific to various periods, right- and left-wing variants etc, but these are phrased in concepts, not in specific texts. [10].

So could we chunk and then predict? As it happens, we are already doing chunking in the various layers of the neural networks, and in particular this is how word vectors—a form of chunking—were developed. Across a sufficiently large corpus of historical events, I am guessing we will find a series of “super-events” which, I’m guessing, will eventually stabilize in forms not dissimilar to those used as concepts by qualitative analyst. Along those same lines, I’m guessing that we should generally expect to see human-like errors—that is, plausible near-misses from activities implied by the text or bad analogies from the choice of training cases—rather than the oftentimes almost random errors found in existing systems.

7. No, they aren’t sentient, though it may be useful to treat them as such [11]

As usual, returning to the opening key, a few words on the current vociferous debate on whether these systems—or at least their close relatives, or near-term descendants—are sentient. Uh, no…but it’s okay, provided you can keep your name out of the Washington Post, [13] and it may be useful to think of them this way.

For starters, we seem to go through this sentient computer program debate periodically, starting with reactions to the ELIZA program from the mid-1960s (!) and Seale’s Chinese Room argument from 1980 (!) and yet one almost never sees it mentioned in contemporary coverage.

But uh, no, they aren’t sentient, though just to toss a bit more grist into the mill—or perhaps that should be “debris into the chipper”—here are the recent Economist cites on the debate:

Pro: https://www.economist.com/by-invitation/2022/06/09/artificial-neural-networks-are-making-strides-towards-consciousness-according-to-blaise-aguera-y-arcas

Con: https://www.economist.com/by-invitation/2022/06/09/artificial-neural-networks-today-are-not-conscious-according-to-douglas-hofstadter

Now, despite firmly rejecting the notion that these or any other contemporary neural network is sentient, I am guessing—and to date, we’ve insufficient institutional experience to know how this is going to play out—we will do this after a fashion. Consider the following from the late Marshall Sahlins The New Science of the Enchanted Universe: An Anthropology of Most of Humanity

Claude Levi-Strauss retells an incident reported by Canadian ethnologists Diamond Jenness, apropos of the spiritual place masters or “bosses” known to Native Americans as rulers of localities, but who generally kept out of sight of people. “They are like the government in Ottawa, an old Indian remarked. An ordinary Indian can never see the ‘government.’ he is sent from one office to another, is introduced to this man and to that, each of whom sometimes claims to be the ‘boss,’ but he never see the real government, who keeps himself hidden.”

Per the above, there might be something in our cognitive abilities that make it useful to treat a transformer system as sentient just as we [constantly] treat organizations as sentient and having human-like personalities and preferences even though, as with consciousness, we can’t locate these. [14] From the perspective of “understanding”—a weasel-word almost as dangerous as “sentient”—one needs to think of these things as a somewhat lazy student coder whose knowledge of the world comes mostly from Wikipedia. Thus invoking the classical Programmer’s Lament, which I believe also dates from the 1960’s

I really hate this damn machine
I really wish they’d sell it
It never will do what I want
But only what I tell it

combined with the First Law of Machine Learning: 

A model trained only on examples of black crows will not conclude that all crows are black, but that all things are crows

And so, enough for now, and possibly, in the future, a bit more context and a few more experimental revelations. But in the meantime, get to work on these things!!

Footnotes

1. The first of “Clarke’s Laws“, the other two being

2. The only way of discovering the limits of the possible is to venture a little way past them into the impossible.

3. Any sufficiently advanced technology is indistinguishable from magic.

Clarke also noted that in physics and mathematics, “elderly” means “over 30.” Leading to another saying I recently encountered which should perhaps be the Parus Analytics LLC mission statement:

Beware of an old man in a profession where men die young.
Sean Kernan on mobster Michael Franzese

2.  Very shortly after COVID began, when people still thought it was going to be an inconvenience for a few weeks, I turned down a job opportunity when I discovered that the potential employer was a vanity project run by a tyrannical ex-hippie who was so committed to butts-on-seats that he (explicit use of gender-specific pronoun…) expected people in the office even if they’d returned the previous night on a flight from Europe. Mind you, it also didn’t help that they wanted twelve references—poor dears, probably never learned to read or use the internet or GitHub—and that the response of one of the references I contacted before abandoning the ambition was “Oh, you mean the motherfuckers who stole my models?” They never filled the position.

3. Groveling apology for length: once again, for a blog entry, this composition is too long and too disjointed. The editors at MouseCorp are going to be furious! Except, uh, they don’t exist. Like you couldn’t tell.

Hey, it’s the first entry in over two years! And I’ve been working on this transformer stuff for close to a year. So I hope, dear reader, you are engaging with this voluntarily, all the while knowing some of you may not be.

4. There are so many introductions available—Google it—with the selection changing all of the time, that I’m not going to recommend any one, and the “best” is going to depend a lot on where you are with machine learning. I glazed over quite a few until one—I’ve lost the citation but I’m pretty sure it was from an engineering school in the Czech Republic—worked for me. For those who are Python programmers, the sample code on HuggingFace also helps a lot.

5. Elaborating slightly, I got this observation from a citation-now-lost TheSequence interview with some VC superstar and programming genius who starts the interview with “after I finished college in Boston…”—like maybe the same Boston “college” Zuckerburg and the Winklevoss twins went to, you just suppose?— that we’re actually on the descending slope of the current AI wave, and the best new AI companies are probably already out there; it just isn’t clear who they are. The interview also contained the interesting observation about the choice of investing in physical infrastructure vs investing in new software development: curiously I’ve not thought about that much over the years, particularly recently, since, as elaborated further below, at my ground-level perspective Moore’s Law, then briefly a couple little-used university “supercomputer” centers, then the cloud, painlessly took care of all imaginable hardware needs, but at the outer limits of AI, e.g. Watson and GPT-3, we’re definitely back to possible payoffs from significant investments in specialized hardware.

6. From the perspective of the research community as whole, this is actually a huge deal, so it is  worth of some further speculation. I have zero inside tracks on the decision-making here, but I’m guessing three factors are the primary drivers

Factor 1. (and this probably accounts for most of the behavior) By all accounts, talent in this area is exceedingly scarce. This spins off in at least three ways

  • Whatever the big bosses may want (beyond, of course, butts-on-seats…), talented programmers live in an open source world, and given the choice between two jobs, will take the one which is more open. This is partly cultural, but it is also out of self-interest: you want as much of your current work (or something close to it) to be available in your next job. I recently received a query from a recruiter from Amazon—they obviously had not read the caveats in my utterly snarky “About” on my LinkedIn profile—asking about my interest in a machine learning position and Amazon’s job description not only highlights the ability to publish in journals as one of the attractions of the job, but lists a number of journals where their researchers have published.
  • And speaking of next job, the more your current work is visible, the better your talents can be assessed. Nothing says “Yeah, right…next please” than “I did this fantastic work but I can’t tell you anything about it.”
  • On the flip side of that, a company may be able to hire people, whether from other companies or out of school, already familiar with their systems: this can save at least weeks if not months of expensive training.

Factor 2. To explore these types of models in depth, you need massive amounts of equipment, which only the Big Dogs have, so they are going to have a comparative advantage in hardware even if the software is open. This, ironically, puts us back into the situation prior to maybe 1995 when a “supercomputer” was still a distinct thing and a relatively scarce resource, so a few large universities and federal research centers could use hardware to attract talent. Thanks to Moore’s Law, somewhere in the 2000s the capabilities of personal computers were sufficient for almost anything people wanted to do—university supercomputer centers I was familiar with spent almost all of their machine cycles on a tiny number of highly specialized problems, usually climate and cosmological models, despite desperately trying—descending even to the level of being nice to social scientists!—to broaden their applications. As Moore’s Law leveled off in the 2010s, cloud computing gave anyone with a credit card access to effectively unlimited computing power at affordable prices.

Factor 3. The potential applications are so broad, and because of the talent shortage, none of these giants are going to be able to fully ascertain the capabilities (and defects) of their software anyway, so better to let a wider community do this. If something interesting comes up, they will be able to quickly throw large engineering teams and large amounts of hardware at it: the costs of discovery are very low in the field, but the costs of deployment at scale are relatively high.

7. GPU = “graphical processing unit”, specialized hardware chips originally developed, and then produced in the hundreds of millions, to facilitate the calculations required for the real-time display of exploding zombie brains in video games but, by convenient coincidence, readily adapted to the estimation and implementation of very large neural networks.

On a bit of a side note, Apple’s new series of “Apple Silicon” chips incorporate, on the chip, a “neural engine” but, compared to the state-of-the-art GPUs, this has relatively limited capabilities and presumably is mostly intended to (and per Apple’s benchmarks, definitely does) improve the performance of various processes in Apple’s products such as the iPhone.

But the “neural engine” is not a GPU: in Apple’s benchmarks using the distilBERT model, the neural engine achieves an increase of 10x on interference, whereas my experiments with various chips available in Google’s Colaboratory saw increases of 30x on inference and 60x on estimation, and the differences for inference (which is presumably all that the Apple products are doing; the model estimation having been done, and thoroughly optimized, at the development level) is almost entirely proportional to the difference in the number of on-chip processing units. 

Having said that, Apple has made the code for using this hardware available in the widely-used PyTorch environment, so there might be some useful applications. Though it is hard to imagine this being cost-competitive against Google’s $10/month Colaboratory pricing.

A key difference between the Apple Silicon and the cloud GPUs is power consumption: this is absolutely critical in Apple’s portable devices but, at least at first, was not a concern in GPUs, though with these massive new models using very large amounts of energy—albeit not at the level of cryptocurrency mining—energy use has become a concern.

A final word on “Apple Silicon”, having discovered in February-2022 the hard way that you do not (!!!) want to wait until one of your production machines completely dies—and, truth be told, I probably kind of killed the poor thing running transformer models 24/7 in the autumn of 2021 before I discovered how simple Colaboratory is to use—I replaced my ca. 2018 MacBook Air with the equivalent which uses the M2 chip, and the thing is so absolutely blindingly fast it is disorienting. Though I’m sure I will get used to it…

8. A word of caution: you almost certainly do not want to try to involve any academic computer scientists in this, and you most certainly don’t want to give them access to your grant money: this stuff is just engineering, of no interest or payoff to academics. Certainly of no payoff when the results are going to be published five years later in a paywalled social science journal. And hey, it’s not just me: here is another rendition.

Having had some really bad experiences in such “inter-disciplinary” “collaborations”, I used to think that when it comes to externally funded research, computer scientists were like a sixth grade bully shaking down a fourth grader for their lunch money. But now I think it is more primordial: computer scientists see themselves as cheetahs running down that lumbering old zebra at the back of the herd, and think no more of making off with social science grant money—from their perspective, “social science” is an oxymoron, and there is nothing about political behavior that can’t be gleaned from playing the Halo franchise and binge-watching Game of Thrones—than we think of polishing off the last donut in the coffee lounge. 

Don’t be that zebra.

I’m sure there are exceptions to this, but they are few. At the contemporary level of training in political methodology at major research universities, this stuff just isn’t that hard, so use your own people. Really.

9. Andrew Gelman’s variant on Clarke’s Third Law: “Any sufficiently crappy research is indistinguishable from fraud.”

10. Though it would be interesting to see whether a really big model could handle this, particularly, say, a model fine-tuned on a couple dozen comparative politics textbooks. More generally, textbooks may be useful fine-tuning fodder for political science modeling as they are far more concentrated than Wikipedia, though, as always, the computational demands of this might be daunting.

11. At this point we pause to make the obvious note that the issue of consciousness and sentience (which depending on the author, may or may not be the same thing) goes way back into psychology, and was a key focus of William James (and to a somewhat ambiguous extent Carl Jung) and a bunch of other discussions that eventually got swamped in behavioralism (and by the fact that the materialist paradigm prevailing by the middle of the 20th century made zero progress on the matter)

These are really difficult concepts: the COVID virus is presumably not sentient, but is a mosquito? Or is a mosquito still just a tiny if exceedingly complex set of molecules not qualitatively different than a virus?. Where do we draw the line?: mammals, probably, and—just can’t bring myself to eat those little guys any more having listened to this, octos. But chickens and turkeys, which I do eat, albeit with tinges of guilt? Is domestication a sort of evolutionary tradeoff where the organism gives up free will and sentience for sheer numbers?

Is an adjunct professor sentient?—most deans don’t behave as though they are [12:]. Is consciousness even a purely material phenomenon: as Marshall Sahlins argues that’s been the working hypothesis among the elite for perhaps only the past five generations, but not the previous 2000 generations of our ancestors who were nonetheless successful enough to, well, be our ancestors.

12. Shout-out to the recently deceased Susan Welch, long-time Dean of the College of Liberal Arts at Penn State, who among many social science-friendly policies, in fact did treat adjuncts as not only sentient, but human.

13. An ultimate insider-reference to a very strong warnung provided at the kickoff meeting for the DARPA ICEWS competition in 2007.

14. Sahlins again, scare quotes in original:

It is not as if we did not live ourselves in a world largely populated by nonhuman persons. I am an emeritus professor of the University of Chicago, an institution (cum personage) which “prides itself” on it “commitment” to serious scholarship, for the “imparting” of which it “charges” undergraduate students a lot of loney, the Administration “claiming” this is “needed” to “grow” the endowment, and in return for which it “promises” to “enrich” the students’ lives by “making” them into intellectually and morally superior persons. The University of Chicago is some kind of super-person itself, or at least the Administration “thinks” it is. [pg 71]

Advertisement
This entry was posted in Methodology, Politics, Programming. Bookmark the permalink.

1 Response to Seven thoughts on neural network transformers

  1. Pingback: Two followups, ISA edition | asecondmouse

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s