Being a Critical Journalist in Digital Times

Geert Lovink who does wonderful work at the Institute of Network Cultures yesterday tweeted a cry of horror on finding out via TechCrunch that Google is funding the development of software that writes local news stories. Media the world over have parroted the same news which seems largely based on a press release from the UK Press Association. The parroting in itself is a indicator of the dire situation in journalism where uncritically posting press releases has become a stand in for actual in depth and well researched coverage. Those who at least attempted a stab at a perspective mostly seem to have stuck to the hackneyed criticism that Google funds the development of robot journalists that will put human journalists out of a job: “Journalists, look out: Google is funding the rise of the AI news machine”.

In reality things probably move both faster and slower. Let me explain that.

Having been in the midst of my own little media tempest in a teacup I tend to think that press releases and the follow up roar in diverse media are as much ‘alternative fact’ as they are not. My case involved also a director of a publishing house announcing that we as researchers were ragingly enthusiast about the effectiveness with which we are able to predict best sellers based on deep learning methods. I think I—which is “the researchers” in the story—said results were encouraging or some such, how that turned to “ragingly enthusiast” remains a mystery to me too.

What I did in any case was not exactly rocket science in the realm of machine learning. Using the well established open source Python libraries Theano and Keras I build a straight forward neural network that I fed the 250,000 or so features counting Tf·idf matrix that was derived from a set of 200 Dutch published novels of which sales numbers were known. We were then able to predict the ‘selling capability’ of unseen novels of the same publisher for which sales numbers were known with some 80% accuracy. In more plain English: applying meanwhile middle of the road machine learning techniques we can predict if novels will sell or not and eight out of ten times we will be correct.

Machine learning techniques such as deep learning using neural networks can be extremely sensitive to patterns in large data sets, patterns that are too distributed throughout the data for humans to be really able to pick up on them. Given enough training data such technologies infer models, or sets of features, that will be very good at telling you to what categories unseen examples belong. If they belong to the category of best sellers or non sellers for instance. In our case the model picked up on the words that are common to the best sellers of recent years and was able to find matches of such word use in novels it had not been trained on. As a meticulous and time unlimited clerk it compared the scores on some 250k variables per novel, averaged the scores, and compared them to those of successful selling novels. Not exactly rocket science really, just an immense amount of work impossible to pull off in feasible time with mere human capacity. For a well programmed algorithm though, a work of mere minutes.

Classifiers, as such algorithms are also called, are already crunching real world data all the time. This is what I mean with the ‘faster’ part. In many respects what AP is going to try to do, is not that new at all. Stock exchange predictions, flight fuel consumption patterns, internet store customer interest, the likelihood a person on Facebook will lean to a certain political conviction, and so forth: all have been measured and predicted by similar algorithms for a small decade now at least.

The information that citizens experience is more often than not tailored to their needs already by such algorithms. Although many point this out all the time (for instance here, or here, or here, or here) it is somehow still a huge surprise when sometimes the processes that these algorithms support break out of their otherwise mostly covert and invisible existence. To those that are aware of how widespread the application of these algorithms are, rather this very surprise is surprising: you are living in an highly automated information world already. Long time. Better get used to it. Or at least get aware of it. And yes, this also happens in news story generation already, as TechCrunch did not fail to point out. Automated Insights provides this type of services on impressive industry scale.

The ‘slower’ bit is that press releases like the one from AP systematically over claim. Although I have to admit this is conjecture, it is unlikely that Google’s 700kEuro+ funding will lead AP and Urbs Media to eradicate local news gathering and publishing. First of all this does not seem their aim, but moreover the quality and effects of “a new service” crunching out “up to 30,000 localised stories each month from open data sets” remain to be seen. What are these open data sets, and what will these news stories be? There sure can be sense in informing the public about community level decisions on construction and development based on town council decisions mined from public service databases. Inferring and precisely directing a message like “Town council planning bypass to relief your neighbourhood of cut-through traffic” can be relevant news to a small number of locals, but could well be too tedious and too low-impact for journalists to have to bother with. Automated services like that might thus be well placed actually.

The big problem why progress in developing such a service will be slow is that the closer you get to the community and the individual, the more heterogeneous and specific news needs and interests become. Inferring automatically that a plan for a motorway bypass will affect people in some area is one, but deciding what this means to people is a whole different ball game. One that machine learning is still terribly bad at. What open data sets are you going to use to have some sense of how to frame your news story, so important if it is to be tailored to local needs? Are we turning to what is available on the Web, for instance, produced by the community itself? I sincerely doubt that will result in actually “fact-based insights into local communities”, which is the “increasing demand” AP says to target. These are challenges of automated inference that are not easily solved, resulting in the slower bit: the output and impact of projects like these are usually far more modest—yet still usable—than press releases tend to suggest. A very nice prototype will be derived. Probably, maybe.

Another part of the slower bit is that it is easy enough to generate high level, mainstream interest, stock exchange news items from well formalised statistics, but that it is a lot harder to generate daily real life human interest stories. “First quarter income was reported at 15 million USD” is a sentence easily enough generated from well groomed statistics and databases. Generating “Bob’s farm house store will be closed temporarily in October” is of more immediate interest to some local community, but it involves complexities of automated inference from real world information far beyond current capabilities. Although generating language is getting intriguingly easy—as easy as predicting best sellers almost—it is still a far cry from the heterogeneous specificity that is relevant if you get to local or individual level. There is a meaningful and relevant difference between “Bob’s farm shop will be closed temporarily in October” and “Bob’s farm shop will be closing in October”. The first is an ordinary well formed announcement, the second is a potential source of hazardous assumptions about Bob and his commercial and physical well being. Current deep learning algorithms are rather insensitive to such subtleties and one has little control over whether one or the either will be generated. But such subtleties are what starts to matter if you get down to the less formulaic, less high level pattern based, nature of language in local community real life.

So in all it will be a while, and I would rather be interested in the results of the project than in shout outs about the potential demise of journalism writ large. At the same time I do not want to downplay the hazardous situation that these technologies may eventually put journalists in. This involves the even harder question of how we wield our digital technology ethically. Yes, it is possible to predict best sellers. Even the simplest possible application of deep learning yields 80% success. As a publisher you would be a rather poor businessman if you would not at least scout out the possible edge this might give you in a publishing industry. But that does not mean you need to do away with your editors. In fact that would be the intellectually poorest and most unethical choice. You can chose to do so, and you can even keep making a profit that way. The downside is however, that your predictive algorithm will force creative writing into a unpalatable sludge of same-plot-same-style novels. The more clever and ethical option is to use such an algorithm as a support tool to avert the worst potential non sellers. Each averted non seller—even if we get one in five wrong—is money saved to poor into more promising, more interesting projects. Responsibly deployed this algorithm enhances the ability of a publisher to support new and truly interesting work. An ethical businessman—I am aware this files as a modern oxymoron though—would turn losses prevented into an investment in new interesting literature that deviates form the by now plain old literary thriller genre. In machine learning that is something we are far worse at than humans still: predicting what outlier might actually not be a non seller, but an example of the new brilliant different style and narrative that will hit it big.

The rationale for how we choose to wield our technologies comes from critical thinking and reflection—or the lack thereof of course. It is the more deplorable therefore that PA’s press release resulted in a quite predictable flood of reactions in the “journalists’ jobs are under threat” genre. Because it means that journalists did mostly not do their jobs of critical news mongering. If they had, they had not talked so much about their jobs. For as argued, it will be a while until these are really under threat, if at all. Rather they could have talked about the highly arguable motives of an industry giant like Google being involved in developing algorithms that select and write news items. Do we really believe that these algorithms will be impartial? Will we yet again believe the stories that data science and natural language parsing algorithms are neutral because they are based on mathematics and logic? How likely is it that Google, more specifically its board, will be ethical about deploying these algorithms? That is the type of questions the press should have set out to answer. Instead of devoting time to being really investigative and critical, they chose to repeat the press release.

So don’t be surprised if you sometime soon read a local announcement: “Bob’s store will be closed temporarily in October, but you can find fine online stores on Google”.



Willard McCarty on Humanist pointed me to a, quite silly, article in the Economist entitled “March of the Machines”. It can almost be called a genre piece. The author downplays very much the possible negative effects of artificial intelligence and then argues that society should find an ‘intelligent response’ to AI—as opposed, I assume, to uninformed dystopian stories.

But I do hope the intelligent response society will seek to AI will be less intellectually lazy than the author of said contribution. I think to be honest that someone needed to crank out a 1000 words piece quickly, and reverted to sad stopgap rhetorics.

In this type of article there’s invariably a variation on this sentence: “Each time, in fact, technology ultimately created more jobs than it destroyed”. As if—not denying here any of a job’s power to be meaningful and fulfilling for many people—a job is the single quality of existence.

Worse is that such multi purpose filler arguments ignore unintended side effects of technological development. Mass production was brought on by mechanisation. We know that it also brought mass destruction. It is always sensible to consider both the possible dystopian and utopian scenarios. No matter what Andrew Ng (quoted in the article) obviously should say as an AI researcher, it is actually very sensible to consider overpopulation of Mars before you colonise it. Before conditions are improved for human live there—at whatever expense—even a few persons will effectively establish such an overpopulation. Ng’s argument is a non sequitur anyway. If the premise of the article is correct we are not decades away from ubiquitous application of AI. Quite the opposite, the conditions on Earth for AI have been very favourable for more than a decade already. We hardly can wait to try out all our new toys.

No doubt AI will bring some good, and also no doubt it will bring a lot of awful bad. This is not inherent in the technology, but the in the people that wield it. Thus it is useful to keep critically examining all applications of all technologies while we develop them, instead of downplaying without evidence its unintended side effects.

If we do not, we may create our own foolish utopian illusions. For instance when we start using arguments such as “AI may itself help, by personalising computer-based learning and by identifying workers’ skills gaps and opportunities for retraining.” Which effectively means asking the machines what the machines think the non-machines should do. Well, if you ask a machine, chances are you’ll get a machinery answer and eventually a machinery society. Which might be fine for all I know, but I’d like that to be a very well informed choice.

I am not a believer of The Singularity. Chances that machines and AI will aggressively push out human kind are in all likelihood gross exaggerations. But a realistic possibility is the covert permeation of human society by AI. We change society by our use of technology and the technology changes us too. This has been and will always be the case, and it is far from some moral or ethical wrong. But of these changes we should be conscious and informed, so that we hold the choice and not the machine. If a dialogue between man and (semi-)intelligent machine would be started as naive as the author of the Economist piece suggests, then human kind might indeed be very naively set to become machine like.

Machines and AI are, certainly until now, extensions and models of human behaviour. They are models and simulations of such behaviour, they are never humans. This can improve human existence manyfold. But having the heater on is something quite different than asking a model of yourself: “What gives my life meaning? How should I come to a fulfilling existence?” Asking that of a machine, even a very intelligent one, is still asking a machine what it is to be human. It is not at all excluded that a machine will not ever find a reasonable or valuable answer to that. But I would certainly wait beyond the first few iterations of this technology before possibly buying into any of the answers we might get.

It is deceptively easy to be unaware of such influences. In 1995 most people found cell phones marginally useful and far too expensive. A mere 20 years later almost no one wants to depart from his or her smartphone. This has changed how we communicate, when we communicate, how we live, who we are. AI will have similar consequences. Those might be good, those might be bad. They shouldn’t be however covert.

Thus I am not saying at all that a machine should never enter a dialogue with humans on human existence. But when we enter that dialogue we change the character of the interaction we have had with technologies since we can remember considerably. Humans have always defined technology, and our use of it has in part defined us. By changing technology we change ourselves. This acts out on the individual level—I am a different person now due to using programming languages than I was when I did not—and on the scale of society where we are part of socio-technical ecosystems comprising both technologies, communities, and individuals.

But these interactions have always been a monologue on the intellectual level. As soon as this becomes a dialogue because the technology literally can now speak to us, we need to be aware that it is not a human speaking to us, but a model of a human.

I for one would be excited to learn what that means, what riches is may bring. But I would always enter such a conversation well aware that I am talking not to another, but to a machine, and I would weigh that fact into the value and evaluation of the conversation. To assume that AI will answer questions on what course of action would lead me to improving my skills and my being, may be too heavily a buy in into the abilities of AI models to understand human life.

Sure AI can help. Even more so if we are aware of the fact that its helpful qualities are by definition limited to the realm of what the machine can understand.


Methodological safety pin

There is a trope in digital humanities related articles that I find particularly awkward. Just now I stumbled across another example, and maybe it is a good thing to muze about it a short bit. Whence the example comes I don’t think is important as I am interested in the trope in general and not in this particular instance per sé. Besides, I like the authors and have nothing against their research, but before you know it flames are flying everywhere. So in the interest of all I file this one for prosperity anonymized.

This is the quote in question: “The first step towards the development of an open-source mining technology that can be used by historians without specific computer skills is to obtain a hands-on experience with research groups that use currently available open-source mining tools.”

Readers of digital humanities related essays, articles, reviews etc. will have found ample variations on this theme in the literature. From where I am sitting such statements rig up a dangerous strawman or facade. There are a number of hidden (and often not so hidden at all) assumptions that are glossed over with such statements.

First of all there is the assumption that it is obvious that as a scholar without specific computer skills you still should be able to use computer technology. This is a nice democratic principle I guess, but is it a wise one too?

Second, there’s the suggestion that all computer technology is homogeneous. There is no need to differentiate between levels and types of interfaces and technologies. It can all sweepingly be nicely represented as this amorphous mass of “open-source mining technology”. I know it is not entirely fair to pin this on the authors of such statements. Indeed the authors may be very well aware that they are generalizing a bit in service of the less experienced reader. However, the scholarly equivalent would be to say that the first step for a computer scientist that wants to understand history is to get a hands-on experience with historians. Even if that might be in general true, from scholarly arguing I expect more precision. You do not ‘understand history’. One understands tiny, very specific parts of it, maybe, when approached with very specific very narrowly formulated research questions, and meticulous methodology. I do not understand why the wide brush is suddenly allowed if the methodology turns digital.

Third, and this is the assumption that I find most problematic: there is the assumption (rather axiom maybe) that there shall be a middle man, a bridge builder, a guide, a mediator, or go-in-between that shall translate the expertise from the computer skilled persons involved towards the scholar. You hardly ever read it the other way round by the way, it is never the computer scientist in need of some scholarly wisdom. This in particular is a reflex and a trope I do not understand. When you need expertise you talk to the expert, and you try to acquire the expertise. But when it comes to computational expertise we (scholars) are suddenly in need of a mediator. Someone who goes in between and translates between expertises. In much literature—that in itself is part of this process of expertise exchange—this is now a sine qua non that does not get questioned at all: of course you do not talk to the experts directly, and of course you do not engage with the technology directly. When your car stalls, you don’t dive into the motor compartiment with your scholarly hands do you?!

Maybe not—though I at least try to determine even with my limited knowledge of car engines what might be the trouble. But I sure a hell talk to the expert directly. The mechanic is going to fix my car, I want to know what the trouble is and what he is going to do. Yes well, the scholar retorts, but quite frankly I do not talk so much on the car engine trouble to my mechanic at all! Fair enough, might not be your cup of tea. But the methodology of your research should be. Suppose you are diagnosed with cancer, do you want to talk only to the secretary of your doctor?

Besides, it is about the skills. A standard technique to disguise logical fallacies in reasoning is to substitute object phrases. I play this little game with these tropes too: “The first step towards the development of a hand grenade that can be used by historians without specific combat skills is to obtain a hands-on experience with soldiers that use currently available hand grenades.”

This doesn’t invalidate the general truthiness of the logic, but it does serve to lay bare its methodological fallacy: if you want to use that technology, better acquire some basic skills from the experts if you want to rely safely on the outcome of its use.

Intellectual Glue and Computational Narrative

There exist several recurring debates in the digital humanities. Or rather maybe we should position these debates as between digital humanities and humanities proper. One that is particularly thorny is the “Do you need to know how to code?” debate. In my experience it is also frequently aliased as the “Should all humanists become programmers?” debate. One memorable event in the debate was Stephen Ramsay’s (2011) remark “Do you have to know how to code? I’m a tenured professor of Digital Humanities and I say ‘yes.’” A sure fire starter. Ramsay used the metaphor of building to describe coding work done in DH. Taking up on this Andrew Prescott (2012) argued that in most humanities software building DH researchers seemed to be uncomfortably in the backseat. Most non digital humanities PIs seem to regard developing software as a support act without intrinsic scientific merit, Prescott used to word ‘donkeywork’ to express what he generally experienced humanities researchers were thinking of software development. Prescott reasoned that as long as digital humanities researchers were not in the driver seat DH would remain a field lacking an intellectual agenda of its own.

I agree: in a service or support role DH nor coding will ever develop their full intellectual potential for the humanities. As long as it is donkeywork it will be a mere re-expression and remediaton of what went before. The problem there is that the donkey has to cast his or her epistemic phenomenology towards the concepts and relations of the phenomenology of the humanities PI. In such casting there will be mismatches and unrealized possibilities for modeling the domain, the problem, data, and the relations between them. It is most literally like a translation, but a warped and skewed one. Like what would result if the PI was to request a German translation of his English text but requiring it being written according to English syntax and ignoring the partial incommensurability of semantic items like ‘Dasein’ and ‘being’. Or compare it to commissioning a painting from Van Gogh but requiring it be done in the style of Rembrandt. The result would be interesting no doubt, but neither something that would satisfy the patron or the artist. The benefactor would get a not quite proper Rembrandt. And, for the argument here more essential, the artist under these restrictions would not be able to develop his own language of forms and style. He would be severely hampered in his expression and interpretation.

This discrepancy between the contexts of interpretation through code and through humanistic inquiry we find reflected I think in the way DH-ers tend to talk about their analytical methods as two realms separated. The best known of these metaphors is that of the contrast between ‘close’ and ‘distant’ reading, initiated by the works of Franco Morreti (2013). Ramsay (2011b) and Kirschenbaum (2008) also clearly differentiate between two levels or modes of analysis. One is a micro perspective, the other operates within a macro-level scope. Kirschenbaum described the switching from computational analysis of bit level data to putting up a meaningful perspective on the hermeneutic level of a synthesis narrative as “shuttling” back and forth between micro and macro modes of analysis. Martin Mueller (2012) in turn wished for an approach of “scalable reading” that would be able to make this switching between ‘close’ and ‘distant’ forms of analysis less hard, the shuttling more seamless.

We have microscopes and telescopes, what we lack is a tele-zoom lens. A way of seamlessly connecting the close with the distant. Without it these modes of analysis will stay well apart because the ‘scientistic’ view of computer analysis as objective forsakes the rich tradition of humanistic inquiry, as Hayles remarks (2012). Distant reading as analytic coding does gear towards an intellectual deployment of code (Ramsay 2011b). But the analytic reach of quantitative approaches is still quite unimpressive. I say this while stressing that this is not the same as ‘inadequate’, I dare bet there is beef in our methods (Scheinfeldt 2010). But although we count words in ever more subtle statistical ways to for instance analyze style, the reductive nature of these methods seems to kill the noise that is often relevant to much scholarly research (McGann 2015). For now it remains striking that the results of these approaches are confirmation oriented more than resulting in new questions or hypotheses; mostly they seem to reiterate well known hypotheses. Nevertheless, current examples of computational analyses could very well be the baby steps on a road towards a data driven approach to humanities research.

Thus if there is intellectual merit in a non-service role of code, then why do the realms of coding and humanistic inquiry stay so well apart as they seem to do? Let’s for a moment pass by the easy arguments that are all too often just there to serve the agenda of some sub-cultures in the humanities community. It is not a lack of transferable skills. I can teach 10 year old girls HTML in 30 minutes, everyone can learn to code. It is not an inherent conservative and technology averse nature of humanities (Courant 2006). Like any community the humanities has its conservative pockets and its idealist innovators. No, somehow the problem lies with computation and coding itself. Apparently we have not yet found the principles and form of computing that allow it to treat the complex nature of noisy humanities data and the even more complex nature of humanities’ abductive reasoning. That is, reasoning based more on what is plausible than what is provable or solvable as an equation. Humanities are about problematizing what we see, feel, and experience; about creating various and diverse perspectives so that the one interpretation can be compared to the other, enriching us with various informed views. Such various but differing views and interpretations are a type of knowledge too, albeit a different kind of knowledge than that results from quantification (Ramsay 2011b:7). These views acquire a scholarly or scientific status once they are rigorously tried, debated, and peer reviewed.

One of the aspects that sets humanities arguments apart from other types of scientific reasoning and analysis is its strong relation to and reliance on narrative. Narrative is the glue of humanities’ abductive logic. But code has narratological aspects too. As Donald Knuth has argued there is a literacy of code (Knuth 1984). Most humanities scholars are most literally illiterate in this realm. Yet many of the illiterate demand the intellectual primacy over code reliant research in the humanities. But to create an adequate intellectual narrative you need to be well versed in the language you’re using, you must be literate. I am not a tenured professor of digital humanities, but just the same I dare posit that you can not wield code as an intellectual tool if you are not literate in it.

Does this mean that the realms of humanities oriented computation and of humanistic abductive inquiry must stay apart? No, it means that non code literate humanists should grand those literate in code and humanities the time and space to develop the intellectual agenda of code in the humanities. But at the same time should those literate in code reflect on their mimicry of a ‘scientistic’ empiricism. The intellectual agenda of humanities is not to plow aimlessly through ever more data. Number crunching is a mere base prerequisite even within its own narrow understanding of scientific style. Only when we get into making sense of these numbers, of applying interpretation to them, we unleash the full power of the humanistic tradition. And making sense is all about building meaningful perspectives through the creation of narratives. The computational literate in the humanities need to figure out the intellectual agenda of digital humanities, and they need to develop their own style of scientific and intellectual narrative that connects it to the main stream intellectual effort of the humanities.

With all this in mind it is encouraging to learn that the Jupyter Notebook Project acquired substantial funding for further development (Perez 2015). We do not have that dreamed of tele-zoom, that scalable mode of reading. But Jupyter Notebooks may well be an ingredient of the glue needed to link the intellectual effort of humanities coding to mainstream humanities discourse. These Notebooks started out as a tool for interactive teaching of Python coding. The iPython Notebooks developed into computer language agnostic Jupyter Notebooks that allow the mixing of computer and human language narrative. In Jupyter Notebooks text and code integrate to clarify and support each other. The performative aspects of code and text are bundled to express the intellectual merit of both. Fernando Perez and Brian Granger (2015) developed their funding proposal strongly around the concept of computational narrative: “Computers are good at consuming, producing and processing data. Humans, on the other hand, process the world through narratives. Thus, in order for data, and the computations that process and visualize that data, to be useful for humans, they must be embedded into a narrative—a computational narrative—that tells a story for a particular audience and context.”

Hopefully the Jupyter Notebooks will be part of a leveling of the playing field for both narratively inclined and computationally oriented humanities scholars. Hopefully they will become a true middle-ground for computational and humanistic narrative to meet, mix, and grow from a methodological pidgin into a mature new semiotic system for humanistic intellectual inquiry.



Courant, P.N. et al., 2006. Our Cultural Commonwealth: The report of the American Council of Learned Societies’ Commission on Cyberinfrastructure for Humanities and Social Sciences. University of Southern California.

Hayles, K.N., 2012. How We Think: Digital Media and Contemporary Technogenesis, Chicago (US): University of Chicago Press.

Kirschenbaum, M., 2008. Mechanisms: New Media and the Forensic Imagination, MIT.

Knuth, D.E., 1984. Literate Programming. The Computer Journal, 27(1), pp.97–111.

McGann, J., 2015. Truth and Method: Humanities Scholarship as a Science of Exceptions. Interdisciplinary Science Reviewd, 40(2), pp.204–218.

Moretti, F., 2013. Distant Reading, London: Verso.

Mueller, M., 2012. Scalable Reading. Scalable Reading—dedicated to DATA: digitally assisted text analysis. Available at: [Accessed September 22, 2015].

Perez, F., 2015. New funding for Jupyter. Project Jupyter: Interactive Computing. Available at: [Accessed October 1, 2015].

Perez, F. & Granger, B.E., 2015. Project Jupyter: Computational Narratives as the Engine of Collaborative Data Science. Project Jupyter: Interactive Computing. Available at: [Accessed October 1, 2015].

Prescott, A., 2012. To Code or Not to Code? Digital Riffs: extemporisations, excursions, and explorations in the digital humanities. Available at: [Accessed October 1, 2015].

Ramsay, S., 2011a. On Building. Stephen Ramsay — Blog. Available at:

Ramsay, S., 2011b. Reading Machines: Toward an Algorithmic Criticism (Topics in the Digital Humanities), Chicago (US): University of Illinois Press.

Scheinfeldt, T., 2010. Where’s the Beef? Does Digital Humanities Have to Answer Questions? Found History. Available at: [Accessed October 1, 2015].