Being a Critical Journalist in Digital Times

Geert Lovink who does wonderful work at the Institute of Network Cultures yesterday tweeted a cry of horror on finding out via TechCrunch that Google is funding the development of software that writes local news stories. Media the world over have parroted the same news which seems largely based on a press release from the UK Press Association. The parroting in itself is a indicator of the dire situation in journalism where uncritically posting press releases has become a stand in for actual in depth and well researched coverage. Those who at least attempted a stab at a perspective mostly seem to have stuck to the hackneyed criticism that Google funds the development of robot journalists that will put human journalists out of a job: “Journalists, look out: Google is funding the rise of the AI news machine”.

In reality things probably move both faster and slower. Let me explain that.

Having been in the midst of my own little media tempest in a teacup I tend to think that press releases and the follow up roar in diverse media are as much ‘alternative fact’ as they are not. My case involved also a director of a publishing house announcing that we as researchers were ragingly enthusiast about the effectiveness with which we are able to predict best sellers based on deep learning methods. I think I—which is “the researchers” in the story—said results were encouraging or some such, how that turned to “ragingly enthusiast” remains a mystery to me too.

What I did in any case was not exactly rocket science in the realm of machine learning. Using the well established open source Python libraries Theano and Keras I build a straight forward neural network that I fed the 250,000 or so features counting Tf·idf matrix that was derived from a set of 200 Dutch published novels of which sales numbers were known. We were then able to predict the ‘selling capability’ of unseen novels of the same publisher for which sales numbers were known with some 80% accuracy. In more plain English: applying meanwhile middle of the road machine learning techniques we can predict if novels will sell or not and eight out of ten times we will be correct.

Machine learning techniques such as deep learning using neural networks can be extremely sensitive to patterns in large data sets, patterns that are too distributed throughout the data for humans to be really able to pick up on them. Given enough training data such technologies infer models, or sets of features, that will be very good at telling you to what categories unseen examples belong. If they belong to the category of best sellers or non sellers for instance. In our case the model picked up on the words that are common to the best sellers of recent years and was able to find matches of such word use in novels it had not been trained on. As a meticulous and time unlimited clerk it compared the scores on some 250k variables per novel, averaged the scores, and compared them to those of successful selling novels. Not exactly rocket science really, just an immense amount of work impossible to pull off in feasible time with mere human capacity. For a well programmed algorithm though, a work of mere minutes.

Classifiers, as such algorithms are also called, are already crunching real world data all the time. This is what I mean with the ‘faster’ part. In many respects what AP is going to try to do, is not that new at all. Stock exchange predictions, flight fuel consumption patterns, internet store customer interest, the likelihood a person on Facebook will lean to a certain political conviction, and so forth: all have been measured and predicted by similar algorithms for a small decade now at least.

The information that citizens experience is more often than not tailored to their needs already by such algorithms. Although many point this out all the time (for instance here, or here, or here, or here) it is somehow still a huge surprise when sometimes the processes that these algorithms support break out of their otherwise mostly covert and invisible existence. To those that are aware of how widespread the application of these algorithms are, rather this very surprise is surprising: you are living in an highly automated information world already. Long time. Better get used to it. Or at least get aware of it. And yes, this also happens in news story generation already, as TechCrunch did not fail to point out. Automated Insights provides this type of services on impressive industry scale.

The ‘slower’ bit is that press releases like the one from AP systematically over claim. Although I have to admit this is conjecture, it is unlikely that Google’s 700kEuro+ funding will lead AP and Urbs Media to eradicate local news gathering and publishing. First of all this does not seem their aim, but moreover the quality and effects of “a new service” crunching out “up to 30,000 localised stories each month from open data sets” remain to be seen. What are these open data sets, and what will these news stories be? There sure can be sense in informing the public about community level decisions on construction and development based on town council decisions mined from public service databases. Inferring and precisely directing a message like “Town council planning bypass to relief your neighbourhood of cut-through traffic” can be relevant news to a small number of locals, but could well be too tedious and too low-impact for journalists to have to bother with. Automated services like that might thus be well placed actually.

The big problem why progress in developing such a service will be slow is that the closer you get to the community and the individual, the more heterogeneous and specific news needs and interests become. Inferring automatically that a plan for a motorway bypass will affect people in some area is one, but deciding what this means to people is a whole different ball game. One that machine learning is still terribly bad at. What open data sets are you going to use to have some sense of how to frame your news story, so important if it is to be tailored to local needs? Are we turning to what is available on the Web, for instance, produced by the community itself? I sincerely doubt that will result in actually “fact-based insights into local communities”, which is the “increasing demand” AP says to target. These are challenges of automated inference that are not easily solved, resulting in the slower bit: the output and impact of projects like these are usually far more modest—yet still usable—than press releases tend to suggest. A very nice prototype will be derived. Probably, maybe.

Another part of the slower bit is that it is easy enough to generate high level, mainstream interest, stock exchange news items from well formalised statistics, but that it is a lot harder to generate daily real life human interest stories. “First quarter income was reported at 15 million USD” is a sentence easily enough generated from well groomed statistics and databases. Generating “Bob’s farm house store will be closed temporarily in October” is of more immediate interest to some local community, but it involves complexities of automated inference from real world information far beyond current capabilities. Although generating language is getting intriguingly easy—as easy as predicting best sellers almost—it is still a far cry from the heterogeneous specificity that is relevant if you get to local or individual level. There is a meaningful and relevant difference between “Bob’s farm shop will be closed temporarily in October” and “Bob’s farm shop will be closing in October”. The first is an ordinary well formed announcement, the second is a potential source of hazardous assumptions about Bob and his commercial and physical well being. Current deep learning algorithms are rather insensitive to such subtleties and one has little control over whether one or the either will be generated. But such subtleties are what starts to matter if you get down to the less formulaic, less high level pattern based, nature of language in local community real life.

So in all it will be a while, and I would rather be interested in the results of the project than in shout outs about the potential demise of journalism writ large. At the same time I do not want to downplay the hazardous situation that these technologies may eventually put journalists in. This involves the even harder question of how we wield our digital technology ethically. Yes, it is possible to predict best sellers. Even the simplest possible application of deep learning yields 80% success. As a publisher you would be a rather poor businessman if you would not at least scout out the possible edge this might give you in a publishing industry. But that does not mean you need to do away with your editors. In fact that would be the intellectually poorest and most unethical choice. You can chose to do so, and you can even keep making a profit that way. The downside is however, that your predictive algorithm will force creative writing into a unpalatable sludge of same-plot-same-style novels. The more clever and ethical option is to use such an algorithm as a support tool to avert the worst potential non sellers. Each averted non seller—even if we get one in five wrong—is money saved to poor into more promising, more interesting projects. Responsibly deployed this algorithm enhances the ability of a publisher to support new and truly interesting work. An ethical businessman—I am aware this files as a modern oxymoron though—would turn losses prevented into an investment in new interesting literature that deviates form the by now plain old literary thriller genre. In machine learning that is something we are far worse at than humans still: predicting what outlier might actually not be a non seller, but an example of the new brilliant different style and narrative that will hit it big.

The rationale for how we choose to wield our technologies comes from critical thinking and reflection—or the lack thereof of course. It is the more deplorable therefore that PA’s press release resulted in a quite predictable flood of reactions in the “journalists’ jobs are under threat” genre. Because it means that journalists did mostly not do their jobs of critical news mongering. If they had, they had not talked so much about their jobs. For as argued, it will be a while until these are really under threat, if at all. Rather they could have talked about the highly arguable motives of an industry giant like Google being involved in developing algorithms that select and write news items. Do we really believe that these algorithms will be impartial? Will we yet again believe the stories that data science and natural language parsing algorithms are neutral because they are based on mathematics and logic? How likely is it that Google, more specifically its board, will be ethical about deploying these algorithms? That is the type of questions the press should have set out to answer. Instead of devoting time to being really investigative and critical, they chose to repeat the press release.

So don’t be surprised if you sometime soon read a local announcement: “Bob’s store will be closed temporarily in October, but you can find fine online stores on Google”.

–JZ_20170710_2017