Category Archives: Blogposts

Careful what you wish for

At the Social, Digital, Scholarly Editing 2013 conference Peter Robinson made a statement that he repeated a week later at the Digital Humanities 2013 conference. This statement is going to be taken so far out of context, going to be misused, abused, perverted, and corrupted so much that we had better deal with it now.

What Peter said was: “Digital humanists should get out of textual scholarship; and if they don’t, textual scholars should throw them out”. (http://tinyurl.com/mjvf7vj)

This is going to be taken out of context by clueless humanities scholars that hold there is no role for the digital medium or for computing in the humanities. It will be gratefully taken by even more clueless policy makers to cut into research possibilities, collaboration, and funding for humanities and computer science researchers.

That is not what Peter was after.

Peter, as far as I understood, first of all means that textual scholars should in the near future be sufficiently computer literate to not need the support of digital humanists for every digitally enhanced task. Scholarly tasks like putting a facsimile on line, transcribing a text, annotating it, publishing the whole as a digital edition should be second nature, and digital humanists should not have to be pestered to support that on any level —that we should dispense with the well-known project team triad between scholar, digital humanist, and developer. Peter presumes that, as we speak, the tools to support these tasks have grown so generic and usable that this really should not be a digital humanities research issue anymore.

Peter did not say: we do not need digital humanists. He also did not say: we do not need tool building. He did imply that tool building in digital textual scholarship is by and large done, and that the textual scholarly community should just use those tools and return to scholarly editing full stop.

Peter is wrong.

Judging from the tools Peter has been building, he believes that a textual scholar’s tool set in the digital environment should consist of a means to discover and store base materials such as images and draft transcriptions. There ought to be a tool for transcription of those, tools for annotating the transcription. He emphasized in both talks that the result of that workflow should be textual data. In Peter’s particular vision of a tool environment any textual resource is available for discovery and reusable under a CC license. Given robust and generic tools, editors and researchers can transform discoverable data according to particular hierarchies, using likewise stable computer logic, to answer to various scholarly needs. For instance a scholar who wants to consider the document structure can reproduce that particular XML hierarchy; another scholar could request a particular hierarchy for visualization, and so on.

Peter’s description of the digital ecosystem surrounding the digital scholarly edition stresses the separation of interest between text as (discoverable, online) data and text as publication. This is in my view a very useful distinction, and one that Patrick Sahle has also pointed out in his recent thesis (http://tinyurl.com/mfoucoh). This separation of concerns ensures that a text is available in a repository in a basic format so that it can be further used and processed by other tools tied to other scholarly tasks, however divergent these uses may be. Peter emphasized therefor the pivotal importance of API access to digital texts, because “Your interface is everybody else’s enemy”.

I think Peter is absolutely right there. Offering an API (http://tinyurl.com/oosk5) to your web based text delivers it to us for any conceivable further use. This might be annotating and interpreting it. This might be analyzing it by algorithmic approach. It might be to produce a simple ePub to give the larger audience access to a particular interpretation of the work by a particular scholar.

But Peter is wrong in presuming we can serve all possible uses with our current tools. Only the tools to create a digital edition as we currently (sort of) understand it are done. And really, they are not even entirely done. They are for the most part rickety in terms of robustness, model capabilities, sustainability, interoperability, community support, institutional support, open access, etc. Peter is proclaiming victory over a problem that we have hardly begun to uncover. A little too soon, a little too ferociously.

What is more important. Peter is only right if you agree that there is exactly one way to describe textual data, that there’s only one model for digital text —i.e. overlaid structural hierarchies— and only if you hold that we do not want to record our interactions with text as new digital scholarly data to be added to a shared network of digital scholarly data, and only if you believe we will not discover new uses and applications of digital text, and if you accept that our tools need be no more than taped and wired together improvised contraptions, and if you agree that there’s such a thing as a generic tool first of all … only then is Peter right.

To say that digital humanists can now leave the digital textual scholarly arena to concentrate on more interesting work, is to abandon a computational and intellectual effort at exactly the wrong time. Now is as crucial a time as there ever was to apply new models to text. It seems to me that what Peter fails to see is that the digital environment is an environment that is highly intellectually creative and prolific. It is an environment that allows us to express a multitude of forms of digital textual scholarship and intellectual argument on text expressed in tools and interfaces. Current models inspire new thought on engaging text with new models. Rethinking our models, creating and recreating text and text interaction based on new models is both an explorative and intellectual task essential to rethinking how we work with and experience text in the digital environment, which by now is more a part of our lives than the automobile.

But remodeling requires an evolution of tools and interfaces, a perpetual cycle of invention and reinvention, which is as much an expression of intellectual and scholarly engagement with text as it is the interface to a scholarly publication. The interaction between scholars, digital humanists, computer science researchers, and developers is pivotal to this endeavor. Severing the tie of collaboration between digital humanists and textual scholarship would be tantamount to killing one of the most promising fields of research. And in the wake of that it would kill textual scholarship too, for textual scholarship without the digital paradigm has meanwhile become rather unsustainable. And this is the paradox: textual scholarship that ignores the digital paradigm has little future, but the textual scholarship community is certainly at this point in time not computer literate enough to meet its own digital needs

Peter provided ludites and conservatives with the perfect ammunition for another round of paradigmatic regression. Statements like these are mostly only useful to push everybody back into their comfort zone boxes. But it is in the exploration of new tools and possibilities, in the collaboration on thinking up and creating what is nearly possible and not yet possible at all that we uncover new ways of thinking about text, what it means, how it behaves, what its uses are.

Our texts and that which allows them to interact with us –that is to say our tools– are not static. To proclaim that the development of the digital scholarly edition is done is to say that there is no further point in researching and exploring text in the digital environment.

Nederlab: Concerns from the Research Perspective.

Nederlab is a recently awarded Dutch Science Foundation large infrastructure investment. The successful proposal for a 2.4MEuro subsidy was carried by an impressive consortium of leading researchers in the fields of linguistics, history, and literary scholarship. Reading the proposal I find there are several serious issues that may cripple the project from the start. I think both researchers and technicians involved should tackle these issues rather now than later.

Nederlab according to the proposal should be a “laboratory for research on the patterns of change in the Dutch language and culture” (Nederlab 2011, p.2). “Scholars in the humanities – linguists, literary scholars, historians – try to understand processes of change and variation […]. To study these processes, large quantities of data are needed […].” The goal of Nederlab is “to enable scholars in the humanities to find answers to new, longitudinal research questions.” “Until now […] research has of necessity mainly consisted of case studies detailing comparatively brief periods of time. As more historical texts become available in digital form, new longitudinal research questions emerge, and systematic research on the interaction of changes in culture, society, literature and language can be mooted.” Let’s recap this in a bulleted list. Nederlab…

  • is for linguists, literary scholars, and historians
  • is meant to support longitudinal humanities research
  • uses detection of patterns in language and culture
  • relies on the historical texts that have become available

The subsidy application continues: “From an international perspective, scholars of Dutch history, literature and language are in an excellent position to develop new perspectives on topics as the ones described above. The Netherlands is a central player in the international eHumanities debate, in particular with respect to infrastructure and research tools (cf. CLARIN and DARIAH).”

So we add to our list of Nederlab characteristics:

  • relies on CLARIN and DARIAH related infrastructure and tools

The main body of the plan describes research cases and questions that could be answered given a historical corpus spanning 10 centuries. Most of the use cases focus on post 18th century period. Most are linguistics oriented, or use linguistic parsing as a means to infer observations from digital data. So we again expand our list:

  • focuses on post 1700 use cases / material
  • relies on automated parsing and tagging for its research paradigm

The Nederlab application then also states a contingency strategy: “There is heterogeneity in the data: a large slice of the corpus is of poor quality since it consists of poor OCR. In compensation, a number of representative high-quality corpora will directly be incorporated as core corpora in the demonstrator. Data curation will gradually revalue the poor parts of the corpus, a necessary measure as weaker data is difficult to access by tools. The problem can be mitigated by designing the tools in such a way that they bypass the problems of weaker data, and by enabling researchers to compose subsets of the corpus from which weaker data are barred.” (Nederlab 2011, p.27) Thus we can also add to our list:

  • uses a historical corpus that is for a large part of poor quality
  • can not use automated tools to remedy this (or only in a very limited ability)
  • will rely on manual curation to remedy this over time
  • ignores poor quality data (“weaker data are barred”)

Now, let’s combine these characteristics with the knowledge we have from experience about the current state of digital affairs in humanities research from a longitudinal perspective. First, let’s take a look at how representative the sources are that we have. The farther we move back in time the less coverage we have (lot’s of sources got lost). The sources we do have are luckily somewhat representative (they are a good enough ‘sample’ of what went on), but also quality gets poorer and poorer the farther we move back in time. Simply put the availability of our material is skewed towards roughly post 1800 material. This means that in itself the research material available and thus potentially available for diachronic research is biased towards the post 1800AD period.

Now let’s look at the level of curation and curatability. The situation is here even more skewed. Automated tools for scanning, OCR-ing, and POS-tagging are only effective on more recent material. Roughly material before 1800 needs a costly and capacity expensive manual effort to be digitized. First because of the form, format, and language (spelling and orthography) of documents. Secondly because the form and amount of material does not allow for automatic parsing. Patterns of culture can not be detected along purely automated approach currently in older material. The common linguistics strategy to auto-tag, then reduce noise, and then infer patterns does not work here. Rather a manually annotate, curate noise into data, and infer strategy is needed; even if we want to apply the quicker automated strategy because for that we need training corpora… that need to be manually curated.

So the conclusion from a research perspective here must be that the material we have, or can fairly easily curate, is ‘favoring’ later period (> 1800) research. Let’s take a look at the composition of our research audience over the time period:

Historians are interested in (surprise?) history, linguists are predominantly interested in reasonably recent affairs of language –logically deriving from the fact that data must be readily available to do any meaningful empirical linguistic research. Literary scholars have a fairly evenly distributed interest through time, but are depending on available material. Which means they come only in ‘full swing’ after roughly 1200 AD. In any case, again the available material and period of study do seem to skew towards linguists.

How is the tool situation? Given that commercially employed software engineers need to focus added economical value language tools and pattern recognition tools have been blue printed towards current era human language and culture artifacts (texts, audio, film etc.) Almost no editing, capturing, or annotation software is readily applicable or even available for older language periods. In part this is also the reason why the predominant ‘digital paradigm’ of humanities in the older language phases is still reliant on manual tag, link, and infer strategies: no serious effort in automated recognition, capturing or tagging older language have ever been undertaken. This means there’s plenty of automated taggers and parsers… for modern Dutch. For older Dutch there are at most some suitable transcription tools (for instance eLaborate). The tool categories listed (i.e. tokenization, spelling normalization, part of speech-tagging, lemmatization) (Nederlab 2011, p.7) are available only effectively for roughly post 1700AD material.

The paradigm matter is of particular interest here. The more statistics oriented paradigm of linguistics is facilitated because automatic formalization of modern language material is highly feasible. Only few doubt that such a paradigm will also be useful for the (pattern) analysis of historical material by historians and literary scholars. But we utterly lack the tools and the knowledge of the properties that should be formalized. Nederlab actually provides an excellent occasion to study this problem and make significant headway in these types of formalizations.

However, it will now be clear that as for research interest, available digital material, prospect of digitization, and availability and applicability of existing tools the situation and prospect is highly in favor of modern age linguistics research. The older phases have problems with the amount of available sources and data, the ability to digitize data because of the heterogeneity of form and format of sources, the ability to apply automated analysis against data, and the prospect of tools being fitted for such. In short: the problems and difficulties as for the technical solutions predominantly lie with the older language phases. This is where we need an extra effort to resolve the situation so that computational approaches can be applied and the full advantage of a diachronic corpus can be realized. However the contingency plan of the proposal rather suggest that we are to expect this:

The proposal states in no unclear words that the contingency plan is to ignore poor quality data. This strategy applied to about any data from before 1800 will result in no data or far too little data for any useful automated pattern analysis.

The major advantage, headway, and opportunity the Nederlab project could create is the ability to digitally manipulate and analyze a true diachronic corpus with appropriate tools. The way things are formulated now however, it rather looks like we’ll end up with yet again a 19/20th century focus linguistics’ sandbox. That would thus be a major missed opportunity. From the run up and preparation of the proposal I personally know that minds were alike and the all over intention indeed was to make a decisive push for longitudinal resources for longitudinal language and culture research. To make this happen we need to focus rather on the left side of the skraphs (< sketch-graphs) I drew here. It is pretty clear that the right side is already well supported. This is therefore a call to arms for all researchers and developers involved to put the focus where it ought to be and not on the low hanging fruits.

references
(Nederlab 2011) Nederlab: Laboratory for research on the patterns of change in the Dutch language and culture. Application for Investment Subsidy NWO Large. Meertens Intistute, Amsterdam/The Hague, October 2011. http://www.nederlab.nl/docs/Nederlab_NWO_Groot_English_aanvraagformulier.pdf (accessed Sunday 14 October 2012).

The human face of interoperability

On 19 and 20 March 2012 a symposium in the context of Interedition was held at the Huygens Institute for the History of the Netherlands. Next to excellent papers presented by for instance Gregor Middell (http://bit.ly/KTBJp9) and Stan Ruecker (http://bit.ly/KTBVVm) a number of development projects were presented that showed very well what Interedition is all about. Interedition promotes and progresses the interoperability of tools used in Digital Textual Scholarship. A nice example was given by Tara Andrews (http://bit.ly/LjhKi6) and Troy Griffiths (http://bit.ly/NW1WCC). They showed how the same software component was used in separate local projects to realize various tool sets for stemmatological analysis, transcription and regularization of variant texts. The tool that they have developed –in collaboration with me– enables textual scholars to correct and adapt the results of a text collation through an interactive graph visualization (cf. Fig.1). I think their example showed well how collaboration can lead to efficient tool development and the reuse of digital solutions.

At the same time the collaborative work on this nifty graph based regularization tool might be a good example that shows that we need to interrogate some of the basic assumptions we tend to have about interoperability. Though before getting to that, because at the core Interedition is about interoperability, let me first explain what my understanding of interoperability is. Not as trivial a task as one might expect. I think it was already back in 2008 that I did a ‘grab bag’ lecture on Interedition. This is the type of lecture where you list a large number of topics and the number of minutes you could do a little talking on them. The audience then picks their favorite selection. A little disappointingly at the time nobody was interested in interoperability –I tend to think because none in the audience knew what the term even meant.

Interoperability (http://en.wikipedia.org/wiki/Interoperability) can be defined along various lines but simply put it is the ability of computer programs to share their information and to use each others’ functionalities. A familiar example is mail access. These days it doesn’t matter whether I check my mail using GMail, Outlook, Firebird, the inbuilt mail client of my smartphone, or on the iPad. All those programs and devices present the same mail data from the same mail server, and all of them can send mail through that server. The advantage is that everybody can work with the program he or she is most comfortable with, while the mail still keeps arriving. In short interoperability ensures that we can use data any time any place, so people can conveniently collaborate.

That it is about using the same data any time any place probably inspired the common ground that interoperability is a matter of data format standards. Pick one and all technology will work in harmony nicely. That of course is only very partially true. The format of data indeed does say something about its potential for exchange and multi-channeled use. But mostly format says something about applicability. “Nice that you marked up every stance in that text file, but I needed kind of all pronouns really…” Standardized formats unfortunately don’t necessarily add up to data being useful for other applications. For that it turns out to be more important that computer (or digital) processes rather than data are exchangeable. Within Interedition therefor we were interested in the ability to exchange functionality. Could we re-utilize the processes we run on data for others in ways that are more generalizable? Apart from all the research abilities that would rise from such, there was also a very pragmatic rationale for re-utilization of digital processes: with so few people available in the field of Digital Humanities that actually are able to create new digital functionality, it would be a pity to let them invent the same wheels independently.

In practice to reach interoperable solutions it is paramount to establish how digital functionality in humanities research can be crossbred to reuse the same code for several processes: “Where in your automated text collation workflow can my spelling normalizer tool be utilized?” Most essentially that is a question addressing our abilities to collaborate. Re-utilization and interoperability are driven by collaboration between people. Interoperability in that sense becomes more a matter of collaboratively formalizing and building discrete components of digital workflows, rather than the selection of a few standards. Standards area abundantly available anyway (http://www.dlib.indiana.edu/~jenlrile/metadatamap/). Connectable and reusable workflow components, not so much.

I think therefor that the most important lesson learned from Interedition until now is that interoperability is not essentially about tools and data, or about tools over data. It is about humans and interaction over tools and data. Interedition succeeded in gathering together a unique group of developers and researchers that were prepared and daring to go as far as it would take to understand each others’ languages, aims, and methods. That attitude generated new, useful, and innovative co-operations. The successful development of shared and interoperable software was almost a mere side effect of that intend to understand and collaborate. Stimulating and supporting a creative community full of initiative turns out to be pivotal –far more so than maintaining any technical focus. Thus interoperability is foremost about interoperability of people. The fact that it are the white ravens –those combining research and coding skills in one person– that push the boundaries of Digital Humanities time and time again is the ultimate proof for that.

(This is a translation of a blog post that appeared originally in Dutch on http://www.huygens.knaw.nl/de-menselijke-kant-van-interoperabiliteit/#more-4377.)

Bring out yer dead!

Tristan Louis blogged about how we are killing the internet by moving our data from discoverable networked http spaces to closed off databases at Facebook and Apps on Android and Apple. And sure enough the big dot-com companies are playing a serious game of land grab. Currently that game is played with applications and data. With those two elements the dot-coms form silos within the Internet, and we can’t play nice with them in the open data sense. However, data that is entered in an application by a user is morally of course still the intellectual property of that user, no matter what infamous privacy statements may proclaim.

To save the internet Louis suggests that we shouldn’t use apps that close data to the wider community. But that is at odds with an Internet that more and more extends into hardware gadgets like iPads and smartphones and into reality augmenting apps. Assuring the Internet stays an open place should not be sought by boycotting the Internet Of Things. The problem here is not caused by the Internet moving into new realms and functionalities, as Louis argues. It is rather the current inability to share openly the data of ‘Things and Apps’.

Fundamentally technically nothing is impeding the open sharing of data. Many apps actually do share their user data happily –through Dropbox for instance. But undeniable indeed big companies like Facebook and Google are founded on the digital gold represented by user data. Companies are mainly interested in user tracking data. That is: the statistical analysis data that tells something about the behavior of users. The other part –the user generated content– is interesting of course, but mainly as analysis base for tracking and profiling. User profile data and user generated content are separable streams of data within any app. Thus it should be possible to expose user generated content over standard http protocols to be discoverable and shareable. And in fact there are initiatives in the open source development realm already creating blueprint technology for fair sharing of user content, like Unhosted. User tracking data is rightfully in the ownership of Internet companies. If users don’t want that they should not use the applications of such companies. In any case a clean separation between the two types of data is thence possible.

A problem is of course that there is no real intrinsic reason for apps and Facebook-like Internet applications to serve user generated content openly, because for that you need to implement extra services. Moreover dot com companies will still tend to think as the user data as the gold of their business model. But as said, in essence the gold is not the user content as such but the profiles derived from it. An intrinsic motivation might be found at a certain moment with two start ups that don’t mind sharing their user content to rapidly grow their user content base so as to grow faster into competitive mode with well established Internet businesses.

We’re also looking –I tend to think– at a typical situation where legislation might be enhancing democratic behavior. I’m as cautious with suggesting legislation as with medicine use. If one can do without, please do. But in this case we might have an actual legislation need. Companies have been successful for decennia in limiting the use of data through copyright infringement law. Such laws were not even that much in the interest of actual artists but more in the interest of people that derived money of the work of artists, dead or alive. Some initiative to also protect the right of general citizens to protect their data from being closed down by companies would be opportune. Currently much sought legislation is also going that direction.

Open data and the lawful conformation of the right to guard one’s own data is part of bringing about a smarter society respecting both the needs of companies and citizens.

(This post originally started as a comment on ‘I killed the Internet’ at TextualScholarship.nl but it grew into its own blogpost size. The title of this blogpost is of course a reference to the famous Monty Python scene…)