Category Archives: Blogposts

This Graph is my Graph, this Graph is your Graph

There is no better way to acknowledge that you are an academic or digital humanities arrival than finding yourself on the receiving end of a class act hatchet job on your work. This year at DH2014 Stefan Jänicke, Annette Geßner, Marco Büchler, and Gerik Scheuermann of Leipzig University’s Image and Signal processing group and Göttingen’s DH Center presented a paper firmly criticizing the graph layout that CollateX and StemmaWeb apply. Tara Andrews (who is responsible for the programming that makes the backend kick) and I presented some work on graph interactivity for StemmaWeb at DH2013. We have been working on the graph representation and interactivity because it is fun and pretty cool—a primary but often overlooked motivator of much DH work as Melissa Terras pointed out at her DHBenelux keynote. The academically more serious part is in the interaction between scholars and text. That is for me in any case, for Andrews it is the stemmatological component. I want to know how tools and interfaces either or not support scholar-text interaction. Hint: GUIs do more bad than good. Mostly I want to know how they–and the code that makes them tick–affects scholarly interpretation. Not many may know this, but I am the one that programmed the graph visualization and interaction in StemmaWeb. So I guess I am entitled to say one or two things on the work of Jänicke et al.

Let me first of all point out that I think their work is a very welcome and timely addition to the thinking and practice of graph interactivity. Not much work has been done on how graphs can be read as an aggregative representation of witness texts. Yet that work is essential in that it pertains to how scholars perceive their material. So the more, the merrier—which in general is my attitude anyway I believe. Above and beyond that, it is great to learn that someone is working on an actually JavaScript library for this type of scholarly lay out. I think Ronald Dekker, who works on CollateX, is already compiling the needed information to actually allow TRAViz to interoperate with CollateX.

So far the good. Now for the bad and the puzzling. Jänicke et al. argue for a number of design rules. Some of these make sense to me, like “Bundle major edges!”. In fact in StemmaWeb we wanted to have weighted edges, we… just didn’t get around to it first time round. Apparently this feature now present (the edge width is a function of the number of witnesses) wasn’t there yet when Jänicke et al. checked. The thing is, all the work was done in our copious free time, so weighted edges took a little longer. That takes me to the first puzzling. It took two computer scientists and two digital humanists to pull off the creation of a set of rules? The StemmaWeb graph visualization and interactivity components came around a little more economically, and with less ceremony in any case. The good part is that apparently there is much programming capacity around. It cannot be stressed enough how valuable it is to have some computing effort going into creating and maintaining a reusable code library specific to this type of visualization. We built the graph interactivity on top of the standard GraphViz layout engine that is based on a number of default graph design principles. The abstracting of that graph interaction and decoupling it as a separate library from the specifics of StemmaWeb logic has been a long standing wish for us, but given our primary research concerns we never got around to that. This summer we are trying again. Unfortunately I have a suspicion that again other academic and not so academic stuff will get in the way. And oh, did I point out we really seriously value collaboration?

Other rules really do not make sense to me. Not labeling edges? Why not, it is pretty essential mostly to scholars to know what witnesses coincide. Color coding that? We tried, it doesn’t work beyond seven witnesses, that’s when you run out of the colors of the rainbow, and humans are very bad at distinguishing shades of green. Thus color coding is limited, and above that it hides actual information. Abolish backward edges because of cognitive load? Reading James Joyce is a cognitive load, not following a backward edge. But going along for the purpose of it: how do Jänicke et al. treat transpositions in that case? “Rule 5: insert line breaks!” Again that exclamation mark. “Why shouldn’t we adopt the behavior of a text flowing in a book (with line breaks) for Text Variant Graphs?” Well, because it is not a book, and I am interested in new ways of reading. That is my take, but there’s no reason why we could not differ of opinion by good argument.

That takes me to the core of what I find unhelpful about this set of rules. They are hammered out in the style of god given dogmas. Even if this rule set has empirical user studies and design principles underpinning it, the dogmatism is still unpalatable. Especially the “line breaks” rule seems to suggest that print paradigm was heavily over represented in the user survey. The point is that if you ask users from the print paradigm what they want and what they like, chances are you will end up with a digital mimesis of print paradigm. I don’t mind if you want to get what you got by doing what you did, but raising that to absolute rule is not very explorative to say the least. What Jänicke et al. fail to appreciate is the digital humanities and human computer interaction potential here for experimenting with design choices and learning how they affect our reading, use, and interpretation of variant text representation. Instead they boilerplate a number of rules mostly based on print paradigm assumptions. And again: why does this need to be in this forbidding shouting of dogmatic rules? Will I be shot the next time I violate them for the sake of experimentation? I like the remarks they make on iterative approach. We have been reading the Agile Manifesto since 2001, and we fully embraced evolutionary development. But this goes for rules too: they are there to be iteratively questioned and purposely broken for the sake of progressing our knowledge.

For years there has been a quote under my email signature, which kind of says it all: “Jack Sparrow: I thought you were supposed to keep to the code. Mr. Gibbs: We figured they were more actual guidelines.”

 

Careful what you wish for

At the Social, Digital, Scholarly Editing 2013 conference Peter Robinson made a statement that he repeated a week later at the Digital Humanities 2013 conference. This statement is going to be taken so far out of context, going to be misused, abused, perverted, and corrupted so much that we had better deal with it now.

What Peter said was: “Digital humanists should get out of textual scholarship; and if they don’t, textual scholars should throw them out”. (http://tinyurl.com/mjvf7vj)

This is going to be taken out of context by clueless humanities scholars that hold there is no role for the digital medium or for computing in the humanities. It will be gratefully taken by even more clueless policy makers to cut into research possibilities, collaboration, and funding for humanities and computer science researchers.

That is not what Peter was after.

Peter, as far as I understood, first of all means that textual scholars should in the near future be sufficiently computer literate to not need the support of digital humanists for every digitally enhanced task. Scholarly tasks like putting a facsimile on line, transcribing a text, annotating it, publishing the whole as a digital edition should be second nature, and digital humanists should not have to be pestered to support that on any level —that we should dispense with the well-known project team triad between scholar, digital humanist, and developer. Peter presumes that, as we speak, the tools to support these tasks have grown so generic and usable that this really should not be a digital humanities research issue anymore.

Peter did not say: we do not need digital humanists. He also did not say: we do not need tool building. He did imply that tool building in digital textual scholarship is by and large done, and that the textual scholarly community should just use those tools and return to scholarly editing full stop.

Peter is wrong.

Judging from the tools Peter has been building, he believes that a textual scholar’s tool set in the digital environment should consist of a means to discover and store base materials such as images and draft transcriptions. There ought to be a tool for transcription of those, tools for annotating the transcription. He emphasized in both talks that the result of that workflow should be textual data. In Peter’s particular vision of a tool environment any textual resource is available for discovery and reusable under a CC license. Given robust and generic tools, editors and researchers can transform discoverable data according to particular hierarchies, using likewise stable computer logic, to answer to various scholarly needs. For instance a scholar who wants to consider the document structure can reproduce that particular XML hierarchy; another scholar could request a particular hierarchy for visualization, and so on.

Peter’s description of the digital ecosystem surrounding the digital scholarly edition stresses the separation of interest between text as (discoverable, online) data and text as publication. This is in my view a very useful distinction, and one that Patrick Sahle has also pointed out in his recent thesis (http://tinyurl.com/mfoucoh). This separation of concerns ensures that a text is available in a repository in a basic format so that it can be further used and processed by other tools tied to other scholarly tasks, however divergent these uses may be. Peter emphasized therefor the pivotal importance of API access to digital texts, because “Your interface is everybody else’s enemy”.

I think Peter is absolutely right there. Offering an API (http://tinyurl.com/oosk5) to your web based text delivers it to us for any conceivable further use. This might be annotating and interpreting it. This might be analyzing it by algorithmic approach. It might be to produce a simple ePub to give the larger audience access to a particular interpretation of the work by a particular scholar.

But Peter is wrong in presuming we can serve all possible uses with our current tools. Only the tools to create a digital edition as we currently (sort of) understand it are done. And really, they are not even entirely done. They are for the most part rickety in terms of robustness, model capabilities, sustainability, interoperability, community support, institutional support, open access, etc. Peter is proclaiming victory over a problem that we have hardly begun to uncover. A little too soon, a little too ferociously.

What is more important. Peter is only right if you agree that there is exactly one way to describe textual data, that there’s only one model for digital text —i.e. overlaid structural hierarchies— and only if you hold that we do not want to record our interactions with text as new digital scholarly data to be added to a shared network of digital scholarly data, and only if you believe we will not discover new uses and applications of digital text, and if you accept that our tools need be no more than taped and wired together improvised contraptions, and if you agree that there’s such a thing as a generic tool first of all … only then is Peter right.

To say that digital humanists can now leave the digital textual scholarly arena to concentrate on more interesting work, is to abandon a computational and intellectual effort at exactly the wrong time. Now is as crucial a time as there ever was to apply new models to text. It seems to me that what Peter fails to see is that the digital environment is an environment that is highly intellectually creative and prolific. It is an environment that allows us to express a multitude of forms of digital textual scholarship and intellectual argument on text expressed in tools and interfaces. Current models inspire new thought on engaging text with new models. Rethinking our models, creating and recreating text and text interaction based on new models is both an explorative and intellectual task essential to rethinking how we work with and experience text in the digital environment, which by now is more a part of our lives than the automobile.

But remodeling requires an evolution of tools and interfaces, a perpetual cycle of invention and reinvention, which is as much an expression of intellectual and scholarly engagement with text as it is the interface to a scholarly publication. The interaction between scholars, digital humanists, computer science researchers, and developers is pivotal to this endeavor. Severing the tie of collaboration between digital humanists and textual scholarship would be tantamount to killing one of the most promising fields of research. And in the wake of that it would kill textual scholarship too, for textual scholarship without the digital paradigm has meanwhile become rather unsustainable. And this is the paradox: textual scholarship that ignores the digital paradigm has little future, but the textual scholarship community is certainly at this point in time not computer literate enough to meet its own digital needs

Peter provided ludites and conservatives with the perfect ammunition for another round of paradigmatic regression. Statements like these are mostly only useful to push everybody back into their comfort zone boxes. But it is in the exploration of new tools and possibilities, in the collaboration on thinking up and creating what is nearly possible and not yet possible at all that we uncover new ways of thinking about text, what it means, how it behaves, what its uses are.

Our texts and that which allows them to interact with us –that is to say our tools– are not static. To proclaim that the development of the digital scholarly edition is done is to say that there is no further point in researching and exploring text in the digital environment.

Nederlab: Concerns from the Research Perspective.

Nederlab is a recently awarded Dutch Science Foundation large infrastructure investment. The successful proposal for a 2.4MEuro subsidy was carried by an impressive consortium of leading researchers in the fields of linguistics, history, and literary scholarship. Reading the proposal I find there are several serious issues that may cripple the project from the start. I think both researchers and technicians involved should tackle these issues rather now than later.

Nederlab according to the proposal should be a “laboratory for research on the patterns of change in the Dutch language and culture” (Nederlab 2011, p.2). “Scholars in the humanities – linguists, literary scholars, historians – try to understand processes of change and variation […]. To study these processes, large quantities of data are needed […].” The goal of Nederlab is “to enable scholars in the humanities to find answers to new, longitudinal research questions.” “Until now […] research has of necessity mainly consisted of case studies detailing comparatively brief periods of time. As more historical texts become available in digital form, new longitudinal research questions emerge, and systematic research on the interaction of changes in culture, society, literature and language can be mooted.” Let’s recap this in a bulleted list. Nederlab…

  • is for linguists, literary scholars, and historians
  • is meant to support longitudinal humanities research
  • uses detection of patterns in language and culture
  • relies on the historical texts that have become available

The subsidy application continues: “From an international perspective, scholars of Dutch history, literature and language are in an excellent position to develop new perspectives on topics as the ones described above. The Netherlands is a central player in the international eHumanities debate, in particular with respect to infrastructure and research tools (cf. CLARIN and DARIAH).”

So we add to our list of Nederlab characteristics:

  • relies on CLARIN and DARIAH related infrastructure and tools

The main body of the plan describes research cases and questions that could be answered given a historical corpus spanning 10 centuries. Most of the use cases focus on post 18th century period. Most are linguistics oriented, or use linguistic parsing as a means to infer observations from digital data. So we again expand our list:

  • focuses on post 1700 use cases / material
  • relies on automated parsing and tagging for its research paradigm

The Nederlab application then also states a contingency strategy: “There is heterogeneity in the data: a large slice of the corpus is of poor quality since it consists of poor OCR. In compensation, a number of representative high-quality corpora will directly be incorporated as core corpora in the demonstrator. Data curation will gradually revalue the poor parts of the corpus, a necessary measure as weaker data is difficult to access by tools. The problem can be mitigated by designing the tools in such a way that they bypass the problems of weaker data, and by enabling researchers to compose subsets of the corpus from which weaker data are barred.” (Nederlab 2011, p.27) Thus we can also add to our list:

  • uses a historical corpus that is for a large part of poor quality
  • can not use automated tools to remedy this (or only in a very limited ability)
  • will rely on manual curation to remedy this over time
  • ignores poor quality data (“weaker data are barred”)

Now, let’s combine these characteristics with the knowledge we have from experience about the current state of digital affairs in humanities research from a longitudinal perspective. First, let’s take a look at how representative the sources are that we have. The farther we move back in time the less coverage we have (lot’s of sources got lost). The sources we do have are luckily somewhat representative (they are a good enough ‘sample’ of what went on), but also quality gets poorer and poorer the farther we move back in time. Simply put the availability of our material is skewed towards roughly post 1800 material. This means that in itself the research material available and thus potentially available for diachronic research is biased towards the post 1800AD period.

Now let’s look at the level of curation and curatability. The situation is here even more skewed. Automated tools for scanning, OCR-ing, and POS-tagging are only effective on more recent material. Roughly material before 1800 needs a costly and capacity expensive manual effort to be digitized. First because of the form, format, and language (spelling and orthography) of documents. Secondly because the form and amount of material does not allow for automatic parsing. Patterns of culture can not be detected along purely automated approach currently in older material. The common linguistics strategy to auto-tag, then reduce noise, and then infer patterns does not work here. Rather a manually annotate, curate noise into data, and infer strategy is needed; even if we want to apply the quicker automated strategy because for that we need training corpora… that need to be manually curated.

So the conclusion from a research perspective here must be that the material we have, or can fairly easily curate, is ‘favoring’ later period (> 1800) research. Let’s take a look at the composition of our research audience over the time period:

Historians are interested in (surprise?) history, linguists are predominantly interested in reasonably recent affairs of language –logically deriving from the fact that data must be readily available to do any meaningful empirical linguistic research. Literary scholars have a fairly evenly distributed interest through time, but are depending on available material. Which means they come only in ‘full swing’ after roughly 1200 AD. In any case, again the available material and period of study do seem to skew towards linguists.

How is the tool situation? Given that commercially employed software engineers need to focus added economical value language tools and pattern recognition tools have been blue printed towards current era human language and culture artifacts (texts, audio, film etc.) Almost no editing, capturing, or annotation software is readily applicable or even available for older language periods. In part this is also the reason why the predominant ‘digital paradigm’ of humanities in the older language phases is still reliant on manual tag, link, and infer strategies: no serious effort in automated recognition, capturing or tagging older language have ever been undertaken. This means there’s plenty of automated taggers and parsers… for modern Dutch. For older Dutch there are at most some suitable transcription tools (for instance eLaborate). The tool categories listed (i.e. tokenization, spelling normalization, part of speech-tagging, lemmatization) (Nederlab 2011, p.7) are available only effectively for roughly post 1700AD material.

The paradigm matter is of particular interest here. The more statistics oriented paradigm of linguistics is facilitated because automatic formalization of modern language material is highly feasible. Only few doubt that such a paradigm will also be useful for the (pattern) analysis of historical material by historians and literary scholars. But we utterly lack the tools and the knowledge of the properties that should be formalized. Nederlab actually provides an excellent occasion to study this problem and make significant headway in these types of formalizations.

However, it will now be clear that as for research interest, available digital material, prospect of digitization, and availability and applicability of existing tools the situation and prospect is highly in favor of modern age linguistics research. The older phases have problems with the amount of available sources and data, the ability to digitize data because of the heterogeneity of form and format of sources, the ability to apply automated analysis against data, and the prospect of tools being fitted for such. In short: the problems and difficulties as for the technical solutions predominantly lie with the older language phases. This is where we need an extra effort to resolve the situation so that computational approaches can be applied and the full advantage of a diachronic corpus can be realized. However the contingency plan of the proposal rather suggest that we are to expect this:

The proposal states in no unclear words that the contingency plan is to ignore poor quality data. This strategy applied to about any data from before 1800 will result in no data or far too little data for any useful automated pattern analysis.

The major advantage, headway, and opportunity the Nederlab project could create is the ability to digitally manipulate and analyze a true diachronic corpus with appropriate tools. The way things are formulated now however, it rather looks like we’ll end up with yet again a 19/20th century focus linguistics’ sandbox. That would thus be a major missed opportunity. From the run up and preparation of the proposal I personally know that minds were alike and the all over intention indeed was to make a decisive push for longitudinal resources for longitudinal language and culture research. To make this happen we need to focus rather on the left side of the skraphs (< sketch-graphs) I drew here. It is pretty clear that the right side is already well supported. This is therefore a call to arms for all researchers and developers involved to put the focus where it ought to be and not on the low hanging fruits.

references
(Nederlab 2011) Nederlab: Laboratory for research on the patterns of change in the Dutch language and culture. Application for Investment Subsidy NWO Large. Meertens Intistute, Amsterdam/The Hague, October 2011. http://www.nederlab.nl/docs/Nederlab_NWO_Groot_English_aanvraagformulier.pdf (accessed Sunday 14 October 2012).

The human face of interoperability

On 19 and 20 March 2012 a symposium in the context of Interedition was held at the Huygens Institute for the History of the Netherlands. Next to excellent papers presented by for instance Gregor Middell (http://bit.ly/KTBJp9) and Stan Ruecker (http://bit.ly/KTBVVm) a number of development projects were presented that showed very well what Interedition is all about. Interedition promotes and progresses the interoperability of tools used in Digital Textual Scholarship. A nice example was given by Tara Andrews (http://bit.ly/LjhKi6) and Troy Griffiths (http://bit.ly/NW1WCC). They showed how the same software component was used in separate local projects to realize various tool sets for stemmatological analysis, transcription and regularization of variant texts. The tool that they have developed –in collaboration with me– enables textual scholars to correct and adapt the results of a text collation through an interactive graph visualization (cf. Fig.1). I think their example showed well how collaboration can lead to efficient tool development and the reuse of digital solutions.

At the same time the collaborative work on this nifty graph based regularization tool might be a good example that shows that we need to interrogate some of the basic assumptions we tend to have about interoperability. Though before getting to that, because at the core Interedition is about interoperability, let me first explain what my understanding of interoperability is. Not as trivial a task as one might expect. I think it was already back in 2008 that I did a ‘grab bag’ lecture on Interedition. This is the type of lecture where you list a large number of topics and the number of minutes you could do a little talking on them. The audience then picks their favorite selection. A little disappointingly at the time nobody was interested in interoperability –I tend to think because none in the audience knew what the term even meant.

Interoperability (http://en.wikipedia.org/wiki/Interoperability) can be defined along various lines but simply put it is the ability of computer programs to share their information and to use each others’ functionalities. A familiar example is mail access. These days it doesn’t matter whether I check my mail using GMail, Outlook, Firebird, the inbuilt mail client of my smartphone, or on the iPad. All those programs and devices present the same mail data from the same mail server, and all of them can send mail through that server. The advantage is that everybody can work with the program he or she is most comfortable with, while the mail still keeps arriving. In short interoperability ensures that we can use data any time any place, so people can conveniently collaborate.

That it is about using the same data any time any place probably inspired the common ground that interoperability is a matter of data format standards. Pick one and all technology will work in harmony nicely. That of course is only very partially true. The format of data indeed does say something about its potential for exchange and multi-channeled use. But mostly format says something about applicability. “Nice that you marked up every stance in that text file, but I needed kind of all pronouns really…” Standardized formats unfortunately don’t necessarily add up to data being useful for other applications. For that it turns out to be more important that computer (or digital) processes rather than data are exchangeable. Within Interedition therefor we were interested in the ability to exchange functionality. Could we re-utilize the processes we run on data for others in ways that are more generalizable? Apart from all the research abilities that would rise from such, there was also a very pragmatic rationale for re-utilization of digital processes: with so few people available in the field of Digital Humanities that actually are able to create new digital functionality, it would be a pity to let them invent the same wheels independently.

In practice to reach interoperable solutions it is paramount to establish how digital functionality in humanities research can be crossbred to reuse the same code for several processes: “Where in your automated text collation workflow can my spelling normalizer tool be utilized?” Most essentially that is a question addressing our abilities to collaborate. Re-utilization and interoperability are driven by collaboration between people. Interoperability in that sense becomes more a matter of collaboratively formalizing and building discrete components of digital workflows, rather than the selection of a few standards. Standards area abundantly available anyway (http://www.dlib.indiana.edu/~jenlrile/metadatamap/). Connectable and reusable workflow components, not so much.

I think therefor that the most important lesson learned from Interedition until now is that interoperability is not essentially about tools and data, or about tools over data. It is about humans and interaction over tools and data. Interedition succeeded in gathering together a unique group of developers and researchers that were prepared and daring to go as far as it would take to understand each others’ languages, aims, and methods. That attitude generated new, useful, and innovative co-operations. The successful development of shared and interoperable software was almost a mere side effect of that intend to understand and collaborate. Stimulating and supporting a creative community full of initiative turns out to be pivotal –far more so than maintaining any technical focus. Thus interoperability is foremost about interoperability of people. The fact that it are the white ravens –those combining research and coding skills in one person– that push the boundaries of Digital Humanities time and time again is the ultimate proof for that.

(This is a translation of a blog post that appeared originally in Dutch on http://www.huygens.knaw.nl/de-menselijke-kant-van-interoperabiliteit/#more-4377.)