Computer-aided Scientific Workflow


In Hamburg (at least ten years ago) when you went to Gymnasium, at some point you would learn about something called methodological competences, a “learn to learn” course showing you how to conduct research for the topic of interest, how to then efficiently read through texts, take notes, organize these notes and finally – as a preparation for the finals – how to learn and present these things.

It was this course, that at our school was thought in a two weeks seminar on Sylt, that got me interested in this topic. Of course, in 1999, when we had that course, it was still all about the analogue world – literature was obtained at the library (we even learned how books are sorted in the Staatsbibliothek), texts were photocopied, we were told to use highlighters, write flashcards, etc. Still every now and then when my interest on this topic awoke again, I started reading about new ideas on how to efficiently and effectively process knowledge that you obtain while reading and how to preserve this knowledge in a way that makes it usable even in ten years time. Now the most famous person managing this would be Niclas Luhman who was meticulously writing down every thought on every book he ever read onto little papers, numbering and tagging them, keeping handwritten indices in lookup-table. In many interviews he explained how he worked, stating, that he wouldn’t even need to write a book anymore – with this system, called Zettelkasten he’d look up important words, and the story would tell itself. He even spoke of it as a person with whom he was leading a discussion – his alter ego (see this German article on Sciencegarden as well as the following interview).

After his dead in 1998 the Bielefeld University bought the Zettelkasten from the heir: 20 000 sheets, in 80 boxes, everything Luhman was thinking in his lifetime – and as he was well recognized for his works, this was treated as the holy grail of wisdom on sociological systems theory, and as far as I know it’s even analyzed today.

But what to do, when you don’t have that much time? When you simply cannot afford to spend your every days evenings creating, tagging, indexing and sorting new sheets of papers into a giganteous pile, that most people would rather consider to be the symptoms of a hoarding personality disorder?

Luckily today we have computers, and they become more helpful to us with every passing day. After working on some projects regarding corporate knowledge management (when I was still fulltime-employed, before I decided to study), I became interested in how to make personal use of ideas and methods used in companies. So I stumbled upon the term personal knowledge management, and found a lot of ideas, from fundamental skills that one should acquire (time management skills, organisational skills, networking skills, learning skills, etc.), learning methods (SQ3R, mind and concept maps, Loci and other mnemonic systems, etc.) to automated knowledge acquiring systems (knowledge harvesting) and higher self-management methods, such as Management by Objectives. Most of the old things I learned at school are still valid, but transferred into the digital world: Use personal information managers, use personal wikis, use social bookmarkings (always be interconnected!), keep a reading/learning journal – best way: do it as a blog, use online discussion groups, learn to efficiently search for publications online, etc.

So I tried. I used a lot of tools, tried different systems and approaches, i.e. a Learning Blog on WordPress, Office OneNote, Journler, Evernote, Zim, Zettelkasten, synapsen, Zettels, Voodopad, Delicious Library, del.icio.us, BibSonomy, ResearchGate, Papers, Citavi, EndNote and so many more. Still, it never really worked out for me. I guess, the biggest problem for me is the “overhead”. Of course, transferring the “Zettelkasten” of Luhman into a Wiki is a great improvement (if you think about it, Luhmans system is not that different to a Wiki at all!). Still, installing and managing a Wiki (that normally also requires installing and managing a web server as well as a DBMS), learning a new syntax, and especially the notion of reading something, making notes, marking interesting quotes – and then digitizing them, i.e. typing them from paper into a wiki (at least quotes, if you keep your notes digitally) – that’s just too much for me. I’d do it if I’ve got the time, but most of the time, I just don’t have it. Besides, every additional tool is an additional burden, if it isn’t interface-able with other tools. For instance, if I feed my global BibTeX file with bibliographical data, and I cannot access this data with my personal wiki, then I’ll have to manually add that information there, too. If I want to share these things online, I at least have to copy/paste. If it’s a service like CiteULike or BibSonomy then I’m presented with an HTML form, so no easy copy and pasting – again, data to insert more than once. Most of the above tools aren’t interfaceable. But even worse: what happens if I decide to switch? Can I still access all my data from my former tool? Can I export it into any open format, that is understandable or at least easy to parse into another format?

So, after trying out a lot and searching for tools that I like in their handling, I finally stumbled upon “academic workflows”. This is a term I discovered only recently on the web, and if you look for it, it seems to be a fairly recently coined term – at least for workflows on computers that involve simplifying the notion of acquiring, reading and processing and finally storing information. There are even blogs, e.g. Organo Gnosi or Academic Workflows on Mac, which solely focus on tools and possibilities that allow you to create a good and easy to use workflow. The focus is rather on the interaction possibilities of different tools, where and why to use something and how to transfer this to other tools. I often take a look at some of the suggestions and try out this and that. But the real revelation is something that I stumbled across by accident, just two days ago:

An ingenious way of “glueing” together some pretty neat tools on Mac OS X.  Yeah, unfortunately it’s just an OS X solution; but I think (though I’d like to keep things generally open to all systems) it’s still worth looking into and writing about for three reasons:

  1. All presented tools are open source tools, so except for the operating system (and the Apple Script parts), porting the tools should be possible. If not, I’m sure that most of them have a pendant in Linux. So the presented workflow should be more or less reproducible on a Linux system. Get a feel for why this workflow is – in my opinion – ingenious, and then try porting it to Linux. The community of scientists using Linux will most definitely appreciate it. I unfortunately cannot picture a portation to Windows based system, due to the lack of the scripting capabilities. But hey, proof me wrong. Just because I can’t do it mustn’t mean that it’s impossible 😉
  2. What makes this workflow interesting is: It’s fast and lightweight (not like Papers or Citavi), it is flexible using good and open standards, making it reusable, it is exhausting, going from easy acquiring, to taxonomy, reading (note taking), and presenting. All in fast, neat bits of programs that are easily accessible through shortcuts. But there’s still a lot missing, like integration to social bookmarking, or notifying the peer of a reference if it is used elsewhere. Also, though having its charm consisting of all the little software gems, having a single tool capable of all these tasks wouldn’t be wrong either. Compare it to a total supply chain exhausting ERP system, that additionally creates controlling information and automated payment (i.e. artifacts of the scientific world). There is no such software available for science. So, if you are a software developer, here’s an opportunity.
  3. In the past years I made the experience that – especially if you ignore computer scientists, who generally prefer Linux over all other operating systems – a lot of scientists seem to prefer Macs over Windows systems. And if you think about it, it makes perfectly sense. As a scientist, you probably mainly use your computer as an aiding tool, thus you need it to be as easily usable as possible. In this category Macs are truly the winner. On the other hand you probably also need a great deal of power provided by the system, i.e. easy scriptability, mass data processing, automation, high calculation power, etc. Here Unix systems clearly win, whereas Macs are somewhere in-between: They provide these features as well, with the restriction of directly accessing the operating system (to be more specific: the Kernel) itself. As a computer scientist or a system maintainer you probably want this – you might need to program on a low-level, directly access hardware, change Kernel modules or even write your own. For the rest, Macs are a similar suited choice (with the easy UI and highly integrated tools that Linux lacks). So, if you’re an academic reading this, the probability isn’t that small, that you already use a Mac anyways 😉

But let’s finally cut to the chase: The Canadian PhD student Stian Håklev developed a system consisting of Chrome, BibDesk, Skim and DokuWiki as well as a lot of Ruby-Scirpts that are invoked by Apple Script (which apparently isn’t something that Stian likes to write (otherwise you could have also skip the Ruby parts doing everything in Apple Script)), that allow the interaction.

Using Chrome and Google Scholar, Stians system is able to – invoked by a keystroke – automagically download papers he’s searching for, while simultaneously downloading the BibTeX data provided by Google Scholar, and adding it to the global BibTeX file wich is accessed by BibDesk. In BibDesk a cite-key is created that is used as name for the PDF file as well, and in addition the PDF-File is linked to BibDesk, allowing to view it while in BibDesk. regardless of how you open it, those files are opened with Skim, a PDF viewer that allows you to mark interesting parts and add annotations. Now, different to Preview or Adobe Reader which have the same capability, those annotations become searchable (even through Spotlight), but still they are treated as metadata – opening the PDF in Preview/Adobe will not show the annotations (which the other tools do – they even print them!). Notes can be tagged, and BibDesk will notice these notes and tags, and add groups for them, so in BibDesk you can group your references by tags you gave in the annotations using Skim.

To go even further, you can now automagically – by again issuing a key sequence – generate a DokuWiki document. DokuWiki uses files instead of a database, making it easier to maintain and to sync, using rsync. Now, DokuWiki takes all the information provided by BibDesk in the BibTeX file, as well as all the information that you added via Skim and produces a page that shows all bibliographic information, including a link to the PDF file, all citations (everything you marked in Skim) and all your notes. Of course, you can further edit this page (efficiently with Side Wiki), add additional thoughts, tags, and synopsis. All references are linked and will open the PDF at the exact page, they are taken from! And DokuWiki let’s you search via author (you can even create author pages), or tags, and of course let you add links between pages; conveniently every page has again the name of the cite-key, so the handling get’s pretty easy.

If you’re into being an Open Scholar (which is really interesting and goes hand in hand with Open Access and Open Source Software, so have a look at these philosophies), and want to make your every progress available, you can push your offline Wiki to an online site – as it consist of plain text files any text-comparison tool, such as rsync will do the job. And Stian has even written a plugin for WordPress making it easy to reference you’re papers (and wiki pages), using – yet again – the cite-key in your articles.

Comparing it to all the tools I have already used, this seems to be the by far most intuitive and efficient way of processing scientific knowledge. The automation and the exchange of metadata allows for adding data only once as well as easy creation of documents that already contain your thoughts and excerpts and invite you to further process the knowledge and to further develop your ideas is so intriguing, that I can confidently picture myself in actually doing it. There’s even integration for the Kindle, so you could also use your eBook Reader device (or maybe even your phone and tablets?) to read your papers, making it even leaner.

I for my part am pretty excited in getting everything up and running and maybe even further improve it. Stian shares his DokuWiki on the web, and in there also describes his setup, which he calls researchr. Everything is open source and free to use, all his scripts can be found on his github repository. So everything is provided to rebuild this setup (unfortunately there’s no easy way around it, though I cannot see, why it couldn’t be possible to simply write a script, that sets up all the scripts and pieces). And last but not least, there are two interesting videos – one being talk (I like that one better) demonstrating the workflow (round about the middle of the video) – the other a pure screen capture presentation with a demo of the workflow.

Enjoy 😉

After forking and implementing the workflow I hope to be able to improve it – in my opinion it should be possible to write an installation script; maybe something that involves using GitHub (as e.g. Homebrew or pahtogen do), I rather use Firefox than using Chorme and making the setup system-independant would be nice as well – though I think breaking up the perfect couple Skim and BibDesk would be hard (see also here or here).

Advertisements

3 thoughts on “Computer-aided Scientific Workflow

  1. Pingback: researchr on Linux (Addemdum to Computer-aided Scientific Workflow) | ~ PygoscelisPapua | Profession~

  2. Hi Kannan, I am glad I found your blog through a GAnalytics referrer log, because we should definitively talk!:) I’ve had a lot of the same thoughts that you mention here about information systems (my PhD is in digital learning, so I am also very interested in how to facilitate individual and collaborative work with ideas both for students and for researchers – to my mind, there’s a lot of overlap), and I’ve also tried a whole bunch of programs that never quite felt right.

    I’m quite a hacky programmer, and Researchr is really built with duct-tape and held together with string 🙂 It’s awesome to see it work, and a lot of people really like it when they see demos (some have even adopted it, although usually I have to sit down with their laptop for an hour to install it). I really want to develop it, or something like it into a tool that “normal people” can use… the first step (not easy!) would be to get it into a structure where it was actually much easier to install, upgrade etc, and I also have a lot of ideas for how to develop it further (you can see some of my screencasts and blog posts, for example linking between individual research wikis, streaming read articles like LastFM for music, sharing metadata etc). In fact, once I finish my PhD, I would love to work on tools for open academic workflows full time – if I can find funding, and good people to work with…

    I’ve thought a lot about how to push the project forwards, and discussed it with some other tech-interested people. In addition to Bodong, who is a good friend of mine, Gabby (http://losingtime.ca/) was also experimenting with Linux, and Ryan Muller (http://wiki.learnstream.org/) helped me build the prototype social platform, although he then quit academia 🙂 One possible way forward would be to turn it into a web platform, but there are some significant issues, including hosting the PDFs (unless you want to limit to only truly open articles, but then it’s not useful to most people) – and reading and marking up PDFs online is still not great (ideally we’d transition to a better format than PDFs, but that seems far off). And I love having access to stuff offline…

    So far I’ve been the only one working on this system, and it’s been built up very gradually – the github code is basically just a dump of my working version, where I keep fiddling to add or fix things. No tests, no long-term plan, the documentation is quite out-of-date etc. I’m happy to spend time on this, turning it into a proper software project, moving the wiki documentation to Github, etc, if I thought that it would be something useful to others. Even if we managed to extract a subset of functionality that would be useful to others… I’d love to discuss.

Please comment. I really enjoy your thoughts!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s