Whether to Refactor or Rewrite SubSift

We have recently wrestled with the question of whether to refactor the original SubSift software upon which the intended functionality of JISCRI SubSift Services is based, or whether to completely rewrite this functionality from scratch. On balance we opted to rewrite for the reasons set out below. Hopefully our analysis may prove useful to other projects that aim to re-use software produced as a “side-effect” of academic research.

language soup

SubSift was written as a bespoke set of tools to support the peer review process for KDD’09, a major data mining conference, aiding in the allocation of over 500 submitted research papers to circa 200 reviewers. The tools were variously written in Prolog, JAVA, C++ and Matlab M-code, depending on the main area of expertise of the specific researcher who wrote each tool. The emphasis was on hitting the hard (immovable) deadlines of the KDD’09 paper review process rather than on producing generic re-usable software. Remarkably, the deadlines were met and the tools delivered the required functionality. In the SubSift Services project, this tool set is being repurposed as a set of re-usable REST services to support peer review at other conferences, for workshops, for journals, and for mash-ups with tools like Yahoo! Pipes.

Although the SubSift Services project has access to expertise in all the languages and environments used to produce the original software, there are at least three good reasons why this technological mix is incompatible with our goal of producing easily repurposed open source software. Firstly, not many programmers have this mix of skills and so this restricts the scope for contributors. Secondly, the installation and support overhead for this combination of environments is a considerable obstacle to adoption and re-use of the open source software. Thirdly, the interface between these tools is fragile in the sense that changes to the underlying environments, especially their input/output libraries, is likely to break the processing pipeline. Rewriting SubSift in fewer languages, preferably in a single well known and freely available language, overcomes all these problems – albeit at the risk of the rewrite taking longer than the project schedule and budget allows.

Leaving aside issues of the eclectic mix of languages, there is a deeper and more compelling reason to rewrite rather than refactor SubSift; one which is likely to be true for much research-produced software, regardless of how many languages it is written in. The original SubSift software was an exploration of multiple possible solutions to the problem being addressed rather than an implementation of a single known solution. As such, the design is far more complex and involved than it needs to be for the single solution that was eventually chosen as the best one. This is in no way a criticism of the software; this is an inevitable consequence of exploring a design space while implementing a solution. After reviewing the SubSift code we concluded that by treating the existing implementation as a detailed functional specification, it should be possible to produce a cleaner, better engineered implementation that is both simpler to understand and more easily maintained. Furthermore, this rewriting affords an opportunity to implement SubSift Services in a single language, which simultaneously mitigates the aforementioned problems arising from the mix of languages.