Piping Hot (or not) for SubSift

One of the ideas behind the SubSift project is to create a number of Yahoo! Pipes that may be combined as a mash-up to support peer review of conference and journal papers, matching submitted publications with their most appropriate reviewers. So, it will come as no surprise that we’ve been experimenting with Yahoo! Pipes recently and with mixed success.

Yahoo! Pipe to extract publication titles from DBLP

On the win side, it is surprisingly easy to create pipes that do the sort of things that Yahoo! Pipes was designed to create. For example, the screen shot above shows a pipe that extracts the titles of all publications for a given author from their DBLP page and returns these into an RSS news feed. The ease with which these information extraction pipes can be created is important for applications like SubSift because text extraction based on the (current) web page layout of a third-party website is brittle, being prone to changes in the underlying HTML. Hence being easy to update or recreate the pipe is a big plus for maintainability and longevity of the pipes.

On the lose side, Yahoo! have place a number of resource restrictions on pipes that seriously limit the range of applications for which they approach makes sense. We hit two of these restrictions in our initial foray into this technology: maximum execution time and maximum file size that can be retrieved. In the DBLP publication titles pipe, fetching the publications of too many reviewers at once causes a timeout. This is a problem because SubSift needs to handle hundreds of reviewers (e.g. a whole programme committee for a conference). Similarly, authors with a lot of publications had large web page sizes and so retrieving their pages failed because of the file size exceeds the Yahoo! limit.

If the project is to continue with the strategy of using Yahoo! Pipes for most of the web services then some way of working around these limitations will be required.