Fun with Pipes for SubSift

pipe_overview image

Currently SubSift uses the DBLP website to feed it’s term frequency algorithms. Ignoring disambiguation issues for a moment, once each researcher is known, the system fetches their publication listing from DBLP:

This returns a chronologically ordered list of publications. SubSift internally parses the contents of this page (a technique called screen scraping), strips out the unnecessary information (e.g. html tags, stop words, etc) and extracts the paper titles of each publication. This forms the basis of the bag-of-words; the collection of terms that represents each researcher.

One of the goals of the SubSift project has been to modularise the system and in particular, extract this DBLP-fetching component out and replace it with a call to Yahoo! Pipes.

There are several reasons for doing this.

yahoo pipe interface

Firstly, it lets SubSift more easily adapt to any changes in document structure made by the DBLP website. Regular expression-based screen scraping is a brittle technique (likely to break if the smallest change is made to the source document) and a bit of a black art at the best of times. The same techniques are used in Yahoo! Pipes, however in 6 months or a year’s time, realising a change has been made to the structure of the html produced by DBLP and hunting inside the Perl code for the relevant regex string can be very difficult. Having it all compartmentalised makes this task a lot easier (Pipes can even be modified through the web without needing to touch any code at all).

Secondly, Pipes is a rapid data mashup creator and we can quickly and easily create new pipes for many different types of data sources. Not only can we adapt to changes easily, but if a new service becomes popular, or a specific conference uses a different website to store it’s community member information in, then we can quickly build a new pipe to pull in the required information. This is also important when you consider that SubSift has a much wider range of applications then just scientific conferences and data needs for other uses might come from a much more diverse set of sources.

Finally, one of the nice things with pipes is that it has a user friendly interface; one that doesn’t require an extensive and detailed knowledge of programming to use. While I would emphasis some programming experience is needed (and especially familiarity with regular expressions), we hope that it could be maintained by mere mortals (and not just Perl gurus).

I have developed three different pipes that we can use for SubSift.

The DBLP Computer Science Bibliography logo

[1 a] The first mimics the existing DBLP lookup. It takes a researcher name as input and returns a list of the titles of their publications.

[1 b] I have created an extension to this pipe which will do the same as the above but also filters papers based on a provided year.

Microsoft Academic Search logo

[2] The second pipe queries the Microsoft Academic Search site. An advantage of using this service is that it also provides brief abstract information in it’s search results. We can therefore return more information about a given researcher and help improve the term frequency extraction. Another advantage is that the search service itself allows us to supply a date restriction, this means we don’t need to do any date-based filtering in the pipe. However the downside of using MAS is that it’s a web search, not a front end onto a database like DBLP. This means that any search is not guaranteed to return results specific to an individual researcher and disambiguation becomes harder to solve.

ILRT logo

[3] The final pipe uses the staff listing page here at ILRT. You provide the name of a member of staff to the pipe and it queries the listing page to find the personal page for that person. It then uses a second pipe to retrieve that personal page and extract out all the useful information (description of the person, projects they’ve worked on, papers they’ve written etc).

All these pipes can be found on the Yahoo Pipes site here: