API Method: document items import

POST /:user_id/documents/:folder_id/import/:bookmarks_id

Requests the creation of multiple document items in the specified documents folder of the specified user account. Returns an OK message with a 202 Accepted HTTP status code.

Each item is created by fetching the web page of each bookmark url in the specified bookmarks folder. Optionally, by specifying depth values greater than the default of zero, web pages linked to by any <a href="..."> elements in the bookmarked web page and its child pages will also be fetched and appended to the text of that document. The maximum number of pages fetched at any one depth is specified by the breadth value. These optional "crawl" parameters allow a set of linked pages to be fetched rather than a single web page per bookmark, but clearly the amount of text fetched will increase (possibly dramatically) as depth and breadth are increased.

Fetching web pages takes time and so this import method just adds the list of urls to be fetched to the SubSift web harvester robot that runs continuously as a background task. Once a page is fetched by the robot, a document item is created in the specified documents folder. Pages are fetched at a maximum rate of one per second although in reality each page may take several seconds to fetch, so a long list of bookmarks can take an hour or more until all the pages are turned into the document items requested by this SubSift API method.

Note that not all web sites permit the use of robots (also known as "bots", "webots" and "crawlers") on all or parts of their site. The SubSift harvester robot checks and obeys the rules defined in the robots.txt of each site it visits and will not retrieve any page denied to robot access. Please try to avoid heavy mining of any single site or the SubSift robot may get permanently banned from that site.

URL:

http://subsift.ilrt.bris.ac.uk/user_id/documents/folder_id/import/bookmarks_id.format

Formats (about return formats):

csv, json, rdf, terms, xml, yaml

HTTP Methods (about HTTP methods):

POST

Requires Authentication (about authentication):

true

Parameters:

Usage Examples:

cURL (about cURL):

curl -X POST -H "Token:mytoken" http://subsift.ilrt.bris.ac.uk/kdd09/documents/pc/import/pc-senior

Response (about return values):

XML example:

<?xml version="1.0" encoding="UTF-8"?>
<result>
  <ok>true</ok>
</result>