fork download
  1. <table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
  2.  
  3. <tr>
  4.  
  5. <td>
  6.  
  7.  
  8.  
  9. <!-- BEGIN MAIN CONTENT TABLE -->
  10.  
  11.  
  12.  
  13. <table width="100%" border="0" cellspacing="0" cellpadding="10" bgcolor="#ffffff">
  14.  
  15. <tr>
  16.  
  17.  
  18.  
  19. <td width="10"><img src="../../../img2/space.gif" alt="" width="1" height="1" /></td>
  20.  
  21.  
  22.  
  23. <td valign="top">
  24.  
  25.  
  26.  
  27. <h3 class="blue-space">D-Lib Magazine</h3>
  28.  
  29. <p class="blue">November/December 2014<br />
  30.  
  31. Volume 20, Number 11/12<br />
  32.  
  33. <a href="../11contents.html">Table of Contents</a>
  34.  
  35. </p>
  36.  
  37.  
  38.  
  39. <div class="divider-full">&#160;</div>
  40.  
  41.  
  42.  
  43. <h3 class="blue-space">The Architecture and Datasets of Docear's Research Paper Recommender System</h3>
  44.  
  45.  
  46.  
  47. <p class="blue">
  48.  
  49. Joeran Beel and Stefan Langer<br />
  50.  
  51. Docear, Magdeburg, Germany<br />
  52.  
  53. {beel, langer}@docear.org<br /><br />
  54.  
  55.  
  56.  
  57. Bela Gipp<br />
  58.  
  59. University of California Berkeley, USA &amp; National Institute of Informatics, Tokyo, Japan<br />
  60.  
  61. gipp@nii.ac.jp<br /><br />
  62.  
  63.  
  64.  
  65. Andreas N&#252;rnberger<br />
  66.  
  67. Otto-von-Guericke University, Magdeburg, Germany<br />
  68.  
  69. andreas.nuernberger@ovgu.de
  70.  
  71.  
  72.  
  73. <br /><br />doi:10.1045/november14-beel
  74.  
  75. </p>
  76.  
  77.  
  78.  
  79. <div class="divider-full">&#160;</div>
  80.  
  81.  
  82.  
  83. <p class="blue"><a href="11beel.print.html" class="fc">Printer-friendly Version</a></p>
  84.  
  85.  
  86.  
  87. <div class="divider-full">&#160;</div>
  88.  
  89.  
  90.  
  91. <!-- Abstract or TOC goes here -->
  92.  
  93.  
  94.  
  95. <h3 class="blue">Abstract</h3>
  96.  
  97.  
  98.  
  99. <p class="blue">
  100.  
  101. In the past few years, we have developed a research paper recommender system for our reference management software Docear. In this paper, we introduce the architecture of the recommender system and four datasets. The architecture comprises of multiple components, e.g. for crawling PDFs, generating user models, and calculating content-based recommendations. It supports researchers and developers in building their own research paper recommender systems, and is, to the best of our knowledge, the most comprehensive architecture that has been released in this field. The four datasets contain metadata of 9.4 million academic articles, including 1.8 million articles publicly available on the Web; the articles' citation network; anonymized information on 8,059 Docear users; information about the users' 52,202 mind-maps and personal libraries; and details on the 308,146 recommendations that the recommender system delivered. The datasets are a unique source of information to enable, for instance, research on collaborative filtering, content-based filtering, and the use of reference-management and mind-mapping software.
  102.  
  103. </p>
  104.  
  105.  
  106.  
  107. <p class="blue">Keywords: Dataset, Recommender System, Mind-map, Reference Manager, Framework, Architecture</p>
  108.  
  109.  
  110.  
  111. <!-- Article goes next -->
  112.  
  113.  
  114.  
  115. <div class="divider-full">&#160;</div>
  116.  
  117. <h3>1. Introduction</h3>
  118.  
  119.  
  120.  
  121. <p>Researchers and developers in the field of recommender systems can benefit from publicly available architectures and datasets.<span style="vertical-align: super;"><a href="#n1">1</a></span> <i>Architectures</i> help with the understanding and building of recommender systems, and are available in various recommendation domains such as e-commerce [<a href="#1">1</a>], marketing [<a href="#2">2</a>], and engineering [<a href="#3">3</a>]. <i>Datasets</i> empower the evaluation of recommender systems by enabling that researchers evaluate their systems with the same data. Datasets are available in several recommendation domains, including <a href="http://g...content-available-to-author-only...s.org/datasets/movielens/">movies</a>, <a href="http://l...content-available-to-author-only...a.edu/millionsong/">music</a>, and <a href="http://w...content-available-to-author-only...l.de/ws/dc13/">baby names</a>. </p>
  122.  
  123.  
  124.  
  125. <p>In this paper, we present the architecture of <i><a href="http://d...content-available-to-author-only...r.org">Docear's</a></i> research paper recommender system. In addition, we present four datasets containing information about a large corpus of research articles, and Docear's users, their mind-maps, and the recommendations they received. By publishing the recommender system's architecture and datasets, we pursue three goals.</p>
  126.  
  127.  
  128.  
  129. <p>First, we want researchers to be able to understand, validate, and reproduce our research on Docear's recommender system [<a href="#4">4</a>-<a href="#10">10</a>]: In our previous papers, we could often not go into detail of the recommender system due to spacial restrictions. This paper gives the information on Docear's recommender system that is necessary to allow the re-implementation of our approaches and to reproduce our findings.</p>
  130.  
  131.  
  132.  
  133. <p>Second, we want to support researchers when building their own research paper recommender systems. Docear's architecture and datasets ease the process of designing one's own system, estimating the required development times, determining the required hardware resources to run the system, and crawling full-text papers to use as recommendation candidates.</p>
  134.  
  135.  
  136.  
  137. <p>Third, we want to provide real-world data to researchers who have no access to such data. This is of particular importance, since the majority of researchers in the field of research paper recommender systems have no access to real-world recommender systems [<a href="#11">11</a>]. Our datasets allow analyses beyond the analyses we have already published, for instance to evaluate collaborative filtering algorithms, perform citation analysis, or explore the use of reference managers.</p>
  138.  
  139.  
  140.  
  141. <p>This paper will present related work, provide a general overview of Docear and its recommender system, introduce the architecture, and present the datasets.</p>
  142.  
  143.  
  144.  
  145. <div class="divider-full">&#160;</div>
  146.  
  147. <h3>2. Related Work</h3>
  148.  
  149.  
  150.  
  151. <p>Several academic services published datasets, and hence have eased the process of researching and developing research paper recommender systems. <i><a href="http://w...content-available-to-author-only...e.org/faq/data.adp">CiteULike</a></i> and <i><a href="https://w...content-available-to-author-only...l.de/bibsonomy/dumps/">Bibsonomy</a></i> published datasets containing the social tags that their users added to research articles. The datasets were not originally intended for recommender system research but are frequently used for this purpose [<a href="#12">12</a>-<a href="#14">14</a>]. <i>CiteSeer</i> made its corpus of research papers <a href="http://c...content-available-to-author-only...u.edu/about/data">public</a>, as well as the citation graph of the articles, data for author name disambiguation, and the co-author network [<a href="#15">15</a>]. CiteSeer's dataset has been frequently used by researchers for evaluating research paper recommender systems [<a href="#12">12</a>], [<a href="#14">14</a>], [<a href="#16">16</a>-<a href="#22">22</a>]. Kris Jack, <i>et al.</i>, compiled a dataset based on the reference management software <i>Mendeley</i> [<a href="#23">23</a>]. The dataset includes 50,000 randomly selected personal libraries from 1.5 million users. These 50,000 libraries contain 4.4 million articles with 3.6 million of them being unique. For privacy reasons, Jack, <i>et al.</i> only publish unique IDs of the articles and no title or author names. In addition, only those libraries having at least 20 articles were included in the dataset. Sugiyama and Kan released <a href="http://w...content-available-to-author-only...u.sg/~sugiyama/SchPaperRecData.html">two small datasets</a>, which they created for their academic recommender system [<a href="#24">24</a>]. The datasets include some research papers, and the interests of 50 researchers. The CORE project released a <a href="http://c...content-available-to-author-only...c.uk/intro/data_dumps">dataset</a> with enriched metadata and full-texts of academic articles, and that could be helpful in building a recommendation candidate corpus. </p>
  152.  
  153.  
  154.  
  155. <p>Architectures of research paper recommender systems have only been published by a few authors. The developers of the academic search engine <i>CiteSeer(x)</i> published an architecture that focused on crawling and searching academic PDFs [<a href="#25">25</a>], [<a href="#26">26</a>]. This architecture has some relevance for recommender systems since many task in academic search are related to recommender systems (e.g. crawling and indexing PDFs, and matching user models or search-queries with research papers). Bollen and van de Sompel published an architecture that later served as the foundation for the research paper recommender system <i>bX</i> [<a href="#27">27</a>]. This architecture focuses on recording, processing, and exchanging scholarly usage data. The developers of <i>BibTiP</i> [28] also published an architecture that is similar to the architecture of bX (both bX and BibTip utilize usage data to generate recommendations). </p>
  156.  
  157.  
  158.  
  159. <div class="divider-full">&#160;</div>
  160.  
  161. <h3>3. Docear and Its Recommender System</h3>
  162.  
  163.  
  164.  
  165. <p><a href="http://d...content-available-to-author-only...r.org">Docear</a> is an open source literature suite for organizing references and PDFs, including the PDFs' annotations. Docear is available for Windows, Mac OS, and Linux and offers a recommender system for publicly available research papers on the Web. In contrast to most other reference managers, Docear uses mind-maps for the information management. Figure 1 shows a screenshot depicting the management of PDFs in Docear, including annotations. A user can create several categories (e.g. "Academic Search Engines") and sub-categories (e.g. "Google Scholar"). Each category contains a number of PDFs, and for each PDF, its annotations that are made by the user &#151; e.g. highlighted text, comments, and bookmarks &#151; are displayed. If the cursor is moved over a PDF or annotation, the PDF's bibliographic data such as the title and authors, is shown.</p>
  166.  
  167.  
  168.  
  169. <p>For the remainder of this paper, it is important to note that each element in the mind-map &#151; i.e. each category, PDF, or annotation &#151; is called a "node". Each node has some descriptive text (e.g. the category label or PDF's file name), an option to a link to a file or web page (a click on the node opens the linked file or web page), and some further attributes such as the bibliographic data. For each node, the dates when the node was created, modified, and moved are sto
Runtime error #stdin #stdout #stderr 0s 5080KB
stdin
Standard input is empty
stdout
Standard output is empty
stderr
./prog.sh: line 1: syntax error near unexpected token `newline'
./prog.sh: line 1: `<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">'