Plans
The DPLA aims to scale to 10 million books in its first 18 months. The Proteus infrastructure has been designed to support efficient processing, indexing, and retrieval with collections of this size.
To meet this goal, we propose a range of work to improve Proteus. Our top priorities are:
•stabilize APIs and modify the tools so that they are more convenient for us and more accessible by others wishing to build interfaces
•scale up and improve the accuracy of the document-document relation extraction to handle 10 million books. Currently the automatic alignment-based duplicate detection is able to search all pairs in a 100,000 books collection using a large cluster.
•improve infrastructure for distributed search, exploiting Galago’s in-memory indices for low latency
•improve existing capabilities such as improved OCR, parsing, entity disambiguation, picture search, quotation detection, topic modeling, duplicate detection, and text reuse analysis
In addition, we are interested in some or all of the following areas. Priority among them (and the previous four) will be based on discussion with DPLA stakeholders and will depend upon available resources.
•expanding evidence finding techniques for adding book citations to Wikipedia and other texts
•developing more case studies and evaluation collections to measure progress
•providing support for collecting, indexing, searching, and learning from user annotations
•using NLP techniques to automatically enhance metadata
•improving models for translation detection for major languages
•adapting to varying levels of OCR quality in different books in the collection and mitigating bad OCR by re-running with new models or performing post-correction
•supporting topical similarity among documents and passages to aid exploration and speed up retrieval
•providing time-based searching and browsing for entities and topics
We are interested in ideas from the DPLA community about new interfaces, which would suggest new capabilities for the Proteus infrastructure.