Exploiting WWW Resources in Experimental Document Analysis Research
01 January 2002
Many large collections of document images are now becoming available online as part of digital library initiatives, fueled by the explosive growth of the World Wide Web. In this paper, we examine protocols and system-related issues that arise in attempting to make use of these new resources, both as a target application (building better search engines) and as a way of overcoming the problem of acquiring ground-truth to support experimental document analysis research. We also report on our experiences running a simple test involving data drawn from one such collection. The potential synergies between document analysis and digital libraries could lead to substantial benefits for both communities.