Convolutional Neural Networks for Figure Extraction in Historical Technical Documents
09 November 2017
We present a novel method of extracting figures and images from pages in scanned documents, especially from historical technical documents. Our method uses existing tools, in particular PDF Figures, to extract figures from modern structured PDF documents. We use PDF Figures to generate data which are used to train a convolutional neural network to find figures on a scanned page. The output of our neural network is a bitmap image whose non-zero pixels correspond to predicted figure locations. Our method is easily extendable: the training data is generated in a completely automated fashion, requiring no hand labeling, and hence larger training sets may be obtained easily by simply extracting figures from a bigger set of structured documents. We have implemented prototype examples of this method in Torch. The convolutional neural network is exceptionally effective in extracting figures from modern conference and journal papers, and can also adapt well to a test set of historical technical documents we downloaded from the Bell Labs Records.