Convolutional Neural Networks for Figure Extraction in Historical Technical Documents

We present a novel method of extracting figures and images from pages in scanned documents, especially from historical technical documents. Our method uses existing tools, in particular PDF Figures, to extract figures from modern structured PDF documents. We use PDF Figures to generate data which are used to train a convolutional neural network to find figures on a scanned page. The output of our neural network is a bitmap image whose non-zero pixels correspond to predicted figure locations. Our method is easily extendable: the training data is generated in a completely automated fashion, requiring no hand labeling, and hence larger training sets may be obtained easily by simply extracting figures from a bigger set of structured documents. We have implemented prototype examples of this method in Torch. The convolutional neural network is exceptionally effective in extracting figures from modern conference and journal papers, and can also adapt well to a test set of historical technical documents we downloaded from the Bell Labs Records.

Authors

Iraj Saniee

Group Leader

Chun-Nam Yu

Principal Researcher

View Publication

Select your country

Select your country

Convolutional Neural Networks for Figure Extraction in Historical Technical Documents

Authors

Looking for Nokia licensed products support?

Looking for Nokia licensed products support?