HDPA - Historical Document Processing and Analysis Framework

Python web application in Django

Application is developed for OS Linux (Reference OS Ubuntu 18.04)

Please read carefully README.txt file before deploying.

Download

The application is available only for research purposes for free. Commercial use in any form is strictly excluded. For further information, please, see the paper below:

L. Lenc, J. Martinek, P. Kral, A. Nicolao and V. Christlein, HDPA: Historical Document Processing and Analysis Framework, in Evolving Systems, IF: 2.07., Received: 20 December 2019, Accepted: 23 April 2020, Online: 20 May 2020, pp. 1-14, Springer, ISSN: 1868-6478, doi: 10.1007/s12530-020-09343-4, FullText, BibTeX.

Please, cite this paper when you used HDPA framework in your experiments.

Download HDPA framework

Historical German OCR Corpus v 0.1 and Tools for Data Generating

Dataset Description

Historical German OCR Corpus v 0.1 contains image representation of the sentences from the historical german corpus https://cl.lingfil.uu.se/histcorp/index.html. There are 5 differrent types of generated synthetic dataset, each contains 25 000 samples with an appropriate labeling. There is also an annotated dataset which contains 1386 samples. Labels of the pictures are stored in TXT files with UTF-8 encoding.

Tools Description

TextRecognitionDataGenerator

https://github.com/Belval/TextRecognitionDataGenerator

data_creator.py

Available with the instructions to use in the archive below.

Download

Datasets and tools are available only for research purposes for free. Commercial use in any form is strictly excluded. For further information, please, see the paper below:

J. Martinek, L. Lenc, P. Kral, A. Nicolaou, V. Christlein, Hybrid Training Data for Historical Text OCR, in 15th International Conference on Document Analysis and Recognition (ICDAR 2019), Sydney, Australia, 20-25 September 2019, pp. 565-570, IEEE, ISBN: 978-1-7281-2861-0, doi: 10.1109/ICDAR.2019.00096, FullText, BibTeX.

Please, cite this paper when you used this database or tools in your experiments.

Download Dataset and Tools

If you have some questions / comments related to this corpus or tools, please, do not hesitate to contact the authors: Jiri Martinek jimar@kiv.zcu.cz, Ladislav Lenc llenc@kiv.zcu.cz or Pavel Kral pkral@kiv.zcu.cz.

Czech OCR Corpus: Dataset of Czech documents from Wikipedia

Dataset Description

Czech OCR Corpus v 0.1 is a collection of documents for optical character recognition. It is composed of 20 documents from Czech Wikipedia. Every document was printed and scanned. The scanning was done with the different resolution (150, 300 and 600 DPI) and. The scans in every resolution are placed into its own folder. The scanned documents are stored in the pdf format. The documents have maximum one page (the longest document is composed of 523 words, the shortest one has 119 words and the average word number is 299). Each document has also a text reprezentation in .txt file. There is also a file with all the documents at once in .docx format (corpus.docx). Text representation of the documents is stored in the individual text files using UTF-8 encoding in TXT folder. Each filename in TXT folder corresponds to the filenames in 150DPI, 300DPI and 600DPI folders.

Download

This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded. For further information, please, see the paper below:

J. Martinek, P. Kral, Error Correction for Information Retrieval of Czech Documents, in 10th International Conference on Agents and Artificial Intelligence (ICAART 2018), Funchal, Madeira, Portugal, 16-18 January 2018, pp. 630-634, SciTePress, ISBN: 978-989-758-275-2, doi: 10.5220/0006661906300634, FullText, BibTeX.

Please, cite this paper when you use this database in your experiments.

Download Dataset

If you have some questions / comments related to this corpus, please, do not hesitate to contact the authors: Jiri Martinek jimar@kiv.zcu.cz or Pavel Kral pkral@kiv.zcu.cz.

Porta fontium page segmentation dataset

Dataset Description

This dataset was created from document images digitised within the Porta fontium project. We have selected a newspaper called “Ascher Zeitung”. This newspaper dates back to the second half of the nineteenth century and it is printed in German using Fraktur font. We collected 25 pages in total. The pages were annotated using Aletheia and the layout desription is saved in the PAGE XML format. The transcription was utilised using tools presented in paper “Tools for semi-automatic preparation of training data for ocr”. All pages are annotated on the paragraph level by bounding polygons. The test set contains also text lines with corresponding baselines.

Download

This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded. For further information, please, see the paper below:

L. Lenc, J. Martinek, P. Kral Text Line Segmentation in Historical Newspapers, in 21th International Conference on Artificial Intelligence and Soft Computing (ICAISC 2022) FullText, BibTeX.

Please, cite this paper when you used this database in your experiments.

Download Dataset

If you have some questions / comments related to this corpus, please, do not hesitate to contact the authors: Ladislav Lenc llenc@kiv.zcu.cz, Jiri Martinek jimar@kiv.zcu.cz or Pavel Kral pkral@kiv.zcu.cz.