Please read carefully README.txt file before deploying.
The application is available only for research purposes for free. Commercial use in any form is strictly excluded. For further information, please, see the paper below:
Please, cite this paper when you used HDPA framework in your experiments.
Historical German OCR Corpus v 0.1 contains image representation of the sentences from the historical german corpus https://cl.lingfil.uu.se/histcorp/index.html. There are 5 differrent types of generated synthetic dataset, each contains 25 000 samples with an appropriate labeling. There is also an annotated dataset which contains 1386 samples. Labels of the pictures are stored in TXT files with UTF-8 encoding.
https://github.com/Belval/TextRecognitionDataGenerator
Available with the instructions to use in the archive below.
Datasets and tools are available only for research purposes for free. Commercial use in any form is strictly excluded. For further information, please, see the paper below:
Please, cite this paper when you used this database or tools in your experiments.
If you have some questions / comments related to this corpus or tools, please, do not hesitate to contact the authors: Jiri Martinek jimar@kiv.zcu.cz, Ladislav Lenc llenc@kiv.zcu.cz or Pavel Kral pkral@kiv.zcu.cz.
Czech OCR Corpus v 0.1 is a collection of documents for optical character recognition. It is composed of 20 documents from Czech Wikipedia. Every document was printed and scanned. The scanning was done with the different resolution (150, 300 and 600 DPI) and. The scans in every resolution are placed into its own folder. The scanned documents are stored in the pdf format. The documents have maximum one page (the longest document is composed of 523 words, the shortest one has 119 words and the average word number is 299). Each document has also a text reprezentation in .txt file. There is also a file with all the documents at once in .docx format (corpus.docx). Text representation of the documents is stored in the individual text files using UTF-8 encoding in TXT folder. Each filename in TXT folder corresponds to the filenames in 150DPI, 300DPI and 600DPI folders.
This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded. For further information, please, see the paper below:
Please, cite this paper when you use this database in your experiments.
If you have some questions / comments related to this corpus, please, do not hesitate to contact the authors: Jiri Martinek jimar@kiv.zcu.cz or Pavel Kral pkral@kiv.zcu.cz.
This dataset was created from document images digitised within the Porta fontium project. We have selected a newspaper called “Ascher Zeitung”. This newspaper dates back to the second half of the nineteenth century and it is printed in German using Fraktur font. We collected 25 pages in total. The pages were annotated using Aletheia and the layout desription is saved in the PAGE XML format. The transcription was utilised using tools presented in paper “Tools for semi-automatic preparation of training data for ocr”. All pages are annotated on the paragraph level by bounding polygons. The test set contains also text lines with corresponding baselines.
This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded. For further information, please, see the paper below:
Please, cite this paper when you used this database in your experiments.
If you have some questions / comments related to this corpus, please, do not hesitate to contact the authors: Ladislav Lenc llenc@kiv.zcu.cz, Jiri Martinek jimar@kiv.zcu.cz or Pavel Kral pkral@kiv.zcu.cz.