==========================================================================================================

      Czech Document OCR Corpus v 0.1
==========================================================================================================


BASIC INFORMATION
--------------------

Czech Document OCR Corpus v 0.1 is a collection of documents for optical character recognition. It is composed of 20 documents from Czech Wikipedia. Every document was printed and scanned. The scanning was done with the different resolution (150, 300 and 600 DPI) and. The scans in every resolution are placed into its own folder. The scanned documents are stored in the pdf format. The documents have maximum one page (the longest document is composed of 523 words, the shortest one has 119 words and the average word number is 299). Each document has also a text reprezentation in .txt file. There is also a file with all the documents at once in .docx format (corpus.docx).

Technical Details
--------------------

Text representation of the documents is stored in the individual text files using UTF-8 encoding in TXT folder. Each filename in TXT folder corresponds to the filenames in 150DPI, 300DPI and 600DPI folders.


This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded. 



AUTHORS
--------------------
Jiri Martinek jimar@kiv.zcu.cz
Pavel Kral pkral@kiv.zcu.cz

Date: November, 2017
