How to save the edited document in abbyy finereader. How to recognize scanned text with Abbyy FineReader! Working with the program

The conversation will be about ABBYY FineReader 12, that is, about its latest version. Without looking too far, we have chosen the most famous product of ABBYY, which, to its merits, is perfectly Russified. Already at first glance, Fine Reader (FR) gives the impression of a program with good Russian-language support: in this regard, indeed, everything is done at a very decent level, including reference information.

In the beginning - retreat. The question of how to convert all or some part of the archive into digital format (and what, in fact, is understood by the word "digital") is always relevant. Buying a scanner hardly solves all the problems. Of course, very often a disc or several with proprietary software is supplied with the scanner documentation. However, already at the stage of sanitation it becomes clear that the quality of the scanning program leaves much to be desired, or the format in which the saving takes place, unfortunately, is not suitable for storage. Why? Most graphic formats do not separate the text from the non-text space of the document, and therefore it is not possible to copy any excerpt from such a file.

It is in such cases that functional text recognition programs come to the rescue, whose capabilities, in particular, include extracting text from an image.

Introducing ABBYY FineReader

Package ABBYY Finereader 12 - Optical Character Recognition (OCR) system. Designed both for automatic input of printed documents into a computer, and for converting PDF documents and photographs into editable formats (from the program manual)

The abbreviation "OCR" applies to all data recognition applications (not just text). Data can be retrieved from a printed or electronic document. Once upon a time, not very long ago, few people knew about OCR, in one form or another, and the process of translating a text into an electronic form turned into a mere routine, right up to manual reprinting of the original text. Today, having a flatbed scanner (only a few people use a hand-held scanner at home) and finereader 12- be sure - no difficulties in scanning and recognition will arise.

Starting with the sixth version, FineReader supports import and export in PDF format, patented by Adobe. Many readers have probably encountered difficulties in translating from this format to any other (doc, etc.), since there are not so many really useful programs in this area (only ABBYY's subsidiary PDF Transformer is worthy of attention). The fact is that such programs carry out text recognition only once, as a result of which the “identity” of the result is not at all great (depending on the complexity of the document), plus the formatting of the document is pretty much lost.

This is not the case with FineReader. The ninth version of the program includes a technology called Document OCR. It is based on the principle of integral document recognition: it is analyzed and recognized as a whole, and not page by page. At the same time, all kinds of columns, headers and footers, fonts, styles, footnotes and images remain intact or are replaced by those close to the original.

Installing the package

Demo-version of Finereader 12 can be downloaded on the Abbyy.ru website, in the Download section, the full licensed version is distributed on a CD-ROM. The purchase methods can be found on the same website in the "Buy" section.

On the ABBYY developers website you can download a demo version of ABBYY FineReader version 12 (or another one that is relevant for today)

ABBYY FineReader is distributed in several versions: Professional Edition, Corporate Edition, Site License Edition, etc. The difference between the Professional version and the rest is that it is designed to work in a corporate network with the ability to work together on document recognition. Otherwise, the difference is insignificant and depends on the choice of the terms of the license agreement.

It's hard to imagine that FineReader 2.0 existed 12 years ago, which took about 10 MB of disk space. Over time, the package "grew" tenfold and now in the installed form takes up to 300 MB. A lot or a little - judge for yourself. The new FR supports 179 recognition languages, among which there are little-known artificial languages \u200b\u200b(Ido, Interlingua, Ocidental and Esperanto), programming languages, formulas, etc. Let's not forget about the support of various formats and scripts. So, if for some reason you want to limit the space occupied by the package, during installation, mark only those components that will be in demand during work.

The choice of components affects the installation time, which, however, should not take long. During the installation process, you will be introduced to the basic features of FR. After activation (via the Internet, via E-mail, using the received code, etc.), the program is ready for fully functional work. In demo mode, you will certainly encounter various restrictions, which, unfortunately, do not allow you to fully use the package.

FineReader interface. Functionality

Access to the program's capabilities is available both through the scripts that appear in the main menu immediately after the installation process, and, in fact, through the main interface.


FineReader startup splash screen

The appearance of the program from version to version does not undergo any special changes: the developers do not see any reason to radically change it. Considerable attention is paid to ergonomics, which is noticeable in all ABBYY products (Lingvo, PDF Transformer, FlexiCapture ...). In other words, Fine Reader 12's interface is well thought out and prone to all users, including beginners. The principle of “Get the result in one click” will appeal to those who are not used to setting up and changing something. On the other hand, more experienced users can fine-tune FineReader through the preferences dialog (Tools -\u003e Options ...). The only caveat: for comfortable work in the application, it is advisable to set the screen resolution to 1280 × 800, so that all the tools are always, as they say, at hand.

After launching the Fine Reader program, a window with buttons for quick access to the program functions will appear. This menu is also available through the Tools -\u003e ABBYY FineReader menu, the "Basic Scripts" button in the far right corner of the program, or through the Ctrl + N keyboard shortcut (by analogy with Word, where this combination causes the opening of a new document).

Scan to Microsoft Word: in the ninth version of FineReader there is support for Microsoft Word 2007, which has not yet become popular. In turn, a "corporate" red icon appears on the toolbar in Microsoft Office applications, in the add-ins section after installing FR.


Menu for exporting a recognized FineReader document
Choosing languages \u200b\u200bfor scanning and recognizing documents

In addition to Microsoft Office, FR supports integration with Microsoft Outlook, provides export of recognition results to the same Microsoft Word, Excel, Lotus Word Pro, Corel WordPerect and Adobe Acrobat. These features make it somewhat easier and faster to work with the program, especially if you have to work in it regularly.

PDF or images in Microsoft Word: recognize data from a PDF or other type of graphic file supported by Finereader 12. It should be noted that the technology for extracting text from a PDF file in FR is not just "peeling" the text content (there may be no text layer in PDF) from the graphic one. In fact, the recognition technology is quite difficult: after analyzing the content of the document, the program decides what and how to do with the text: simply extract or recognize, and so on for each text fragment.

Scan to Microsoft Excel: scanning to XLS (Microsoft Excel format) can be justified if the scanned image contains tables.

Scan to PDF: There are many reasons to scan to PDF. One of them is security: this is the only format familiar to FR, in the settings of which you can set a password lock. The password is set not only for opening a document, but also for printing it and other operations. It is possible to choose one of three encryption levels: 40-bit, 128-bit based on RC4 standard, 128-bit level based on AES (Advanced Encryption Standard).

Convert photo to Microsoft Word: translation of a file from a graphic format (and it can be PDF or a multi-page image) to DOC / DOCX.

Open in Fine Reader: open a graphic file (PDF, BMP, PCX, DCX, JPEG, JPEG 2000, TIFF, PNG) for FineReader recognition.

Working in FineReader

Now - briefly about the features of the program. The whole process is divided into scanning, recognition and saving of results. After you have selected the type of program action, specified the file or device for scanning, FineReader performs its task step by step, by the way, quite resource-intensive for the central processor.

If you are the happy owner of a dual-core processor, then working in the Fine Reader 12 package, you can appreciate the power of the computer's performance. The fact is that FR, having detected a dual-core processor, recognizes not one, but two pages of a document at once in parallel. A trifle - but nice.

First, there is scanning, then recognition and export of the temporary document in the selected format.


PDF document recognition process

Scanning. You do not need to make any presets in FineReader (other than selecting a reader) before scanning. That is why scenarios were invented: they are designed to simplify the execution of the same type of actions.

Recognition. Simplification touched upon other little things as well. So, if you remember the previous versions of the program, before we had to manually change the language (languages, if there were several) of the document. Now this happens automatically, though not always. In the latter case, FR subtly suggests checking the document language.

Returning to the FR recognition technology: why does the program first scan the entire document as a whole, and not page by page? As already mentioned, the text is recognized based on the entire content: fonts of the same size / typeface, tables and borders, indents, etc. are selected.

Do not be surprised if FineReader 12 displays a message that the page cannot be recognized because no text area was found. For the sake of experiment, we photographed an area of \u200b\u200ba text document on a mobile phone from the LCD screen (however, knowing the result in advance). Fine Reader 12 did not recognize the text of the image, because it was clearly of such quality, which is clearly not enough for this. On the second run, we photographed a page with a text with a digital camera under normal lighting.

FineReader recognized the passage without any problems, retaining the formatting and marking with markers some questionable points or characters that may have variable spelling.

As you can see in the image, these are mainly dots, hyphens, commas - in general, small characters. In addition, it is clearly seen that the program took into account the irregularities, curvatures of the photographed page and aligned the lines of text. Conclusion - FR did an excellent job with its albeit not very difficult task.

Occasionally, some minor points may go unnoticed by the Fine Reader program, but they are easy to correct manually. Fortunately, the package has its own WYSIWYG editor, the capabilities of which are quite enough for the final editing of the document. A spell check is also available.

How to improve the recognition accuracy, so that later on to a lesser extent editing the text? First, you can connect a custom Microsoft Word dictionary. True, it is difficult to judge the increase in accuracy, except perhaps the increase in the vocabulary of a spell checker (a module that checks spelling and grammar). Among other things, to improve recognition, it makes sense to familiarize yourself with the program settings (Service -\u003e Options) and select one of two modes:

careful recognition - it can be selected when recognizing documents of any "complexity": with tables without grid lines, text, graphs, tables on a colored background, etc. It can also help with a poor-quality source for recognition

quick recognition - this mode is recommended for processing large volumes of documents with simple design, or in the event that time does not allow for thorough recognition. In most cases, when you have black printed text on a white background, you can opt for fast recognition.

In general, improving the quality of FineReader's work is a separate topic for conversation, the details of which you can learn from the official help, namely in the section "How to improve the results obtained".

Saving the document. The last stage of work in Fine Reader 12 is saving the final result in a certain graphic / text format. Pre-settings for saving can be specified in the FR options: Service -\u003e Options, the "Save" tab. Each format has its own settings. When saving in DOCX format, care should be taken about format compatibility (DOCX format files are not recognized in Word 2003<). В txt-файлах не забудьте проверить правильность кодировки (особенно в случае с текстом в кириллице).

ABBYY Screenshot Reader

In many large packages, developers very often like to add small service utilities. For example, the well-known application for burning discs Nero includes a set of 3 - 5 utilities that allow something that even Nero itself cannot. Review (here you can download it as part of Fine Reader 12).

As for FineReader, it contains one small Screenshot Reader application. With it you can take a screenshot and quickly convert it to the desired format using FR. The program is available through the Start menu (Start -\u003e All Programs -\u003e ABBYY FineReader 12.0 -\u003e ABBYY Screenshot Reader.).

Screenshot Reader's capabilities are somewhat wider than it might seem at first glance. (otherwise it would be possible to get by with a simple press of the "PrintScreen" key on the keyboard). In addition to the fact that Screenshot Reader takes a screenshot of the screen (or more precisely, the selected area of \u200b\u200bthe screen), the program is tightly integrated with FR.

When you press the "Snapshot" button on the Screenshot Reader panel, the cursor changes shape and the screen area selection tool is activated. The selected area of \u200b\u200bthe image is enclosed in a frame for further text recognition (it starts automatically).

In the drop-down list, you can select the desired action: in fact, Screenshot Reader duplicates FR quick scripts with the difference that instead of a snapshot from the scanner, a screenshot is sent to the input.

It should be noted that the program, along with the entire package, requires activation. When registering a product ABBYY FineReader 12 Professional Edition Screenshot Reader is provided free of charge as a "bonus".

Conclusion

FineReader is an indispensable program for scanning and recognizing graphic data. The Russian-language interface and the availability of settings will not scare off an inexperienced user. Support for the latest formats, innovative technologies and, as a result, high-quality recognition make the program the best choice, especially since ABBYY FineReader still has no competitors in this area.

FineReader 12 hotkeys

  • Create a new ABBYY FineReader document - CTRL + N
  • Open an ABBYY FineReader document 12 - CTRL + SHIFT + N
  • Save Pages - CTRL + S
  • Save image to file - CTRL + ALT + S
  • Recognize all pages of a document - CTRL + SHIFT + R
  • Close current page - CTRL + F4
  • Recognize selected pages of an ABBYY FineReader document - CTRL + R
  • Open Scenario Manager - CTRL + T
  • Open dialog Options "Fine Reader"- CTRL + SHIFT + O
  • Open Help - F1
  • Go to the Document window - ALT +1
  • Go to Image window - ALT +2
  • Go to the Text window - ALT +3
  • Go to the Close-up window - ALT +4

Although advances made to artificial intelligence (AI) over the past 50 years have not brought smart machines one iota closer to human cognitive capabilities, it would be unfair to completely deny progress in this direction. The most obvious and striking example is chess (not to mention the simpler games). The computer cannot yet imitate our thinking, but it is quite capable of compensating for this gap with a large amount of specialized memory and speed of search. Vladimir Kramnik described the game of the Deep Fritz program that won him in 2006 as “inhuman” in the sense that it often contradicted the established (human) rules of strategy and tactics.

And a little over a year ago, another brainchild of IBM, which at one time laid the foundation for the triumphant chess victories of computers (the famous Deep Blue), called Watson, made a new breakthrough, beating two champions of the popular American Jeopardy quiz by a wide margin. It is significant, however, that although Watson independently voiced the answers, the questions were still transmitted to him in text form. This suggests that successes in many areas of AI applications - speech and image recognition, machine translation - are rather modest, although this does not prevent us from applying them in practice today. The greatest successes, perhaps, are demonstrated by optical character recognition systems (OCR, Optical Character Recognition), with which almost all PC users are probably familiar in one way or another. Moreover, Russian developments in this area occupy a worthy place in the world - I mean ABBYY FineReader.

A bit of history

The current version of ABBYY FineReader is number 11, that is, the application has come a long way of development, and even the history of this process is of some interest. Without pretending to be an exhaustive chronicle, I will only cite the main milestones over the last decade, during which I more or less followed FineReader:

YearVersionKey features
2003 7.0 Increase in recognition accuracy up to 25%. Most of all this was reflected in tables, especially complex ones, with colored cells, hidden dividers, etc.
2005 8.0 Further optimization of recognition algorithms, primarily aimed at working not with document scans, but with digital photographs. For this, additional functions for preparing originals (elimination of distortions, alignment of lines, etc.) have appeared.
2007 9.0 The emergence of ADRT technology, which takes into account the logical structure of the entire processed (multi-page) document and is able to highlight repeating elements (headers and footers), connect "flowing" objects (tables), etc.
2009 10.0 Further improvement of ADRT and recognition algorithms, increasing the accuracy of processing originals with low resolution up to 30%.
2011 11.0 The main attention is paid to the speed of the program. "Second coming" of black and white mode, which on originals of good quality gives an additional acceleration up to 30%.

Naturally, during the same time, FineReader has expanded support for document formats, improved built-in tools and interface, improved the reconstruction of the structure of originals, etc. However, the highlighted points are directly related to OCR technologies and demonstrate well the abrupt development process characteristic of complex science-intensive systems, when after the next "breakthrough" there follows a certain period of "lull" necessary for the improvement of new algorithms. They are the main value of any OCR program, and therefore, any detailed information about them rarely reaches users. However, ABBYY has kindly agreed to lift the veil of secrecy, and today we have the opportunity to look into the inner sanctum of FineReader.

Basic principles

So, since OCR belongs to the field of AI, it is quite logical that developers seek to imitate the activity of our brain at least to some extent. Of course, the structure of our visual system is incredibly complex, but the basic "large-block" principles of its functioning have been sufficiently studied, usually there are three of them:

  1. Integrity - an object is considered as a set of its parts and (for visual images) spatial relationships between them. In turn, the parts are interpreted only as part of the entire object. This principle helps to build and refine hypotheses, quickly rejecting the unlikely ones.
  2. Purposefulness - since any interpretation of data pursues a specific goal, then recognition is a process of putting forward hypotheses about an object and purposefully testing them. A system operating in accordance with this principle will not only use computational power more economically, but also less likely to make mistakes.
  3. Adaptability - the system saves the information accumulated in the process of work and uses it repeatedly, that is, it learns itself. This principle allows you to create and accumulate new knowledge and avoid the repeated solution of the same problems.

FineReader is the only OCR system in the world that operates in accordance with the above principles at all stages of document processing. The corresponding technology is called IPA - by the first letters of English terms. For example, according to the principle of integrity, a fragment of an image will be interpreted as a symbol only if it contains all the structural parts of similar objects, moreover, being in certain relationships. This helps to replace the enumeration of a large number of templates (in search of a more or less suitable one) with a purposeful test of a reasonable number of hypotheses, and relying on the previously accumulated information about possible character styles in the recognized document.

However, the principles of IPA apply not only to the fragments corresponding to (presumably) individual characters, but also to the entire original page image. Most OCR systems are based on the recognition of the hierarchical structure of the document, that is, the page is divided into basic structural elements, such as tables, images, blocks of text, which, in turn, are divided into other characteristic objects - cells, paragraphs - and so on. , down to individual characters.

Such an analysis can be carried out in two main ways: top-down, that is, from the constituent elements to individual symbols, or, conversely, bottom-up. One of them is most often used, but ABBYY has developed a special algorithm MDA (multilevel document analysis) that combines both. In short, it looks like this: the structure of the page is analyzed by the top-down method, and the reconstruction of the electronic document at the end of recognition occurs from the bottom-up, however, at all levels, an additional feedback mechanism operates. As a result, the probability of gross errors associated with incorrect recognition of high-level objects is sharply reduced.

ADRT

Historically, OCR systems have evolved from recognizing individual characters. This task is still the most important and most difficult one; the most complex algorithms are associated with it. However, it soon became clear that higher-level information (for example, about the document language and the correct spelling of recognized words) could help in solving it - this is how context and dictionary checks appeared. Then the desire to preserve the formatting and recreate the physical structure (that is, the relative position of various objects) of the document led to the need for detailed analysis of the entire page. It is clear that this also noticeably affects the overall quality of recognition, since it helps to correctly process multi-column layout, tables and other techniques of "non-linear" text layout.

Most modern OCRs operate precisely on these three levels - characters, words, pages - practicing, as already mentioned, top-down or bottom-up approaches. However, ABBYY, in accordance with the principles of IPA, introduced another level into FineReader - the entire multipage document. First of all, this was necessary for the correct reproduction of the logical structure, which is becoming more and more complicated in modern documents. But there are additional bonuses: increased accuracy and faster processing of repeating objects, more correct identification (and therefore recognition) of objects "flowing" from page to page.

This is exactly what was developed ADRT (Adaptive Document Recognition Technology) is a technology for analyzing and synthesizing a document at a logical level. Ultimately, it helps to make the result of FineReader work as similar as possible to the original. To do this, the image of the entire document is analyzed, and the recognized words are combined into groups (clusters), depending on the style, environment and location on the page. Thus, the program, as it were, sees the “logic” of the markup of the document and can further unify the design of the result.

Thanks to ADRT, FineReader, starting from version 9.0, has learned to detect, recognize and reproduce the following structural parts and elements of document formatting:

  • main text;
  • headers and footers;
  • page numbers;
  • headings of the same level;
  • table of contents;
  • text inserts;
  • figure captions;
  • tables;
  • footnotes;
  • signature / seal areas;
  • fonts and styles.

Recognition process

According to the MDA algorithm, the actual recognition starts from top to bottom, from the page level. It is clear that the more wrong decisions will be made in the early stages of this process, the more will be in the next. That is why the recognition accuracy depends so much on the quality of the originals, but the algorithms for their preprocessing can be essential. Thus, as the popularity of color documents grew, FineReader introduced adaptive binarization ( AB). If you scan a document in black and white mode at once, where there are watermarks or the text is located on a textured or colored substrate, then the image will invariably appear "garbage", which will then be quite difficult to separate from the "useful" image (since the original information about him is already lost). That is why FineReader prefers to work with color or grayscale images, independently converting them to black and white (this process is called binarization). But that's not all. Since the colors of the text and background can differ within the page and even individual lines, AB highlights words with more or less the same characteristics and selects the binarization parameters that are optimal in terms of recognition quality for each. This is precisely the adaptability of the algorithm, which is thus an example of the use of feedback in MDA. It is clear that the effectiveness of AB strongly depends on the design of the source documents - on the ABBYY test base, this algorithm provided an increase in recognition accuracy by 14.5%.

But the most interesting, of course, begins when the recognition process descends to the lowest levels. The so-called linear division procedure breaks lines into words and words into individual letters; then, in accordance with the IPA principle, it forms a set of hypotheses (that is, possible options for what kind of character it is, into which characters the word is split, etc.) and, having provided each with a probability estimate, sends the character recognition mechanism to the input. The latter consists of a number of so-called classifiers, each of which also forms a series of hypotheses, ranked according to the assumed degree of probability. The most important characteristic of any classifier is the middle position of the correct hypothesis. It is clear that the higher it is, the less work for subsequent algorithms - for example, a dictionary check. But for sufficiently well-functioning classifiers, most often they evaluate such characteristics as the recognition accuracy according to the first three hypotheses or only according to the first - that is, roughly speaking, the ability to guess the correct answer from three or one try. ABBYY uses the following types of classifiers in its systems: raster, feature, feature differential, contour, structural and structural differential - which are grouped at two logical levels.

Operating principle RK, or raster classifier, is based on a pixel-by-pixel comparison of the symbol image with the standards. The latter are formed as a result of averaging images from the training sample and are reduced to a certain standard form; accordingly, the size, thickness of elements, and slope are also pre-normalized for the recognized image. This classifier is distinguished by its simplicity of implementation, speed of operation and resistance to image defects, but it provides a relatively low accuracy and that is why it is used at the first stage - to quickly generate a list of hypotheses.

Feature classifier ( PC), as its name implies, is based on the presence in the image of signs of a particular symbol. If there are N such features in total, then each hypothesis can be represented by a point in N-dimensional space; accordingly, the accuracy of the hypothesis will be estimated by the distance from it to the point corresponding to the standard (which is also accumulated on the training sample). It is clear that the types and number of features largely determine the quality of recognition, so there are usually a lot of them. This classifier is also relatively fast and simple, but not too resistant to various image defects. In addition, the PC operates not with the original image, but with a certain model, abstraction, that is, it does not take into account some of the information: for example, the very fact of the presence of some important elements does not say anything about their mutual arrangement. For this reason, the PC is not used instead of, but together with the RK.

Contour classifier ( QC) is a special case of the PC and differs in that it analyzes the contours of the assumed symbol extracted from the original image. In general, its accuracy is lower than that of a full-fledged PC.

Feature differential classifier ( MPC) is also similar to a PC, but is used solely to distinguish similar objects such as "m" and "rn". Accordingly, it analyzes only those areas where differences are hidden, and not only the original images, but also the hypotheses formed at the early stages of recognition are fed to it as input. The principle of its operation, however, is somewhat different from the PC. At the stage of training in the N-dimensional space, two "clouds" (groups of points) of possible values \u200b\u200bfor each of the two options are formed, then a hyperplane is constructed that separates the "clouds" from each other and is approximately equidistant from them. The recognition result depends on which half-space the point corresponding to the original image falls into.

By itself, the MPC does not put forward hypotheses, but only clarifies the available ones (the list of which is generally sorted by the bubble method), so that a direct assessment of its effectiveness is not carried out, but indirectly equates it with the characteristics of the entire first level of OCR recognition. However, it is clear that it depends on the correctness of the selected features and the representativeness of the sample of standards, the provision of which is a rather laborious task.

Structural differential classifier ( KFOR) was originally used to process handwritten texts. Its task is to distinguish between such similar objects as "C" and "G". Thus, the SDC is based on the features characteristic of each pair of symbols, the learning process is even more complicated than that of the MPC, and the speed of work is lower than that of all previous classifiers.

Structural classifier ( SC) is a source of pride for ABBYY, originally it was developed to recognize the so-called hand-printed text, that is, when a person writes in "block" letters, but was later used for print. It is used at the final stages of recognition and comes into effect quite rarely, namely, only when at least two hypotheses with sufficiently high probabilities reach it.

The qualitative characteristics of all classifiers are summarized in the following table. However, they only allow us to evaluate the effectiveness of the algorithms relative to each other, since they are not absolute, but obtained on the basis of processing a specific test sample. One might get the impression that at the last stages of recognition, the struggle is literally for fractions of a percent, but in fact, each classifier makes a significant contribution to improving the recognition accuracy - for example, the IC reduces the number of errors by a noticeable 20%.

RKPCQCMPC *SDK **SK **
Accuracy for the first three options,%99,29 99,81 99,30 99,87 99,88 -
Accuracy according to the first option,%97,57 99,13 95,10 99,26 99,69 99,73

* assessment of the entire first level of ABBYY OCR-algorithm
** estimate for the whole algorithm after adding the appropriate classifier

It is curious, however, that, despite the rather high accuracy, the recognition algorithm itself does not make a final decision. In accordance with the MDA principle, hypotheses are put forward at every logical level, and their number can grow exponentially. Accordingly, sequential testing of all hypotheses is unlikely to be effective, and therefore ABBYY OCR systems use the method of structuring hypotheses, i.e. referring them to certain models. There are a couple of dozen of the latter, here are just a few of their types: dictionary word, non-dictionary word, Arabic numerals, Roman numerals, URL, regular expression - and each one can include many specific models (for example, a word in one of the well-known languages, Latin, Cyrillic etc.).

All final actions are performed with hypotheses based on models. For example, context checking will determine the language of the document and immediately significantly reduce the likelihood of models using incorrect alphabets, and the dictionary check compensates for errors in the uncertain recognition of some characters: for example, the word "turn" is present in the English dictionary - unlike "tum" (in anyway, it is not among the popular ones). Although the priority of the dictionary is higher than that of any classifier, it is not necessarily the last resort, and in the general case does not stop further checks: firstly, as mentioned above, there is a model of a non-dictionary word, and secondly, the special organization of dictionaries allows with a high proportion of probabilities of guessing whether some unknown word might refer to a particular language. Nevertheless, the dictionary check (and the completeness of dictionaries) has a significant impact on the recognition result, and in the tests of ABBYY itself it practically halves the number of errors.

Not only OCR

Printed documents are far from the only ones of interest from the point of view of their digitization and automatic processing. Quite often you have to work with forms, that is, documents with predefined and fixed fields that are filled in manually, but relatively neatly (so-called hand-printed characters) - an example is various questionnaires. Their processing technology has a separate name - ICR (intelligent character recognition) - and quite significantly differs from OCR. So, since in this case the task is not to recreate the entire document, but to extract specific data from it, it splits into two main subtasks: finding the required fields and actually recognizing their content.

This is a rather specific area, and ABBYY offers a completely separate software product ABBYY FlexiCapture for it. It is designed to create automated and semi-automated systems, assumes customization for specific types of documents for which special templates are created, is able to intelligently find various fields on pages and verify data in them, etc. However, the very basis is based on character recognition algorithms similar to those that are used in FineReader, and the general scheme is very similar:

However, there is still an important difference: the structural classifier is an obligatory participant in the process - this is due to the specifics of hand-printed characters. In addition, ICR involves a large number of specific additional checks: for example, whether a character is strikethrough, or whether the recognized characters actually form a date.

Digitizing text is a fairly common task for those who work with documents. Abbyy Finereader will help you save a lot of time by automatically converting inscriptions from bitmap pictures or "readers" into editable text.

In this article, we'll look at how to use Abbyy Finereader for OCR.

How to recognize text from a picture using Abbyy Finereader

In order to recognize the text on the bitmap, you just need to load it into the program, and Abbyy Finereader will automatically recognize the text. You just have to edit it by selecting the one you want and save it in the required format or copy it into a text editor.

The text can be recognized directly from the connected scanner.

Read more on our website.

How to create PDF and FB2 document with Abbyy Finereader

Abbyy Finereader allows you to convert images to universal PDF and FB2 formats for reading on e-books and tablets.

The process for creating such documents is similar.

1. In the main menu of the program, select the E-Book section and press FB2. Select the type of original document - scan, document or photo.

2. Find and open the required document. It will be loaded into the program page by page (this may take some time).

3. When the recognition process is completed, the program will offer to select a format to save. We choose FB2. If necessary, go to "Options" and enter additional information (author, title, keywords, description).

After saving, you can stay in text editing mode and convert it to Word or PDF format.

Features of text editing in Abbyy Finereader

There are several options for text that Abbyy Finereader recognizes.

Save the pictures and headers and footers in the original document so that they can be transferred to the new document.

Analyze the document to see what errors and problems may arise during the conversion process.

Edit the page image. Available options for cropping, photo correction, changing the resolution.

So we told you how to use Abbyy Finereader. It has a fairly wide range of editing and converting capabilities. Let this program help you create any documents you need.

One of the most popular functions for working with scanning and processing files of various types is Fine Reader. The functionality of the software product was developed by the Russian company ABBYY, it allows not only recognizing, but also processing documents (translating, changing formats, etc.). Many users can only install, but they cannot figure out how to use ABBYY FineReader. You can find answers to many questions in this article.

The program allows you to scan and recognize text - and not only

To understand in detail what kind of ABBYY FineReader 12 program it is, you need to consider in detail all its features. The first and easiest function is to scan a document. There are two options for scanning: with and without recognition. In the case of a normal scan of a printed sheet, you will receive an image that was scanned in the specified folder on your computer device.

ATTENTION. The sheet must be laid on the scanning part of the printer straight, along the lines indicated on the printer. Do not allow the source to be wrinkled, this can lead to poor quality of the final scan.

You must independently decide what you need FineReader for, since the utility has significant functionality, for example, you can independently choose in what color you want to receive the image, it is possible to convert all photos to black and white. Recognition is faster in black and white, and the processing quality increases.

If you are interested in ABBYY FineReader's OCR function, you need to press a special button before scanning. In this case, there are several options for obtaining information. By default, a recognized piece of the sheet will appear on your screen, which you can copy or edit manually.

If you choose other functions, you can immediately get the file as a Word document or Excel spreadsheet. The selection of functions is very simple, the menu is intuitive, easy to configure due to the fact that all the buttons you need are in front of your eyes.

IMPORTANT. Before you can recognize ABBYY FineReader text, you need to select the processing language exactly. Despite the fact that the utility works completely automatically, it happens that the low quality of the source does not allow us to understand what kind of language was in the source. This greatly reduces the quality of the final results of the application.

Several operating modes

To fully understand how to use ABBYY FineReader 12, you need to try two modes of operation "Careful" and "Fast recognition". The second mode is suitable for high quality images, while the first is suitable for low quality files. Thorough mode takes 3-5 times longer to process files.

The illustration shows the result of the program - OCR from an image

What other features are there?

OCR in ABBYY FineReader is not the only useful feature. For greater convenience of users, it is possible to translate the document into the formats required by the user (pdf, doc, xls, etc.).

Change text

To understand how to change the text in Fine Reader, the user needs to open the "Service" - "Check" tab. After that, a window will open that will allow you to edit the font, change symbols, colors, etc. If you are editing an image, you should open the "Image Editor", it almost completely corresponds to a simple paint tool, but it will allow you to make minimal edits.

ATTENTION. If you still could not figure out how to use ABBYY FineReader productively, you can read the Help section, which can be found in the application window, in the About tab.

Now you know what FineReader is for and you can use it correctly in your home or office. The functionality of the application is huge, use it and you can be convinced of the indispensability of this software product when processing documents and files during office work.

Image recognition work consists of the following stages:

  1. Receive scanned images (scans).
  2. Open them in an OCR program (FineReader).
  3. Make a page layout into blocks. That is, to split the page into areas, each of which will contain either text, or pictures, or tables, or other homogeneous content.
  4. Recognition itself.
  5. Proofreading of the recognized, verification of the received text and original scans.
  6. Saving the obtained results in one of the documentary formats (DOC, RTF, PDF, HTML, etc.).

When recognizing texts, there are two options: either you scan the material yourself, or work with already scanned text.

In the first case, the “Acquire images” and “Open images” stages are combined into one - FineReader immediately opens the received scans in its package. In the second case, the “Get images” stage has already been passed, you just need to open them in the program.

Let's consider both options in turn.

Scan text into FineReader

Scanning is started via "File → Scan Pages" or the "Scan" menu button, or Ctrl-K.

Figure: 1 Scan interface

However, before you start scanning, it would be nice to figure out how to get the most optimal scans for recognition. And to do this, understand how a “good” (from the point of view of FineReader) scan differs from a “not very good” one.

The program requires three things for high-quality recognition. First, the ability to reliably distinguish text and illustrations from the background of the page. Secondly, so that letters, numbers and other content are clear and legible, so that there are no situations "here and the human eye will not always understand what is printed." Thirdly, the lines of the text on the scan should go exactly as they are printed on the page of the book, without distortions or distortions. There are also other requirements for a high-quality scan, but these can be considered key.

1. To reliably distinguish "here is the text, and here is the background of the page" requires that the transition between the one and the other was sharp, not blurry. Here are some examples of good and bad legibility pages. In the first case, of course, it will be recognized worse, with a large number of errors.


Figure: 2. Blurred boundaries of letters



Figure: 3. Clear boundaries of letters

A common cause of blurry text-to-background borders is out of focus scanning, what is commonly referred to as out of focus. Therefore, before starting work, it is advisable to check your scanner at this point.

Another reason that can interfere with distinguishing between text and background is that the background of the page is too "dense". Normally, it should be either pure white or white with a slight admixture of some color. If you scan old books, where the paper is often yellowed, then the background may also be yellowish (but moderately).

If the background looks noticeably shaded, then such pages will again be worse recognized.

The appearance of the background depends on the set scan brightness. It can be adjusted through the "Brightness" slider. To begin with, it makes sense to put 50%, check what will happen in this case, correct if necessary.

2. The legibility of text is mainly dependent on brightness and scan resolution.

If the brightness is too high, the lines of the letters will be torn, they will sort of crumble into separate pieces. If the brightness is low, then the details of the letters begin to merge with each other, formless spots appear. Both are not very edible "food" for recognition programs.

The brightness here is adjusted in the same way as in the previous case - we set 50% in the scanning interface for a start, and then according to the situation.


Figure: 4. Page with too much brightness



Figure: 5. Page with too low brightness (overshadowed page background)



Figure: 6. And here is the same page, but in its normal form

The scan resolution determines how many pixels in the scan will be for each letter. If these pixels are enough to draw the outline of the letter, then there will be no problems with recognition. If not enough, the letters can become difficult to distinguish even for the human eye, let alone recognition programs.


Figure: 8. The same, but at 200 points



Figure: 9. The same, but at 400 points

When choosing a permit, the following rules are usually followed:

  • 300 points are selected for books of mass publications (pages filled with text of the usual size, almost without pictures);
  • 400 points are selected for books and magazines with a noticeable amount of text in small sizes (notes, captions under figures, tables, small text boxes);
  • 600 points are selected for books printed in very small size (many reference books and encyclopedias, miniature books). Or with finely detailed drawings, for example, engravings. This should also include many books published in the 1990s - then publishers saved on paper and often printed in very tiny letters.

FineReader's scanning interface allows you to select only 300 points or 600 (the "Resolution" line). Therefore, if you have a lot of material that it is desirable to do at 400 points, then it is better to scan not from under FineReader, but from the program that comes with the scanner.

Or, in the FineReader settings, switch from the program's own interface to the TWAIN interface of your scanner (“Tools → Settings → the“ Scan / Open ”tab → click on“ Use the scanner interface ”below). Then you will be able to scan from FineReader, but you will work in the scanner interface (usually there are more settings and functions).

3. Smooth, neat looking lines of text are mainly provided by image preprocessing (“pre-” in this case means “performed after scanning, but before recognition”). After the correct preprocessing, the content of the pages will be recognized with a higher quality.

For this, FineReader has a fairly rich set of functions, which can be seen in the program's settings, on the "Scan / Open" tab. Also, this window can be called through the "Settings" button in the window of the scanning interface.


Figure: 10. Preprocessing settings

“Divide the spread of the book” should be selected when the book was scanned not by page, but by spreads. Then, for recognition, they will be sliced \u200b\u200bby page.

"Determine page orientation" is used when the book was scanned sideways. Then it will be deployed to its normal position. But if the book contains pages that are printed rotated 90 degrees relative to the main mass, then it is better to uncheck this box. Otherwise, when outputting the recognized one to PDF, you can get some of the pages in "portrait" orientation, and some - in "landscape". In this case, it is better to rotate the required pages manually, in the built-in image editor

Correct Skews corrects skewed pages. The setting is definitely necessary, but keep in mind that the PDF "Text under the page image" obtained from such scans will not look very neat - grayish wedges at the edges of the page (where the rotation was made).

Correct Line Distortion evens out the curls that often occur near the binding (also known as whiskers) during scanning.


Figure: 11. Example of a page with line bends

Correct Keystone Corrects page distortion that occurs when a book is not pressed very firmly against the scanner glass.

"Invert Images" is necessary if the scanned material contains a lot of "light letters on a dark background" text and you want to convert them to regular "dark letters on a light background".

"Remove colored elements" is useful if you need to remove various unnecessary things on a page like "black letters on a white background", such as markings with a pen on the margins, signatures and seals (office documents), or even just spots. But if on the same page there are some made in the color of "need" - graphs, diagrams or photographs, then you cannot put a tick. Otherwise, they will be deleted too.

"Correct image resolution" - a point requiring a more detailed explanation than the previous ones. The fact is that the recognition process in FineReader is very sensitive to the resolution set in the properties of a given image. It essentially depends on how accurately the point sizes of the letters of the text, letter and line spacing, and so on will be determined. Therefore, a check mark is necessary here. In addition, you should not be surprised if during the recognition process you constantly receive FineReader messages “on page such and such, the resolution is incorrectly set and it would be good to fix it”.

In addition to preprocessing settings, the Scan / Open tab contains the General settings block. This sets a set of basic actions that will be performed on the opened pages. Options for such actions can be as follows:

  1. just open scanned images without doing anything with them. To do this, uncheck the "Automatically process added pages" checkbox.
    This makes sense only if your scans are of such high quality that nothing can really improve them. You can immediately send for recognition. Of course, this happens, but much less often than we would like :-), so it is better to leave a tick.
  2. open images, do preprocessing, but don't do anything else until your command. To do this, select the "Image preprocessing" item.
    This is usually done if you do not need to start recognition immediately, but first see what happened as a result of preprocessing, how well it worked for a given set of images.
  3. open images, perform preprocessing, markup into blocks, do not start recognition yet. To do this, select the item "Image analysis (including preprocessing)".
    The most frequently chosen option. Your scans are of quite decent quality, you can well imagine what preprocessing will do with them, there is no need to check after it. So we combine the three stages of working with images described above into one and start looking at how well the markup is done.
  4. all stages of recognition are carried out automatically, without any intermediate control. You immediately get the finished result and start to read it. To do this, select the item "Image recognition (including preprocessing)". It makes sense to do this only if you have good quality scans and with a very simple appearance - for example, solid text in one language and nothing more. In all other cases, it is better to choose option 2 or 3. Especially if you have pages with complex formatting, tables, charts, pictures, etc.


Figure: 12. An example of a page with a complex layout



Figure: 13. An example of a page with a complex layout

Open images in FineReader

This is the second option for working with images: do not scan them yourself, but get them ready-made and open them in FineReader. It is done through the "Open" button in the main window menu or through "File → Open PDF or Image", or through Ctrl-O.


Figure: 14. Window "Open image"

In the Explorer window that opens, select images, specify the necessary settings (the "Settings" button) and click "Open". The settings here are the same as described for scanning, you need to work with them in the same way.

When the pages are opened in FineReader, the default package is created unnamed ("Untitled Document") and stored in the TMP folder, only within the current session. In order not to accidentally lose the results of your work, it is recommended to save the package under some permanent name immediately after creation ("File → Save FineReader Document").

Layout of pages into blocks

After you have opened the scans, you need to mark up the pages into blocks. This is done through "Document → Document Analysis" or through Ctrl-Shift-E.

Layout has two main working goals.

The first is to separate what is text on the page from what is not text. "Text" in this case is everything that FineReader is able to recognize. Accordingly, everything that he is unable to recognize is considered "non-text". Basically, this is an illustrative part of the page - pictures, drawings, graphs, diagrams and the like. From this point of view, formulas, handwritten notes and notes are also considered non-text - FineReader is not yet able to recognize them. This means that when marking them up, they must be marked as "picture".

Secondly, what is the text is still needed to be categorized - just text, tables, notes (footnotes), headers and footers, tables of contents, and the like. So that later, when you read the recognized text in a text editor, all these elements would look exactly the way you are used to (would be formatted accordingly).

A markup page might look something like this:


Figure: 15. "Image" window with a markup page

Now you need to look at the markup made by the program on each of the pages and, if necessary, correct it.

Marking errors are usually of the following types.

1. Some part of the page content (text, image, etc.) is highlighted correctly in terms of the area boundaries, but the wrong content is assigned to it. For example, a piece of text is marked up as a picture or vice versa.

In this case, you need to click on such an area, open the context menu, select “Change area type” in it, in the submenu that opens, select the required type (“Text”, “Table”, “Picture”, “Background image”, “Barcode” the code").


Figure: 16. Context menu "Change area type"

Quickly see where which area can be by the color of the frames. "Text" is highlighted with dark green frames, "Table" - blue, "Picture" - light red, "Background picture" - dark red, "Barcode" - light green.

2. In terms of content, the area is selected correctly, but in terms of size (borders), not everything that was required in this case is selected. Or, on the contrary, a piece from an adjacent area with different content fell.


Figure: 17. Page with incorrect markup

The captions surrounding it (should be marked as “text”) are attached to the upper area of \u200b\u200bthe “picture”.

Part of the image did not fall into the lower area of \u200b\u200bthe "picture" during marking.

To fix this, you must first click in the "Image" window on the "Arrow" button.

And then click on each incorrectly marked area and move its borders. In much the same way, as usual, move the borders of open programs.

3. Some part of the page content was skipped by the markup, did not fall into any of the created areas.


Figure: 18. The formula dropped out of the markup (did not get into any of the blocks)

Here you will need to create a new area on the page (highlight the missing part of the page with a frame), and then assign the desired type to the created area.

To do this, you must first click in the "Image" window on the icon "Select recognition zone"

After that, draw a frame around the desired area (as usual in a graphics editor, select a part of the picture) and finally set the type of area. The last operation has already been described in point 1.

If you just need the text part of the page, like a solid text (which is most often the case), then this is quite enough. If you want different design elements of recognized pages (notes, headers and footers) to look exactly like notes and headers and footers in Word, then you need to check this point too.

It is regulated through the context menu. Click on the required "Text" area on the checked page, in the context menu select the item "Assign text", inside its submenu look against which item is ticked (usually it is "Autodetect"). If it is not where it is necessary, switch to the required element.


Figure: 19. Context menu "Text assignment"

Recognition

After the errors in the markup have been fixed, you can start recognition. This is done via "Document → Recognize Document" or via Ctrl-Shift-R. Before that, do not forget to set the recognition language and configure the necessary settings.

The language is set through the "Document language" box in the button bar of the main program window.


Figure: 20. Language selection via the main menu

Or in the settings ("Service → Options → the" Document "tab).


Figure: 21. Choosing a language through the FineReader settings

If the language you need is not in the list that opens, then click "Select languages" at the bottom of the list and in the window that opens, check the box next to the language (set of languages) you need. After that, it will be added to the list.

In the recognition settings (“Service → Settings → the“ Recognize ”tab), it is better to leave the recognition mode in the default value (“ Careful recognition ”). It makes sense to set "Fast recognition" only if you have something simple in appearance and with very good scanning quality. For example, a scanned black and white printout of a text document without illustrations.


Figure: 22. Settings, "Recognize" tab

Of the rest of the settings, the group "Definition of structural elements" is of primary importance. Details of page design are listed here: footnotes (notes), headers and footers, lists, tables of contents. When an element is checked, it will be recognized and saved in DOC / RTF / DOCX not just as part of the text on the page, but as a footnote, footer, list or table of contents.

Just don't forget the important point. If you have to recognize areas with similar content, then one check mark in the "Recognize" tab settings may not be enough. In addition, at the markup stage, it is also necessary to correctly mark these areas with the "Text assignment" marker from the context menu.

Proofreading

Proofreading of recognized text in FineReader can be done in two ways. Either using the "Check" function, or in the usual way, viewing pages in the built-in editor FineReader. Through the "Close-up" window we check with the scan, where there are errors - we fix it.

The "Check" function is launched by the button in the upper right corner of the menu or via Ctrl-F7. Its work is based on the fact that during recognition, FineReader marks characters and words that have been recognized with an insufficiently high level of reliability. That is, the program has some doubts about them "maybe this is really the symbol that is presented to you, but there may be something else." During the check, such questionable places are shown to the user in turn, so that he can correct them if necessary.

The verification window is simple enough. In its upper part, a fragment of the page is shown, which contains the checked character. At the bottom, a line of recognized text with this character is displayed, as well as several buttons for easy editing.


Figure: 23. "Check" window

If everything is in order, the symbol is defined correctly, then click on "Skip". If it is defined incorrectly, then enter the correct value either using the keyboard, or if there is no such value on the keyboard, then using the "Insert Symbol" button (Greek letter "omega"). Then click on "Confirm".

We act in the same way if the symbol is recognized correctly, but its formatting is incorrect. For example, in the text of the book, italics appear in some place, but it was recognized as a regular font. To reformat, use the buttons at the bottom of the window.

But the possibilities of the check window are still quite limited. And by what size a piece of the page can be shown at the top of the window, and by the editing options that are available here. Therefore, all movements in the text, from one checkpoint to another, are also tracked in the "Text" and "Close-up" windows. All the while the work is in progress, the cursors in the "Text" and "Close-up" move synchronously with their position in the "Check".

If in the checked fragment of the page (in its scan) you suddenly need to see more than a few words shown in the Check, then you can do this in Close-up. If editing the current error requires the capabilities of the editor from the "Text", then you can temporarily switch to it (simply by clicking on its window), do the necessary work and return back to the "Checkout" (by clicking on its window). After returning to the "Checkout", there will be displayed all the changes that you made in the "Text".


Figure: 24. An example of work in the simultaneously open windows "Check", "Text" and "Close-up"

If the "Check" window with its limited capabilities is not very convenient for you (you are used to working with all the conveniences of text editors and are not going to change your habits), then you can do this work from the very beginning in the "Text" window.

The places requiring verification are displayed there in full - these are symbols and words highlighted in light blue. The ability to move from error to error without looking at the entire page is also available - the buttons "Next error" and "Previous error" on the button bar on the left side of the window.

Theoretically, as conceived by the creators of FineReader, the "Check" window should be sufficient for full-fledged proofreading of the recognized text. All doubtful places are marked, we move along them, correct mistakes, and at the output we get a completely cleaned up text.

But, as is often the case, theory here is at odds with everyday work practice. In the recognized texts, there are systematically erroneous passages that are not marked as errors. That is, FineReader recognizes some character / word incorrectly, but at the same time with full confidence that it recognized it correctly.

Therefore, for a full-fledged proofreading, the "Check" window alone is usually not enough - especially if the text contains a lot of scientific or technical terms, professional jargon, and similar "non-dictionary". We still have to go through the manually recognized one - carefully look through it in the "Text" window and check all the more or less dubious places.

Proofreading text in the "Text" window is not much different from ordinary proofreading work. Adjust the "Text" and "Close-up" windows so that they occupy the largest part of the working window of the program, go to the next checked page, view its text. If you find a dubious or clearly erroneous place, then click on it - while the cursor in the "Close-up" is set exactly in the same place in the original (scan). Compare the original and the recognized one, correct if necessary, move on.


Figure: 25. Proofreading using the "Text" and "Close-up" windows

The functionality of the "Text" window editor is no different from the functionality of any text editor of medium complexity. The appearance of the buttons in the menu is quite typical, there shouldn't be any problems when working with them. If you need to correct some character that is missing on the keyboard, then, as in the "Check" window, you must click on the button with the Greek "omega" and select the required one in the table that opens.

Saving results

When the scanned material is recognized and read out, it must be saved in one of the documentary formats - DOC, DOCX, RTF, PDF, HTML, etc. This is done through "File → Save document as → select the desired format" or through the "Save" button in the FineReader main menu.

In the opened Explorer window, select the format, use the "Settings" button to set the saving parameters, click "OK". If you want to immediately see if there are any noticeable errors in the appearance of the saved text, then in addition, check the box "Open document after saving". Then it will be immediately opened in the editor (browser, viewer).


Figure: 26. Window for saving recognized text

The usual practice of recognition is that the scanned text of a book or magazine is input, and at the output, all its pages are saved to a file with the name of this book. This setting “Create one file for all pages” is the default in the “File options” line. If you recognize not some solid text, but just a scattering of pages (for example, office documentation), then here you will need to set "Save a separate file for each page".

Save settings in DOC, DOCX, RTF formats


Figure: 27. Settings for saving to DOC / DOCX / RTF

The key and main thing that needs to be chosen here is with what degree of accuracy the original appearance will be displayed in the saved document (one of the saving modes in the "Document Formatting" window). All other settings are nothing more than clarification and detailing of this item.

There are four choices here: Carbon Copy, Editable Copy, Rich Text, and Plain Text.

1. "Exact copy".

As conceived by the developers, there should have been almost a mirror image of the recognized page. That is why it is named so. With accurate reproduction of fonts, letter sizes (point sizes), letter spacing in words, spacing between words, lines and paragraphs and other layout details. The idea, in general, is not a bad one, but FineReader usually lacks the ability to implement it in the intended volume.

Fonts and their typefaces (Normal, Italic, Bold) are often reproduced on the principle of "as it comes out, so it comes out." Can be transmitted accurately. It may happen that the font used on the recognized page is replaced by another font (similar in appearance, but different). It may happen that the Normal face will be recognized as Bold or vice versa. And so on and so forth.

With the reproduction of pins, distances and other formatting, the situation is not much better - it is usually possible to more or less accurately reproduce the appearance (layout) of the recognized page only in cases of something not very complicated.

As a result, it turns out that it is not very clear what - a Word document that can only be read (well, copy text from there). It is unrealistic to edit it outside of "remove a couple of letters, insert a couple of letters". But editing is still required - after all, he will go on to some kind of work, which means that it will be necessary to redo the formatting for the needs of future use.

On the one hand, all the text here is scattered over numerous frames, which makes it quite difficult to work with it. On the other hand, during recognition, the program generates a bunch of Word styles - all formatting in the text is done exclusively through styles. It is quite common when several hundred different styles are generated for the text of a medium-sized book (300-400 pages). Which makes editing even more difficult.

Summary - it doesn't make much sense to choose this save mode, it is rather inconvenient to work with saved text here.

If you need a complete reproduction of the original appearance, then it is both easier and more practical to do it in the form of PDF "Text under the page image" or PDF "Only text and pictures" (about these output methods a little below).

2. "Editable copy".

By definition, this is a lightweight version of the "exact copy". The appearance of the original is reproduced not with the same degree of meticulousness as in the previous case, there are noticeably fewer frames with text (although they occasionally come across). However, although this option is called "editable", working with it is also, not to say, convenient.

If you need a Word document as it is, just to view its contents and copy the desired piece of text, then this option can be used as well. If you need to redo a lot, reformat, and so on, then it is better to choose something else.

The reason is the same - too much fuss of converting the text from the type that the Editable Copy will give to the type that you might need. There is still a certain amount of text in frames, and there is still a tendency in formatting to accurately reproduce the appearance (layout) of the original. And the habit of generating a bunch of styles has not gone away.

Summary - working with the text here is not as troublesome as in "Replica", but still leaves much to be desired.

3. "Formatted text".

The degree of correspondence to the original is minimized here - reproduction of fonts and sizes, the approximate location of the material on the original pages, the general appearance of the text and tables.

This option is much easier to work with than the previous ones, but it is still difficult due to the large number of styles. However, this is quite easy to treat - you can quickly go through the text and apply your own set of styles to it.

4. "Plain text".

Although it is called "Plain Text", you can save both the text itself and the text with pictures here. Formatting in this version is minimized - ordinary Word paragraphs from one side of the page to the other, plus pictures stuck between them. A bunch of styles familiar from the previous options are also not generated.

But if you wish, even here, you can leave the original line and pagination. Plus, keep the typefaces - regular, italic, bold.

Usually, either "Formatted Text" or "Plain Text" is selected for saving, depending on what you are going to do next and how to use the recognized one.

Now about the rest of the settings of this window.

  1. Default paper size.
    Here you can set the Word setting “Page settings → Paper size”, that is, on what paper format you will print. Usually A4 is displayed. But it should be borne in mind that in the modes "exact copy" and "edited copy" one to one, not only the contents of the recognized page are saved, but also its original size. As a result, if you put a paper size here that is larger than the page size, then there will be empty margins around the text when printing. If you set a smaller format, then part of the page material may be lost (it will be outside the boundaries of the sheet of paper).
  2. "Preserve line breaks and divisions."
    If the checkbox is checked, then the line break that is in the original will be saved. In this case, line breaks are made soft. If the checkboxes are not checked, then the text will go in ordinary Word paragraphs, with lines from one edge of the page to the other.
  3. "Maintain pagination".
    If checked, then the original pagination will be saved. If the checkboxes are not checked, then Word itself will break the text into pages.
  4. "Preserve headers and footers and page numbers."
    If the checkbox is checked, then the text marked and recognized as headers and footers and page numbers will be saved and placed in the corresponding Word fields. If the checkbox is not checked, then this part of the text will not be displayed at all.
  5. Preserve Line Numbers.
    If the checkbox is checked, the numbering of these lines will be preserved in the lists with numbered lines.
  6. "Maintain background and letter colors."
    If the checkbox is checked, the text printed in color (or on a colored background) will be displayed as in the original. If the checkboxes are not checked, then all the text will be displayed in the usual way - black on a white background (or white on a black background).
  7. "Preserve bold, italic, and underline in plain text."
    Output to "Plain Text" can be done according to the principle "everything is in the same style, Normal", or it is possible to preserve the style that was in the original. Here this moment is regulated.
  8. Highlight uncertainly recognized characters.
    This checkbox must be checked if you prefer to read the recognized text not in FineReader, but in some text editor. Then all marks of symbols and words that you had in the "Text" window will be reproduced in the saved document.
  9. "Save Pictures".
    It is determined whether images will be saved in addition to the text.
  10. "Picture quality".
    The degree of compression of the images from the original is determined here. It can be adjusted in three directions - through various compression algorithms, through the resolution of the saved image and through the color depth in it. You can see the details if you select the Custom option in the Image Quality line. It is most practical to use it, rather than the "Small size (150 dpi)" and "High quality (original image resolution)" presets.


Figure: 28. Image quality setting window

Since lowering the original resolution and subsequent compression may result in poorly predictable distortions, it is better to uncheck the "Reduce original image resolution" checkbox.

Set the color depth according to the situation. If the images are needed as they are, then select "Do not change the color of the image." If just the general look is enough, accurate color reproduction is not necessary, then select "Convert color images to grays". It is better not to choose the conversion of color and gray images to black and white, because binarization can give a lot of distortion (and poorly predictable). It is also better not to select the "Automatic" item - it is not very clear what kind of work logic is there and what you will get at the output.

Save settings in PDF and PDF / A formats


Figure: 29. Settings for saving to PDF

There are also four saving modes here: “Text and pictures only”, “Text over the page image”, “Text under the page image”, “Image only”.

  1. "Only text and pictures."
    Here you will actually get a PDF version of what is displayed in the "exact copy" - the recognized text and illustrations from the "Text" window in a form as close as possible to the original. The reproduction quality of the original is higher here than in DOC / DOCX / RTF, since the PDF format has much more possibilities for this.
  2. Text Over Page Image.
    This is a PDF consisting of two layers - the original image (bottom layer), on which the recognized text is superimposed (top layer). This option is quite convenient if the PDF will then be edited.
  3. "Text under the page image".
    This is a PDF composed of the same two layers - the original image and the recognized text. Only they go in reverse order - the image is the top layer, the text is the bottom (invisible) layer. This output method is also called "PDF with text background" and is used when you need to get an exact copy of the original's appearance on the one hand, and the ability to copy the text of this original on the other hand.
  4. Image only.
    This is a PDF compiled from the original images. In addition to the images themselves, there is nothing else.

Now about the rest of the settings of this window.

1. "Default paper size".

In PDF output, the meaning of this setting is the same as in the previous case - the format of the sheet on which the page will be printed.

In the previous case, it was said about the rule "if the page is smaller than the specified format, then there will be empty margins around the text, if more, part of the text will be cut off." In PDF, it is enforced even more stringently, since here the original page is reproduced one-to-one in any version. Therefore, it is most reasonable to set "Use original size" here.

2. "Maintain background and letter colors."

3. "Save headers and footers".

The meaning of these two settings is the same as in the previous case.

4. "Create a table of contents".

If the "Definition of structural elements → Table of contents" checkbox was selected in the recognition settings, then the book's table of contents recognized in this way can be used to automatically create a table of contents in a PDF file.

5. "Allow PDF Tags".

In PDF, tags are a functional analogue of Word styles, a way of structurally marking up the contents of a PDF file. With their help, information is saved about breaking the text into chapters, about headings, table of contents, illustrations, tables, notes, hyperlinks, mathematical formulas and the like.

If you often need to copy pieces of text from PDF, then check this box. Then the copied text will be much more consistent with how it looks on the PDF page.

Tags are also useful if the PDF has to be viewed on screens of various sizes - from desktops to smartphones. In such cases, PDF readers have to reformat the page content to fit the current screen size, and with tagging, this is much more accurate, without noticeable distortion of the original view.

6. Use Mixed Bitmap Content (MRC).

MRC (Mixed Raster Content) is the name of a compression technology capable of producing significantly higher compression ratios than the well-known JPEG and JPEG 2000. Many people are familiar with it from the DjVu format - it is based on MRC. The choice "should I check the box or not" is ambiguous here and is determined based on your situation.

The main plus is the size of the resulting PDF. It can be several times smaller than PDF obtained with the same compression settings, but without MRC.

What are the disadvantages:

MRC compression is designed in such a way that it always produces a poorly predictable amount of distortion during operation. Due to the fact that the distortions here only partially depend on the compression settings, and to a fair extent on the page content. Text, drawings, graphics, photographs - with MRC compression, they all behave noticeably differently and give a different amount of distortion.

Noticeably high resource consumption when compressing and viewing such PDFs. Even on today's computers, MRC-PDF can be opened and scrolled not as smoothly as usual, but in jumps, when the next page is displayed on the screen not all at once, but in parts.

7. "Save pictures".

8. "Image quality".

The meaning of these settings is the same as in the previous case - it is necessary or not necessary to save images when creating PDF and with what level of compression to save them. The recommendations are also similar - uncheck the "Reduce original resolution" checkbox, it is better not to change the chromaticity, set the "Quality" engine by analogy with compression in JPEG 2000.

9. "Fonts".

If you put "Use Windows fonts", then the set of fonts that is installed on your computer will be used for recognition and subsequent output. If you check "Use predefined fonts", then only the set of fonts that are installed during FineReader installation.

It is preferable to set the first option, since this will use a much wider variety of fonts and it will be easier for the program to match the fonts of the recognized books.

10. "Embed Fonts".

If you require that when viewing a PDF file on another computer, it will be visible exactly as you received it (in these fonts), then you must check the box here.

11. "PDF security settings".

Here you can set password protection for PDF viewing, printing, copying text and pictures from it, editing.

If you have any questions about FineReader that you have not found an answer to in the text of the article, you can ask the program developers.