Andrey Lаppo - Technical Writing Portfolio / Writing Samples
Note: This is an anonymized translation of internal documentation created for Russian-speaking colleagues at an e-commerce company. Interface elements and software names appear in Russian as they were used in the original workflow. The company managed 50,000+ product documentation files requiring OCR processing for searchability.
Document Type: Technical Process Documentation
Audience: content managers, technical staff
Original Language: Russian
This workflow handles OCR (Optical Character Recognition) for PDFs in a large documentation archive. Three types of documents need different approaches:
The third type is the tricky one. Print-ready PDFs often convert text to vector curves so fonts print consistently. These files look like they have text, but it’s actually vector graphics. You can’t OCR them until you convert the vectors back to images.
Preparation:
Install PDF-XChange Pro
Open the PDF in PDF-XChange Editor Plus
Do basic cleanup if needed:
Split facing pages: Split Pages… (Split pages in the active document)

Crop printer’s marks: Crop Page Tool (Draw boundaries over the page, to define the crop box)

Fix skew: Deskew Pages Content (Deskew scanned images in the document to improve reading and text recognition)

Check page numbering: drag pages in the thumbnail panel or use Move Pages (Move pages of the active document)
Try selecting text with the cursor. (When processing many documents, it helps to set File > Preferences > Tools > Default Tool to Select Text so that’s the tool that opens by default.)

͏͏͏
In the toolbar, click the bottom part of the Edit Objects combo button, choose Shapes, and try selecting text by dragging with the left mouse button. If individual letters get selected, go to the next section. Otherwise, skip to the OCR Methods section.

Sometimes text is saved as curves. Usually this means the file was prepared for printing. You’ll need to convert the curves into something OCR can handle—raster images. Pick any method below, then move to the OCR Methods section.
Method 1: PDF-XChange Editor > Sanitize Document (Removes sensitive information, such as metadata, form data, invisible contents…)

Check all boxes except Rasterize content with overlapping objects. Save the document. File size won’t change much. This probably won’t work, but it’s worth trying.
Method 2: Same as above, but check Rasterize content with overlapping objects.
The Rasterization resolution field will unlock. Enter at least 200 DPI. For proper recognition, 300 DPI is better. For high-quality layouts, for printing, or if the document has small text, use 400—600 DPI. Rough estimate for required resolution: aim for 30—40 pixels of height for the smallest text size in the document. (30 is good, 40 is excellent.) File size can increase 15—60x (!).
Method 3: PDF-XChange Editor > Export > Export to Image(s)

Reassemble the resulting images using PDF-XChange Editor or PDF-Tools.
Method 4: ABBYY FineReader PDF > Edit > Delete Objects and Data… > check all boxes > [b]Apply[/b]
File size can increase ~50x.
Method 5: There’s another way to preserve the original appearance completely, barely change the file size, work regardless of font-background contrast, and add an invisible text layer. I regret only learning about it after writing this workflow. It requires more advanced techniques not covered here.
Choose whichever method seems more convenient, accurate, or faster for your situation.
In the OCR Pages (Enhanced) window, use checkboxes to select the needed languages, plus Ignore existing text on page, Ignore comments on page, Ignore form fields on page, and under Output Options set Type to Searchable Image

Settings will be saved. You won’t need to go into them in the future.
Check boxes for Batch Processing Mode, Multi-Threaded Processing Mode, Do Not Ask for Passwords, Allow select multiple files, Show extended dialog for files selecting, OCR Pages, Show setup dialog while running

Choose Input Files > File types: PDF Documents (*.pdf)
OCR Pages > If document contains text: OCR document
In the Save Documents module:
%[FileName]For OCR Pages (Enhanced) settings, use the same ones from the OCR in PDF-XChange Editor section above
Unfortunately, manually typing model numbers, article codes, and product codes leads to errors. Similar-looking characters get mixed up: the numeral 1 vs. lowercase l vs. uppercase I, Latin O vs. Cyrillic О vs. the numeral 0, and so on. When working with technical documents, I recommend using select-copy-paste whenever possible. If you spot a suspected or obvious error, send it immediately to the archive administrator with the filename, page number, and a brief description.
Another issue: PDFs often have line breaks (in some cases only the first line of selected text gets pasted), and automatic recognition sometimes incorrectly determines text blocks.
For the first issue, use any lightweight text editor. Enter the PDF text block edit mode, copy the text to the text editor, remove unwanted line breaks, copy the cleaned text, paste it back where you got it.
For the second issue, use ABBYY Screenshot Reader (included with ABBYY FineReader PDF): launch it from the Windows Start menu. If no window appears, click the red icon with white frame in the system tray. In the dropdown list under Снимок choose Области, under Передать choose Текст в буфер обмена, select the language (avoid Авто unless the language is truly unknown), press Alt + Enter, select the area of the needed “dead” text as close as possible without overlapping, press Enter, paste where you intended. For offline recognition using a single language, you can also use ShareX. It’s FOSS but a bit complicated to set up initially.
When copying technical specifications, these character confusions happen frequently:
1 (one) vs. l (lowercase L) vs. I (uppercase i)0 (zero) vs. O (capital o) vs. О (Cyrillic о)8 (eight) vs. B (capital b) vs. В (Cyrillic в)5 (five) vs. S (capital s)- (hyphen) missing or replaced with — (em dash)Always use copy-paste for product codes and model numbers. Don’t manually retype them. Report suspected OCR errors to the archive administrator immediately.
Resolution guidelines:
Rule of thumb: if you can’t read it comfortably on screen at 100% zoom, OCR probably can’t either.
This workflow was developed for processing mixed-format PDF documents including scans, vector PDFs, and print-ready files with outlined text. It’s been tested on technical specifications, product catalogs, brochures, and multi-language documentation.