Comments on: Mindhacks For Fingertips Follow-Up – Plus Earhacks

By: Ray Davis

Ray Davis — Mon, 02 Nov 2009 15:06:07 +0000

I do a fair amount of OCR-ing, and I invested in ABBYY FineReader many years and versions ago. One of its selling points is "trainability," so I'm pleased to hear about Readiris's support. For those of us drawn to older texts printed variably or in not-so-typical typefaces, that feature is well worth the extra tinkering. (You can see what happens without it in massive library trashings like the grant-funded "The Million Illegible Books Project" and the Google-funded whatchamacallit.)

By: Cian O'Connor

Cian O'Connor — Fri, 30 Oct 2009 15:20:24 +0000

http://unpaper.berlios.de/
Is what I was thinking of. It will do a pretty good job of cleaning up PDF scans prior to OCR scans.
The following app has a version that works on windows, and probably the Mac as well:
http://www.mobileread.com/forums/showthread.php?t=21906

By: Cian O'Connor

Cian O'Connor — Fri, 30 Oct 2009 15:05:25 +0000

Abbyy Finereader is supposed to be the best OCR system out there, though its PC only. I have version 9 (I think), which is excellent. Version 10 is supposed to be better. Handles OCR, and converts PDFs to HTML/Word -> and is pretty reasonable with footnotes/diagrams/fonts. There’s also a cut price version for 50-60$ which is supposed to be pretty decent for stuff that is just text, or not too complex layouts. The favourite of librarians and e-book pirates apparently.
Omni is pretty crap by all accounts, unless you’re an expert tweaker.

There’s some piece of freeware that I’ve lost (mostly I search the pirate sites for scans, as its easier, than scanning in myself) which will clean up scans and so improve OCR. Its part of the Linux OCR suite that uses Tesseract (which is okay for free, but if time is money…), and is very useful.

Incidentally, a LOT of academic books are out there on filesharing sites these days. I’m not sure what that’s an indicator of.

By: Andrew Brown

Andrew Brown — Thu, 29 Oct 2009 19:09:04 +0000

Abbyy Finereader is pretty competent at all the things you want. PC only, so far as I know, but very quick, multilingual, and dead accurate on high quality scans. I have version 7, and have never felt any need to upgrade. So I don’t know if it has been bloated all to hell since then.

By: Salient

Salient — Wed, 28 Oct 2009 12:07:53 +0000

I have become a bit confused. If anyone wishes to help orient me, it’s much appreciated:

* Which of these programs is good at recognizing when symbols don’t make sense to it (e.g. an equation) and storing the resultant information in a picture, and just including that picture in the resultant file?

* Which of these programs does things like recognize word frequency in order to improve its ability to OCR correctly in ambiguous cases? (Or, alternatively, can be trained to prefer certain words over others.)

* Which of these programs can be “trained” to recognize new symbols and substitute them, for example seeing â•‘ and OCR’ing it as \norm, or seeing Î² and OCR’ing it as \beta ?

* Which of these programs can be trained to recognize formatting data and encode that: for example, scan “what _dreams_ may come” and OCR this as “what \begin{boldface}dreams\end{boldface} may come” ?

* Which of these programs can be trained to recognize relative position, for example notice e^x^ and OCR this as e\superscript{x} ?

I am planning to achieve world domination by developing an open-source OCR tool that scans old math texts and converts them to .TeX files which, when compiled, faithfully recover the original text and formatting, but as I currently have exactly none of the skill set necessary to accomplish this, I’d rather get the next closest OCR program created by someone else and tinker with it a bit.

Thanks in advance to anyone who responds!

By: Brendan

Brendan — Wed, 28 Oct 2009 04:56:09 +0000

Probably not of much use or interest to people here, but Acrobat’s OCR for Chinese texts is surprisingly good. It still requires correcting (particularly for texts that mix traditional and simplified characters, and anything featuring Chinese characters plus, e.g., transliterated Sanskrit is going to be a world of pain), but is much, much better than the other options (all…one of them?) for Mac OCR of Chinese. Thanks for the heads-up — hadn’t known about this before.

By: jre

jre — Tue, 27 Oct 2009 21:24:31 +0000

Tesseract, or “tesseract-ocr”, is free (in both senses), works on all platforms and does not have a problem with grayscale or binary images. I have used it to scan multi-generation copies of wretched quality and gotten good results. Downside: you need to scan to a tiff or convert a pdf file to tiff.

By: Jon H

Jon H — Mon, 26 Oct 2009 23:17:23 +0000

"Everyone with Acrobat should know this: thereâ€s an OCR option under the document menu. (No great shakes, but it works.)" With Acrobat, I found it best to use the option to retain the bitmap text, rather than to replace the bitmap with the OCR'd text. The text is searchable, and copyable, but the original font remains, and the OCR mistakes are hidden. (Which could be a problem, but might not be.)

By: Zora

Zora — Mon, 26 Oct 2009 18:16:27 +0000

A heavy-duty guillotine gets rid of the covers and the spine. A sheet-fed scanner (expensive!) can scan a whole book, both sides of the pages, in less than half an hour. Then OCR the whole batch.

The tools are somewhat expensive, but once the investment is made, conversion from deadtree to e is fast … assuming a recent printing with clear, crisp print.

By: John Holbo

John Holbo — Mon, 26 Oct 2009 16:53:50 +0000

“Any tips on turning a bound book into scannable pages?”

Be prepared to break its spine. As to sheet feeding: I’m sure you don’t want to be feeding actual book pages through one of those, even if the whole spine has dissolved and you just have a pile of paper. You could photocopy the whole, fairly quickly, then sheetfeed the photocopies, more slowly. That might produce perfectly good results. I have access to a really high quality flatbed scanner at work and a less high quality flatbed at home. I think flatbed is probably the way to go with old books.

By: Jon H

Jon H — Mon, 26 Oct 2009 16:34:13 +0000

Any tips on turning a bound book into scannable pages?

I picked up an old scanner with a sheet feeder at a ‘freecycling’ event here at work. Had to buy a semi-compatible power supply, and need to solder it in, but it works, more or less.

Some of my books are starting to turn yellow, so retaining the physical objects isn’t a strong motivator.

By: jholbo

jholbo — Mon, 26 Oct 2009 12:42:28 +0000

Hmmm, this isn’t good. Quiggin sent me his PDF for conversion and Readiris couldn’t read it. Total blank It’s the black&white problem I mentioned above, I presume. I have successfully converted lots of PDF’s I’ve found here and there, and I was starting to think that as long as I personally didn’t select b&w while scanning, that other PDF’s floating around would be no problem. But an OCR app that draws a complete blank with any degree of frequency is a serious problem. Not good. (Since 90% of the stuff I want to convert has been scanned by me, it’s not a problem for me. But for other people? Not so good, I expect.)

Fortunately, Acrobat did a pretty serviceable job on the thing Quiggin sent me. Everyone with Acrobat should know this: there’s an OCR option under the document menu. (No great shakes, but it works.)

By: John Holbo

John Holbo — Mon, 26 Oct 2009 00:38:09 +0000

I have also noticed one quirk about Readiris, which I discovered early enough that it’s not going to bother me, but it’s worth mentioning. For test purposes, I scanned some docs in b&w and grayscale. I figured the former would be just as good, and might even be better. And scanning in b&w is slightly faster. To my surprise, the b&w scans didn’t show up at all. They looked fine in any PDF reader. But Readiris couldn’t read them. Since OCR starts by converting grayscale to b&w (I believe), this is sort of an odd bug. I emailed them and a support person asked for a sample problematic file. I haven’t heard back since then (that was a week ago.) So I recommend: scanning in grayscale, or at least testing any b&w setting before using. I haven’t had any trouble with old PDF’s that I didn’t make, but I think those are mostly grayscale, in effect.

By: John Holbo

John Holbo — Mon, 26 Oct 2009 00:23:12 +0000

Hmmm, I have Acrobat with the CS Suite. Is there anyway to get OCR in CS Suite without just going into Acrobat and doing the quick OCR thing. Because that works OK, and may be fine for many people’s purposes. But Readiris works a lot better. I considered going for Finereader instead – and there is a free trial version you can download, I believe. But it’s $200, so I went for the slightly cheaper one. And I’m very happy.

John Quiggin: Send me that PDF you want converted. I’ll send you the RTF conversion that is produced with 1-click. And you see whether you like the results.

By: Zora

Zora — Sun, 25 Oct 2009 23:02:37 +0000

The folks at Distributed Proofreaders (16,000 books scanned or harvested) are fond of Abbyy Finereader.