Following up this post, here’s the way to do the scan-and-OCR thing (if you are a mac user). First, DEVONthink seems a very worthwhile application, which I’m disciplining myself to use. But that’s time investment. Here’s the time saver: Readiris turns out to be a great OCR application, recommended for those who think they might pay a bit over $100, to spare them all the damn data-entry, but aren’t ready to plunk down the $550, or whatever, for OmniPro, because that’s obviously nuts. I was hoping DEVONthink would do double-duty in the OCR department, and I wouldn’t have to buy a second app. But the results were disappointing. Readiris, on the other hand, works great for two kinds of projects I have.
[UPDATE: see below in comments for a possible, major drawback: namely, Readiris has trouble with PDF’s based on black&white, as opposed to grayscale, scans.]
First, processing really long documents that might be of borderline quality, threatening you with a huge amount of clean-up. (I’m assuming you have decent scan quality, otherwise you should rescan it; but maybe the original print quality was so-so.) You have to spend maybe 15 minutes training Readiris to process the distinctively dubious quality of whatever specific thing you’ve got. But it actually seems to learn. Then the app pretty much just chews through, in 50 page chunks, surprisingly error-free. Not perfect. But better than I had expected. I was on the fence about some projects for making e-editions of old public domains books I’ve got kicking around the place. Now it actually seems like a do-able thing.
But what most scholars need more than the ability to convert whole old books into e-books is the quick-and-dirty (but surprisingly clean) capacity of Drop2Read, which is part of ReadIris. It sits in the dock. You just drag-and-drop a PDF onto it; it creates and autosaves an RTF conversion into the same folder as the PDF original, then opens it for you to check. (You can set preferences about the details of all this.) It works well, often even with multi-column text with figures and illustrations. (It doesn’t preserve and place illustrations for you, but it isn’t driven mad by the presence of such things.) It’s 90%, for formatting and for the text itself. For one measly click, that’s a bargain. Text masticated via Drop2read is more wholesome for feeding DEVONthink. More to the point, for most people, it’s ready for you just to cut&paste, later, when you want that block quote. Keep the RTF version alongside the PDF, which can be used as a reading copy and/or a thing against which you check OCR problems. (Next: I need a faster scanner.)
But what should I listen to while performing these night-time chores? What plangent sounds to soothe my ears, as I watch the scanner weave it’s gentle path of light ‘neath the closed cover? If you don’t like Philip Glass … then you’ll probably hate the Doveman I’ve now got on high rotation. I love the new album, The Conformist [amazon]. (And before that, I liked The Acrobat a lot, too.) You can stream various tracks here. And there are a couple freebies floating around here and here (oh, and don’t miss the Footloose cover.) And if you have listened to that, and want something else breathy and a bit wimpy (but maybe not so Belle and Sebastian tweecore), I notice that Amazon is selling The Zombies, Odessey and Oracle for a lousy $1.99. As I believe I have mentioned: I like Colin Blunstone’s voice. (You like Ray Lamontagne’s voice? It’s like that.) Seriously, this is some classic 60’s slightly not rocking quite enough but still great stuff; right up there with the Beach Boys, Pet Sounds. But more breathy and proto-proggy. Here’s a Pitchfork review of this particular release, with which I am in substantial agreement. (Although I think giving it a 9.3 might be generous.) But it seems like you don’t get all the bonus tracks if you just buy the mp3’s. Hmmmm. Still, a good deal.
Doveman and The Zombies have this in common: the lead singer is basically breathing in your ear, from, like, 3 inches away, sounds like. This could get silly – to say nothing of the threat of looming emo; but instead it’s … steady and consistent. Which may mean that you get a bit tired of it. Me? I’m playing it over and over. [UPDATE: Oh, Pitchfork pretty much agrees about the Doveman, too.)
{ 21 comments }
John S. Wilkins 10.25.09 at 4:20 pm
Thanks for the recommendations. I have an old copy of OmniPro which is at best unreliable so I was looking for a new OCR app.
John Holbo 10.25.09 at 4:33 pm
Glad to be of assistance. I should maybe update the post to dampen people’s expectations just a bit. For people who don’t know: OCR has pretty much always been somewhat disappointing. These packages promise to do this thing, but nothing quite does. It’s just hard. When I say Readiris is great I mean: it works noticeably better than older stuff I’ve used, including older versions of OmniPage (but it’s been many years since I used that app.) So: Readiris is very pleasing, if you approach with the soft bigotry of low expectations that old-timer OCR users rightly suffer from. But folks who aren’t familiar with this sort of thing shouldn’t expect the thing to get absolutely everything right. 90%+ right is still a lot of clean-up, if you really want everything totally clean.
mistersquid 10.25.09 at 4:38 pm
Thanks for the remarks regarding ReadIris and Drop2Read’s features and functionalities.
OCR is certainly becoming a standard feature of many programs and program suites and I expect that before too long we may see this functionality rolled into mainstream OSes. Look alive once Apple releases its much anticipated tablet device.
The specialized applications–DEVONThink, ReadIris, and OmniPro–are possible go-tos for OCR, but many people may already have a very very good OCR application if they have a license to the Adobe CS suite. My experience begins with Adobe CS2 which OCRs standard PDFs scanned at 140dpi very capably and very quickly, depending on the typeface scanned.
People who have avoided Adobe Acrobat and are considering an OCR-capable piece of software should give Acrobat (bundled withAdobe CS2 or later) a try if they already have access to it.
Zach 10.25.09 at 5:21 pm
Microsoft’s OneNote software is pretty amazing and probably already available in your copy of Office. It works best in conjunction with a tablet PC so that you can write in it, but it seems like it would suit your purposes just fine. It allows for all kinds of note taking and organisation, makes text searchable (with surprising accuracy) and even allows for searching within audio files. The last feature is nice if you just want to record some thoughts by speaking out loud, but if you’re a frequent audiobook listener you could theoretically use it to jump straight to any point based on a keyword search. I haven’t tried the search with files that large though.
Also, it is available for Windows or OSX.
John Quiggin 10.25.09 at 7:33 pm
Hmm. Just last night, I was thinking, ‘why did these fools put up a scanned print of their (Sep 2009) Memorandum of Understanding instead of a text PDF, and how can I avoid retyping it without spending a fortune?’. I wish and the Intertubes provide. Thanks, John
Bill Benzon 10.25.09 at 10:53 pm
Hi John S. Wilkins! Fancy meeting you here.
For those who don’t know him, John S W is a philosopher of biology and a damned good one too. He’s just published two books, one on the concept of a species and the other on . . . just what exactly? He’s now working on a book on the evolutionary underpinnings of religion — a hot topic these days to which John will bring a bit of sanity. I hope.
I met him online some years ago and we’ve been corresponding off and on ever since. Like some in these here vitual parts, John’s a meatworld Aussie. At the moment, however, he’s on the US leg of his Europe & US tour. We spent a delightful afternoon last Thursday chatting about this and that.
Note: I second misterquid’s endorsement of Adobe’s OCR app in Creative Suite. I’ve got CS2 and am quite happy with the OCR component.
Zora 10.25.09 at 11:02 pm
The folks at Distributed Proofreaders (16,000 books scanned or harvested) are fond of Abbyy Finereader.
John Holbo 10.26.09 at 12:23 am
Hmmm, I have Acrobat with the CS Suite. Is there anyway to get OCR in CS Suite without just going into Acrobat and doing the quick OCR thing. Because that works OK, and may be fine for many people’s purposes. But Readiris works a lot better. I considered going for Finereader instead – and there is a free trial version you can download, I believe. But it’s $200, so I went for the slightly cheaper one. And I’m very happy.
John Quiggin: Send me that PDF you want converted. I’ll send you the RTF conversion that is produced with 1-click. And you see whether you like the results.
John Holbo 10.26.09 at 12:38 am
I have also noticed one quirk about Readiris, which I discovered early enough that it’s not going to bother me, but it’s worth mentioning. For test purposes, I scanned some docs in b&w and grayscale. I figured the former would be just as good, and might even be better. And scanning in b&w is slightly faster. To my surprise, the b&w scans didn’t show up at all. They looked fine in any PDF reader. But Readiris couldn’t read them. Since OCR starts by converting grayscale to b&w (I believe), this is sort of an odd bug. I emailed them and a support person asked for a sample problematic file. I haven’t heard back since then (that was a week ago.) So I recommend: scanning in grayscale, or at least testing any b&w setting before using. I haven’t had any trouble with old PDF’s that I didn’t make, but I think those are mostly grayscale, in effect.
jholbo 10.26.09 at 12:42 pm
Hmmm, this isn’t good. Quiggin sent me his PDF for conversion and Readiris couldn’t read it. Total blank It’s the black&white problem I mentioned above, I presume. I have successfully converted lots of PDF’s I’ve found here and there, and I was starting to think that as long as I personally didn’t select b&w while scanning, that other PDF’s floating around would be no problem. But an OCR app that draws a complete blank with any degree of frequency is a serious problem. Not good. (Since 90% of the stuff I want to convert has been scanned by me, it’s not a problem for me. But for other people? Not so good, I expect.)
Fortunately, Acrobat did a pretty serviceable job on the thing Quiggin sent me. Everyone with Acrobat should know this: there’s an OCR option under the document menu. (No great shakes, but it works.)
Jon H 10.26.09 at 4:34 pm
Any tips on turning a bound book into scannable pages?
I picked up an old scanner with a sheet feeder at a ‘freecycling’ event here at work. Had to buy a semi-compatible power supply, and need to solder it in, but it works, more or less.
Some of my books are starting to turn yellow, so retaining the physical objects isn’t a strong motivator.
John Holbo 10.26.09 at 4:53 pm
“Any tips on turning a bound book into scannable pages?”
Be prepared to break its spine. As to sheet feeding: I’m sure you don’t want to be feeding actual book pages through one of those, even if the whole spine has dissolved and you just have a pile of paper. You could photocopy the whole, fairly quickly, then sheetfeed the photocopies, more slowly. That might produce perfectly good results. I have access to a really high quality flatbed scanner at work and a less high quality flatbed at home. I think flatbed is probably the way to go with old books.
Zora 10.26.09 at 6:16 pm
A heavy-duty guillotine gets rid of the covers and the spine. A sheet-fed scanner (expensive!) can scan a whole book, both sides of the pages, in less than half an hour. Then OCR the whole batch.
The tools are somewhat expensive, but once the investment is made, conversion from deadtree to e is fast … assuming a recent printing with clear, crisp print.
Jon H 10.26.09 at 11:17 pm
“Everyone with Acrobat should know this: there’s an OCR option under the document menu. (No great shakes, but it works.)”
With Acrobat, I found it best to use the option to retain the bitmap text, rather than to replace the bitmap with the OCR’d text. The text is searchable, and copyable, but the original font remains, and the OCR mistakes are hidden. (Which could be a problem, but might not be.)
jre 10.27.09 at 9:24 pm
Tesseract, or “tesseract-ocr”, is free (in both senses), works on all platforms and does not have a problem with grayscale or binary images. I have used it to scan multi-generation copies of wretched quality and gotten good results. Downside: you need to scan to a tiff or convert a pdf file to tiff.
Brendan 10.28.09 at 4:56 am
Probably not of much use or interest to people here, but Acrobat’s OCR for Chinese texts is surprisingly good. It still requires correcting (particularly for texts that mix traditional and simplified characters, and anything featuring Chinese characters plus, e.g., transliterated Sanskrit is going to be a world of pain), but is much, much better than the other options (all…one of them?) for Mac OCR of Chinese. Thanks for the heads-up — hadn’t known about this before.
Salient 10.28.09 at 12:07 pm
I have become a bit confused. If anyone wishes to help orient me, it’s much appreciated:
* Which of these programs is good at recognizing when symbols don’t make sense to it (e.g. an equation) and storing the resultant information in a picture, and just including that picture in the resultant file?
* Which of these programs does things like recognize word frequency in order to improve its ability to OCR correctly in ambiguous cases? (Or, alternatively, can be trained to prefer certain words over others.)
* Which of these programs can be “trained” to recognize new symbols and substitute them, for example seeing â•‘ and OCR’ing it as \norm, or seeing β and OCR’ing it as \beta ?
* Which of these programs can be trained to recognize formatting data and encode that: for example, scan “what _dreams_ may come” and OCR this as “what \begin{boldface}dreams\end{boldface} may come” ?
* Which of these programs can be trained to recognize relative position, for example notice e^x^ and OCR this as e\superscript{x} ?
I am planning to achieve world domination by developing an open-source OCR tool that scans old math texts and converts them to .TeX files which, when compiled, faithfully recover the original text and formatting, but as I currently have exactly none of the skill set necessary to accomplish this, I’d rather get the next closest OCR program created by someone else and tinker with it a bit.
Thanks in advance to anyone who responds!
Andrew Brown 10.29.09 at 7:09 pm
Abbyy Finereader is pretty competent at all the things you want. PC only, so far as I know, but very quick, multilingual, and dead accurate on high quality scans. I have version 7, and have never felt any need to upgrade. So I don’t know if it has been bloated all to hell since then.
Cian O'Connor 10.30.09 at 3:05 pm
Abbyy Finereader is supposed to be the best OCR system out there, though its PC only. I have version 9 (I think), which is excellent. Version 10 is supposed to be better. Handles OCR, and converts PDFs to HTML/Word -> and is pretty reasonable with footnotes/diagrams/fonts. There’s also a cut price version for 50-60$ which is supposed to be pretty decent for stuff that is just text, or not too complex layouts. The favourite of librarians and e-book pirates apparently.
Omni is pretty crap by all accounts, unless you’re an expert tweaker.
There’s some piece of freeware that I’ve lost (mostly I search the pirate sites for scans, as its easier, than scanning in myself) which will clean up scans and so improve OCR. Its part of the Linux OCR suite that uses Tesseract (which is okay for free, but if time is money…), and is very useful.
Incidentally, a LOT of academic books are out there on filesharing sites these days. I’m not sure what that’s an indicator of.
Cian O'Connor 10.30.09 at 3:20 pm
http://unpaper.berlios.de/
Is what I was thinking of. It will do a pretty good job of cleaning up PDF scans prior to OCR scans.
The following app has a version that works on windows, and probably the Mac as well:
http://www.mobileread.com/forums/showthread.php?t=21906
Ray Davis 11.02.09 at 3:06 pm
I do a fair amount of OCR-ing, and I invested in ABBYY FineReader many years and versions ago. One of its selling points is “trainability,” so I’m pleased to hear about Readiris’s support. For those of us drawn to older texts printed variably or in not-so-typical typefaces, that feature is well worth the extra tinkering. (You can see what happens without it in massive library trashings like the grant-funded “The Million Illegible Books Project” and the Google-funded whatchamacallit.)
Comments on this entry are closed.