Regexps Rule

by Kieran Healy on May 29, 2005

Regular readers will know that my list of indispensable applications includes the Emacs text editor, the TeX/LaTeX typesetting system,and a whole array of ancillary utilities that make the two play nice together. The goal is to produce beautiful and maintainable documents. Also it gives Dan further opportunity to defend Microsoft Office. I am happy to admit that a love of getting the text to come out just so can lead to long-run irrationalities. The more complex the underlying document gets, the harder it is to convert it to some other format. And we all know which format we mean.

Well, yesterday morning the long run arrived: I finished the revisions to my book manuscript and it was now ready to send to the publisher for copyediting. Except for one thing. The University of Chicago Press is not interested in parsing complex LaTeX files. They are quite clear about what they want, and it isn’t unreasonable. I had a horrible vision of spending weeks manually futzing with a book’s worth of formatted text. But thanks largely to the awesome power of regular expressions, or regexps, and the availability of free tools that implement them, the whole thing was pretty painless.

There is no truly satisfactory way to convert a LaTeX document to something that retains the document’s structure and formatting and that Microsoft Word can reliably read. By far the best option is to convert to HTML first, using something like Hevea. LaTeX and HTML are both markup languages and their specifications are publicly available (unlike Microsoft’s DOC format), so the conversion tools are good. If your LaTeX document was fairly straightforward, then Hevea would probably do the trick for you, even if you fed it a book manuscript. But if, like me, you use fancy-pants LaTeX stuff like Jurabib and the Memoir Class then you are out of luck, because Hevea doesn’t know about any of that. I wanted to keep the bells and whistles because they allowed for a sane approach to the structure of the document, especially the notes. The author-year citation method is terrible for notes in a book, as is the insane ibid/idem system. Jurabib can automatically create notes that cite the full reference the first time a work is cited and a shortened reference (the author’s surname and an abbreviated title, say) thereafter. So with that as a given, Hevea was out.

The Memoir document class is a superb piece of work, and one of the many things it can do is produce manuscript-style output—- that is, something that looks like it was prepared on a typewriter, set ragged right and double spaced in a monospace font with no hyphenation. My original naive hope was that Adobe Acrobat could take this PDF output in this form and produce something serviceable using its “Save as RTF” or “Save as Microsoft Word” options. Acrobat dutifully saved the PDF as a Word file, alright. It took forever, but when I opened the result in Word it looked fantastic—like a carbon copy of the original. This is because it was a carbon copy. Acrobat had simply converted every page to an image and stuck the result in a Word file! I should have known better, I suppose. Getting a structured, editable document out of a PDF file seems like a hard task: it’s been likened to producing a live pig from a packet of sausages. The best Acrobat could do was save the text, without any formatting at all, or even any spaces between the words. Not very helpful.

Things weren’t looking good. The manuscript-like output was great—- I just needed a way to get it into an editable format. Then I wondered whether the manuscript could be marked up in a way that signaled the most important formatting in some obvious way. The most important formatting elements to keep were the paragraph breaks, section headings and the italics that made the notes and bibliography readable. No way was I manually typing any of those. They were all in a BibTeX database, anyway. A quick refresher course later, and the solution began to take shape. I redefined LaTeX’s section commands so that they had an appropriate callout signaling themselves (e.g., three equals signs) as required. I wrote simple expressions to do stuff like put three asterisks at the beginning of every paragraph and three exclamation points on either side of every italicized word.

That just left the problem of the notes and bibliography. These are automatically generated and so are not available in the tex file anywhere to have formatting tags attached via a regexp. It turns out, though, that the remarkable BibTool has the ability to do regular expression operations on individual fields in your BibTeX database. It can do much else besides, too: it was able, for example, to extract all and only the references cited in the book, make a .bib file out of them, and then insert the relevant manual markup in the right places for books, articles and so on. I am constantly amazed by how much excellent free software there is in the world. Maybe the next book will be about that.

Having done all that, I re-created the manuscript in its new Frankenmarkup form. It was a big mess. But it had structure, which was the important thing. It turns out that Word has its own regular expression engine, and I was able to use that—along with the macro recording feature—to take out the thousands of useless paragraph symbols, reconstruct the real paragraphs and formatted text, put the section headings in, and reconstitute superscripted note numbering and formatted notes, and create the bibliography right down to the six hyphens preferred when citing multiple works by the same author. I did use Hevea for one thing—translating the four or five tables into HTML. Some of them were quite complex, but it worked perfectly. Word read them without a problem and its ‘Autoformat’ feature gave them a standard look.

The whole thing went far more smoothly, and far faster, than I had any right to expect. In retrospect, of course, I’d urge LaTeX dweebs to lay off the bells and whistles when producing long manuscripts. It’s a waste of time. But so is watching TV, and you probably do a lot of that, too. If you have a standard book or article with little in the way of add-ons, it should be relatively easy to convert it to HTML. (Daniel will be along momentarily in the comments to tell you to start using Word in the first place.) But even a complex document can be handled quickly and with relatively little fuss. The result is now sitting in a burn folder on my Desktop. The Word files look uglier than a NASCAR driver afraid of losing to a girl, but they conform to the submission guidelines and the whole thing is ready to be put in the post first thing after Memorial Day.

Update: Moving something up from the comments below: I should say that the people at Chicago have been absolutely great throughout the process, and I’ve known for ages that the Day of Reckoning would come when I had to convert the manuscript. I also know that, in the end, Chicago’s designers and typesetters will do a much better job than me when it comes to producing a good-looking book. They really know how what they’re at, whereas all my efforts to get TeX to do nice things are really just amateurish messing about.

{ 26 comments }

1

Jacques Distler 05.29.05 at 1:49 am

An academic publisher that doesn’t accept LaTeX documents?

Egad!

What you need, is not better RegExps, but a better publisher.

I could see a publisher saying: Thou Shalt use our DocumentClass. (Thus producing camera-ready copy with little effort on their part.) But to not accept LaTeX, and the huge labour-saving it would represent for them, is positively assinine.

Very, very 1980’s!

2

Keith M Ellis 05.29.05 at 3:04 am

It would be hard to overstate the advantages of being adept with regexes. As Kieran demonstrates, it’s not just computer scientists and IT people who will greatly benefit.

I highly recommend O’Reilly’s book, Mastering Regular Expressions.

3

lakelobos 05.29.05 at 3:27 am

Oh God, Kieran is a hacker and we were so optimistic. You are a better person than most of us, but please don’t have an entry extolling the value of fountain pens or plumes.

4

bi 05.29.05 at 4:03 am

If I were faced with the same requirements, I’d probably write in HTML and leave it at that…

5

Chris 05.29.05 at 4:06 am

Not my experience of converting from PDF to Word, but then I wasn’t using the Memoir documentclass. The only real trouble I had was that quotation marks can come out funny and ligatures get screwed. But I was only trying to fix a 7000 word article, an altogether easier proposition.

6

tlr 05.29.05 at 5:11 am

Have a look at Bert Bos’ HTML utilities if you ever contemplate writing large documents directly in HTML: http://www.w3.org/People/Bos/#htmlutils

7

Kieran Healy 05.29.05 at 8:29 am

The only real trouble I had was that quotation marks can come out funny and ligatures get screwed.

To avoid this, compile the LaTeX document in a monospace “typeweriter” font: then you have no ligatures. It’s easy to search and replace the quotation marks.

8

bza 05.29.05 at 10:19 am

Ligature problems usually arise when you get from LaTeX to word via dvips and ps2pdf. If you use pdflatex to produce PDF directly you shouldn’t need a work-around.

9

Chris 05.29.05 at 10:30 am

Oh and thanks for drawing my attention to memoir. I’ve already been playing around with it … a great improvement.

10

EKR 05.29.05 at 11:12 am

Kieran,

A few comments from someone who’s been through this process before (with Addison-Wesley, not University of Chicago). If you want to work in something like TeX or Troff, it’s best to simply produce camera-ready submissions. This isn’t enormously fun, but it has a number of advantages:

1. It’s the only way you’ll get the final book to look exactly the way you want it to look. This is especially true if you want figures, which the publisher is very likely to botch in translation.

2. If you produce camera ready, then you can typically provide the publisher with the copy-editable ms. on paper and get a traditional pen-on-paper copy-edit. I actually prefer this.

3. Once the publisher has started the copy-edit/proofread/typeset cycle they’re under enormous time pressure. If they control they document, they’ll just make whatever decisions are most convenient. If they need you to produce the final copy-edited version, you get a lot more control at this stage. In my case, I found a last minute mistake which the publisher only let me fix because I was able to make sure it only affected 2 or three pages, rather than flowing through to the entire chapter. I would never have been allowed to do this if I didn’t control this phase of production.

11

Adam Kotsko 05.29.05 at 11:38 am

I’m skeptical that he’s going to find a “better” publisher than University of Chicago Press.

12

Kieran Healy 05.29.05 at 2:24 pm

I should say that the people at Chicago have been absolutely great throughout the process, and I’ve known for ages that the Day of Reckoning would come when I had to convert the manuscript. I also know that, in the end, Chicago’s designers and typesetters will do a much better job than me when it comes to producing a good-looking book. They really know how to do it, whereas all my efforts to get TeX to do nice things are really just amateurish messing about.

13

lago 05.29.05 at 3:26 pm

For less ambitious projects, I’ve used latex2rtf combined with Word’s regular expression engine. Might be a bit more accessible to the less ambitious hacker.

14

nick 05.29.05 at 4:35 pm

If you want to work in something like TeX or Troff, it’s best to simply produce camera-ready submissions.

Big University Presses (or at least the humanities/soc-sci depts thereof) really don’t want camera-ready copy. They have their One True Format, and your MS is going to be marked up into it by professional typesetters whether you like it or not. (And by the time you see the galleys, you’ll probably like it.)

15

Nick Caldwell 05.29.05 at 6:16 pm

On a slight tangent, one of the big problems I had with using LaTeX was bibliographic. As best as I can tell, the MLA style isn’t well supported in LaTeX-land, and, in any case, god help you if you want to be able to record other media like film or television in your bibliography. The — what was it? — “other” category doesn’t cut it. I messed around with a couple of different extension packages, including one that promised to generate a custom bib file, but no real luck. I ended up generating the damned thing manually.

16

ben wolfson 05.29.05 at 6:29 pm

I think the U of C journals all want, or at least accept, LaTeX submissions. At least job requirements for working in production on them include a familiarity with both it and SGML; you have to clean up submitted .tex files or some such.

Kotsko: you would say that.

17

Kieran Healy 05.29.05 at 7:29 pm

think the U of C journals all want, or at least accept, LaTeX submissions.

Yes, I’ve done this with them. As you say, they just want you submit a cleaned-up file.

18

Lakshmi Sadasiv 05.29.05 at 8:37 pm

“the U of C journals all want, or at least accept, LaTeX submissions. At least job requirements for working in production on them include a familiarity with both it and SGML; you have to clean up submitted .tex files or some such.”

This is true; I worked at the University of Chicago Press (a few years back) and all of the journals are produced via sgml files. AASTeX was the format of choice though we could take Word and WordPerfect files as well. The last project I did there was a book that was going to be produced via the sgml-based workflow and I seem to recall that the press’s business plan was to integrate the books division into the sgml workflow. I wonder what happened.

It could just be that you got a dumb and/or lazy editor who couldn’t walk over to the other side of the building and ask if there was a way to accept LaTeX files. (I am several years out of date as to how exactly the file translation process runs there now but from what I remember you could run the chapters as AASTeX files and include the auxiliary files to make the cross references).

This in no way is to say that regexps are not the best thing ever. Of course, Word still sucks with regard to mathematics (MathType and Equation Editor are both stupid and cumbersome and don’t deal well with regexps) so it is worth it to argue with your publisher about accepting LaTeX files. (If they really don’t like it, let them send it to the third world to be rekeyed like they do all the time anyway.)

19

EKR 05.29.05 at 9:14 pm

Vis-a-vis professional typesetting….

Interesting. The commercial houses (Pearson, Wiley, etc.) all seem quite happy with camera-ready, even in the professional series. And clearly some of those come out looking very nice (cf. The Art of Computer Programming, or the Stevens books). At least to my eye, those books look as good or better than those produced by professional typesetters. For example, I far prefer the typesetting in TCP/Illustrated (done with groff) to that in the Chicago Manual of Style. Of course, this is clearly a taste thing, and I’m fanatical about having control of the look and feel, so I was willing to put in the time to get exactly the layout I wanted.

20

ben wolfson 05.31.05 at 10:42 am

This in no way is to say that regexps are not the best thing ever.

Then there’s Jamie Zawinski’s observation: Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems..

21

Tim McGovern 05.31.05 at 4:18 pm

I’m the dumb and/or lazy guy (no offense taken, really!) at the U of C Press who’s putting Kieran through all the separation anxiety of leaving his beloved LaTeX. I’ve got to say thanks, first on behalf of the Press for the props, and second, for your insightt on our electronic manuscript requirements.

A bit of an apologia, as well, for those requirements:

There certainly is a big dichotomy between the requirements and capabilities of the Books and Journals divisions. The determining factor in all of this is the editing process, not production (typesetting and printing).

There was a great deal of talk, as Lakshmi says, about bringing books’ editing and production processes more inline with journals’ when I first started working at the Press (~5 years ago), but the practicalities of doing so quickly overcame the cool factor of “Online editing! SGML!”

I can only really speak to the Books end of this process, but what I can tell you is that our manuscript editors are words people, not computer people, and it’s a fairly rare individual who is turned on by both (though becoming less so!).

Kieran’s editor will edit on screen, generally in word because it’s what she’s familiar with, but also because it’s what most (I’d say about 80% in the fields I work in — history, sociology, rhetoric, and sex) of our authors use.

Inertia and the Lowest Common Denominator factor are not the only reasons we use Word. We use Word’s “track changes” to produce a redlined hard copy manuscript, which the author gets to review, and accept/tweak edits (small edits, e.g., spelling, are made transparently). Then edits are finally accepted/rejected here by the manuscript editor.

You’ll note that this process takes a lot out of the hands of the author: final production of the electronic manuscript for the typesetter is the responsibility of the manuscript editor (to say nothing of production of typeset pages!). Some authors (ekr?) would love to have complete control over the final product, but we in the publishing business like to think we add value beyond our imprint, binding, and distribution!

Now I hie me back to work, getting ready for Kieran’s ms arrival.

22

Jonathan Goldberg 05.31.05 at 4:44 pm

I get a bit of a kick out of Friedl (the author of the excellent Mastering Regular Expressions, recommended above) saying (on page 30) “When it comes right down to it, regular expressions are not difficult.” And then writing a 430 page book to prove it.

Although, bizarrely enough, I know what he means…

23

George Colpitts 06.01.05 at 4:46 pm

Very interesting particularly along with the previous post on indispensable applications. I hate Word so much that I have learned some Latex as Word has a mind of its own and I can never get it to do what I want. Nevertheless it would be reaaly nice if I could learn to use Word in a better way. Can somebody please post a good reference on how to use Word properly so it doesn’t drive me crazy ?

24

teekay 06.02.05 at 1:42 am

Tim points to a critical shortcoming of LaTeX and its various front ends — collaborative drafting and editing. The easiest workaround is to produce in LaTeX and share the PDF output, which can then be marked up using Preview (on Mac) or Acrobat Professional. This is unsatisfactory for a whole bunch of reasons: (i) the markup then needs to be manually transferred back to the source file, which is fine for a small article with only one back and forth between author and editor (or between two authors) but not for a book with frequent iterations of the process; (ii) the markup tools even in Acrobat are rather primitive — there’s no equivalent of Word’s track changes; (iii) I resent the fact that in order to add functionality to an open source application I need to buy an expensive and unwieldy proprietary solution like Acrobat.

An additional problem is that Word is almost universal in the sense that the vast majority of people who use computers have access to it and have a basic idea of how to use it, while LaTeX might be rather offputting to learn. In other words, I can’t ask my co-author to spend days learning LaTeX just because I happen to prefer it to Word…

Any thoughts by people who use LaTeX in such collaborative contexts?

25

theCoach 06.02.05 at 8:56 am

Office 12 formats are pure XML, open and with a royalty free license, and updates will be released so that earlier versions can read the new format.
Thought you might want to know that.
http://www.microsoft.com/presspass/features/2005/jun05/06-01XMLFileFormat.asp

26

ben wolfson 06.02.05 at 11:28 am

teekay, you could go all out and host your .tex files on a CVS or Subversion server. Segregation of duties! Let the version tracker track versions, the editor edit text, and the typesetting program handle layout. No need for one program to do all the jobs.

Comments on this entry are closed.