Quantcast
Viewing latest article 3
Browse Latest Browse All 24

Convert Many Word Documents to ASCII At Once

In a recent tip, we showed you how to convert a Microsoft Word document to plain ASCII text using a CygWin command called CATDOC. At the end of that article, we promised to show you how to do multiple conversions in one fell swoop. Well, hang on tight, because this will go lickety-split.

This job requires the help of a small shell script, as follows:

#!/bin/bash

# Finds all Word Document files (i.e. named
# *.doc) that are at or below the current
# folder and converts them to plain ASCII
# text, by adding '.txt' to the file name.
# For example, Research.doc becomes
# Research.doc.txt (if Research.doc.txt
# already exists, it will be overridden).
# The original Word document is left alone.

for f in $( find . -name '*.doc' ); do
catdoc -w $f > $f.txt
done


Open your favorite text editor, start a new file, and paste the above text into it. Save the file in your CygWin home directory, using the name “catdoc_all”, or whatever you like. (On my system, I installed CygWin in C:\sys\cygwin, therefore my home directory is C:\sys\cygwin\home\craig.) At this point, you should be all set.

Running the Script: To run the conversion, first navigate to the parent folder that contains all of the word documents to be converted using the CD commands. Then, run your new script by typing “. ~/catdoc_all” (without the quotes) — That’s a period, followed by a space, followed by a tilde, followed by a slash, followed by the name of your script. Hit Enter, and your’re done. Ta da! It’s that easy.

If you care to know why this works, then keep reading.

The Source Command: When a period is the first character on a command line, CygWin (bash) assumes that to be a shortcut for the “source” command. In other words, it specifies the source file of a command script to be played. (A period anywhere else in a command line usually refers to the current directory.) The tilde refers to your home directory, and the slash is, of course, the standard separator between a directory and a file name. So this tells CygWin to read the contents of the catdoc_all file that is located in your home folder and execute the commands therein.

How the Script Works: The first line of the script (#!/bin/bash) ensures that the bash shell will be the one to interpret this script. That’s optional, if you know for certain that the bash shell will always be the current one running.

The lines that begin with a pound sign (#) are merely comments. bash ignores those.

The meat of this script consists of three parts, a FIND command (“find . -name ‘*.doc’ “) that locates all of the Word document files to be converted, a for-loop (“for f in $( … ); do … done”) that iterates over those findings, and the CATDOC command (“catdoc -w $f > $f.txt”) that processes each one.

When the shell interpreter sees a for-loop, it looks ahead to the “in” part, in this case the find command, and executes that first. The find command, as written here, searches starting in the current directory (.) for all files with a name that matches the pattern “*.doc”. The result, a potentially long list of specific filenames, is then processed by the for loop. The “f” specified in the for-loop (i.e. between the “for” and the “in” is known as a control variable. We named ours “f” (short for “filename”), but we could have name it anything. Each time through the loop this “f” variable takes on the value of the next file name in the list. In other words, it is a placeholder for the real filenames.

Finally, the “$f” notation within the CATDOC command (“catdoc -w $f > $f.txt”) tells bash to do placeholder substitution using whatever the current filename is that is associated with the f variable.

For example, say that the current folder has two subfolders called “research” and “publish”, and that they each have one word document apiece: “data.doc” and “results.doc”, respectively. Then, the find command will find the two files (research/data.doc, and publish/results.doc) and hand those two names off to the for-loop. The for-loop will therefore act twice, once for each name. The first time through, the f variable will be equated to the first of the two names (research/data.doc). For each name, it executes the CATDOC command. Just before executing it, however, the two places that specify $f gets substituted with the actual filename (research/data.doc). So, what actually gets executed is “catdoc -w research/data.doc > research/data.doc.txt”. The for-loop then repeats in order to process the second name: “catdoc -w publish/results.doc > publish/results.doc.txt”.

And there we have it.


Viewing latest article 3
Browse Latest Browse All 24

Trending Articles