Image may be NSFW.
Clik here to view.Perhaps a better title for this tip would have been, “Automatic Typographical Standards Compliance with SED”, but I wasn’t sure how to spell “Typographical” (just kidding). Let me explain what I’m talking about and then you can decide.
If you are an editor, marketeer, webmaster, technical writer, or copywriter, then your job may include making sure that submitted text complies to typographical standards. For example, do you often see misspellings of proper names like “Microsoft” and “Photoshop”? Perhaps they appear as two words (Photo Shop), or they are incorrectly capitalized (PhotoShop)? Well, here’s a trick you can use to enforce compliance by filtering the incoming text and automatically replacing all of the bad forms. This trick works for any kind of ASCII files:
- Web pages (HTML, PHP)
- Raw text (*.TXT)
- XML data files, or any other type of “markup” (*.XML)
- Flat data files – comma separated or tab separated (*.CSV)
- Rich Text (*.RTF)
By the way, this trick also works for fixing up an entire body of existing text — for example, a whole website that is acquired through a merger, say.
The SED command: SED Is an extremely powerful UNIX tool for filtering ASCII files. (For Windows users, see the link below for our tip on using CygWin to execute UNIX commands on Windows.) SED stands for “Streaming Editor”. In short, that means that it can be automated as part of a shell script without the need of a GUI, and in fact, you’ll find just such a shell script below that you can use as a starting point. The power of SED is that, among other things, it allows for a whole set of substitution pairs to be executed at one time, like this:
s/MicroSoft/Microsoft/ s/Micro Soft/Microsoft/ s/Micro soft/Microsoft/ s/PhotoShop/Photoshop/ s/Photo Shop/Photoshop/ s/Photo shop/Photoshop/
The leading “s” stands for “substitute.” It says to find occurrences of whatever appears between the first two slashes and replace them with whatever appears between the second and third slashes. Matches are normally case-sensitive, which is why the third and sixth lines exist as necessary variations of the second and fifth, respectively. Note that the trailing slash is required.
Running SED: There are a number of different ways to run SED. The most convenient way to do it when a list of multiple commands is involved (as we are doing here), is to place those commands in their own ASCII file, which we’ll refer to as a “SED script.” We’ll name this script “comply.sed” and place it in our home folder. (In my case, I’m using CygWin on Windows which is installed at C:\sys\cygwin, so my home folder is C:\sys\cygwin\home\craig). Remember: the UNIX shortcut to refer to your home folder is “~/”, or in this case “~/comply.sed”.
The actual command line to invoke SED looks something like this:
sed -f ~/comply.sed submission.txt > revised.txt
The -f switch tells SED that the next argument is the filename of the sed script to use. After that is the name of the file to process (submission.txt). The “> revised.txt” portion of the command line tells the command shell to take the output from SED and send it to a file called revised.txt. In other words, the SED command, when invoked as shown, makes a copy of submission.txt as revised.txt, applying the substitutions that are listed in ~/comply.sed as it makes the copy.
Alternatively, instead of creating a second file (revised.txt) that contains the substitutions, SED has a “-i” switch can be used to have it edit the original file in place.
sed -i -f ~/comply.sed submission.txt
Furthermore, this in-place editing mode can be told to make a backup copy of the original file first. To do this, specify a filename extension along with the -i switch, like so:
sed -i.bak -f ~/comply.sed submission.txt
(Important: Note that there is no space between the “-i” and the “.bak”.) Finally, the SED command can be combined with the FIND command in order to process multiple source files at one time.
find *.txt -exec sed -i.bak -f ~/comply.sed {} ;
This says, find all files (in the current directory and any subdirectories), with names that match the pattern “*.txt” and, for each one found, do whatever appears between the “-exec” and the “\;”. As you will note, when we inserted our SED command there, we changed the specific “submission.txt” reference to a pair of braces ({}). That is special notation for the FIND command. It’s a placeholder for the found file that is currently being processed. (We talked about the find command before in our tip about converting many Microsoft Word documents to plain ASCII at once. See below.)
As promised, to make running SED more convenient, we can place the combination find/sed command in a shell script. For this example, I named the shell script file “comply” (no extension) and saved it to my home folder. Thus, instead of having to remember to type “find *.txt -exec sed -i.bak -f ~/comply.sed {} \;”, I can just type “~/comply”.
Image may be NSFW.
Clik here to view.
Special Symbols: So far, we’ve only shown the substitution of exact literals. But, each pattern being matched is actually a regular expression. We just have not done anything to take advantage of this, yet. In regular expressions, most punctuation symbols have a special meaning. This includes period (.), question mark (?), plus (+), asterisk (*), backslash (\), square brackets ([]), and braces ({}), and others. In order to include any of those punctuation symbols in a pattern and have the symbol matched literally, character escaping can be used. Simply precede the punctuation symbol with a backslash. This goes for escaping a backslash as well. (In other words use a double backslash to match a single backslash.)
s/sf.net/sourceforge.net/
Whole Words: As we just mentioned, most punctuation symbols have a special meaning and need to be escaped if they are to be treated literally. Conversely, letters and numbers are treated literally by default, but gain special meaning when escaped. For example, “\n” says not to match on the letter n, but instead to match on a newline (the break between two lines). Similarly, “\t” matches a tab, rather than the letter t.
One such special regex element that comes in handy here is “\b”, which ensures that we only replace whole words. “\b” stands for word boundary. What that means is that \b doesn’t actually match any text, per se. It’s one of a class of patterns that matches an invisible nothing. In this case, it matches any transition from a letter to a non-letter, or vice versa. Say for example, we are concerned that somebody might submit text that says, “Photo shopping is fun.” Well, using our comply.sed script as shown above, that sentence would be changed to “Photoshopping is fun.” — thanks to the sixth command. To prevent that, we would change that command to look like this:
s/Photo shopb/Photoshop/
This means that “shop” has to be a whole word, not the start of a longer word, otherwise it won’t match. To be even safer, we might add \b at the beginning of our pattern, to ensure that Photo is a whole word as well.
s/bPhoto shopb/Photoshop/
This is especially important when matching on acronyms or other tiny bits of text that could easily be found embedded in longer words. For example, if it’s your standard to always spell out Microsoft, instead of the abbreviation MS, and then the s-command should look like this:
s/bMSb/Microsoft/
Otherwise, if you leave out to the two “\b”s, then you’ll find the word Microsoft showing up in plenty of strange places (e.g. “ARMS & LEGS” becoming “ARMicrosoft & LEGS”).
Catching All Occurrences: There’s a necessary option to the s-command that we have not discussed yet. The default behavior for the s-command is to stop at the first occurrence that it finds on each line (paragraph). If you want it to carry on and continue looking for additional matches on the line, then you have to add the letter g (which stands for searching “globally”) to the tail end of the s command, like this:
s/PhotoShop/Photoshop/g
Aha! Now it becomes clear why a third slash delimiter is necessary. It separates the replacement value from the option(s).
Case Insensitive Matches: Another option that we can add to the end of an s-command is the letter “i”, which stands for insensitive. The “g” and the “i” can be in either order. Be aware that, not only will the following change “PhotoShop” to “Photoshop”, it will also change “PHOTOSHOP” to “Photoshop”, which is probably not desired.
s/photoshop/Photoshop/gi
Alternative Delimiters: What if the item you want to change contains a slash? There are two ways to handle this. One is to use character escaping by placing a backslash in front of any slash that is to be taken literally. The other is to use a different character than slash for the delimiters. It turns out that whatever character immediately follows the “s” in the first delimiter position is assumed to be the same character that appears in the second and third delimiter positions. A common choice is the pipe character (|). For example, these two s-commands are functionally equivalent:
s/him/her/him/ s|him/her|him|
Working with Microsoft Word Documents: If the text you want to standardize is already in a word document, you can use a technique we described earlier to convert the document to plain ASCII (see below), and run SED against that. Then, use a compare tools such as WinMerge to compare the before and after to see if anything actually changed. You would have to manually re-create those changes in the Word document, but at least, this will show you exactly what needs changing and where. Alternatively, instead of using SED, you could write VBA code that executes directly in Microsoft Word, but that’s a completely different topic.
Download Links:
Related articles: