Quantcast
Viewing latest article 6
Browse Latest Browse All 24

RegEx Pattern Matching on Dates

RegEx is a common abbreviation for “regular expression.” If you have had any exposure at all to regular expressions, then you may be under the impression that regular expressions are complicated and hard to write. While it is true that they can get complicated quickly, it is also true that it doesn’t take much to get started with regular expressions, and that simple regular expressions can be quite powerful.

Here’s a for-instance. Say that you have a log file where all of the entries are dated, and that you want to process the log file in some way, but the dates interfere with that process. Perhaps, you want to reduce a copy of the log file down to one example of each different message (which can be done with the UNIX command uniq, or with the TextPad sorting option that drops duplicate lines). Or, perhaps, you have two such log files that you want to compare, as shown here:

04/01/07 23:44 QC Job Started
04/01/07 23:44 Preprocessing 118 orders
04/01/07 23:45 Preprocess complete
04/01/07 23:45 Optimization Started
04/01/07 23:49 Optimization Complete
04/01/07 23:49 Allocation Started
04/02/07 00:14 Allocation Complete
04/02/07 00:14 Job Complete

vs.

04/04/07 23:32 QC Job Started
04/04/07 23:32 Preprocessing 139 orders
04/04/07 23:33 Error PO# 3425 - Invalid SKU: 89265
04/04/07 23:34 Preprocess complete
04/04/07 23:34 Optimization Started
04/04/07 23:38 Optimization Complete
04/04/07 23:38 Allocation Started
04/04/07 23:54 Allocation Complete
04/04/07 23:54 Job Complete

As they are, if you were to bring these two files up in a compare tool such as WinMerge, you wouldn’t learn much. Because the dates and times are different, the compare tool would not find any common ground. It would only report that the entire first file is different from the entire second file.
Image may be NSFW.
Clik here to view.
Date_Compare_1.jpg
(click on image to enlarge)
So, let’s perform a search and replace on the two files to take the uniqueness out of the time stamps by replacing anything that resembles a date and time with the words “date time”. (Exactly how to do that is what I will demonstrate in a moment.) Now, when we compare the two files, the significant differences pop right out (139 orders vs. 118 orders, and the error in PO number 3425).
Image may be NSFW.
Clik here to view.
Date_Compare_2.jpg
(click on image to enlarge)
By the way, yes, in this particular example we could have accomplished the same thing by merely deleting all of the dates and times, and that since they happen to be lined up nice and neat in a perfect column, a column-select operation would have done the job. You’ll just have to imagine that the log files are “dirtier” than these tiny examples.

Image may be NSFW.
Clik here to view.
Date_Replace.jpg

So, what is the regular expression pattern we’ll need to match on these timestamps? First of all, you may want to review the article we posted last week entitled Instant Gratification for Regular Expression Writers. In that article, we gave an example of a regex pattern that finds any word(s) that begins with the letter P and ends with the letter R. (Remember, “Peter Piper picked a peck of pickled peppers”?). At the end of the article there is a blow-by-blow description of how that example pattern was constructed. Well, here we are going to do essentially the same thing, except we’ll be matching on digits, rather than letters of the alphabet. The simplest regex pattern that will do the job here is:

[0-9]*/[0-9]*/[0-9]* [0-9]*:[0-9]*

The pattern element “[0-9]” matches any single digit in the range of zero to nine. An asterisk means that the preceding pattern element can be repeated any number of times (or even zero times for that matter). The slashes within the dates, the space between the date and time, and the colon within the time are all matched literally. (In other words, those three characters do not have a special meaning within the regular expression, the way that the square brackets and the hyphens do.) All of the following are examples of strings that will match this pattern:

04/04/2007 23:32
04/04/07 23:32
04/04/07 3:32
4/4/07 23:32
4/4/07 43:32
4/4/107 23:32
999999/999999/999999 999999:999999
// :

Notice that the second half of the list consists of strings that are not reasonable dates and times, yet they match the pattern. Depending on what we are trying to accomplish, that may or may not be a problem. In the case of our log files, the timestamps are computer generated, so we can assume that they’ll all have the right number of digits. We know that the likelihood of there being a string that consists of two slashes, a space, and a colon (amid some number of digits) that is anything but a timestamp is extremely low. If, on the other hand, we needed to use a regex like this to validate the user input of a date and time, then we’d probably want to be more precise.

One thing we could do is change all of the asterisks (*) to plus-signs (+). Where an asterisk says that the preceding element can match zero or more times, a plus-sign says that the preceding element must match at least one time.

[0-9]+/[0-9]+/[0-9]+ [0-9]+:[0-9]+

Now, “// :” will no longer match the pattern, because there must be at least one digit in each position (“9/9/9 9:9″). Better yet, let’s use a notation that allows us to specify the exact number of digits that are required. This is done by specifying the number within a pair of curly braces.

[0-9]{2}/[0-9]{2}/[0-9]{2} [0-9]{2}:[0-9]{2}

Now, the pattern will only match if all five of the numeric positions (month, day, year, hour, minute) are exactly 2 digits each.

Oops, now we’ve gone too far. What if the year is four digits? What if the hour is one digit? Or the month? Or the day? No problem. As an alternative to specifying a single number within the curly braces (to mean an exact match), we can specify a minimum count and a maximum count within the curly braces, by separating them with a comma.

[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4} [0-9]{1,2}:[0-9]{2}

Now, we have it where the month can be one or two digits, the day can be one or two digits, the year must be between two and four digits, the hour can be a one or two digits, and the minutes must be exactly 2 digits.

Tip #1: The search and replace dialog shown in the snapshot above comes from TextPad, but these principles hold for any tool that can perform a regex search and replace.

Tip #2: Sometimes it is necessary to escape the curly brackets (along with other symbols such as parentheses) in a regex by preceding them with a backslash. In TextPad, for example, the default is that they must be escaped; however, TextPad has an option to turn on “POSIX-style regex matching,” in which case curly braces and parentheses would need to be escaped when they are NOT to be interpreted as regex of symbols.

Tip #3: In our final variation of the regex pattern above, we left it so that a three digit year is possible (being within the scope of {2,4}). One way to handle this is to take that year portion of the pattern (“[0-9]{2,4}”), change it back to requiring exactly 2 digits (“[0-9]{2}”), but then enclose the whole thing in parentheses and add a second modifier to that to say that it can occur one or two times (“([0-9]{2}){1,2}”) — (2 x 1 = 2, or 2 x 2 = 4).


Viewing latest article 6
Browse Latest Browse All 24

Trending Articles