Regular Expression Basics

Posted on March 21, 2002

in Site Development

by Chris Spruck (sprocket)

Rated 4.53 (Ratings: 47) (Add your rating)

Want more?

More articles in Site Development
More articles by sprocket

Chris Spruck

Member info | Full bio

User since: September 18, 2000

Last login: September 18, 2000

Articles written: 1

Regular expressions, sometimes referred to as regex, grep, or pattern matching, can be a very powerful tool and a tremendous time-saver with a broad range of application. As an extended form of find-and-replace, you can use a regular expression to do things such as perform client-side validation of email addresses and phone numbers, search multiple documents for strings and patterns you wish to change or remove, or extract a list of links from source code. Regex is supported by most languages and tools, but because there can be varying implementations, this article will cover basic principles that are commonly used.

Literals and Metacharacters

If you've seen a regular expression before and thought it looked like alien space-algebra, it does, but have no fear - you'll be fluent in alien space-algebra in no time! To make the most of the power of regex, you need to be familiar with a few classifications of characters. Literals are normal text characters and can include whitespace (tabs, spaces, newlines, etc.). Unless modified by a metacharacter, a literal will match itself on a one-for-one basis. Metacharacters' power lies in how they are arranged and interpreted as wildcards. Metacharacters can be escaped with a backslash (\) to find instances of themselves, for instance, if you need to find a caret (^) or a backslash, as well as used in nested groups or other combinations.

Below is a list of some metacharacters and character classes for a quick glance - each will be explained in further detail with examples. Keep in mind that a "match" can be as simple as a single character or as complex as a sequence of literals and metacharacters in nested and compounded combinations.

Metacharacter	Match
\	the escape character - used to find an instance of a metacharacter like a period, brackets, etc.
. (period)	match any character except newline
x	match any instance of x
^x	match any character except x
[x]	match any instance of x in the bracketed range - [abxyz] will match any instance of a, b, x, y, or z
\| (pipe)	an OR operator - [x\|y] will match an instance of x or y
()	used to group sequences of characters or matches
{}	used to define numeric quantifiers
{x}	match must occur exactly x times
{x,}	match must occur at least x times
{x,y}	match must occur at least x times, but no more than y times
?	preceding match is optional or one only, same as {0,1}
*	find 0 or more of preceding match, same as {0,}
+	find 1 or more of preceding match, same as {1,}
^	match the beginning of the line
$	match the end of a line

Detailed descriptions of regex operators

Within these descriptions, x is used as a placeholder for examples - x can be an actual x or it can be an entire sequence like href="http://www.evolt.org", <DIV>, or ((\.\.)?/[a-z]+\.jpg).

. - Matches any one character except newline and is generally used with quantifiers, which will be explained below. For instance, .{3} would find three-letter words

x - Matches any instance of x and can include specific character sets or ranges, for instance, [wxyz] would match any instance of w, x, y, or z, but not wz, yx, or other combinations of the given character set, unless it was followed by a quantifier.

^x - Matches any character that is not x and can also be used in a range. For example, <[^abel]+> would match one or more letters that are not a, b, e, or l, and which are surrounded by < and >, thus it would match <font> but not <table>.

[x] - Matches any character in the given range. Examples of a range would be the expression [0-9], which would find a single digit, or [a-z], which would find a single lower case character. You can combine ranges as well - [A-Za-z0-9] will find a single upper or lower case character or digit. You may also combine ranges with commas, such as [0-3, 5-8] which would find any digit that isn't 4 or 9.

() - Parentheses are used to group operators much like basic algebra and are also used to delineate a backreference, which is the way you can do replaces with matches. (Backreferences get their own section below). A simple example would look something like: www\.([a-z]+)\.com which will find www.anycharactersathroughzhere.com.

{} - Curly brackets (or braces) are used to define numeric quantifiers, which allow you to specify the optional, minimum, or maximum number of occurrences in the match. x{3} would find exactly 3 occurrences of x. x{3,} matches on at least 3 occurrences of x. x{3,5} matches at least 3 occurrences of x and no more than 5.

? - The preceding match is optional or must match exactly one time. An example would be: ((\.\.)?/[a-z]+\.jpg) which matches a path to an image file ending in .jpg and could start with a ../ or just a /. A ./ or ../../ would fail to match that particular expression.

* - Matches the preceding character or group 0 or more times. Note that this is not the same as the use of the ? listed above. z* can match no z, z, or for those readers who have already fallen asleep, zzzzzzzzzzzzzzzzzzzzzzz.

+ - Matches the preceding character or group 1 or more times. In comparison to the previous example, z+ would have to match at least z or zz or zzz and so on.

^ - Used to force a match to the beginning of a line. Note that this is not the same as a character exclusion such as [^xyz], which would match any characters that are not x, y, or z. ^Hello would match at the beginning of a line such as Hello Chris and would not match Chris said Hello.

$ - Used to force a match to the end of a line. $end would match at the end of a line such as This is the end and would not match end this article already!

The various operators and metacharacters listed above are pretty standard across most implementations of regex. POSIX class names and character class shorthands are shortcuts to specify character types like digits, whitespace, and so on.

POSIX (Portable Operating System Interface) classes should be more consistent across languages and applications but there may not be an exact parallel between certain class shorthands and POSIX classes, and either class type may not always be fully supported. If they are supported, POSIX classes can be useful since they have a little more precision when it comes to things like whitespace and other non-alphanumeric characters.

POSIX Class	Match
[:alnum:]	alphabetic and numeric characters
[:alpha:]	alphabetic characters
[:blank:]	space and tab
[:cntrl:]	control characters
[:digit:]	digits
[:graph:]	non-blank (not spaces and control characters)
[:lower:]	lowercase alphabetic characters
[:print:]	any printable characters
[:punct:]	punctuation characters
[:space:]	all whitespace characters (includes [:blank:], newline, carriage return)
[:upper:]	uppercase alphabetic characters
[:xdigit:]	digits allowed in a hexadecimal number (i.e. 0-9, a-f, A-F)

Character class	Match
\d	matches a digit, same as [0-9]
\D	matches a non-digit, same as [^0-9]
\s	matches a whitespace character (space, tab, newline, etc.)
\S	matches a non-whitespace character
\w	matches a word character
\W	matches a non-word character
\b	matches a word-boundary (NOTE: within a class, matches a backspace)
\B	matches a non-wordboundary

Think dif{2}erently

Many Macintosh applications can easily handle regular expressions, but that's not what I mean here. The philosophy of regex is one of surgical precision and extreme logic, and you have to play by the rules. Like doing a complex database query, you have to know exactly what you want and exactly how to get it or you'll end up with either way more data than you need or not enough information. The concepts of AND, OR, wildcards, and the liberal use of parentheses all come into play with regex. You have to carefully create an expression that meets your needs but is neither too restrictive nor too inclusive or the dark side of regular expressions will rear its ugly head.

A warning about "greediness"

With true power, comes an unhealthy dose of greed. Regular expressions are very greedy. They may seem nice and friendly, but they'll take all they can get. What this means is that a regex will try to match as much as it can, since it's not smart enough to stop on the earliest possible match. It assumes you want the "whole thing", which is why you need to create a surgical strike of an expression. You can take care of a broken toe by amputating above the knee, but then where does that leave you? (Hopping mad, probably).

A great example of regex greediness is the expression:

<a href=".*">.*</a>

At first glance, it appears this expression will find an href tag (having no extra attributes) with a reference containing just about any URL, followed by ">, then anything in the link text, then the closing </a>. You could use this to get a list of all the links in a web page. Sounds useful and looks mostly harmless, right? What you end up with is something like this:

<a href="http://sample.url.here">Click this!</a>. Some text goes <a href="../text.htm">here</a>. Maybe several paragraphs go here. More text goes <a href="/less/is/more.htm">here</a>. Another big block of text, text, and more text. <a href="end.htm">The End</a>

The reason you get a whole block of text mixed with links as a single match instead of a simple list containing each link is because the sub-expression .* is where the greed kicks in. The .* really does mean "match anything" so it merrily goes along until it can't match anything else, which matches up to the very last </a> it can find and grabs everything in between along the way. It started at the toe and went straight to the thigh, without even thinking about slowing down at the knee.

Here's where we put a splint on the toe instead of amputating the whole leg. Break down the parts of this expression:

<a href="[^"]+">[^<]+</a>

You start with the <a href=" and then you see [^"]+">. If you've been following along with the rules, you know that this means find at least one of any character except a double-quote, then find the first instance of a double-quote, then a >. The same principle applies to the next part - [^<]+</a> finds at least one of any character except a <, then matches the first literal instance of </a>. Search with this expression and you get a nice short list of complete href tags. Conquer the greed! A clear understanding of the rules of regex and the various operators is paramount and it will take patience as well as experimentation with your logic to learn to tune an expression to yield exactly what you need.

Backreferences

Using a backreference is how you finally get to witness the real power of regular expressions. Extracting a list of links from a page of source is useful, but nowhere as useful as being able to do something with that data. Parentheses can be used to "remember" a subexpression, and a backreference in the form of \digit is how you refer to that particular group. Parentheses are counted from left to right within the expression, so the first open parentheses group has a backreference of \1, the second has a backreference of \2, and so on. You can use the memory-like functionality of a backreference in a replace string.

A good example of this uses the href expression from above. You can get a list of complete hrefs from some source with the expression <a href="[^"]+">[^<]+</a>. Let's say you need to find all external links on a web site and remove the href tag, but leave the link text intact, and we'll assume for this example that none of your local links start with http://. You would add parentheses to your expression like this:

<a href="http://[^"]+">([^<]+)</a>

You would then perform a find with this expression and simply replace with \1. The parentheses "memorize" the link text and the \1 calls it into the replace, leaving you with just the link text e.g. some text about <a href="http://www.evolt.org">evolt</a> results in some text about evolt.

A more interesting example might be a transposition using more than one backreference. Pretend you have a text list of web site users in the form of LastName, FirstName and you want a list of names in a FirstName LastName format. The expression, ([^,]+),\s(.+) would find Spruck, Chris, since ([^,]+), matches any number of characters that aren't a comma, followed by a comma, then a space, then (.+) finds any number of characters again. Notice where I placed both sets of parentheses. To change Spruck, Chris to the preferred format, you would replace that with \2\s\1, yielding Chris Spruck.

When you're doing replaces, it's very important that you test your expressions on backup copies of files, or even a dummy test file of your own creation, so if your expression is off by a parenthesis or something else, you haven't ruined your files permanently. Once you know your expression works on a sample, then go ahead and work on all your files. If you do run an expression that gives you unintended results, you can probably run another one again to correct the mishap. Don't ask how I know this.

Sometimes it may also be useful to run more than one expression over the same set of data to make it easier to catch every last bit that you need with a second expression. For instance, you might want to add quotes to all your tag attributes if some are unquoted, then run another expression that somehow operates based on the quotes.

A few practical examples

Get a list of IP addresses from a server log:

(\d{1,3}\.){3}\d{1,3} - This expression will find three instances of a one to three digit number followed by a period, then one to three more digits, e.g. 206.159.10.1

Find doubled words in text such as "Rate this article high high, please!":

\s([A-Za-z]+)\s\1 - This expression will match a space, followed by a word of any length (which is later recalled by using the parentheses for a backreference), then a space again. The backreference, \1, then picks up the second instance of the same word. You could then simply replace the match with \1, which will remove the second instance of the word.

Remove FONT tags from your web pages:

<(FONT|font)([ ]([a-zA-Z]+)=("|')[^"\']+("|'))*[^>]+>([^<]+)(</FONT>|</font>) and replace with the backreference \6 - This expression looks quite complicated, but I wanted to show an example with some more involved logic. A simpler example that finds the same string will follow this explanation. <(FONT|font) accounts for an upper or lower case tag. ([ ]([a-zA-Z]+)= matches a space followed by any attribute name and an =. The next subexpression, ("|')[^"\']+("|'), finds the leading double or single quotes on the attribute(s), then any attribute value that's not a double or single quote, i.e. Arial, +5, #c3d4ff, etc., then the closing double or single quote. Notice that the subexpression for the entire attribute is enclosed in parentheses and followed by an asterisk - ([ ]([a-zA-Z]+)=("|')[^"\']+("|'))*. This allows you to find a tag with either no attributes or any number that may exist. [^>]+> then matches anything up to the first > (similar to the "greediness" example above). The backreference is defined next as ([^<]+), which will capture any text between the opening and closing font tags, and is referred to as \6 because it's the sixth parenthetical group in the entire expression. Then (</FONT>|</font>) accounts for the closing font tag in either case.

<(FONT|font)[^>]*>[^<]*(</FONT>|</font>) is a simpler example that accomplishes the same thing as the expression explained above. The difference is that it is much less picky about what is between the font tags, so if you have inconsistent tag syntax, it will probably capture the various instances you may have. On the other hand, if you have any extra junk characters in your search data, you may catch things you didn't intend, which is why you should test your expressions ahead of time.

A brief history of the 31 Flavors

There are a number of applications and languages that support regular expressions, but unfortunately, not all of them support regex in quite the same way. Although regular expressions had their origins in neurophysiology in the 1940s and were developed by theoretical mathematicians in the 1950s and 1960s, the evolution and subsequent divergence of regex implementations was due to the independent development of various Unix tools such as grep, awk, sed, Emacs, and others. [1]

Today, it's probably safe to say that Perl has the most robust regex engine in common use. Other languages and applications that have some form of regex support or pattern matching (and this by no means is a complete list) include: JavaScript, VBScript, PHP, Python, Tcl, Java, C, Macromedia Dreamweaver/Ultradev, ColdFusion and ColdFusion Studio, BBEdit, NoteTabPro, TextPad, UltraEdit, the XML Schema and XPath Recommendations, the various Unix tools used for text processing and their clones, and just about any modern application with a Find function.

Conclusion

Regular expressions are a powerful tool to keep in your web belt. They can appear daunting, but by learning a few simple rules, you can save yourself from hours of time doing manual find-and-replaces the slow, boring way.

I'll close with what may be the world's first (and undoubtedly the world's worst) regular expression joke:

What did one regex say to the other?

Other Resources

[1] Mastering Regular Expressions - Friedl
www.regexlib.com
www.webreference.com/js/column5/

All the regular expressions in this article were tested using ColdFusion Studio 4.5.2, so you may encounter slight differences in different applications or languages. Thanks to Sean Palmer for some expression testing.

Chris' favorite regular expression is a smirk with an optional wink. He lives in Atlanta, Georgia and dreams of being back on the coast. He probably needs more info in his bio. (He almost never refers to himself in the third person.)

77 comments on this article. Log in to add your comment | Rate this article:

sweet

Submitted by taftman on March 22, 2002 - 11:49.

Great article, thanks.
- rob

regexes and the amazing vim

Submitted by jeduthun on March 22, 2002 - 17:40.

Included in "various Unix tools used for text processing and their clones" is the text editor vi, and its modern day equivalent, vim (vi improved). I learned most of what I know about regular expressions from using vim (which comes in Linux and Windows flavors, among others). Its documentation has very thorough coverage of regexes and their usage. I have started to use vim at work for almost all my everyday editing, partially because of its strong regex support.

Anyhow, if you want to learn regexes, vim makes a good playground, and you can't beat the cost. Open up a file and then type in

:%s/searchregex/replaceregex/g

to run a search/replace on a whole file using regular expressions. (The % means 'all lines in the file' and the g means 'all instances on each line' -- both parameters are customizible.)

In vim, you can type

:help regex

for the full manual on regular expressions.

call me stupid if you like but...

Submitted by notabene on March 24, 2002 - 13:46.

... I didn't get this:

What did one regex say to the other?
.+

Don't laugh, will you? :-)

"joke" explained

Submitted by sprocket on March 24, 2002 - 15:06.

.+ would match at least any one character, so the content of the punchline is really irrelevant. The fact that the punchline can be anything, due to the expression, is the joke itself. Now that I've made it completely unfunny by explaining it, any other questions? Maybe I should just stick to articles. :-)

Chris

aaaaaaaaaaaaaaaah

Submitted by notabene on March 24, 2002 - 17:09.

He he.

He he he.

more regex jokes

Submitted by jeduthun on March 25, 2002 - 09:26.

Just so Chris can't claim authorship of the world's only regex joke...

What regex are you most likely to see at Christmas?

[^L]

Why couldn't Chris try out the regular expressions he created until he left home?

His mom wouldn't let him play with matches.

please, stop

Submitted by jesteruk on March 25, 2002 - 20:46.

They are the worst jokes i ever heard in my life. Trust me, that's an achievement there, my uncle was a vicar who liked to think of himself as a funny man *shudders at the memories*

Brilliant article mate, clear, concise, well contructed and supported. Well done, keep it up. Just leave out the jokes - please... hehe.

-J

Are backreferences really an enhancement?

Submitted by sebkostal on May 15, 2002 - 17:23.

Do you really need the backreference or ist this just another way to put things? Let's take the remove font tags example:

Instead of remembering the text between the font tags to insert it later, can't you just ignore the text instead to leave it there?? Are there things, you just cannot do without backreference?

I ask this, because I could not get the backreference to function in Dreamweaver.

Thanks for the article by the way. I have been doing a lot of simple searching and replacing to save time, but this will surely be helpful in the future.

backreference in Dreamweaver

Submitted by EvilCHELU on May 16, 2002 - 09:41.

dreamweaver uses $1, $2 ... to make backreferences instead of \1, \2 ... the examples uses

Backreferences

Submitted by jeduthun on May 16, 2002 - 10:40.

Backreferences are hugely powerful in regular expression search and replace operations.

For instance, let's suppose you have a list of names that look like this:

Doe, John
Doe, Jane
Foo, Fred
...

And you want it to look like this:

John Doe
Jane Doe
Fred Foo

This is easy with backreferences. Just replace this:
$\w\+$,\s$\w\+$ with this: \2 \1

Or suppose you want to double-quote all your attributes that are currently single-quoted (but want to leave all the other quotes in your document alone. You could use backreferences to do that too. Replace this: ='$[^']*$'
with this:
="\1"

Need help with a regular expression.....

Submitted by rajan11 on May 20, 2002 - 15:24.

I need an expression to test optional occurance of a string ABCD. I thought (ABCD)? will do the trick but unfortunately it didn't help. I tried to test the following expressions against the above req expression: 1. "ABCD", 2. "" (empty string) 3. "ACDB" I get true for all the above test expressions. I am expecting the regular expression to match only "ABCD" and "". "ABDC" should not have matched. Any help will be greatly appreciated. Thanks

Re: need help with a regular expression.

Submitted by luminosity on July 28, 2002 - 05:31.

ABDC matches precisely because the match is optional. You can't test for a simple optional part.. optionals are meant to be included within more complex regexps. If you want to only match the empty string or abcd you should use (ABCD|), I believe.

Re: Need help with a regular expression.....

Submitted by vor0nwe on August 22, 2002 - 01:39.

Do you want to match the complete string? Then you need to test for ^(ABCD)?$; the following strings will match this regexp: "ABCD", "", and nothing else...

I ain't getting it, need some help!

Submitted by jssingh on October 4, 2002 - 16:34.

hi. I am trying to validate some input and it looks pretty simple, but I am just not getting it. I need to validate according to the crieria -- It must be 6 to 12 chars long -- it must include atleast one number or symbol (any symbol on the keyboard) -- It must include atleast one letter (A-Z) -- it must not have spaces. I am using this : "^([a-z]+[^\s[a-z]]+){6,12}$". But it ain't working. Appreciate any help, thanks.

No help?

Submitted by jssingh on October 7, 2002 - 16:31.

Looks like I am out of luck. The last msg was posted ages ago. Is someone home?

Re: I ain't getting it, need some help!

Submitted by vor0nwe on October 7, 2002 - 22:33.

Hi jssingh,

You're running into a 'limitation' of regexes: the format has to be (more or less) fixed. AFAICS, you're going to have to use a couple of regular expressions to do what you want to do; since the order of the required elements is not fixed...

Here's how I would do it (I think):
1. ^[^\s]{6,12}$ -- to check that there are no spaces and that the size is correct.
2. [a-z]+ -- to check for one or more letters
3. [^\sa-z]+ -- to check for one or more non-letter characters (number or symbol)

All three checks must be true.
The reason your ^([a-z]+[^\s[a-z]]+){6,12}$ is that the order of the elements is fixed: it has to begin with one or more letters, and then it must have some numbers or symbols. You can't really mix the two.

Re: I ain't getting it, need some help!

Submitted by jssingh on October 8, 2002 - 12:03.

Thanks a ton vorOnwe, its much appreciated. I bet your right, something that I was struggling for a couple of days. Thanks again.

^x?

Submitted by trfc791 on January 15, 2003 - 07:06.

I am a bit confused about ^x matching any letter except x, because in PHP, which is the only language I've used regex in, I have always used ^x to mean "starts with x". However, in some regular expressions like jssingh's, ^([a-z ... blah blah blah means that it starts with a letter, but in Chris's example (the anchor tags one), [^"] means any character except ". Both seem to work. Why is this so?

^L

Submitted by trfc791 on January 15, 2003 - 07:07.

Ah yes, and because of that, I don't know if [^L] in the joke is supposed to mean "starts with L" or "anything but L". Explain this joke please :P

Re: ^x?

Submitted by vor0nwe on January 16, 2003 - 10:22.

I can imagine your getting confused. I would say that Chris messed up there.

In fact, you're right: ^x does indeed mean a string starting with x. It's only within the square brackets that ^ means 'everything except'. So [^x] means everything except the letter 'x'.

So [^L] means "anything but L". But even then I don't get the joke... :(

This joke I understood :-)

Submitted by notabene on January 17, 2003 - 07:38.

Hey people,

In French Christmas is Noel (no L, hence the [^L]). Funny, eh?

Re: ^x?

Submitted by sprocket on January 17, 2003 - 15:53.

vor0nwe - How have *I* 'messed up'? It's in the article and you explained it yourself, too. Outside of brackets, ^x matches something starting with x and inside a range (brackets like [^x]), it means something except x.

If you read the table of metacharacters, you'll see that ^ has its own entry, described as "match the beginning of the line". The full description of this metacharacter (bold is my emphasis here) says " ^ - Used to force a match to the beginning of a line. Note that this is not the same as a character exclusion such as [^xyz], which would match any characters that are not x, y, or z. ^Hello would match at the beginning of a line such as Hello Chris and would not match Chris said Hello."

And just to make this even more confusing, hehe, if you wanted to match the beginning of a sequence that didn't start with x, y, or z, an expression for that could be ^[^xyz].

Also, remember that this article is about general principles and not specific to PHP, Perl, or any particular implementation of regex, so you may encounter differences in the regex engines.

Thanks all for reading and for your input! HTH!

Chris

Re: ^x?

Submitted by vor0nwe on January 17, 2003 - 16:10.

Hehe... Hi Chris, no offense intended. :-)

The 'mess-up' I was referring to is actually only in the table of metacharacters and what they match. In there, it says that '^x' (without any brackets) will 'match any character except x', which isn't true, AFAIK. ^x will match an x at the beginning. [^x], however (mind the brackets), will match any character except x. It's really a small detail, but nonetheless confusing when you think you already got the basics, are too hasty to read through the whole article, and only check out the table (which is what I do when I'm in a hurry and need a solution yesterday).

...Are you saying that in some implementations of regex, ^x (outside square brackets) can also mean 'everything except x'? Fascinating!

Anyway, thanks and keep up the good work... :-)

Re: ^x?

Submitted by sprocket on January 17, 2003 - 17:02.

OK, I now see what you mean and as shown in that table, yes, it should have been displayed more accurately. I guess the thinking (and a few words to this effect in the text) was that any discrete element listed in the metacharacter table could be used at any point within a set of brackets, in a numeric quantifier, etc. My sincere apologies.

I never meant to imply that ^x without brackets itself can also mean 'everything except x'. Doesn't mean it's impossible, although it would be rather strange and trying to parse matches like that would be pretty sketchy, IMO.

Chris

Good Article - and need some help

Submitted by grusshauf on February 20, 2003 - 18:05.

Very good article. I've used some simple expressions so far, and some that were a little more complicated (after hours of trial and error)...but that was months ago and I've forgotten just about everything I've learned (I can hardly read my own code that worked those months ago). And I hadn't learned backreferences.
Anyway, what I'm trying to do is to find a specific link. So say I have a link, in a string/file:

Say I know that I need to find the href section preceeding "Sat, Sep 28" (I'll know the exact date). How would I come up with this part only:

I want to replace it with something else, and leave the
Sat, Sep 28
Part intact.

I have actually tried some things - I know how it is when someone just wants something done for them. I want to learn.
Thank you,
Russell

Re: Good Article - and need some help

Submitted by vor0nwe on February 21, 2003 - 01:09.

Russel,

I think you don't need backreferences so much as positive lookahead.

<A.*?>(?=Sat, Sep 28)

will match any <A> tag that precedes "Sat, Sep 28", but will not include that "Sat, Sep 28" in the match.

The ? question mark just next to the * asterisk means it's a non-greedy match; it'll only match the last <A tag that precedes the given text; otherwise it would match everything between the first available "<A" and the last ">" to precede the given text.

I don't know very much about the different RegEx syntaxes; this one works in MS's .Net framework and VBScript/JScript regexes. I'm not sure if and how this works in other implementations, though.

Re: Good Article - and need some help

Submitted by grusshauf on February 21, 2003 - 10:42.

Thank you for responding. I'm using VBScript (through VB), but it gave me an error when I tried to test it. Then I remembered that I've needed to put \s in for a space in the past. But this didn't work either. My final try was:

<A.*?>(?=Sat,\sSep\s28)

Any suggestions?
Thanks again, Russell

Re: Good Article - and need some help

Submitted by grusshauf on February 21, 2003 - 10:57.

When I test what you have with what I'm wanting to test - at (http://regexlib.com), it works (finds a match). So it must be a VB/VBScript issue...

Re: Good Article - and need some help

Submitted by vor0nwe on February 21, 2003 - 11:10.

Ok, so I've set a reference to Microsoft VBScript Regular Expressions 5.5 (I've also tested it with 1.0, and that also works).

    Dim rx As RegExp
    Dim strTest As String
    
    strTest = "Testing <A href=""/b1/show?page=SomePage"">Sat, Sep 28</A> to see if it works..."
    
    Set rx = New RegExp
    With rx
        .Pattern = "<A.*?>(?=Sat, Sep 28)"
        .IgnoreCase = True
        Debug.Print .Replace(strTest, "{test}")
    End With

results in:

Testing {test}Sat, Sep 28</A> to see if it works...

Isn't this what you wanted? Or did you just forget to set the IgnoreCase flag?

HTH,

Vor0nwe

Re: Good Article - and need some help

Submitted by grusshauf on February 21, 2003 - 11:36.

Thank you. I've found the problem. I have a reference set to "Microsoft VBScript Regular Expressions". When I remove that reference and try to add "Microsoft VBScript Regular Expressions 5.5" it tells me that it cannot load the DLL. I've tried this both in VB and VBA. I will try it again from home (at work now) to see if I can add the reference there (but would really like to get this working here). With the first reference, I get an error when trying to test (or replace, or match, etc.).

I really appreciate your help! I'm sure that it will work if I can work out the reference thing.

-Russell

Password Pattern

Submitted by pickle on March 3, 2003 - 04:38.

How about a pattern for a password that allows special characters, alphanumeric except white space and tab?

OK, you gurus of the Regular Expression....

Submitted by jdmaynard on April 5, 2003 - 16:07.

I would like to lose all the whitespace characters in about a bazillion anchor tags - like f'rinstance:

and

It's obviously pretty easy to find all the whitespaces in the dosuments, but I'm getting pretty confused just picking out those in the offending tags. Thanks - in advance Doug

excellent article!

Submitted by mwarden on April 8, 2003 - 12:20.

Excellent article, Chris! It helped out a lot with html-specific stuff I was doing. Thanks!

Excellent Article, I Stand Before You a Broken Man

Submitted by sholmes on April 8, 2003 - 20:52.

I thought I had mastered the ancient art of regexp, but alas, they karate chopped me into yahoosville tonight. I have been slaving over a very simple regexp I simply cannot get to work. Possibly another set of eyes may help.

I have a simple string in which I need to match:

1) The letter W at the start
2) 0 or 1 occurrences of 'est'
3) Absolutely nothing else. No symbols, characters, numbers...nothing.

So:

1) "W" = should succeed
2) "West" = should succeed
3) "Way" = should fail

I have tried every angle from W(est|) to W(est)?[^.*] and even W($est)?[^.*] and beyond, all to no avail. Any insight or rudimentary bops on the head in my general direction would be greatly appreciated.

- S. Holmes

I Stand Before You a Broken Man

Submitted by jdmaynard on April 9, 2003 - 06:03.

Is this what you want?

\bW\b|\bWest\b

No Longer Broken

Submitted by sholmes on April 9, 2003 - 07:15.

Thanks, jdmaynard, that did the trick. For the record, I am doing it in ColdFusion 5.0, and the CFML version of what you provided me (for anyone else interested) is:

^W$|^West$

You have saved me from a fate worse than death. ^_^

-S. Holmes

suitable question

Submitted by minivip on April 13, 2003 - 01:48.

W or West - that's the question. The formulation of the match criteria has been a little bit too complicated.

Enjoyed the Article but have questions...

Submitted by dougcranston on April 22, 2003 - 04:50.

I have a problem with a regexp that I found. Its intent is to ensure that a password has atleast one alpha Uppercase, one alpha Lowercase and at least one Numeric.

Have tried two different approaches and both fail. Can someone point out my errors, to be sure it is a simple case of shooting myself in the foot, as I just don't get regexp and its nuances.

First attempt:
var strng = document.frmLogon.sPswrd.value;

if (!((strng.search(/(a-z)+/)) && (strng.search(/(A-Z)+/)) && (strng.search(/(0-9)+/)))) {
alert("The password must contain at least one uppercase letter, one lowercase letter, and one numeral.\n");
document.frmLogon.sPswrd.focus();
document.frmLogon.sPswrd.blur();
document.frmLogon.sPswrd.select();
return false;
}

Second attempt and much longer:

var strng = document.frmLogon.sPswrd.value;

if (strng.search(/[a-z]+/)==false)
{
alert("lcThe password must contain at least one uppercase letter, one lowercase letter, and one numeral.\n");
document.frmLogon.sPswrd.focus();
document.frmLogon.sPswrd.blur();
document.frmLogon.sPswrd.select();
return false;
}
else
{
if (strng.search(/[A-Z]+/)==false)
{
alert("ucThe password must contain at least one uppercase letter, one lowercase letter, and one numeral.\n");
document.frmLogon.sPswrd.focus();
document.frmLogon.sPswrd.blur();
document.frmLogon.sPswrd.select();
return false;
}
else
{
if (strng.search(/[0-9]+/)==false)
{
alert("noThe password must contain at least one uppercase letter, one lowercase letter, and one numeral.\n");
document.frmLogon.sPswrd.focus();
document.frmLogon.sPswrd.blur();
document.frmLogon.sPswrd.select();
return false;
}
else
{return true;}
}
}

Any suggestions or comments would be greatly appreciated.

Thanks,
DougCranston

Fixed it...

Submitted by dougcranston on April 22, 2003 - 08:03.

Problem was I was testing for false, when the regexp returns a -1 if not found, or the char position in the string. Thanks, DougCranston

How do you search for special chars '{' or '['

Submitted by rossm on May 6, 2003 - 02:14.

Hi, I'm new to this Regex game. Can anyone tell me if you can seach for the special chars in a string i.e. "hello Mr {name}, how are you? Where "{name}" should be found? Thanks, Ross

How do you search for special chars '{' or '['

Submitted by jdmaynard on May 6, 2003 - 10:47.

This is described in the article at the top of this list. Read about the escape character, which I think is what you're looking for - something like \{ or \[ I think ought to work for you.

Formatting a string value

Submitted by quasimodo on May 19, 2003 - 07:59.

I have written a javascript function using regex to accept a string value and convert it to a currency value. What I'd like to do is use have a function that accepts a value and a pattern and formats the value according to to the pattern. Can anyone help me? Thanks

Hard to catch?

Submitted by matte99 on May 23, 2003 - 04:42.

Didnt really understand this, i think its way over my head :-)

Corrections

Submitted by cgibbard on July 14, 2003 - 10:43.

^x match any character except x

This isn't true, and is due to confusion with the special case use of character-set notation (that is, square brackets), where if the first character in the square brackets is ^, then the meaning of the set is reversed. i.e. [^A-Za-z] will match anything other than a letter. Caret is not considered special in any other location inside square brackets - for example [A-Za-z^] matches any letter, or caret. Also a good point to note is that the only characters that are considered special (and hence need escaping) inside square brackets are square brackets themselves, the caret if it appears as the first character, backslash as it is used to escape characters, or hyphen, if it would otherwise denote a range of characters. [()] will match an opening or closing paren, as the parens are not special. [a-] will match an 'a' or a hyphen, but [a-z] will match any lowercase letter (and not hyphen).

| (pipe) an OR operator - [x|y] will match an instance of x or y

While it's true that the pipe symbol acts as an or operator, the example shown is incorrect. The pipe is not considered special inside square brackets, so [x|y] will match an 'x', a pipe character or a 'y'. The regular expression x|y will match an 'x' or a 'y' however, as will [xy]. [w|x|y|z] in the article should be [wxyz], or w|x|y|z.

You may also combine ranges with commas, such as [0-3, 5-8] which would find any digit that isn't 4 or 9.

The commas and spaces are unnecessary, and potentially harmful. [0-3, 5-8] as shown, will match '0', '1', '2', '3', comma, space, '5', '6', '7', or '8'. The right way to do this is to simply put [0-35-8]

$end would match at the end of a line such as This is the end and would not match end this article already!

$end will never match anything, as the $ indicates the end of the string, and there won't be any characters past the end of the string. The right way to do this is end$, which will match strings ending in "end", and not match others.

capturing nested {{}} + literals w escaped quotes

Submitted by susan on August 6, 2003 - 15:42.

excellent article + comments even!

it's obvious you can use the non-greedy [^x]+x trick to capture the following string (it's a generic anonymous function where a, b, c are variables, function calls, statements, etc.):

function(){a;b;c;}

captured by

(function\{[^}]+\})

less obvious (to me at least) is, how do you capture the following, with nested {}?:

function(){a;{b;c;};d;}

a similar problem is capturing literals with single and/or double quotes containing their escaped characters:

"abc\""
'susan\'s'

tia

Limitations of Regular Expressions

Submitted by cgibbard on August 7, 2003 - 00:48.

It's not possible in general to match strings using regular expressions which are recursively defined. Things like a regular expression for matched brackets, or valid HTML, or even something as seemingly simple as one which matches a number of a's and then the same number of b's don't exist.

See the following links for a description of what regular expressions can express.

http://www.wikipedia.org/wiki/Regular_language
http://www.wikipedia.org/wiki/Regular_expression
http://www.wikipedia.org/wiki/Regular_grammar

To make sense of that, you'll want to know what a grammar is. A grammar is a specification of 4 things:

Σ, a finite set of symbols (i.e. the English alphabet, one and zero, or ASCII) called terminals.
N, a finite set of abstract symbols (completely disjoint from the terminals) called nonterminals.
P, a finite set of production rules that tell us how to turn a substring of terminals and nonterminals into another string of terminals and nonterminals
S, a symbol picked from the nonterminals in N as the start symbol.

Essentially what we do, is begin with the start symbol, and follow rules however we want until we get a string of terminals. For example, here's a grammar: a and b are the terminals, S and A are nonterminals, S being the start symbol, the production rules are as follows:

S -> aAb
A -> aAb
A -> (the empty string)

For example, here is a derivation of a string in the language:
S
aAb (by rule 1)
aaAbb (by rule 2)
aaaAbbb (by rule 2)
aaabbb (by rule 3)

With a little thought, this grammar matches nonempty strings where there are some number of a's then the same number of b's.

Now what is a regular grammar? A regular grammar is one where the production rules are all in one of three forms:

A -> a where A is some non-terminal in N and a is some terminal in Σ
A -> aB where A and B are some nonterminals in N and a is some terminal in Σ
A -> (the empty string) where A is some nonterminal in N

We can immediately see that the grammar above for matching a's and b's is not regular, but seeing that it's not possible to do something equivalent with a regular grammar takes a little more thought.

I could give a rigourous mathematical proof here, but it would lengthen the article, and I'm not sure that many people here are really interested. A few different proofs can be found easily by searching on Google. The main point is that as type 2 rules drop off a's, the rule to select to start dropping off b's has to change (as the productions followed must drop off an increasingly large number of b's. To account for an arbitrarily large number of a's, there must be an infinite number of production rules to choose from to start taking b's, and hence we are in trouble, as that is not possible.

Now, all that being said, there are a few concessions to make. Placing a finite limit on the number of a's and b's is okay for the language to be regular, it's only when you need the number to be arbitrary that causes serious problems, so if you pick a finite number of nestings that you need, you can work out a regexp that does it. (For larger numbers of nestings, you can expect your solution to be longer.)

Also, if you're working with Perl, there's a secret to Perl's regular expressions. Perl's "regular expressions" haven't been real regular expressions for quite some time now due to some extended features. These allow any grammar to be expressed. In any event, a Perl 5 "regexp" to match matching parens is as follows:

    $paren = qr/
      \(                   # Actual open paren symbol
        (                  # A group of either
           [^()]+          # Not parens
         |                 # Or
           (??{ $paren })  # Another balanced group (not interpolated yet).
        )*                 # Match as many times as needed.
      \)                   # Actual close paren
    /x;

The (??{ $str }) notation allows for strings to remain uninterpolated at the time of assigning the expression. This means that rather than expanding $paren right then and there to its value, which is empty since it hasn't quite been assigned yet, it waits to expand the value until the expression is actually used, and it comes time to match something. In this way you can get recursion of any sort happening with a little work.

RegExp

Submitted by jttaggart on August 10, 2003 - 19:44.

Fantastic article!! And real world examples. Thank you.

RegExp

Submitted by srbulls on October 6, 2003 - 14:26.

Great article! Have tinkered with regular expressions... this was a BIG help getting thru the murky spots.. =) Thanks!

help with RegExp

Submitted by playdoh on October 27, 2003 - 17:41.

Only need to write any string input in a [A-Z][a-z] format...that is every first letter of every/any word capitalized!

If I borrow an idea of wriinge a function to put all input entered into lowercase letters, then replace it with the first-letter-capitalized words - same as the input - would thie be correct?

var myStr=prompt("anything you want to enter goes here: ", "")
function capitalize(){ myStr.input.toLowerCase().replace(/\b([a-z]/g,m) function (newStr){ return newStr.toUpperCase()} }

I can't seem to get this.
I am so confused after reading all the rules about these expressions! HELP please....???

Thanks in advance.

re: help with RegExp

Submitted by cgibbard on October 27, 2003 - 21:25.

http://www.visibone.com/regular-expressions/ seems to have a good run-down on using regular expressions inside Javascript (which I take it is the language you're using).

Here's some fairly flexible code for doing what you want:

function mapMatches(r,f,x) {
	while (r.test(x)) {
		x = RegExp.leftContext + f(RegExp.lastMatch) + RegExp.rightContext;
	}
	return x;
}

function toUpper(x) {
	return x.toUpperCase();
}

function capitalise(x) {
	return mapMatches(/\b[a-z]/,toUpper,x);
}

print (capitalise("abc def ghi jkl mno pqr stu vwx yz"));

mapMatches takes a regular expression, a function, and a string, and repeatedly matches the regular expression against the string, each time replacing the matching bit how the function given tells it to. In this case, /\b[a-z]/ is the regular expression - that is, a lower case letter at the start of a word and toUpper is the replacement function, which just makes the string matched uppercase. So what happens is each time mapMatches loops, it kills off another initial lowercase letter - the intermediate values of x looking something like:

abc def ghi jkl mno pqr stu vwx yz
Abc def ghi jkl mno pqr stu vwx yz
Abc Def ghi jkl mno pqr stu vwx yz
Abc Def Ghi jkl mno pqr stu vwx yz
Abc Def Ghi Jkl mno pqr stu vwx yz
Abc Def Ghi Jkl Mno pqr stu vwx yz
Abc Def Ghi Jkl Mno Pqr stu vwx yz
Abc Def Ghi Jkl Mno Pqr Stu vwx yz
Abc Def Ghi Jkl Mno Pqr Stu Vwx yz
Abc Def Ghi Jkl Mno Pqr Stu Vwx Yz
[at which point there are no more matches]

I'm sure you can probably think of other uses of such a thing. Just be careful to not create infinite loops, as it is easy to do so if the function given does not change the string in such a way to reduce the matching problem. Things can get quite loopy - try:

function watchMapMatches(r,f,x) {
	print(x);
	while (r.test(x)) {
		x = RegExp.leftContext + f(RegExp.lastMatch) + RegExp.rightContext;
	}
	return x;
}

function wonderous(x) {
	len = x.length;
	if (len < 2) { return ""; }
	if (len % 2 == 0) { return x.substring(0,len>>1); }
	else { return x+x+x + x.substring(0,1); }
}

function strange(x) {
	return watchMapMatches(/.+/,wonderous,x);
}

strange("What wonderous replacement!");

Try it! Will this strange and wonderous function finish for every string you could put in? It's a good (and to the best of my knowledge, open) question.

Anyway, cheers!

Cale Gibbard

Start of page header

Article Categories

Highest rated articles

Submit

Recent comments

Regular Expression Basics

Want more?

Chris Spruck

Literals and Metacharacters

Detailed descriptions of regex operators

Think dif{2}erently

A warning about "greediness"

Backreferences

A few practical examples

Get a list of IP addresses from a server log:

Find doubled words in text such as "Rate this article high high, please!":

Remove FONT tags from your web pages:

A brief history of the 31 Flavors

Conclusion

Other Resources