Sambar Server Documentation
|
Sambar Server Regular Expression Help |
Regular expressions (regex's) are useful as a way to match inexact sequences of characters. A regex is a specification of a pattern to be matched in the searched text. This pattern consists of a sequence of tokens, each being able to match a single character or a sequence of characters in the text, or assert that a specific position within the text has been reached (the latter is called an anchor.) Tokens (also called atoms) can be modified by adding one of a number of special quantifier tokens immediately after the token. A quantifier token specifies how many times the previous token must be matched (see below.)
Tokens can be grouped together using one of a number of grouping constructs, the most common being plain parentheses. Tokens that are grouped in this way are also collectively considered to be a regex atom, since this new larger atom may also be modified by a quantifier.
A regex can also be organized into a list of alternatives by separating
each alternative with pipe characters, `|'. This
is called alternation. A match will be attempted for each alternative
listed, in the order specified, until a match results or the list of
alternatives is exhausted.
If an un-escaped dot (`.') appears in a regex, it means to match
any character exactly once. By default dot will not match a newline
character.
A character class, or range, matches exactly one character of text, but the candidates for matching are limited to those specified by the class. Classes come in two flavors as described below:
A bracket expression is a list of characters enclosed in `[]'. It normally matches any single character from the list. If the list begins with `^', it matches any single character not from the rest of the list. If two characters in the list are separated by `-', this is shorthand for the full range of characters between those two (inclusive) in the collating sequence, e.g. `[0-9]' in ASCII matches any decimal digit. It is illegal for two ranges to share an endpoint, e.g. `a-c-e'.
The characters that are considered special within a class specification
are different than the rest of regex syntax as follows. If the first
character in a class is the `]' character (second character if the
first character is `^') it is a literal character and part of the
class character set. This also applies if the first or last character
is `-'. Outside of these rules, two characters separated by
`-' form a character range which includes all the characters
between the two characters as well. For example, `[^f-j]' is the
same as `[^fghij]' and means to match any character that is
not `f', `g', `h', `i', or `j'.
Anchors are assertions that you are at a very specific position within the search text. Regular expressions support the following anchor tokens:
Quantifiers specify how many times the previous regular expression atom may be matched in the search text. Some quantifiers can produce a large performance penalty, and can in some instances completely lock up NEdit. To prevent this, avoid nested quantifiers, especially those of the maximal matching type.
The following quantifiers are maximal matching, or "greedy", in that they match as much text as possible.
Examples:
The following quantifiers are minimal matching, or "lazy", in that they match as little text as possible.
One final quantifier is the counting quantifier, or brace quantifier. It takes the following basic form:
The quantifiers `{1}' and `{1,1}' are accepted by the syntax, but are optimized away since they mean to match exactly once, which is redundant information. Also, for efficiency, certain combinations of `min' and `max' are converted to either `*', `+', or `?' as follows:
Note that {0} and {0,0} are meaningless and will generate an error message at regular expression compile time.
Brace quantifiers can also be "lazy". For example {2,5}? would
try to match 2 times if possible, and will only match 3, 4, or 5 times
if that is what is necessary to achieve an overall match.
A series of alternative patterns to match can be specified by
separating them with vertical pipes, `|'. An
example of alternation would be `a|be|sea'. This
will match `a', or `be', or `sea'. Each alternative
can be an arbitrarily complex regular expression. The alternatives are
attempted in the order specified. An empty alternative can be specified
if desired, e.g. `a|b|'. Since an empty alternative can match
nothingness (the empty string), this guarantees that the expression will match.
Comments are of the form `(?#<comment text>)' and can be
inserted anywhere and have no effect on the
execution of the regular expression. They can be handy for documenting
very complex regular expressions. Note that a comment begins with
`(?#' and ends at the first occurrence of an ending
parenthesis, or the end of the regular expression... period. Comments do
not recognize any escape sequences.
What to Match | Operator | Effect |
Any single character | ? | g?t finds get, got, gut |
Any string of characters (one or more) | + | w+e finds wide, white, write but not we |
Any string of characters (or none) | * | w*e finds wide, white, write and we |
One of the specified characters | [] | g[eo]t finds get and got but not gu |
One of the characters in a range | [-] | [b-p]at finds bat, cat, fat, hat, mat but not rat or sat |
All characters | [] | i[] finds line, list, late |
One expression or another | (|) | W(in|indows) will find Win or Windows |
One or more expressions | +() | +(at) will find atat in catatonic and at in battle |
All characters (perhaps on different lines) | *[] | h[]d finds helped, Hello
World, and Hello (cr lf) Win95
World. /\**[]\*/ will match C style comments (on several lines if necessary (*[] will span across multiple lines up to 32767 characters) |
A string that doesn't start with an expression | !() | : !(http) finds : in
"following:" but not in "http://www.funduc.com"
Note: Syntax for pre-3.1 versions would be !(http): |
One of the characters not in a range | ![-] | [a-z]at!([b-p]at) matches r in
"rat" & s in "sat" but nothing in "bat", "cat", "hat".
Note: Syntax for pre-3.1 versions would be ![b-p]at |
An expression at the beginning of a line | ^ | ^the finds the at the beginning of a line and The (if case sensitive is turned off) |
An expression at the end of a line | $ | end$ finds end when its the last string on a line. |
One or more column(s) before or after a string | +n | [h]+4// finds http:// but not https:// |
Using Special Characters | \ | \(\*\) will find (*) |
© 2001 Sambar Technologies. All rights reserved. Terms of Use.