Quality Testing

Quality is delighting customers

Can any body tell me the "REGULAR EXPRESSIONS" using which situation in your project.?

Can any body tell me the "REGULAR EXPRESSIONS" using which situation in your project."

Views: 203

Reply to This

Replies to This Discussion

For example, If u using any automation script to check mail / delete mail etc. Inbox counter may be change every day / hour and may be increase or decrease . So u have to use reg. ex. to find exact number (integer) value like this :

Inbox(500) : You have to detect "(500)" with bracket and use reg. exp.

A Regular Expression is the term used to describe a codified method of searching invented, or defined, by the American mathematician Stephen Kleene.

The syntax (language format) described on this page is compliant with extended regular expressions (EREs) defined in IEEE POSIX 1003.2 (Section 2.8). EREs are now commonly supported by Apache, PERL, PHP4, Javascript 1.3+, MS Visual Studio, MS Frontpage, most visual editors, vi, emac, the GNU family of tools (including grep, awk and sed) as well as many others. Extended Regular Expressions (EREs) will support Basic Regular Expressions (BREs are essentially a subset of EREs). Most applications, utilities and laguages that implement RE's, especially PERL, extend the capabilities defined. The appropriate documentation should always be consulted.

Contents

A Gentle Introduction: - the Basics
POSIX Standard Character Classes:
Commonly Available extensions: - \w etc
Submatches, Groups and Backreferences:
Regular Expression Tester: - Experiment with your own target strings and search expressions in your browser
Apache browser recognition: - a worked example
Common examples: - regular expression examples
Notes: - general notes when using utilities and lanuages
Utility notes: - using Visual Studio regular expressions
Utility notes: - using sed for file manipulation (not for the faint hearted)

A Gentle Introduction: The Basics

The title is a misnomer - there is no gentle beginning to regular expressions. You are either into hieroglyphics big time - in which case you will love this stuff - or you need to use them, in which case your only reward may be a headache.

Some Definitions before we start

We are going to be using the terms literal, metacharacter, target string, escape sequence and search string in this overview. Here is a definition of our terms:

literal A literal is any character we use in a search or matching expression, for example, to find ind in windows the ind is a literal string - each character plays a part in the search, it is literally the string we want to find.
metacharacter A metacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression, for example, the character ^ (circumflex or caret) is a metacharacter.
target string This term describes the string that we will be searching, that is, the string in which we want to find our match or search pattern.
search expression This term describes the expression that we will be using to search our target string, that is, the pattern we use to find what we want.
escape sequence An escape sequence is a way of indicating that we want to use one of our metacharacters as a literal. In a regular expression an escape sequence involves placing the metacharacter \ (backslash) in front of the metacharacter that we want to use as a literal, for example, if we want to find (s) in the target string window(s) then we use the search expression \(s\) and if we want to find \\file in the target string c:\\file then we would need to use the search expression \\\\file (each \ we want to search for as a literal (there are 2) is preceded by an escape sequence \).

Our Example Target Strings

Throughout this guide we will use the following as our target strings:

STRING1   Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt) STRING2   Mozilla/4.75 [en](X11;U;Linux2.2.16-22 i586) 

These are Browser ID Strings and appear as the Apache Environmental variable HTTP_USER_AGENT (full list of Apache environmental variables).

Simple Matching

We are going to try some simple matching against our example target strings:

Note: You can also experiment as you go through the examples.

Search for




m STRING1 match Finds the m in compatible

STRING2 no match There is no lower case m in this string. Searches are case sensitive unless you take special action.
a/4 STRING1 match Found in Mozilla/4.0 - any combination of characters can be used for the match

STRING2 match Found in same place as in STRING1
5 [ STRING1 no match The search is looking for a pattern of '5 [' and this does NOT exist in STRING1. Spaces are valid in searches.

STRING2 match Found in Mozilla/4.75 [en]
in STRING1 match found in Windows

STRING2 match Found in Linux
le STRING1 match found in compatible

STRING2 no match There is an l and an e in this string but they are not adjacent (or contiguous).

Brackets, Ranges and Negation

Bracket expressions introduce our first metacharacters, in this case the square brackets which allow us to define list of things to test for rather than the single characters we have been checking up until now. These lists can be grouped into what are known as Character Classes typically comprising well know groups such as all numbers etc.

Metacharacter

Meaning

[ ]

Match anything inside the square brackets for ONE character position once and only once, for example, [12] means match the target to 1 and if that does not match then match the target to 2 while [0123456789] means match to any character in the range 0 to 9.

-

The - (dash) inside square brackets is the 'range separator' and allows us to define a range, in our example above of [0123456789] we could rewrite it as [0-9].

You can define more than one range inside a list, for example, [0-9A-C] means check for 0 to 9 and A to C (but not a to c).

NOTE: To test for - inside brackets (as a literal) it must come first or last, that is, [-0-9] will test for - and 0 to 9.

^

The ^ (circumflex or caret) inside square brackets negates the expression (we will see an alternate use for the circumflex/caret outside square brackets later), for example, [^Ff] means anything except upper or lower case F and [^a-z] means everything except lower case a to z.

NOTE: Spaces, or in this case the lack of them, between ranges are very important.

NOTE: There are some special range values (Character Classes) that are built-in to most regular expression software and have to be if it claims POSIX 1003.2 compliance for either BRE or ERE.

So lets try this new stuff with our target strings.

Search for




in[du] STRING1 match finds ind in Windows

STRING2 match finds inu in Linux
x[0-9A-Z] STRING1 no match Again the tests are case sensitive to find the xt in DigExt we would need to use [0-9a-z] or [0-9A-Zt]. We can also use this format for testing upper and lower case e.g. [Ff] will check for lower and upper case F.

STRING2 match Finds x2 in Linux2
[^A-M]in STRING1 match Finds Win in Windows

STRING2 no match We have excluded the range A to M in our search so Linux is not found but linux (if it were present) would be found.

Positioning (or Anchors)

We can control where in our target strings the matches are valid. The following is a list of metacharacters that affect the position of the search:

Metacharacter

Meaning

^ The ^ (circumflex or caret) outside square brackets means look only at the beginning of the target string, for example, ^Win will not find Windows in STRING1 but ^Moz will find Mozilla.
$ The $ (dollar) means look only at the end of the target string, for example, fox$ will find a match in 'silver fox' since it appears at the end of the string but not in 'the fox jumped over the moon'.
. The . (period) means any character(s) in this position, for example, ton. will find tons, tone and tonneau but not wanton because it has no following character.

NOTE: Many systems and utilities, but not all, support special positioning macros, for example \< match at beginning of word, \> match at end of word, \b match at the begining OR end of word , \B except at the beginning or end of a word. List of the common values.

So lets try this lot out with our example target strings..

Search for




[a-z]\)$ STRING1 match finds t) in DigiExt) Note: The \ is an escape characher and is required to treat the ) as a literal

STRING2 no match We have a numeric value at the end of this string but we would need [0-9a-z]) to find it.
.in STRING1 match Finds Win in Windows.

STRING2 match Finds Lin in Linux.

Iteration 'metacharacters'

The following is a set of iteration metacharacters (a.k.a. quantifiers) that can control the number of times a character or string is found in our searches.

Metacharacter

Meaning

? The ? (question mark) matches the preceding character 0 or 1 times only, for example, colou?r will find both color (0 times) and colour (1 time).
*

The * (asterisk or star) matches the preceding character 0 or more times, for example, tre* will find tree (2 times) and tread (1 time) and trough (0 times).

+

The + (plus) matches the previous character 1 or more times, for example, tre+ will find tree (2 times) and tread (1 time) but not trough (0 times).

{n}

Matches the preceding character, or character range, n times exactly, for example, to find a local phone number we could use [0-9]{3}-[0-9]{4} which would find any number of the form 123-4567.

Note: The - (dash) in this case, because it is outside the square brackets, is a literal. Value is enclosed in braces (curly brackets).

{n,m} Matches the preceding character at least n times but not more than m times, for example, 'ba{2,3}b' will find 'baab' and 'baaab' but NOT 'bab' or 'baaaab'. Values are enclosed in braces (curly brackets).

So lets try them out with our example target strings.

Search for




\(.*l STRING1 match finds l in compatible (Note: The opening \ is an escape sequence used to indicate the ( it precedes is a literal not a metacharacter.)

STRING2 no match Mozilla contains lls but not preceded by an open parenthesis (no match) and Linux has an upper case L (no match).

We had previously defined the above test using the search value l? (thanks to David Werner Wiebe for pointing out our error). The search expression l? actually means find anything, even if it has no l (l 0 or 1 times), so would match on both strings. We had been looking for a method to find a single l and exclude ll which, without lookahead (a relatively new extension to regular expressions pioneered by PERL) is pretty difficult. Well that is our excuse.

W*in STRING1 match Finds the Win in Windows.

STRING2 match Finds in in Linux preceded by W zero times - so a match.
[xX][0-9a-z]{2} STRING1 no match Finds x in DigExt but only one t.

STRING2 match Finds X and 11 in X11.

More 'metacharacters'

The following is a set of additional metacharacters that provide added power to our searches:

Metacharacter

Meaning

() The ( (open parenthesis) and ) (close parenthesis) may be used to group (or bind) parts of our search expression together - see this example.
| The | (vertical bar or pipe) is called alternation in techspeak and means find the left hand OR right values, for example, gr(a|e)y will find 'gray' or 'grey'.

<humblepie> In our examples, we blew this expression ^([L-Z]in), we incorrectly stated that this would negate the tests [L-Z], the '^' only performs this function inside square brackets, here it is outside the square brackets and is an anchor indicating 'start from first character'. Many thanks to Mirko Stojanovic for pointing it out and apologies to one and all.</humblepie>

So lets try these out with our example strings..

Search for




^([L-Z]in) STRING1 no match The '^' is an anchor indicating first position. Win does not start the string so no match.

STRING2 no match The '^' is an anchor indicating first position. Linux does not start the string so no match.
((4\.[0-3])|(2\.[0-3])) STRING1 match Finds the 4.0 in Mozilla/4.0.

STRING2 match Finds the 2.2 in Linux2.2.16-22.
(W|L)in STRING1 match Finds Win in Windows.

STRING2 match Finds Lin in Linux.

More Stuff

Contents

POSIX Standard Character Classes
Apache browser recognition - a worked example
Commonly Available extensions - \w etc
Submatches, Groups and Backreferences
Regular Expression Tester - Experiment with your own strings and expressions in your browser
Common examples - regular expression examples
Notes - general notes when using utilities and lanuages
Utility notes - using Visual Studio regular expressions
Utility notes - using sed for file manipulation (not for the faint hearted)

For more information on regular expressions go to our links pages under Languages/regex. There are lots of folks who get a real buzz out of making any search a 'one liner' and they are incredibly helpful at telling you how they did it. Welcome to the wonderful, if arcane, world of Regular Expressions. You may want to play around with your new found knowledge using this tool.

go to contents

POSIX Character Class Definitions

POSIX 1003.2 section 2.8.3.2 (6) defines a set of character classesthat denote certain common ranges. They tend to look very ugly but have the advantage that also take into account the 'locale', that is, any variant of the local language/coding system. Many utilities/languages provide short-hand ways of invoking these classes. Strictly the names used and hence their contents reference the LC_CTYPE POSIX definition (1003.2 section 2.5.2.1).

Value

Meaning

[:digit:] Only the digits 0 to 9
[:alnum:] Any alphanumeric character 0 to 9 OR A to Z or a to z.
[:alpha:] Any alpha character A to Z or a to z.
[:blank:] Space and TAB characters only.
[:xdigit:] Hexadecimal notation 0-9, A-F, a-f.
[:punct:] Punctuation symbols . , " ' ? ! ; : # $ % & ( ) * + - / < > = @ [ ] \ ^ _ { } | ~
[:print:] Any printable character.
[:space:] Any whitespace characters (space, tab, NL, FF, VT, CR). Many system abbreviate as \s.
[:graph:] Exclude whitespace (SPACE, TAB). Many system abbreviate as \W.
[:upper:] Any alpha character A to Z.
[:lower:] Any alpha character a to z.
[:cntrl:] Control Characters NL CR LF TAB VT FF NUL SOH STX EXT EOT ENQ ACK SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC IS1 IS2 IS3 IS4 DEL.

These are always used inside square brackets in the form [[:alnum:]] or combined as [[:digit:]a-d]

go to contents

Common Extensions and Abbreviations

Some utitlities and most languages provide extensions or abbreviations to simplify(!) regular expressions. These tend to fall into Character Classes or position extensions and the most common are listed below. In general these extensions are defined by PERL and implemented in what is called PCRE's (Perl Compatible Regular Expressions) which has been implemented in the form of a libary that has been ported to many systems. Full details of PCRE. PERL 5.8.8 regular expression documentation.

While the \x type syntax for can look initially confusing the backslash precedes a character that does not normally need escaping and hence can be interpreted correctly by the utility or language - whereas we simple humans tend to become confused more easily. The following are supported by: .NET, PHP, PERL, RUBY, PYTHON, Javascript as well as many others.

Character Class Abbreviations
\d Match any character in the range 0 - 9 (equivalent of POSIX [:digit:])
\D Match any character NOT in the range 0 - 9 (equivalent of POSIX [^[:digit:]])
\s Match any whitespace characters (space, tab etc.). (equivalent of POSIX [:space:] EXCEPT VT is not recognized)
\S Match any character NOT whitespace (space, tab). (equivalent of POSIX [^[:space:]])
\w Match any character in the range 0 - 9, A - Z and a - z (equivalent of POSIX [:alnum:])
\W Match any character NOT the range 0 - 9, A - Z and a - z (equivalent of POSIX [^[:alnum:]])
Positional Abbreviations
\b Word boundary. Match any character(s) at the beginning (\bxx) and/or end (xx\b) of a word, thus \bton\b will find ton but not tons, but \bton will find tons.
\B Not word boundary. Match any character(s) NOT at the beginning(\Bxx) and/or end (xx\B) of a word, thus \Bton\B will find wantons but not tons, but ton\B will find both wantons and tons.

go to contents

Submatches, Groups and Backreferences

Some regular expression implementations provide the last results of each separate match enclosed in parenthesis (called a submatch, group or backreference) in variables that may subsequently be used or substituted in an expression. There may one or more such groupings in an expression. These variables are usually numbered $1 to $9. Where $1 will contain the first submatch, $2 will contain the second submatch and so on. The $x format typically persists until it is referenced in some expression or until another regular expression is encountered. Example:

# assume target string = "cat" search expression = (c|a)(t|z) $1 will contain "a" # $1 contains "a" because it is the last # character found using (c|a)  # if the target string was "act" # $1 would contain "c" $2 will contain "t"   # OpenLDAP 'access to' directive example: assume target dn  # is "ou=something,cn=my name,dc=example,dc=com" # then $1 = 'my name' at end of match below # because first regular expression does not have () access to dn.regex="ou=[^,]+,cn=([^,]+),dc=example,dc=com"  by dn.exact,expand="cn=$1,dc=example,dc=com" 

PERL, Ruby and the OpenLDAP access to directive support submatches.

When used within a single expression these submatches are typically called groups or backreferences and are placed in numeric variables (typically addressed using \1 to \9). These groups or backreferences (variables) may be substituted within the regular expression. The following demonstrates usage:

# the following expression finds any occurrence of double characters (.)\1 # the parenthesis creates the grouping (or submatch or backreference)  # in this case it is the first (only), so is referenced by \1 # the . (dot) finds any character and the \1 substitutes whatever  # character was found by the dot in the next character position,  # thus to match it must find two consecutive characters which are the same 

go to contents

Apache Browser Identification - a Worked Example

All we ever wanted to do with Regular Expressions was to find enough about visiting browsers arriving at our Apache powered web site to decide what HTML/CSS to supply or not for our pop-out menus. The Apache BrowserMatch directives will set a variable if the expression matches the USER_AGENT string.

We want to know:

  • If we have any browser that supports Javascript (isJS).
  • If we have any browser that supports the MSIE DHTML Object Model (isIE).
  • If we have any browser that supports the W3C DOM (isW3C).

Here in their glory are the Apache regular expression statements we used (maybe you can understand them now)

BrowserMatchNoCase [Mm]ozilla/[4-6] isJS BrowserMatchNoCase MSIE isIE BrowserMatchNoCase [Gg]ecko isW3C BrowserMatchNoCase MSIE.((5\.[5-9])|([6-9])) isW3C BrowserMatchNoCase W3C_ isW3C 

Notes:

  • Line 1 checks for any upper or lower case variant of Mozilla/4-6 (MSIE also sets this value). This test sets the variable isJS for all version 4-6 browsers (we assume that version 3 and lower do not support Javascript or at least not a sensible Javascript).
  • Line 2 checks for MSIE only (line 1 will take out any MSIE 1-3 browsers even if this variable is set.
  • Line 3 checks for any upper or lower case variant of the Gecko browser which includes Firefox, Netscape 6, 7 and now 8 and the Moz clones (all of which are Mozilla/5).
  • Line 4 checks for MSIE 5.5 (or greater) OR MSIE 6+.
    NOTE about binding:This expression does not work:

    BrowserMatchNoCase MSIE.(5\.[5-9])|([6-9]) isW3C 

    It incorrectly sets variable isW3C if the number 6 - 9 appears in the string. Our guess is the binding of the first parenthesis is directly to the MSIE expression and the OR and second parenthesis is treated as a separate expression. Adding the inner parenthesis fixed the problem.

  • Line 5 checks for W3C_ in any part of the line. This allows us to identify the W3C validation services (either CSS or HTML/XHTML page validation).

Some of the above checks may be a bit excessive, for example, is Mozilla ever spelled mozilla?, but it is also pretty silly to have code fail just because of this 'easy to prevent' condition. There is apparently no final consensus that all Gecko browsers will have to use Gecko in their 'user-agent' string but it would be extremely foolish not to since this would force guys like us to make huge numbers of tests for branded products and the more likely outcome would be that we would not.

go to contents

Regular Expression - Experiments and Testing

This simple regular expression tester lets you experiment using your browser's regular expression Javascript function (use View Source in your browser for the Javascript source code).

Enter or copy/paste the string you want to search in the box labled String: and the regular expression in the box labeled RE:, click the Search button and results will appear in the box labeled Results:. If you are very lucky the results may even be what you expect. This tester displays the whole searched string in the <>Results field and encloses in < > the first found result. That may not be terribly helpful if you are dealing with HTML - but our heart is in the right place. All matches are displayed separately showing the found text and its character position in the string. Checking the Case Insensitive: box makes the search case insensitive thus [AZ] will find the "a" in "cat", whereas without checking the box [aZ] would be required to find the "a" in "cat". Note: Not all regular expression systems provide a case insensitivity feature and therefore the regular expression may not be portable. Checking Results only will supress display of the marked up original string and only show the results found, undoing all our helpful work, but which can make things a little less complicated especially if dealing with HTML strings or anything else with multiple < > symbols. Clear will zap all the fields - including the regular expression that you just took 6 hours to develop. Use with care. See the notes below for limitations, support and capabilities.

<ouch> We had an error such that if the match occurred in the first position the enclosing <> was incorrectly displayed.</ouch>

Hi Malay,

Nice explanation, Thanks for sharing.

Lokesh

 

 

Thank u so much for your reply....could you please tell me the recovery scenarios concept also...

Hi,

To handle unexpected situations that might occur while the test is running.

 

For e.g : If you already scheduled any automation script and suddenly power cut (Non_functional)

              then your pc off and u have to restart all script again. But with the help of recovery scenario

              u can solve this problem with testing tools, logic and coding process .

              You can consider many scenarios (Functional and Non-Functional)

2nd E.g : You can use recovery scenario with performance related application to overcome bad request or something.

 

===============================================================

 

 

Regards,

Malay

 


RSS

TTWT Magazine


Advertisement

Advertisement

Advertisement

Advertisement

© 2022   Created by Quality Testing.   Powered by

Badges  |  Report an Issue  |  Terms of Service