Just What Does a Regular Expression Match? (Unix Power Tools, 3rd Edition)
32.17. Just What Does a Regular Expression Match?
One
of the toughest things to learn about regular expressions is just
what they do match. The problem is that a regular expression tends to
find the longest possible match -- which can be more than you
want.
Go to http://examples.oreilly.com/upt3 for more information on: showmatch
Here's a simple script called
showmatch that is useful for testing regular
expressions, when writing sed scripts, etc. Given
a regular expression and a filename, it finds lines in the file
matching that expression, just like grep, but it
uses a row of carets (^^^^) to highlight the
portion of the line that was actually matched. Depending on your
system, you may need to call nawk instead of
awk; most modern systems have an
awk that supports the syntax introduced by
nawk, however.
#! /bin/sh
# showmatch - mark string that matches pattern
pattern=$1; shift
awk 'match($0,pattern) > 0 {
s = substr($0,1,RSTART-1)
m = substr($0,1,RLENGTH)
gsub (/[^\b- ]/, " ", s)
gsub (/./, "^", m)
printf "%s\n%s%s\n", $0, s, m
}' pattern="$pattern" $*
For example:
% showmatch 'CD-...' mbox
and CD-ROM publishing. We have recognized
^^^^^^
that documentation will be shipped on CD-ROM; however,
^^^^^^
Go to http://examples.oreilly.com/upt3 for more information on: xgrep
xgrep
is a related script that simply retrieves only the matched text. This
allows you to extract patterned data from a file. For example, you
could extract only the numbers from a table containing both text and
numbers. It's also great for counting the number of
occurrences of some pattern in your file, as shown below. Just be
sure that your expression matches only what you want. If you
aren't sure, leave off the wc
command and glance at the output. For example, the regular expression
[0-9]* will match numbers like
3.2 twice: once for the
3 and again for the 2! You want
to include a dot (.) and/or comma (,), depending on how your numbers
are written. For example: [0-9][.0-9]* matches a
leading digit, possibly followed by more dots and digits.
NOTE:
Remember that an expression like [0-9]* will match
zero numbers (because * means
"zero or more of the preceding
character"). That expression can make
xgrep run for a very long time! The following
expression, which matches one or more digits, is probably what you
want instead:
xgrep "[0-9][0-9]*" files | wc -l
The xgrep shell script runs the
sed commands below, replacing
$re with the regular expression from the command
line and $x with a CTRL-b character (which is used
as a delimiter). We've shown the
sed commands numbered, like
5>; these are only for reference and
aren't part of the script:
1> \$x$re$x!d
2> s//$x&$x/g
3> s/[^$x]*$x//
4> s/$x[^$x]*$x/\
/g
5> s/$x.*//
Command 1 deletes all input lines that don't contain
a match. On the remaining lines (which do match), command 2 surrounds
the matching text with CTRL-b delimiter characters. Command 3 removes
all characters (including the first delimiter) before the first match
on a line. When there's more than one match on a
line, command 4 breaks the multiple matches onto separate lines.
Command 5 removes the last delimiter, and any text after it, from
every output line.
Greg Ubben revised showmatch and wrote
xgrep.
--JP, DD, andTOR
32.16. Getting Regular Expressions Right32.18. Limiting the Extent of a Match
Copyright © 2003 O'Reilly & Associates. All rights reserved.
Wyszukiwarka
Podobne podstrony:
CH32 (5)ch32ch32ch32ch32ch32CH32ch32ch32ch32ch32 (3)więcej podobnych podstron