ch20 10


Quick Reference: awk (Unix Power Tools, 3rd Edition) 20.10. Quick Reference: awk Up to this point, we've shown you tools to do basic batch editing of text files. These tools, although powerful, have limitations. Although you can script ex commands, the range of text manipulation is quite limited. If you need more powerful and flexible batch editing tools, you need to look at programming languages that are designed for text manipulation. One of the earliest Unix languages to do this is awk, created by Al Aho, Peter Weinberger, and Brian Kernighan. Even if you've never programmed before, there are some simple but powerful ways that you can use awk. Whenever you have a text file that's arranged in columns from which you need to extract data, awk should come to mind. For example, every Red Hat Linux system stores its version number in /etc/redhat-release. On my system, it looks like this: Red Hat Linux release 7.1 (Seawolf) When applying new RPM files to your system, it is often helpful to know which Red Hat version you're using. On the command line, you can retrieve just that number with: awk '{print $5}' /etc/redhat-release What's going on here? By default, awk splits each line read from standard input on whitespace, as is explained below. In effect, it's like you are looking at one row of a spreadsheet. In spreadsheets, columns are usually named with letters. In awk, columns are numbered and you only can see one row (that is, one line of input) at a time. The Red Hat version number is in the fifth column. Similar to the way shells use $ for variable interpolation, the values of columns in awk are retrieved using variables that start with $ and are followed by an integer. As you can guess, this is a fairly simple demostration of awk, which includes support for regular expressions, branching and looping, and subroutines. For a more complete reference on using awk, see Effective awk Programming or sed & awk Pocket Reference, both published by O'Reilly. Since there are many flavor of awk, such as nawk and gawk (Section 18.11), this article tries to provide a usable reference for the most common elements of the language. Dialect differences, when they occur, are noted. With the exception of array subscripts, values in [ brackets] are optional; don't type the [ or ]. 20.10.1. Command-Line Syntax awk can be invoked in one of two ways: awk [options] 'script' [var=value] [file(s)] awk [options] -f scriptfile [var=value] [file(s)] You can specify a script directly on the command line, or you can store a script in a scriptfile and specify it with -f. In most versions, the -f option can be used multiple times. The variable var can be assigned a value on the command line. The value can be a literal, a shell variable ($name), or a command substitution ('cmd'), but the value is available only after a line of input is read (i.e., after the BEGIN statement). awk operates on one or more file(s). If none are specified (or if - is specified), awk reads from the standard input (Section 43.1). The other recognized options are: -Fc Set the field separator to character c. This is the same as setting the system variable FS. nawk allows c to be a regular expression (Section 32.4). Each record (by default, one input line) is divided into fields by whitespace (blanks or tabs) or by some other user-definable field separator. Fields are referred to by the variables $1, $2, . . . $n. $0 refers to the entire record. For example, to print the first three (colon-separated) fields on separate lines: % awk -F: '{print $1; print $2; print $3}' /etc/passwd -v var=value Assign a value to variable var. This allows assignment before the script begins execution. (Available in nawk only.) 20.10.2. Patterns and Procedures awk scripts consist of patterns and procedures: pattern {procedure} Both are optional. If pattern is missing, {procedure} is applied to all records. If {procedure} is missing, the matched record is written to the standard output. 20.10.2.1. Patterns pattern can be any of the following: /regular expression/ relational expression pattern-matching expression BEGIN END Expressions can be composed of quoted strings, numbers, operators, functions, defined variables, and any of the predefined variables described later in Section 20.10.3. Regular expressions use the extended set of metacharacters, as described in Section 32.15. In addition, ^ and $ (Section 32.5) can be used to refer to the beginning and end of a field, respectively, rather than the beginning and end of a record (line). Relational expressions use the relational operators listed in Section 20.10.4 later in this article. Comparisons can be either string or numeric. For example, $2 > $1 selects records for which the second field is greater than the first. Pattern-matching expressions use the operators ~ (match) and !~ (don't match). See Section 20.10.4 later in this article. The BEGIN pattern lets you specify procedures that will take place before the first input record is processed. (Generally, you set global variables here.) The END pattern lets you specify procedures that will take place after the last input record is read. Except for BEGIN and END, patterns can be combined with the Boolean operators || ( OR), && (AND), and ! (NOT). A range of lines can also be specified using comma-separated patterns: pattern,pattern 20.10.2.2. Procedures procedure can consist of one or more commands, functions, or variable assignments, separated by newlines or semicolons (;), and contained within curly braces ({}). Commands fall into four groups: Variable or array assignments Printing commands Built-in functions Control-flow commands 20.10.2.3. Simple pattern-procedure examples Print the first field of each line: { print $1 } Print all lines that contain pattern: /pattern/ Print first field of lines that contain pattern: /pattern/{ print $1 } Print records containing more than two fields: NF > 2 Interpret input records as a group of lines up to a blank line: BEGIN { FS = "\n"; RS = "" } { ...process records... } Print fields 2 and 3 in switched order, but only on lines whose first field matches the string URGENT: $1 ~ /URGENT/ { print $3, $2 } Count and print the number of pattern found: /pattern/ { ++x } END { print x } Add numbers in second column and print total: {total += $2 }; END { print "column total is", total} Print lines that contain fewer than 20 characters: length($0) < 20 Print each line that begins with Name: and that contains exactly seven fields: NF == 7 && /^Name:/ 20.10.3. awk System Variables nawk supports all awk variables. gawk supports both nawk and awk. Version Variable Description awk FILENAME Current filename   FS Field separator (default is whitespace)   NF Number of fields in current record   NR Number of the current record   OFMT Output format for numbers (default is %.6g)   OFS Output field separator (default is a blank)   ORS Output record separator (default is a newline)   RS Record separator (default is a newline)   $0 Entire input record   $n nth field in current record; fields are separated by FS nawk ARGC Number of arguments on command line   ARGV An array containing the command-line arguments   ENVIRON An associative array of environment variables   FNR Like NR, but relative to the current file   RSTART First position in the string matched by match function   RLENGTH Length of the string matched by match function   SUBSEP Separator character for array subscripts (default is \034) 20.10.4. Operators This table lists the operators, in increasing precedence, that are available in awk. Symbol Meaning = += -= *= /= %= ^= Assignment (^= only in nawk and gawk) ?: C conditional expression (nawk and gawk) || Logical OR && Logical AND ~ !~ Match regular expression and negation < <= > >= != == Relational operators (blank) Concatenation + - Addition, subtraction * / % Multiplication, division, and modulus + - ! Unary plus and minus, and logical negation ^ Exponentiation (nawk and gawk) ++ -- Increment and decrement, either prefix or postfix $ Field reference 20.10.5. Variables and Array Assignments Variables can be assigned a value with an equal sign (=). For example: FS = "," Expressions using the operators +, -, *, /, and % (modulus) can be assigned to variables. Arrays can be created with the split function (see below), or they can simply be named in an assignment statement. Array elements can be subscripted with numbers (array[1], . . . ,array[n]) or with names (as associative arrays). For example, to count the number of occurrences of a pattern, you could use the following script: /pattern/ { array["pattern"]++ } END { print array["pattern"] } 20.10.6. Group Listing of awk Commands awk commands may be classified as follows: Arithmetic functions String functions Control flow statements Input/Output processing atan2[57] gsub[57] break close[57] cos[57] index continue delete[57] exp length do/while[57] getline[57] int match[57] exit next log split for print rand[57] sub[57] if printf sin[57] substr return[57] sprintf sqrt tolower[57] while system[57] srand[57] toupper[57]     [57] Not in original awk. 20.10.7. Alphabetical Summary of Commands The following alphabetical list of statements and functions includes all that are available in awk, nawk, or gawk. Unless otherwise mentioned, the statement or function is found in all versions. New statements and functions introduced with nawk are also found in gawk. atan2 atan2(y,x) Returns the arctangent of y/x in radians. (nawk) break Exit from a while, for, or do loop. close close(filename-expr) close(command-expr) In some implementations of awk, you can have only ten files open simultaneously and one pipe; modern versions allow more than one pipe open. Therefore, nawk provides a close statement that allows you to close a file or a pipe. close takes as an argument the same expression that opened the pipe or file. (nawk) continue Begin next iteration of while, for, or do loop immediately. cos cos(x) Return cosine of x (in radians). (nawk) delete delete array[element] Delete element of array. (nawk) do do body while (expr) Looping statement. Execute statements in body, then evaluate expr. If expr is true, execute body again. More than one command must be put inside braces ({}). (nawk) exit exit[expr] Do not execute remaining instructions and do not read new input. END procedure, if any, will be executed. The expr, if any, becomes awk's exit status (Section 34.12). exp exp(arg) Return the natural exponent of arg. for for ([init-expr]; [test-expr]; [incr-expr]) command C-language-style looping construct. Typically, init-expr assigns the initial value of a counter variable. test-expr is a relational expression that is evaluated each time before executing the command. When test-expr is false, the loop is exited. incr-expr is used to increment the counter variable after each pass. A series of commands must be put within braces ({}). For example: for (i = 1; i <= 10; i++) printf "Element %d is %s.\n", i, array[i] for for (item in array) command For each item in an associative array, do command. More than one command must be put inside braces ({}). Refer to each element of the array as array[item]. getline getline [var][<file] or command | getline [var] Read next line of input. Original awk does not support the syntax to open multiple input streams. The first form reads input from file, and the second form reads the standard output of a Unix command. Both forms read one line at a time, and each time the statement is executed, it gets the next line of input. The line of input is assigned to $0, and it is parsed into fields, setting NF, NR, and FNR. If var is specified, the result is assigned to var and the $0 is not changed. Thus, if the result is assigned to a variable, the current line does not change. getline is actually a function, and it returns 1 if it reads a record successfully, 0 if end-of-file is encountered, and -1 if for some reason it is otherwise unsuccessful. (nawk) gsub gsub(r,s[,t]) Globally substitute s for each match of the regular expression r in the string t. Return the number of substitutions. If t is not supplied, defaults to $0. (nawk) if if (condition) command [else command] If condition is true, do command(s), otherwise do command(s) in else clause (if any). condition can be an expression that uses any of the relational operators <, <=, ==, != , >=, or >, as well as the pattern-matching operators ~ or !~ (e.g., if ($1 ~ /[Aa].*[Zz]/)). A series of commands must be put within braces ({}). index index(str,substr) Return position of first substring substr in string str or 0 if not found. int int(arg) Return integer value of arg. length length(arg) Return the length of arg. log log(arg) Return the natural logarithm of arg. match match(s,r) Function that matches the pattern, specified by the regular expression r, in the string s and returns either the position in s where the match begins or 0 if no occurrences are found. Sets the values of RSTART and RLENGTH. (nawk) next Read next input line and start new cycle through pattern/procedures statements. print print [args] [destination] Print args on output, followed by a newline. args is usually one or more fields, but it may also be one or more of the predefined variables -- or arbitrary expressions. If no args are given, prints $0 (the current input record). Literal strings must be quoted. Fields are printed in the order they are listed. If separated by commas (,) in the argument list, they are separated in the output by the OFS character. If separated by spaces, they are concatenated in the output. destination is a Unix redirection or pipe expression (e.g., > file) that redirects the default standard output. printf printf format [, expression(s)] [destination] Formatted print statement. Fields or variables can be formatted according to instructions in the format argument. The number of expressions must correspond to the number specified in the format sections. format follows the conventions of the C-language printf statement. Here are a few of the most common formats: %s A string. %d A decimal number. %n.mf A floating-point number, where n is the total number of digits and m is the number of digits after the decimal point. %[-]nc n specifies minimum field length for format type c, while - left-justifies value in field; otherwise value is right-justified. format can also contain embedded escape sequences: \n (newline) or \t (tab) are the most common. destination is a Unix redirection or pipe expression (e.g., > file) that redirects the default standard output. For example, using the following script: {printf "The sum on line %s is %d.\n", NR, $1+$2} and the following input line: 5 5 produces this output, followed by a newline: The sum on line 1 is 10. rand rand( ) Generate a random number between 0 and 1. This function returns the same series of numbers each time the script is executed, unless the random number generator is seeded using the srand( ) function. (nawk) return return [expr] Used at end of user-defined functions to exit the function, returning value of expression expr, if any. (nawk) sin sin(x) Return sine of x (in radians). (nawk) split split(string,array[,sep]) Split string into elements of array array[1], . . . ,array[n]. string is split at each occurrence of separator sep. (In nawk, the separator may be a regular expression.) If sep is not specified, FS is used. The number of array elements created is returned. sprintf sprintf (format [, expression(s)]) Return the value of expression(s), using the specified format (see printf). Data is formatted but not printed. sqrt sqrt(arg) Return square root of arg. srand srand(expr) Use expr to set a new seed for random number generator. Default is time of day. Returns the old seed. (nawk) sub sub(r,s[,t]) Substitute s for first match of the regular expression r in the string t. Return 1 if successful; 0 otherwise. If t is not supplied, defaults to $0. (nawk) substr substr(string,m[,n]) Return substring of string, beginning at character position m and consisting of the next n characters. If n is omitted, include all characters to the end of string. system system(command) Function that executes the specified Unix command and returns its status (Section 34.12). The status of the command that is executed typically indicates its success (0) or failure (nonzero). The output of the command is not available for processing within the nawk script. Use command | getline to read the output of the command into the script. (nawk) tolower tolower(str) Translate all uppercase characters in str to lowercase and return the new string. (nawk) toupper toupper(str) Translate all lowercase characters in str to uppercase and return the new string. (nawk) while while (condition) command Do command while condition is true (see if for a description of allowable conditions). A series of commands must be put within braces ({}). -- DG 20.9. patch: Generalized Updating of Files That Differ20.11. Versions of awk Copyright © 2003 O'Reilly & Associates. All rights reserved.

Wyszukiwarka

Podobne podstrony:
ch20
ch20
ch20 (8)
ch20
ch20
ch20 (2)
ch20 (3)
ch20 (17)
ch20 (16)
CH20
ch20
Ch20 pg645 654
ch20
ch20
ch20

więcej podobnych podstron