Confusion with Whitespace Field Delimiters (Unix Power Tools, 3rd Edition)
22.4. Confusion with Whitespace Field Delimiters
One would hope that a simple task like sorting would be relatively
unambiguous. Unfortunately, it isn't. The behavior
of sort can be very puzzling.
I'll try to straighten out some of the
confusion -- at the same time, I'll be leaving
myself open to abuse by the real sort experts. I
hope you appreciate this! Seriously, though: if you know of any other
wrinkles to the story, please let us know and we'll
add them in the next edition.
The trouble with sort is figuring out where one
field ends and another begins. It's simplest if you
can specify an explicit field
delimiter (Section 22.3). This makes it
easy to tell where fields end and begin. But by default,
sort uses whitespace characters (tabs and spaces)
to separate fields, and the rules for interpreting whitespace field
delimiters are unfortunately complicated. As I see them, they are:
The first whitespace character you encounter is a
"field delimiter"; it marks the end
of the old field and the beginning of the next field.
Any whitespace character following a field delimiter is
part of the new field. That is, if you have two
or more whitespace characters in a row, the first one is used as a
field delimiter and isn't sorted. The remainder
are sorted, as part of the next field.
Every field has at least one nonwhitespace character, unless
you're at the end of the line. (That is, null fields
only occur when you've reached the end of a line.)
All whitespace is not equal. Sorting is done according to the ASCII
collating sequence. Therefore, TABs are sorted before spaces.
Here is a silly but instructive example that demonstrates most of the
hard cases. We'll sort the file
sortme, which is:
apple Fruit shipment
20 beta beta test sites
5 Something or other
All is not as it seems -- cat -t
-v (Section 12.5, Section 12.4) shows that the file really looks like this:
^Iapple^IFruit shipment
20^Ibeta^Ibeta test sites
5^I^ISomething or other
^I indicates a tab character. Before showing you
what sort does with this file,
let's break it into fields, being very careful to
apply the rules above. In the table, we use quotes to show exactly
where each field begins and ends:
Field 0
Field 1
Field 2
Field 3
Line 1
"^Iapple"
"Fruit"
"shipment"
null (no more data)
Line 2
"20"
"beta"
"beta"
"test"
Line 3
" 5"
"^Isomething"
"or"
"other"
OK, now let's try some sort
commands; I've added annotations on the right,
showing what character the "sort"
was based on. First, we'll sort on field
zero -- that is, the first field in each line:
% sort sortme ...sort on field zero
apple Fruit shipments field 0, first character: TAB
5 Something or other field 0, first character: SPACE
20 beta beta test sites field 0, first character: 2
As I noted earlier, a TAB precedes a space in the collating sequence.
Everything is as expected. Now let's try another,
this time sorting on field 1 (the second field):
+% sort +1 sortme ...sort on field 1
5 Something or other field 1, first character: TAB
apple Fruit shipments field 1, first character: F
20 beta beta test sites field 1, first character: b
Again, the initial TAB causes "something or
other" to appear first. "Fruit
shipments" preceded
"beta" because in the ASCII table,
uppercase letters precede lowercase letters. Now,
let's sort on the next field:
+% sort +2 sortme ...sort on field 2
20 beta beta test sites field 2, first character: b
5 Something or other field 2, first character: o
apple Fruit shipments field 2, first character: s
No surprises here. And finally, sort on field 3 (the
"fourth" field):
+% sort +3 sortme ...sort on field 3
apple Fruit shipments field 3, NULL
5 Something or other field 3, first character: o
20 beta beta test sites field 3, first character: t
The only surprise here is that the NULL field gets sorted first.
That's really no surprise, though: NULL has the
ASCII value zero, so we should expect it to come first.
OK, this was a silly example. But it was a difficult one; a casual
understanding of what sort "ought to
do" won't explain any of these
cases, which leads to another point. If someone tells you to sort
some terrible mess of a data file, you could be heading for a
nightmare. But often, you're not just sorting;
you're also designing the data
file you want to sort. If you get to design the format for the input
data, a little bit of care will save you lots of headaches. If you
have a choice, never allow TABs in the file. And
be careful of leading spaces; a word with an extra space before it
will be sorted before other words. Therefore,
use an explicit delimiter between fields (like a colon), or use the
-b option (and an explicit sort field), which tells
sort to ignore initial whitespace.
-- ML
22.3. Changing the sort Field Delimiter22.5. Alphabetic and Numeric Sorting
Copyright © 2003 O'Reilly & Associates. All rights reserved.
Wyszukiwarka
Podobne podstrony:
CH22ch22ch22 (2)ch22ch22ch22 (16)ch22ch22ch22ch22ch22ch22 (19)ch22 (4)więcej podobnych podstron