Wednesday, March 16, 2011

Linux sort : bug with , separator and confusing period?

user@host:~/$ echo -e "alan,20,3,0\ngeorge,5,0,0\nalice,3,5,0\ndora,4,0.9,5" | sort -n -k 2 -t ,
dora,4,0.9,5
alice,3,5,0
george,5,0,0
alan,20,3,0user@host:~/$ 

The line with "dora" as the first term should be printed after "alice" and before "george", as we are asking sort to sort on the second column. The 3rd column value of "0.9" seems to confuse sort on this.

This is not a bug in sort but due to the locale setting on different operating systems.

On the above link, look for "Sort does not sort in normal order".

Setting LC_PATH=C sorts as expected:

user@host:~/$ echo -e "alan,20,3,0\ngeorge,5,0,0\nalice,3,5,0\ndora,4,0.9,5" | LC_ALL=C sort -n -k 2 -t ,
alice,3,5,0
dora,4,0.9,5
george,5,0,0
alan,20,3,0
user@host:~/$ 

3 comments:

Michael said...

This will solve the sort problem

sort -t , -k 2n,2 file-to-be-sorted.txt

You need “-k 2,2″ to tell sort to only consider the second column. The “n” in “-k 2n,2″ tells the sort command to sort the column numerically. If you omit the “n” 20 gets sorted before 3.

I did some testing and it seems that when you sort multiple numerical columns the numbers gets combined. When you specify “-k 2″ you tell sort to use all columns from column two until the end of the line. You also specified comma to be the column separator. When you do this then I think sort see the data this way:

the original data

dora,4,0.9,5
alice,3,5,0
george,5,0,0
alan,20,3,0

becomes

dora,40.95
alice,350
george,500
alan,2030

Which is in the correct order. I did a test with different numbers and still got the same behavior.

It's interesting that LC_PATH=C affects the sort.

thushara said...

Actually I found a simpler case where LC_ALL has no effect:

echo -e "alan\t20\t3\t0\ngeorge\t5\t0\t0\nalice\t3\t5\t0\ndora\t4\t9\t5" | LC_ALL=C sort -n -k 2 -t \t

thushara said...

Although, that is due to the shell not sending a proper tab character to the sort program.

This works:

[~] echo -e "alan\t20\t3\t0\ngeorge\t5\t0\t0\nalice\t3\t5\t0\ndora\t4\t9\t5" | LC_ALL=C sort -n -k 2 -t"`echo -e "\t"`"
alice 3 5 0
dora 4 9 5
george 5 0 0
alan 20 3 0