Wednesday, March 23, 2011

Insanely compressed html files

Today, I discovered a URL that sent some insanely compressed content. The compressed content was sent by the server using Content-Encoding: gzip and Transfer-encoding: chunked. The compressed size of the content was 2,921,925 bytes and it decompressed to 1,004,263,982 bytes. The decompressed content was roughly 344 times the size of the compressed content.

This caused certain things to go wrong in the production process. I had set a limit of a few Megs on all fetches and had assumed that a single fetch could not be more than a few Megs. This was the first time I have seen such a huge decompression rate. This caused a subsequent file mapping to fail due to inadequate memory.

The downloaded content suggested why this would compress so well. The URL was There seems to be a dynamically generated part on this URL. If you examine its source, you will see a marker like this:


Content after that seems dynamically generated. You will find markup like this:

<h2></h2> - <br/><h4>... <a href="">read more</a></h4>

On this particular instance, there was an unusually large amount of fake content generated. The downloaded file had just 33 lines, but the last long line was a huge repeating pattern of :

<a href="">read more</a></h4><br/><br/><h2></h2> - <br/><h4>... 

This would of course compress well.

Wednesday, March 16, 2011

Linux sort : bug with , separator and confusing period?

user@host:~/$ echo -e "alan,20,3,0\ngeorge,5,0,0\nalice,3,5,0\ndora,4,0.9,5" | sort -n -k 2 -t ,

The line with "dora" as the first term should be printed after "alice" and before "george", as we are asking sort to sort on the second column. The 3rd column value of "0.9" seems to confuse sort on this.

This is not a bug in sort but due to the locale setting on different operating systems.

On the above link, look for "Sort does not sort in normal order".

Setting LC_PATH=C sorts as expected:

user@host:~/$ echo -e "alan,20,3,0\ngeorge,5,0,0\nalice,3,5,0\ndora,4,0.9,5" | LC_ALL=C sort -n -k 2 -t ,

Friday, March 11, 2011

A useful, scriptable way to remove offending known_hosts keys

You can use ssh-keygen -R to remove invalid keys from the known_hosts file. This becomes useful if the host names are hashed in the file. The default in Ubuntu/Lucid is to hash the host names.