Monday, December 21, 2009

An In-Memory Dictionary with Java

I needed to build an in-memory dictionary to do fast lookups on terms retreived from a web page. The dictionary is around 13K words from English, including names, abbreviations etc. It was compiled from the SCOWL project.

The first thing I realized was that Java uses around 90M of memory to store 65M of string data. At first glance thus seems to be due to the overhead in String class - here is the relevant class structure showing the overhead:

public final class String  implements, java.lang.Comparable<java.lang.String>, java.lang.CharSequence{
private final char[] value;
private final int offset;
private final int count;
private int hash;

As you can see, apart from the char array, there is an offset, count and the hash of the string.

The offset, count fields are used to limit memory usage. Say, if lots of strings could be represented as substrings of others, there need to be only one copy of the char[] value array, the individual strings will reference the same array with different offset, count values.

The hash is used to speed up certain data access operations when Strings are used as a key in a collection like a Hash. Rather than computing the hash each time a String is searched for in a hash, the hash of the string can be stored in the String object. Since a string is immutable, this hash value does not need to be updated once set, making it easy to maintain.

But of course, it adds an extra 4 bytes of overhead.

However, the reason for the biggest increase seems to be the size of the char data type. Since it should be able to store a Unicode character, it takes 2 bytes. So even without the other overhead, we should expect twice the number of bytes as would be needed to store this with a byte array.

And interestingly, the offset, count trick seems to find enough substrings that the amount of memory required is below double the amount one would expect with a byte array. So what we thought was overhead in fact helped us here.

Next I tried to figure out the collisions rate, when the Strings are stored in a HashSet. I was using a HashSet for my dictionary. I found found 938 collisions from 628033 terms [0.14935522%] which is quite ok. Just to make sure that there was no unusual clustering around certain keys, I checked how many slots had more than one collision, meaning three or more keys would be stored on these slots. There were only 7 such slots, and they all had just 3 keys hashed onto them.

So the maximum length of the list at each hash slot was 3. This was quite acceptable.

Here is the code used to find the collision info.

import java.util.Set;
import java.util.HashSet;
import java.util.Map;
import java.util.HashMap;

public class Collider {
static public void main(String[] args) {
String dictDir = "/Users/thushara/code/platform/AdXpose/src/com/adxpose/affinity/en_dict";
File[] files = new File(dictDir).listFiles();
Set<Integer> set = new HashSet<Integer>();
Map<Integer, Integer> collisions = new HashMap<Integer, Integer>();
int dups = 0;
int tot = 0;
for (File file : files) {
try {
FileInputStream fstream = new FileInputStream(file);
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String word;
while ((word = br.readLine()) != null) {
int hash = word.hashCode();
if (set.contains(hash)) {
collisions.put(hash, collisions.get(hash) == null ? 1 : collisions.get(hash)+1);
} else {
} catch (Exception e) {}
System.out.println("found " + dups + " collisions from " + tot + " terms [" + ((float)dups)/tot*100 + "%]");
for (Map.Entry<Integer, Integer> entry : collisions.entrySet()) {
if (entry.getValue()>1) System.out.println(entry.getKey() + ":" + entry.getValue());

with this output:

found 938 collisions from 628033 terms [0.14935522%]

Of course the problem of collisons is addressed in the Birthday Paradox. We could also use this theory to determine if String.hashCode() is optimal.

Next I will make a simple light string class based on a byte array using the last byte as a terminator (like in the traditional C string, storing 0 as the last byte).