Thursday, December 07, 2017

Python : join with care

Code:
print ("running %s" % ' '.join(cmd))



Error:

Traceback (most recent call last):
  File "convert_outstanding_zsd.py", line 29, in <module>
    print ("running %s" % ' '.join(cmd))
TypeError: sequence item 4: expected string, int found


Cause:
There are non-strings in the list cmd.

Ex:
cmd = ["runthis.py", "--host", host, "--port", port]

To make join  happy:
cmd = ["runthis.py", "--host", host, "--port", str(port)]


Sunday, November 05, 2017

A Beautiful Soupy Exercise in Scraping Interesting Integers

Integers can be very interesting at least if you are a mathematician, and even for a lay person like me, interesting integers can be used to spice up some data that is of interest. My interest here being the bike counter installed on the Fremont Bridge sidewalk that counts the number of bicycles crossing the bridge in both directions.

Inspired by this idea of blogjects and twittering houses, I wanted to send out an early morning tweet of the number of cyclists who braved the streets across Fremont the day before. The data is uploaded by SDOT early morning, and a cron task would request for this and tweet it.

Simple and a tad boring. Now what if I could map the count to something interesting? Researching on interesting integers, I came up on the theorem that there are no uninteresting integers because after all if there were a bunch of these, and one of them must be the smallest of the lot, and the fact itself makes this number interesting.

Energized in no small measure by this revelation, I sallied forth to find a list of interesting integers in the thousands range, as every day there were ~ 3000 cyclists being logged. It didn't take long for me to reach a comprehensive page of integers. The pattern is an integer within a <font> tag, followed by a phrase that describes it.











<font size=+3 color=gray>0</font> is the <a href="http://mathworld.wolfram.com/AdditiveIdentity.html">additive identity</a>.<br> <font size=+3 color=gray>1</font> is the <a href="http://mathworld.wolfram.com/MultiplicativeIdentity.html">multiplicative identity</a>.<br> <font size=+3 color=darkblue>2</font> is the only even <a href="http://mathworld.wolfram.com/PrimeNumber.html">prime</a>.<br>

At first glance, it seemed a simple matter of using Beautiful Soup to get each font tag, extract its text, then look for the font's sibling to extract the phrase.

However, the font tag has multiple siblings that make up the complete phrase. In the soup these are represented as NavigableString objects. It's a matter of moving across the document until we hit a <br> tag, collecting all the text as we go along.

Now since all of this needs to be part of the tweet, I quickly realized that not much can be said in 140 characters. So I didn't bother keeping the URLs. I used a jupyter notebook to quickly prototype the outline, and I can't stress enough how useful this is, specially when you are dealing with an unfamiliar API (which Beautiful soup was to me).

Here is how I used the notebook to understand the basic structure of the page:


So getting to the integers was quite trivial as BeautifuSoup provides a way to search for a specific tag (font) with a specific value for a given attribute (size=+3). Since the phrase for the integer is in a number of contiguous elements, we need to construct it by visiting siblings of the font tag until we hit a <br> tag.


unexpected_tags = {}
def get_text_to_eol(font_section):
    text_parts = []
    section = font_section.next_sibling
    while section.name != 'br':
        if section.name == 'a':
            text_parts.append(section.string)
        elif section.name is None:
            text_parts.append(str(section))
        else:
            print ("found %s tag" % section.name)
            unexpected_tags[section.name] = unexpected_tags.get(section.name, 0)+1
        section = section.next_sibling    
    return ' '.join(text_parts)  

Now I ran through the results in jupyter, and the first few are shown below:

for number, text in map(lambda section: (section.get_text(), get_text_to_eol(section)), integer_sections):
    print (number, text)

0  is the  additive identity .
1  is the  multiplicative identity .
2  is the only even  prime .
3  is the number of spatial dimensions we live in.
4  is the smallest number of colors sufficient to color all planar maps.
5  is the number of  Platonic solids .
6  is the smallest  perfect number .
7  is the smallest number of sides of a  regular  polygon that is not  constructible  by straightedge and compass.
8  is the largest  cube  in the  Fibonacci sequence .
9  is the maximum number of  cubes  that are needed to sum to any positive  integer .
10  is the base of our number system.
11  is the largest known  multiplicative persistence .
12  is the smallest  abundant number .
13  is the number of  Archimedean solids .
14  is the smallest even number n with no solutions to  φ (m) = n.
15  is the smallest  composite number  n with the property that there is only one  group  of order n.
found sup tag
found sup tag
16  is the only number of the form x  = y  with x and y being different  integers .
17  is the number of  wallpaper groups .
18  is the only positive number that is twice the sum of its digits.
found sup tag
19  is the maximum number of 4  powers needed to sum to any number.
20  is the number of  rooted trees  with 6 vertices.
21  is the smallest number of distinct  squares  needed to tile a  square .
22  is the number of  partitions  of 8.

Since I collected the unknown tags, I could see what they were:


Here is an example of a superscript being used:
<font size=+3 color=FF6699>16</font> is the only number of the form x<sup>y</sup> = y<sup>x</sup> with x and y being different <a href="http://mathworld.wolfram.com/Integer.html">integers</a>.<br>





Now isn't that interesting, out of all the numbers that this is the only one?

Here is how the subscript is being used:

<font size=+3 color=brown>126</font> = <sub>9</sub><a href="http://mathworld.wolfram.com/Combination.html">C</a><sub>4</sub>.<br>








With this insight, I augmented the superscripts with the ^ symbol, and left the subscripts as is.


def get_text_to_eol(font_section):
    text_parts = []
    section = font_section.next_sibling
    while section.name != 'br':
        if section.name == 'a':
            text_parts.append(section.string)
        elif section.name is None:
            text_parts.append(str(section))
        else:
            if section.name == 'sup':
                text_parts.append('^')
            text_parts.append(section.string)
        section = section.next_sibling    
    return ' '.join(text_parts)       

And I again digressed on a merry tangent where people were using non-ascii characters to tweet subscripts and superscripts.

However, somewhat surprisingly, the sibling list did not always end in a <br>. I hit a None for a section for integer 248 with the html:

<font size=+3 color=006600>248</font> is the smallest number n>1 for which the <a href="http://mathworld.wolfram.com/ArithmeticMean.html">arithmetic</a>, <a href="http://mathworld.wolfram.com/GeometricMean.html">geometric</a>, and <a href="http://mathworld.wolfram.com/HarmonicMean.html">harmonic means<a/> of <a href="http://mathworld.wolfram.com/TotientFunction.html">&phi;</a>(n) and <a href="http://mathworld.wolfram.com/DivisorFunction.html">&sigma;</a>(n) are all <a href="http://mathworld.wolfram.com/Integer.html">integers</a>.<br>

Can you spot the problem, it is subtle?

Notice that harmonic means<a/> is not the correct encoding. Beautiful soup replaces this dangling tag with a beautiful pair <a></a>:

<a href="http://mathworld.wolfram.com/HarmonicMean.html">harmonic means<a></a> of <a href="http://mathworld.wolfram.com/TotientFunction.html">φ</a>(n) and <a href="http://mathworld.wolfram.com/DivisorFunction.html">σ</a>(n) are all <a href="http://mathworld.wolfram.com/Integer.html">integers</a>.<br/></a>

This is all very nice, except that we were relying on a <br> tag to be an eventual sibling, and Beautiful soup is on a soupy wake trying to find the matching </a> to the tag it started with, finally finding it at:

<font size=+3 color=FF6699>1351</font> has the property that <a href="http://mathworld.wolfram.com/e.html">e</a><sup>1351</a></sup> is within .0009 of an <a href="http://mathworld.wolfram.com/Integer.html">integer</a>.<br>

We are getting all the integers from 248 through 1351 in one unbroken block.

Is there then an easier way to solve this problem? It's tempting to think regular expressions when it comes to html parsing issues of this sort. What if we use a regular expression to split apart the sections containing the integer and phrase? After all, using a top down parser snagged on a mismatched tag, but maybe a regular expression can give us a better behaved set of html tags which we can then parse individually with Beautiful Soup.

We can get the html lines with a reg exp split.

Since we decided to forego the URLs, we could construct a BeautifulSoup instance for each line, and call the get_text() method to strip all tags.

import re
lines = re.split("<br>[\r\n\s]*", html)
list_lines = list(filter(lambda x: x is not None, [re.search(r"<font size=\+3 .*", line) for line in lines]))
text_lines = [BeautifulSoup(l.group(0), 'html.parser').get_text() for l in list_lines]


['0 is the additive identity.',
 '1 is the multiplicative identity.',
 '2 is the only even prime.',
 '3 is the number of spatial dimensions we live in.',
 '4 is the smallest number of colors sufficient to color all planar maps.',
 '5 is the number of Platonic solids.',
 '6 is the smallest perfect number.',
 '7 is the smallest number of sides of a regular polygon that is not constructible by straightedge and compass.',
 '8 is the largest cube in the Fibonacci sequence.',
 '9 is the maximum number of cubes that are needed to sum to any positive integer.',
 '10 is the base of our number system.',
 '11 is the largest known multiplicative persistence.',
 '12 is the smallest abundant number.',
 '13 is the number of Archimedean solids.',
 '14 is the smallest even number n with no solutions to φ(m) = n.',
 '15 is the smallest composite number n with the property that there is only one group of order n.',
 '16 is the only number of the form xy = yx with x and y being different integers.',
 '17 is the number of wallpaper groups.',

But now we are not identifying the superscripts, subscripts, as you can see from the output for integer 16.

What we should do is to then use the regular expression to get the lines, apply the parser for each line, then use the function we wrote earlier to get the text within each line. Now since the parser can't go over a <br>, it just might result in a better extraction of phrases.


import re
lines = re.split("<br>[\r\n\s]*", html)
list_lines = list(filter(lambda x: x is not None, [re.search(r"<font size=\+3 .*", line) for line in lines]))
soups = [BeautifulSoup(l.group(0), 'html.parser').font for l in list_lines]
np_list = [(int(s.get_text()), get_text_to_eol(s)) for s in soups]


(7,
  ' is the smallest number of sides of a  regular  polygon that is not  constructible  by straightedge and compass.'),
 (8, ' is the largest  cube  in the  Fibonacci sequence .'),
 (9,
  ' is the maximum number of  cubes  that are needed to sum to any positive  integer .'),
 (10, ' is the base of our number system.'),
 (11, ' is the largest known  multiplicative persistence .'),
 (12, ' is the smallest  abundant number .'),
 (13, ' is the number of  Archimedean solids .'),
 (14, ' is the smallest even number n with no solutions to  φ (m) = n.'),
 (15,
  ' is the smallest  composite number  n with the property that there is only one  group  of order n.'),
 (16,
  ' is the only number of the form x ^ y  = y ^ x  with x and y being different  integers .'),
 (17, ' is the number of  wallpaper groups .'),

Now we convert this list of pairs to a dictionary, so we can quickly look up the integer:

hash = dict(np_list)


{0: ' is the  additive identity .',
 1: ' is the  multiplicative identity .',
 2: ' is the only even  prime .',
 3: ' is the number of spatial dimensions we live in.',
 4: ' is the smallest number of colors sufficient to color all planar maps.',
 5: ' is the number of  Platonic solids .',
 6: ' is the smallest  perfect number .',
 7: ' is the smallest number of sides of a  regular  polygon that is not  constructible  by straightedge and compass.',
 8: ' is the largest  cube  in the  Fibonacci sequence .',
 9: ' is the maximum number of  cubes  that are needed to sum to any positive  integer .',
 10: ' is the base of our number system.',
 11: ' is the largest known  multiplicative persistence .',
 12: ' is the smallest  abundant number .',
 13: ' is the number of  Archimedean solids .',
 14: ' is the smallest even number n with no solutions to  φ (m) = n.',
 15: ' is the smallest  composite number  n with the property that there is only one  group  of order n.',
 16: ' is the only number of the form x ^ y  = y ^ x  with x and y being different  integers .',

The get_text_to_eol() was modified to handle hitting the end of the sibling list without hitting a <br>. Also, we keep all strings unicode up until they need to be output, at which point a conversion to utf-8 is done.

def get_text_to_eol(font_section):
    text_parts = []
    section = font_section.next_sibling
    while section is not None:
        if section.name == 'a':
            text_parts.append(section.get_text())
        elif section.name is None:
            text_parts.append(unicode(section))
        else:
            if section.name == 'sup':
                text_parts.append('^')
            text_parts.append(section.get_text())
        section = section.next_sibling    
    return ' '.join(text_parts)     

I think this will do for our purposes. In the next post I will show how this was used along with the bike counter uploads to tweet early morning updates to twitter.

The full source code, along with the twitter updates can be found here.

Saturday, August 12, 2017

A Trie


This is a trie that uses a sentinel node to denote the end of a word. This is more space efficient than having to flag each node as to whether it denotes an end of a word. To quickly find the number of prefix matches, it stores the prefix count in the node.








class Trie {
    char ch;
    int count = 0;
    Map<Character, Trie> list = new HashMap<Character, Trie>();
    
    public Trie(char ch) {
        this.ch = ch;
    }
    
    public Trie add(char ch) {
        Trie node = this.list.get(ch);
        if (node == null) {
            Trie newNode = new Trie(ch);
            this.list.put(ch, newNode);
            node = newNode;
        }

        //adding the count to the current node is preferable
        //to adding to the node that matches the character.
        //This way, we won't add to the sentinel node
        //and we add only in one place.
this.count++; return node; } public int size() { return this.count; } private Trie findChar(char ch) { return this.list.get(ch); } public boolean findWord(String word) { Trie node = this; for (char ch: word.toCharArray()) { node = node.findChar(ch); if (node == null) { return false; } }
        //we may have found a prefix, make sure it is a word
        //if it's a word, the list must have the sentinel.
        return node.list.get((char)0) != null;
    }
    
    public int findPartial(String prefix) {
        Trie node = this;
        for (char ch : prefix.toCharArray()) {
            node = node.list.get(ch);
            if (node == null) {
                return 0;
            }
        }
        return node.size(); 
    }
    
    public void add(String s) {
        Trie node = this;
        for (char ch : s.toCharArray()) {
            node = node.add(ch);
        }
        //add the sentinel to mark the end of the word.
        node.add((char)0);
    }
}

Now it is possible to reduce the space taken by the trie further by using an array instead of the map. Knowing that we need to use only lower case letters, we can use the charater before 'a' as the sentinel, so that the array length is set to 27.

Another space optimization comes about by using a single word (32 bits) to store both the character and the prefix count. Java uses two bytes for the char type, and we could do with one byte. But that still uses 5 bytes per Trie node, but we don't need the 2 billion range possible with 32 bits to represent the count of all prefixes for any English substring.

The prefix count is highest on the root node, as all words have the head node character as the prefix. So the highest prefix count is the number of words in the dictionary. This is generally never more than 250, 000. We can safely use 24 bits which can represent 8 million as a signed integer.

So we can combine the character and the prefix count to a single word.

Is there anything else we could do? Yes - we could read all the words into our Trie and trim the list on each Trie node. This results from the observation that we rarely use all the slots in our list - Especially as the trie spans out, there are fewer number of new words. Thus we could find the last used index on the list, and create a new shorter list.

Doing all of these drops the size of the trie from ~ 228M to ~ 68M.

Here is an implementation.

I store a random word list in pastebin for testing - there is code here that uses this, as well as pulling a dictionary of lower case words.  If you use this, you will need to make sure the dictionary you substitute has only lower case words, so some pre-processing might be necessary - in particular, you are likely to find the hyphen (-) in some word which you will need to remove.

Last but not least, the memory stats don't give an idea of the space saving due to garbage collector not being deterministic. I use the sizeInBytes() to recursively calculate the memory foot print of the Trie.



  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
public class Trie {

    interface GetAndPut {
        public void put(Character ch, Trie trie);
        public Trie get(Character ch);
        public int lastUsedIndex();
        public Trie[] children();
        public void trim();
    }
    class SuffixCharsWithMap implements  GetAndPut {
        Map<Character, Trie> list = new HashMap<Character, Trie>();
        public void put(Character ch, Trie node) {
            list.put(ch, node);
        }
        public Trie get(Character ch) {
            return list.get(ch);
        }
        public int lastUsedIndex() {
            return list.size()-1;
        }
        public Trie[] children() {
            return list.values().toArray(new Trie[list.size()]);
        }
        public void trim() {

        }
    }
    class SuffixCharsWithArray implements GetAndPut {
        public Trie[] list;

        public SuffixCharsWithArray() {
            int sz = (int)('z') - (int)'`' +1;
            list = new Trie[sz];
        }
        public void put(Character ch, Trie node) {
            list[(int)ch - (int)'`'] = node;
        }
        public Trie get(Character ch) {
            try {
                return list[(int) ch - (int) '`'];
            } catch (ArrayIndexOutOfBoundsException e) {
                // we hit an index that got trimmed out
                return null;
            }
        }
        public int lastUsedIndex() {
            int lue = -1;
            for (int i=0; i<list.length; i++) {
                if (list[i] != null) {
                    lue = i;
                }
            }
            return lue;
        }
        public void trim() {
            if (lastUsedIndex()+1 < list.length) {
                Trie[] newList = new Trie[lastUsedIndex() + 1];
                for (int i = 0; i < newList.length; i++) {
                    newList[i] = list[i];
                }
                list = newList;
            }
        }
        public Trie[] children() {
            List<Trie> l = new ArrayList<Trie>();
            for (Trie t: list) {
                if (t != null && t.getChar() != '`') {
                    l.add(t);
                }
            }
            return l.toArray(new Trie[l.size()]);
        }
    }

    //store char and the prefix count using 32 bits
    //the first byte is the character, the next 3 bytes get the prefix count
    //3 bytes can hold ~ 16 million, and there aren't that many english
    //words. the total word count is less than 250,000, and the prefix count
    //of any substring is less than that.
    private int count = 0;

    public char getChar() {
        return (char)(count & 0xFF000000 >> 24);
    }

    public void setChar(char ch) {
        count = ((int)ch) << 24 | (count & 0x00FFFFFF);
    }

    public int getCount() {
        return count & 0x00FFFFFF;
    }

    //this is safe as the count will never get high enough
    //to push over int the most significant byte holding the character
    public void incCount() {
        count++;
    }

    GetAndPut suffixChars = new SuffixCharsWithArray();
    //GetAndPut suffixChars = new SuffixCharsWithMap();

    public Trie(char ch) {
        this.setChar(ch);
    }

    public Trie add(char ch) {
        Trie node = this.suffixChars.get(ch);
        if (node == null) {
            Trie newNode = new Trie(ch);
            this.suffixChars.put(ch, newNode);
            node = newNode;
        }

        //adding the count to the current node is preferable
        //to adding to the node that matches the character.
        //This way, we won't add to the sentinel node
        //and we add only in one place.

        this.incCount();
        return node;
    }

    public int size() {
        return this.count;
    }

    private Trie findChar(char ch) {
        return this.suffixChars.get(ch);
    }

    public boolean findWord(String word) {
        Trie node = this;
        for (char ch: word.toCharArray()) {
            node = node.findChar(ch);
            if (node == null) {
                return false;
            }
        }
        //we may have found a prefix, make sure it is a word
        //if it's a word, the list must have the sentinel.
        return node.suffixChars.get('`') != null;
    }

    public int findPartial(String prefix) {
        Trie node = this;
        for (char ch : prefix.toCharArray()) {
            node = node.suffixChars.get(ch);
            if (node == null) {
                return 0;
            }
        }
        return node.size();
    }

    public void add(String s) {
        Trie node = this;
        for (char ch : s.toCharArray()) {
            node = node.add(ch);
        }
        //add the sentinel to mark the end of the word.
        node.add('`');
    }

    private void walk(int[] indices) {
        indices[this.suffixChars.lastUsedIndex()] ++;
        for (Trie ch : suffixChars.children()) {
            ch.walk(indices);
        }
    }

    public int[] lastUsedIndices() {
        int[] indices = new int[(int)'z' - (int)'`' + 1];
        walk(indices);
        return indices;
    }

    private void walk2() {
        suffixChars.trim();
        for (Trie ch : suffixChars.children()) {
            ch.walk2();
        }
    }

    static private int walk3(Trie t) {
        if (t == null) return 0;
        // 4 = size of `count`
        // 8 = size of each reference to a Trie

        int acc = 4 + 8 * (((SuffixCharsWithArray)t.suffixChars).list.length);
        for (Trie node: t.suffixChars.children()) {
            acc += walk3(node);
        }
        return acc;
    }

    public void trim() {
        walk2();
    }

    public int sizeInBytes() {
        return walk3(this);
    }

    public void read() throws FileNotFoundException {
        String wordFilePath = "/Users/thushara/lcwords.txt";
        BufferedReader br = new BufferedReader(new FileReader(wordFilePath));
        String word;
        try {
            while ((word = br.readLine()) != null) {
                add(word);
            }
        } catch (IOException e) {
            System.err.format("disk error! %s", e.getMessage());
        }
    }

    static public String getRandomWordList() throws MalformedURLException, IOException {
        Pattern alpha = Pattern.compile("^[A-Za-z]+$");
        String url = "https://pastebin.com/raw/NXH7UAr1";
        URL obj = new URL(url);
        HttpURLConnection con = (HttpURLConnection) obj.openConnection();
        con.setRequestMethod("GET");
        int responseCode = con.getResponseCode();
        BufferedReader in = new BufferedReader(
                new InputStreamReader(con.getInputStream()));
        String inputLine;
        StringBuffer response = new StringBuffer();

        while ((inputLine = in.readLine()) != null) {
            Matcher m = alpha.matcher(inputLine);
            if (m.matches()) {
                response.append(inputLine.toLowerCase());
            }
        }
        in.close();
        return response.toString();
    }

    static public void main(String[] args) throws FileNotFoundException, IOException {
        long mem1 = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
        System.out.format("memory usage at start %d\n", mem1);
        Trie trie = new Trie('$');
        trie.read();
        System.out.format("size of trie in bytes: %d\n", trie.sizeInBytes());
        trie.trim();
        System.out.format("size of trimmed trie in bytes: %d\n", trie.sizeInBytes());

        Scanner in = new Scanner(System.in);
        System.out.println("type a word in lower case (upper case char to exit)> ");
        while (true) {
            String s = in.next();
            if (Character.isUpperCase(s.charAt(0))) break;
            boolean found = trie.findPartial(s) > 0;
            System.out.println(found ? "yes" : "no");
        }

        long st = System.currentTimeMillis();

        String words = getRandomWordList();

        String[] arr = words.split(" ");
        for (String s: arr) {
             if (!s.isEmpty() && !trie.findWord(s)) System.out.println("couldn't find " + s);
        }
        long elapsed = System.currentTimeMillis() - st;
        long mem2 = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
        System.out.format("memory usage at end   %d\n", mem2);
        System.out.format("took %d ms for %d words using %d MB\n", elapsed, arr.length, (mem2 - mem1)/1024/1024);
    }

}