clj-tightly-packed-trie/README.org

#+TITLE: Clojure Tightly Packed Trie

[[https://img.shields.io/clojars/v/com.owoga/tightly-packed-trie.svg]]

* What does this do?

Tries as hash-maps are common, but hash-maps take up a lot of memory (relatively speaking).

For example, creating a hash-map trie of 1, 2, and 3-grams of short story by Edgar Allen Poe results in a hash-map that consumes over 2 megabytes of memory. [[file:examples/markov_language_model.clj][See this markov language model example]].

If you're dealing with much larger corpuses, the memory footprint could become an issue.

A tightly packed trie, on the other hand, is tiny. A tightly packed trie on the same corpus is only 37 kilobytes. That's ~4% of the original trie's size, even after the original trie's keys/values have all been condensed to numbers!

** How do you use library?

A trie is created similar to a hash-map by passing a variable number of "trie entries" to ~trie~.

A "trie entry" is basically the same thing as a map entry. It's just a key and a value.

But for a Trie, the key *must* be seqable and for a tightly-packed trie all keys *must* be comparable.

#+begin_src clojure :results none :session usage-example
(require '[com.owoga.trie :as trie])

(def loosely-packed-trie (trie/make-trie "dog" :dog "dot" :dot "do" :do "day" :day))
loosely-packed-trie
;; => {[\d \a \y] :day, [\d \o \g] :dog, [\d \o \t] :dot, [\d \o] :do}
#+end_src

You'll see from the output of that last line above that the default REPL representation of a Trie
is a flat hash-map-looking-thing. It's actually a sorted-hash-map-looking-thing, because if you seq
over it, you'll get the trie-entries in depth-first post-order traversal.

In some ways, a Trie behaves a lot like a map.

`get` returns the value at the key.

#+begin_src clojure :results none :session usage-example
(get loosely-packed-trie "dog")
;; => :dog
(get loosely-packed-trie "do")
;; => :do
(get (assoc loosely-packed-trie "dove" {:value "dove" :count 10}) "dove")
;; => {:value "dove", :count 10}
#+end_src

But there's a couple cool Trie-specific functions.

`lookup` returns the *Trie* at the key. This way, you have access to all of the node's descendants.

#+begin_src clojure :results none :session usage-example
(trie/lookup loosely-packed-trie "do")
;; => {[\g] :dog, [\t] :dot}
(seq (trie/lookup loosely-packed-trie "do"))
#+end_src

`children` returns the direct children of a node.

#+begin_src clojure :results none :session usage-example
(trie/children (trie/lookup loosely-packed-trie "do"))
;; => ({} {})
#+end_src

That's odd... there's two things in there that look like empty maps.

#+begin_src clojure :results none :session usage-example
(map #(get % []) (trie/children (trie/lookup loosely-packed-trie "do")))
;; => (:dog :dot)
#+end_src

The REPL representation of a Trie only shows children key/values. The "root" node (not necessarily the "true" root node if you've travsersed down with `lookup`) doesn't print any data to REPL.
So if you're looking ata node with no children, you'll see `{}` in the REPL. But you can get the value of that node with `(get node [])`

* Tightly Packed Tries

The trie above is backed by regular old Clojure data structures: hash-maps and vectors.

It's not very efficient. All of the strings, nested maps, pointers... it all adds up to a lot of wasted memory.

A tightly packed trie provides the same functionality at an impressively small fraction of the memory footprint.

One restriction though: all keys and values must be integers. To convert them from integer identifiers back into the values that your biological self can process, you'll need to keep some type of database or in-memory map of ids to human-parseable things.

Here's a similar example to that above, but with values that we can tightly pack.

#+begin_src clojure :results none :session usage-example
(require '[com.owoga.tightly-packed-trie :as tpt]
         '[com.owoga.tightly-packed-trie.encoding :as encoding])

(defn encode-fn [v]
  (if (nil? v)
    (encoding/encode 0)
    (encoding/encode v)))

(defn decode-fn [byte-buffer]
  (let [v (encoding/decode byte-buffer)]
    v
    (if (zero? v) nil v)))

(def tight-ready-loosely-packed-trie
  (trie/make-trie '(1 2 3) 123 '(1 2 1) 121 '(1 2 2) 122 '(1 3 1) 131))

(def tightly-packed-trie
  (tpt/tightly-packed-trie
   tight-ready-loosely-packed-trie
   encode-fn
   decode-fn))

(get tightly-packed-trie [1 2 3])
;; => 123

(map #(get % []) (trie/children (trie/lookup tightly-packed-trie [1 2])))
;; => (121 122 123)

(seq tightly-packed-trie)
;; => ([[1 2 1] 121]
;;     [[1 2 2] 122]
;;     [[1 2 3] 123]
;;     [[1 2] nil]
;;     [[1 3 1] 131]
;;     [[1 3] nil]
;;     [[1] nil])
#+end_src


Instead of a map with all of its pointers, we are storing all of the information
necessary for this trie in just 39 bytes!

#+begin_src clojure :results none :session usage-example
(require '[cljol.dig9 :as d])

(.capacity (.byte-buffer tightly-packed-trie))
;; => 39
#+end_src

It's backed by a byte-buffer so saving to disk is trivial, but there's a helper for that.

Here's the process of saving to and loading from disk. (Only works for tightly-packed tries.)

#+begin_src clojure :results none :session usage-example
(tpt/save-tightly-packed-trie-to-file "/tmp/tpt.bin" tightly-packed-trie)

(def saved-and-loaded-tpt
  (tpt/load-tightly-packed-trie-from-file "/tmp/tpt.bin" decode-fn))

(get saved-and-loaded-tpt '(1 2 3))
;; => 123
#+end_src

* Credits

Ulrich Germann, Eric Joanis, and Samuel Larkin of the National Research Institute of Canada for the paper [[https://www.aclweb.org/anthology/W09-1505.pdf][Tightly Packed Tries: How to Fit Large Models into Memory,and Make them Load Fast, Too]].

Lots of credit also goes to the Clojurians community.

* TODO Why would you want a trie data structure?

TODO: The below is closer to a CSCI lesson than library documentation. If it's necessary, figure out where to put it, how to word it, etc... It might not be worth cluttering documentation with so much detail.

** Autocomplete

A user types in the characters "D" "O" and you want to show all possible autocompletions.

*** Typical "List" data structure

- Iterate through each word starting from the beginning.
- When you get to the first word that starts with the letters "D" "O", start keeping track
  of words
- When you get to the next word that doesn't start with "D" "O", you have all the words you want to use for autocomplete.

#+begin_src clojure :results none
(def dictionary ["Apple" "Banana" "Carrot" "Do" "Dog" "Dot" "Dude" "Egg"])
#+end_src

**** Problems with a list.

It's slow if you have a big list. If you have a dictionary with hundreds of thousands of words
and the user is typing in letters that don't show up until the end of the list, then you're searching
through the first few hundred thousand items in the list before you get to what you need.

If you're familiar with binary search over sorted lists, you'll know this is a contrived example.

*** Typical "Trie" in Clojure

#+begin_src clojure :results none
{"A" {:children {"P" {,,,} :value nil}}
 "D" {:children {"O"
                 :children {"G" {:children {} :value "DOG"}
                            "T" {:children {} :value "DOT"}}
                 :value "DO"}
      :value nil}}
#+end_src

**** How is a trie faster?

* Development