You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
Eric Ihli 56be9e9898 Implement hash-map and byte-array tries
The hash-map trie is convenient to work with at the REPL since the
key/values are human readable and the backing data is traversible in a
fashion familiar to Clojure.

The byte-array backed trie has a slightly different API but is far more
memory efficient.

The paper that the tightly backed trie is based on can be viewed at
https://www.aclweb.org/anthology/W09-1505.pdf
4 years ago
examples Implement hash-map and byte-array tries 4 years ago
resources Implement hash-map and byte-array tries 4 years ago
src/com/owoga/tightly_packed_trie Implement hash-map and byte-array tries 4 years ago
.gitignore Implement hash-map and byte-array tries 4 years ago
README.org Implement hash-map and byte-array tries 4 years ago
deps.edn Implement hash-map and byte-array tries 4 years ago
pom.xml Implement hash-map and byte-array tries 4 years ago

README.org

Clojure Tightly Packed Trie

What does this do?

Tries as hash-maps are common, but hash-maps take up a lot of memory (relatively speaking).

For example, creating a hash-map trie of 1, 2, and 3-grams of short story by Edgar Allen Poe results in a hash-map that consumes over 2 megabytes of memory. See this markov language model example.

If you're dealing with much larger corpuses, the memory footprint could become an issue.

A tightly packed trie, on the other hand, is tiny. A tightly packed trie on the same corpus is only 37 kilobytes. That's 1.7% of the hash-map backed trie's size!

How do you use library?

A hash-map-backed trie is created by passing a variable number of "trie entries" to trie.

A "trie entry" is a list of keys with the last element of the list being the "value" of the node.

(require '[com.owoga.tightly-packed-trie.core :as tpt])

;;  [trie-entry       '(k1  k2  k3  val)
(def trie-entries
  (let [trie-entry-1  '("D" "O" "G" "DOG")
        trie-entry-2  '("D" "O" "T" "DOT")
        trie-entry-3  '("D" "O"     "DO")
        trie-entry-4  '("D" "A" "Y" "DAY")]
    [trie-entry-1
     trie-entry-2
     trie-entry-3
     trie-entry-4]))

;; (trie trie-entry-1 trie-entry-2 trie-entry-n ,,,)
(def hash-map-backed-trie
  (apply tpt/trie trie-entries))

Once the trie is created, you can get a seq of all of the descendants of below a certain key by using get.

;; All of the nodes that are descendants of '("D" "O")
(seq (get hash-map-backed-trie '("D" "O")))
;; => ({"G" {:value "DOG", :count 1}}
;;     {"T" {:value "DOT", :count 1}}
;;     {"O" {:count 1, :value "DO"}})

;; All of the nodes that are descendants of '("D" "A")
(seq (get hash-map-backed-trie '("D" "A")))
;; => ({"Y" {:value "DAY", :count 1}})

New nodes can be ~conj~ed into the trie.

(let [new-trie (conj hash-map-backed-trie
                     '("D" "A" "D" "DAD"))]
  (seq (get new-trie '("D" "A"))))
;; => ({"D" {:value "DAD", :count 1}} {"Y" {:value "DAY", :count 1}})

The entire map can be viewed with as-map.

There's also as-vec which returns the trie as a vector that can be passed directly to clojure.zipper/vector-zip.

(tpt/as-map hash-map-backed-trie)
;; => {:root
;;     {:children
;;      {"D"
;;       {:children
;;        {"A" {:children {"Y" {:value "DAY", :count 1}}},
;;         "O"
;;         {:children {"G" {:value "DOG", :count 1}, "T" {:value "DOT", :count 1}},
;;          :count 1,
;;          :value "DO"}}}}}}

get returns a Trie, so all of the ITrie protocol functions work on the value that is returned by get.

(tpt/as-map (get hash-map-backed-trie '("D" "O")))
;; => {"O"
;;     {:children {"G" {:value "DOG", :count 1}, "T" {:value "DOT", :count 1}},
;;      :count 1,
;;      :value "DO"}}

There's also a transform function in the ITrie protocol that iterates over each loc in the zippered Trie and calls your given function on the loc.

This is useful, as the name suggests, for performing transformations.

(require '[clojure.zip :as zip]
         '[clojure.string :as string])

(let [lower-cased-keys-trie
      (tpt/transform
       hash-map-backed-trie
       (fn [loc]
         (if (map? (zip/node loc))
           (zip/edit
            loc
            (fn [node]
              (let [[k v] (first (seq (zip/node loc)))]
                {(string/lower-case k) v})))
           loc)))]
  (seq lower-cased-keys-trie))
;; => ({"y" {:value "DAY", :count 1}}
;;     {"g" {:value "DOG", :count 1}}
;;     {"t" {:value "DOT", :count 1}}
;;     {"o" {:count 1, :value "DO"}})

Tightly Packed Tries

The trie above is backed by a Clojure hash-map.

It's not very efficient. All of the strings, nested maps, pointers… it all adds up to a lot of wasted memory.

A tightly packed trie provides the same functionality at an impressively small fraction of the memory footprint.

One restriction though: all keys and values must be integers. To convert them from integer identifiers back into the values that your biological self can process, you'll need to keep some type of database or in-memory map of ids to human-parseable things.

Here's a similar example to that above, but with values that we can tightly pack.

(require '[com.owoga.tightly-packed-trie.core :as tpt])

;;  [trie-entry    '(path     value)
(def trie-entries
  (let [trie-entry-1  '(1 2 3      123)
        trie-entry-2  '(1 2 1      121)
        trie-entry-3  '(1 2 2      122)
        trie-entry-4  '(1 3 1      131)]
    [trie-entry-1
     trie-entry-2
     trie-entry-3
     trie-entry-4]))

(def non-tightly-packed-trie
  (apply tpt/trie trie-entries))

(tpt/as-map non-tightly-packed-trie)
;; => {:root
;;     {:children
;;      {1
;;       {:children
;;        {2
;;         {:children
;;          {1 {:value 121, :count 1},
;;           2 {:value 122, :count 1},
;;           3 {:value 123, :count 1}}},
;;         3 {:children {1 {:value 131, :count 1}}}}}}}}

There's a slightly mis-named function that creates a byte-array representation of each node.

as-byte-array is named similarly to as-map and as-vec. But it's mis-named because it doesn't actually return a byte-array like the name suggests. I may fix that in the future.

Instead, it adds some keys to each value, byte-address and byte-array.

The byte-address is the offset that this node is going to be at in the final contiguous byte-array that makes up the tightly packed trie.

The byte-array is the byte-encoded value of the node's key, value, size of the node's children index, and encoded values for each child's key and byte-address-offset from the current node.

The byte-addresses and byte-arrays are calculated assuming that the depth-first post-order traversal of the vector representation of the trie is the correct order that the nodes need to be written to the contiguous array of bytes that make up the final tightly-packed-trie.

Part of that requirement means that the child nodes of each node need to be sorted!

Even though the Trie code looks like it's just backed by regular old hash-maps, it's actually backed by sorted-maps!

(def non-tightly-packed-trie-with-raw-byte-info-added
  (tpt/as-byte-array non-tightly-packed-trie))

(tpt/as-map non-tightly-packed-trie-with-raw-byte-info-added)
;; => {:root
;;     {:byte-address 42,
;;      :byte-array [-128, -128, -126, -127, 7],
;;      :children
;;      {1
;;       {:byte-address 35,
;;        :byte-array [-128, -128, -124, -126, 18, -125, 5],
;;        :children
;;        {2
;;         {:byte-address 17,
;;          :byte-array [-128, -128, -122, -127, 9, -126, 6, -125, 3],
;;          :children
;;          {1
;;           {:value 121,
;;            :count 1,
;;            :byte-address 8,
;;            :byte-array [-7, -127, -128],
;;            :children {}},
;;           2
;;           {:value 122,
;;            :count 1,
;;            :byte-address 11,
;;            :byte-array [-6, -127, -128],
;;            :children {}},
;;           3
;;           {:value 123,
;;            :count 1,
;;            :byte-address 14,
;;            :byte-array [-5, -127, -128],
;;            :children {}}}},
;;         3
;;         {:byte-address 30,
;;          :byte-array [-128, -128, -126, -127, 4],
;;          :children
;;          {1
;;           {:value 131,
;;            :count 1,
;;            :byte-address 26,
;;            :byte-array [1, -125, -127, -128],
;;            :children {}}}}}}}}}

Once the trie is transformed to have the byte-array info on each node, you can pass that trie to tightly-packed-trie to get a MUCH more memory-efficient trie.

This trie is backed by a ByteBuffer rather than a hash-map.

(def tightly-packed-trie
  (tpt/tightly-packed-trie non-tightly-packed-trie-with-raw-byte-info-added))

(.capacity (.byte-buffer tightly-packed-trie))
;; => 47
;;
;;;; Instead of a map with all of its pointers, we are storing
;;;; all of the information necessary for this trie in
;;;; just 47 bytes!

;;;; Hash-map-backed and Tightly-packed comparson
;; The apis are slightly different. But you have access to basically the same data.

;;;; Getting the value of a node in a hash-map-backed trie.
;;
(-> (get non-tightly-packed-trie '(1 2 3))
    tpt/as-map
    seq
    first
    second
    (select-keys [:value :count]))
;; => {:value 123, :count 1}

;;;; Getting the value of a node in a tightly-packed trie.
;;
(tpt/value (get tightly-packed-trie '(1 2 3)))
;; => {:value 123, :count 1}

It's backed by a byte-buffer so saving to disk is trivial, but there's a helper for that.

Here's the process of saving to and loading from disk. (Only works for tightly-packed tries.)

(tpt/save-tightly-packed-trie-to-file "/tmp/tpt.bin" tightly-packed-trie)

(def saved-and-loaded-tpt
  (tpt/load-tightly-packed-trie-from-file "/tmp/tpt.bin"))

(tpt/value (get saved-and-loaded-tpt '(1 2 3)))
;; => {:value 123, :count 1}

TODO Why would you want a trie data structure?

TODO: The below is closer to a CSCI lesson than library documentation. If it's necessary, figure out where to put it, how to word it, etc… It might not be worth cluttering documentation with so much detail.

Autocomplete

A user types in the characters "D" "O" and you want to show all possible autocompletions.

Typical "List" data structure

  • Iterate through each word starting from the beginning.
  • When you get to the first word that starts with the letters "D" "O", start keeping track of words
  • When you get to the next word that doesn't start with "D" "O", you have all the words you want to use for autocomplete.
(def dictionary ["Apple" "Banana" "Carrot" "Do" "Dog" "Dot" "Dude" "Egg"])
#'markov-language-model/dictionary
Problems with a list.

It's slow if you have a big list. If you have a dictionary with hundreds of thousands of words and the user is typing in letters that don't show up until the end of the list, then you're searching through the first few hundred thousand items in the list before you get to what you need.

If you're familiar with binary search over sorted lists, you'll know this is a contrived example.

Typical "Trie" in Clojure

{"A" {:children {"P" {,,,} :value nil}}
 "D" {:children {"O"
                 :children {"G" {:children {} :value "DOG"}
                            "T" {:children {} :value "DOT"}}
                 :value "DO"}
      :value nil}}
class java.lang.RuntimeException
How is a trie faster?