Redesign API, add tests, add example useage

- Add tests
- Refactor new version of Trie to be more coll-like
- Add an example use-case using a Markov chain to generate text
main
Eric Ihli 4 years ago
parent 56be9e9898
commit 77475e0c13

@ -8,115 +8,70 @@ For example, creating a hash-map trie of 1, 2, and 3-grams of short story by Edg
If you're dealing with much larger corpuses, the memory footprint could become an issue.
A tightly packed trie, on the other hand, is tiny. A tightly packed trie on the same corpus is only 37 kilobytes. That's 1.7% of the hash-map backed trie's size!
A tightly packed trie, on the other hand, is tiny. A tightly packed trie on the same corpus is only 37 kilobytes. That's ~4% of the original trie's size, even after the original trie's keys/values have all been condensed to numbers!
** How do you use library?
A hash-map-backed trie is created by passing a variable number of "trie entries" to ~trie~.
A trie is created similar to a hash-map by passing a variable number of "trie entries" to ~trie~.
A "trie entry" is a list of keys with the last element of the list being the "value" of the node.
A "trie entry" is basically the same thing as a map entry. It's just a key and a value.
But for a Trie, the key *must* be seqable and for a tightly-packed trie all keys *must* be comparable.
#+begin_src clojure :results none :session usage-example
(require '[com.owoga.tightly-packed-trie.core :as tpt])
;; [trie-entry '(k1 k2 k3 val)
(def trie-entries
(let [trie-entry-1 '("D" "O" "G" "DOG")
trie-entry-2 '("D" "O" "T" "DOT")
trie-entry-3 '("D" "O" "DO")
trie-entry-4 '("D" "A" "Y" "DAY")]
[trie-entry-1
trie-entry-2
trie-entry-3
trie-entry-4]))
;; (trie trie-entry-1 trie-entry-2 trie-entry-n ,,,)
(def hash-map-backed-trie
(apply tpt/trie trie-entries))
(require '[com.owoga.trie :as trie])
(def loosely-packed-trie (trie/make-trie "dog" :dog "dot" :dot "do" :do "day" :day))
loosely-packed-trie
;; => {[\d \a \y] :day, [\d \o \g] :dog, [\d \o \t] :dot, [\d \o] :do}
#+end_src
Once the trie is created, you can get a seq of all of the descendants of below a certain key
by using ~get~.
You'll see from the output of that last line above that the default REPL representation of a Trie
is a flat hash-map-looking-thing. It's actually a sorted-hash-map-looking-thing, because if you seq
over it, you'll get the trie-entries in depth-first post-order traversal.
#+begin_src clojure :results none :session usage-example
;; All of the nodes that are descendants of '("D" "O")
(seq (get hash-map-backed-trie '("D" "O")))
;; => ({"G" {:value "DOG", :count 1}}
;; {"T" {:value "DOT", :count 1}}
;; {"O" {:count 1, :value "DO"}})
;; All of the nodes that are descendants of '("D" "A")
(seq (get hash-map-backed-trie '("D" "A")))
;; => ({"Y" {:value "DAY", :count 1}})
#+end_src
In some ways, a Trie behaves a lot like a map.
New nodes can be ~conj~ed into the trie.
`get` returns the value at the key.
#+begin_src clojure :results none :session usage-example
(let [new-trie (conj hash-map-backed-trie
'("D" "A" "D" "DAD"))]
(seq (get new-trie '("D" "A"))))
;; => ({"D" {:value "DAD", :count 1}} {"Y" {:value "DAY", :count 1}})
(get loosely-packed-trie "dog")
;; => :dog
(get loosely-packed-trie "do")
;; => :do
(get (assoc loosely-packed-trie "dove" {:value "dove" :count 10}) "dove")
;; => {:value "dove", :count 10}
#+end_src
The entire map can be viewed with ~as-map~.
But there's a couple cool Trie-specific functions.
There's also ~as-vec~ which returns the trie as a vector that can be
passed directly to ~clojure.zipper/vector-zip~.
`lookup` returns the *Trie* at the key. This way, you have access to all of the node's descendants.
#+begin_src clojure :results none :session usage-example
(tpt/as-map hash-map-backed-trie)
;; => {:root
;; {:children
;; {"D"
;; {:children
;; {"A" {:children {"Y" {:value "DAY", :count 1}}},
;; "O"
;; {:children {"G" {:value "DOG", :count 1}, "T" {:value "DOT", :count 1}},
;; :count 1,
;; :value "DO"}}}}}}
(trie/lookup loosely-packed-trie "do")
;; => {[\g] :dog, [\t] :dot}
#+end_src
~get~ returns a ~Trie~, so all of the ~ITrie~ protocol functions work on the value that is returned by ~get~.
`children` returns the direct children of a node.
#+begin_src clojure :results none :session usage-example
(tpt/as-map (get hash-map-backed-trie '("D" "O")))
;; => {"O"
;; {:children {"G" {:value "DOG", :count 1}, "T" {:value "DOT", :count 1}},
;; :count 1,
;; :value "DO"}}
(trie/children (trie/lookup loosely-packed-trie "do"))
;; => ({} {})
#+end_src
There's also a ~transform~ function in the ~ITrie~ protocol that iterates over each
loc in the zippered Trie and calls your given function on the loc.
This is useful, as the name suggests, for performing transformations.
That's odd... there's two things in there that look like empty maps.
#+begin_src clojure :results none :session usage-example
(require '[clojure.zip :as zip]
'[clojure.string :as string])
(let [lower-cased-keys-trie
(tpt/transform
hash-map-backed-trie
(fn [loc]
(if (map? (zip/node loc))
(zip/edit
loc
(fn [node]
(let [[k v] (first (seq (zip/node loc)))]
{(string/lower-case k) v})))
loc)))]
(seq lower-cased-keys-trie))
;; => ({"y" {:value "DAY", :count 1}}
;; {"g" {:value "DOG", :count 1}}
;; {"t" {:value "DOT", :count 1}}
;; {"o" {:count 1, :value "DO"}})
(map #(get % []) (trie/children (trie/lookup loosely-packed-trie "do")))
;; => (:dog :dot)
#+end_src
The REPL representation of a Trie only shows children key/values. The "root" node (not necessarily the "true" root node if you've travsersed down with `lookup`) doesn't print any data to REPL.
So if you're looking ata node with no children, you'll see `{}` in the REPL. But you can get the value of that node with `(get node [])`
* Tightly Packed Tries
The trie above is backed by a Clojure hash-map.
The trie above is backed by regular old Clojure data structures: hash-maps and vectors.
It's not very efficient. All of the strings, nested maps, pointers... it all adds up to a lot of wasted memory.
@ -127,133 +82,53 @@ One restriction though: all keys and values must be integers. To convert them fr
Here's a similar example to that above, but with values that we can tightly pack.
#+begin_src clojure :results none :session usage-example
(require '[com.owoga.tightly-packed-trie.core :as tpt])
;; [trie-entry '(path value)
(def trie-entries
(let [trie-entry-1 '(1 2 3 123)
trie-entry-2 '(1 2 1 121)
trie-entry-3 '(1 2 2 122)
trie-entry-4 '(1 3 1 131)]
[trie-entry-1
trie-entry-2
trie-entry-3
trie-entry-4]))
(def non-tightly-packed-trie
(apply tpt/trie trie-entries))
(tpt/as-map non-tightly-packed-trie)
;; => {:root
;; {:children
;; {1
;; {:children
;; {2
;; {:children
;; {1 {:value 121, :count 1},
;; 2 {:value 122, :count 1},
;; 3 {:value 123, :count 1}}},
;; 3 {:children {1 {:value 131, :count 1}}}}}}}}
#+end_src
There's a slightly mis-named function that creates a byte-array representation of each node.
~as-byte-array~ is named similarly to ~as-map~ and ~as-vec~. But it's mis-named because it doesn't
actually return a ~byte-array~ like the name suggests. I may fix that in the future.
Instead, it adds some keys to each value, ~byte-address~ and ~byte-array~.
(require '[com.owoga.tightly-packed-trie :as tpt]
'[com.owoga.tightly-packed-trie.encoding :as encoding])
The ~byte-address~ is the offset that this node is going to be at in the final contiguous byte-array that makes up the tightly packed trie.
(defn encode-fn [v]
(if (nil? v)
(encoding/encode 0)
(encoding/encode v)))
The ~byte-array~ is the byte-encoded value of the node's key, value, size of the node's children index, and encoded values for each child's key and byte-address-offset from the current node.
(defn decode-fn [byte-buffer]
(let [v (encoding/decode byte-buffer)]
v
(if (zero? v) nil v)))
The ~byte-addresses~ and ~byte-arrays~ are calculated assuming that the depth-first post-order traversal of the vector representation of the trie is the correct order that the nodes need to be written to the contiguous array of bytes that make up the final tightly-packed-trie.
(def loosely-packed-trie
(trie/make-trie '(1 2 3) 123 '(1 2 1) 121 '(1 2 2) 122 '(1 3 1) 131))
Part of that requirement means that the child nodes of each node need to be sorted!
Even though the Trie code looks like it's just backed by regular old hash-maps, it's actually backed by sorted-maps!
#+begin_src clojure :results none :session usage-example
(def non-tightly-packed-trie-with-raw-byte-info-added
(tpt/as-byte-array non-tightly-packed-trie))
(tpt/as-map non-tightly-packed-trie-with-raw-byte-info-added)
;; => {:root
;; {:byte-address 42,
;; :byte-array [-128, -128, -126, -127, 7],
;; :children
;; {1
;; {:byte-address 35,
;; :byte-array [-128, -128, -124, -126, 18, -125, 5],
;; :children
;; {2
;; {:byte-address 17,
;; :byte-array [-128, -128, -122, -127, 9, -126, 6, -125, 3],
;; :children
;; {1
;; {:value 121,
;; :count 1,
;; :byte-address 8,
;; :byte-array [-7, -127, -128],
;; :children {}},
;; 2
;; {:value 122,
;; :count 1,
;; :byte-address 11,
;; :byte-array [-6, -127, -128],
;; :children {}},
;; 3
;; {:value 123,
;; :count 1,
;; :byte-address 14,
;; :byte-array [-5, -127, -128],
;; :children {}}}},
;; 3
;; {:byte-address 30,
;; :byte-array [-128, -128, -126, -127, 4],
;; :children
;; {1
;; {:value 131,
;; :count 1,
;; :byte-address 26,
;; :byte-array [1, -125, -127, -128],
;; :children {}}}}}}}}}
(def tightly-packed-trie
(tpt/tightly-packed-trie
loosely-packed-trie
encode-fn
decode-fn))
(get tightly-packed-trie [1 2 3])
;; => 123
(map #(get % []) (trie/children (trie/lookup tightly-packed-trie [1 2])))
;; => (121 122 123)
(seq tightly-packed-trie)
;; => ([[1 2 1] 121]
;; [[1 2 2] 122]
;; [[1 2 3] 123]
;; [[1 2] nil]
;; [[1 3 1] 131]
;; [[1 3] nil]
;; [[1] nil])
#+end_src
Once the trie is transformed to have the byte-array info on each node, you can pass that
trie to ~tightly-packed-trie~ to get a MUCH more memory-efficient trie.
This trie is backed by a ByteBuffer rather than a hash-map.
Instead of a map with all of its pointers, we are storing all of the information
necessary for this trie in just 39 bytes!
#+begin_src clojure :results none :session usage-example
(def tightly-packed-trie
(tpt/tightly-packed-trie non-tightly-packed-trie-with-raw-byte-info-added))
(require '[cljol.dig9 :as d])
(.capacity (.byte-buffer tightly-packed-trie))
;; => 47
;;
;;;; Instead of a map with all of its pointers, we are storing
;;;; all of the information necessary for this trie in
;;;; just 47 bytes!
;;;; Hash-map-backed and Tightly-packed comparson
;; The apis are slightly different. But you have access to basically the same data.
;;;; Getting the value of a node in a hash-map-backed trie.
;;
(-> (get non-tightly-packed-trie '(1 2 3))
tpt/as-map
seq
first
second
(select-keys [:value :count]))
;; => {:value 123, :count 1}
;;;; Getting the value of a node in a tightly-packed trie.
;;
(tpt/value (get tightly-packed-trie '(1 2 3)))
;; => {:value 123, :count 1}
;; => 39
#+end_src
It's backed by a byte-buffer so saving to disk is trivial, but there's a helper for that.
@ -264,15 +139,17 @@ Here's the process of saving to and loading from disk. (Only works for tightly-p
(tpt/save-tightly-packed-trie-to-file "/tmp/tpt.bin" tightly-packed-trie)
(def saved-and-loaded-tpt
(tpt/load-tightly-packed-trie-from-file "/tmp/tpt.bin"))
(tpt/load-tightly-packed-trie-from-file "/tmp/tpt.bin" decode-fn))
(tpt/value (get saved-and-loaded-tpt '(1 2 3)))
;; => {:value 123, :count 1}
(get saved-and-loaded-tpt '(1 2 3))
;; => 123
#+end_src
* TODO Tests
* Credits
Ulrich Germann, Eric Joanis, and Samuel Larkin of the National Research Institute of Canada for the paper [[https://www.aclweb.org/anthology/W09-1505.pdf][Tightly Packed Tries: How to Fit Large Models into Memory,and Make them Load Fast, Too]].
https://practicalli.github.io/clojure/clojure-spec/
Lots of credit also goes to the Clojurians community.
* TODO Why would you want a trie data structure?
@ -289,13 +166,10 @@ A user types in the characters "D" "O" and you want to show all possible autocom
of words
- When you get to the next word that doesn't start with "D" "O", you have all the words you want to use for autocomplete.
#+begin_src clojure
#+begin_src clojure :results none
(def dictionary ["Apple" "Banana" "Carrot" "Do" "Dog" "Dot" "Dude" "Egg"])
#+end_src
#+RESULTS:
: #'markov-language-model/dictionary
**** Problems with a list.
It's slow if you have a big list. If you have a dictionary with hundreds of thousands of words
@ -306,7 +180,7 @@ If you're familiar with binary search over sorted lists, you'll know this is a c
*** Typical "Trie" in Clojure
#+begin_src clojure
#+begin_src clojure :results none
{"A" {:children {"P" {,,,} :value nil}}
"D" {:children {"O"
:children {"G" {:children {} :value "DOG"}
@ -315,9 +189,5 @@ If you're familiar with binary search over sorted lists, you'll know this is a c
:value nil}}
#+end_src
#+RESULTS:
: class java.lang.RuntimeException
**** How is a trie faster?
-

@ -1,16 +1,16 @@
(ns markov-language-model
(:require [clojure.string :as string]
[clojure.java.io :as io]
[com.owoga.tightly-packed-trie.core :as tpt]
[com.owoga.trie.math :as math]
[com.owoga.tightly-packed-trie :as tpt]
[com.owoga.tightly-packed-trie.encoding :as encoding]
[com.owoga.trie :as tr]
[cljol.dig9 :as d]
[clojure.zip :as zip]
[com.owoga.tightly-packed-trie.bit-manip :as bm]))
(def corpus (slurp (io/resource "cask_of_amontillado.txt")))
;; For better generation of text, you'll probably want to pad the starts
;; of sentences with n-1 "start-of-sentence" tokens.
(defn prep-punctuation-for-tokenization
"Puts spaces around punctuation so that they aren't
tokenized with the words they are attached to.
@ -20,6 +20,17 @@
[text]
(string/replace text #"([\.,!?])" " $1 "))
;; For better generation of text, you'll probably want to pad the starts
;; of sentences with n-1 "start-of-sentence" tokens.
(defn add-bol-and-eol-tokens [text]
(-> text
(string/replace #"(\.)" "</s> . <s>")
(#(str "<s> " %))))
(defn remove-quotes
[text]
(string/replace text #"\"" ""))
(defn remove-formatting-characters
"Input has underscores, presumably because the text
might be rendered by something that can italicize or bold text.
@ -31,6 +42,9 @@
(-> text
remove-formatting-characters
prep-punctuation-for-tokenization
remove-quotes
add-bol-and-eol-tokens
string/lower-case
(string/split #"[\n ]+")))
(defn interleave-all
@ -59,7 +73,16 @@
p1 (partition 1 1 tokens)
p2 (partition 2 1 tokens)
p3 (partition 3 1 tokens)]
(interleave-all p1 p2 p3)))
[p1
p2
p3
(interleave-all p1 p2 p3)])
;; => [((1) (2) (3) (4) (5))
;; ((1 2) (2 3) (3 4) (4 5))
;; ((1 2 3) (2 3 4) (3 4 5))
;; ((1) (1 2) (1 2 3) (2) (2 3) (2 3 4) (3) (3 4) (3 4 5) (4) (4 5) (5))]
)
(defn ngramify-tokens [n m tokens]
(let [partition-colls (map #(partition % 1 tokens) (range n m))
@ -69,38 +92,21 @@
(comment
(->> (tokenize corpus)
(take 5)
(ngramify-tokens 1 4)) ;; => (("The")
;; ("The" "thousand")
;; ("The" "thousand" "injuries")
(ngramify-tokens 1 4))
;; => (("the")
;; ("the" "thousand")
;; ("the" "thousand" "injuries")
;; ("thousand")
;; ("thousand" "injuries")
;; ("thousand" "injuries" "of")
;; ("injuries")
;; ("injuries" "of")
;; ("injuries" "of" "Fortunato")
;; ("injuries" "of" "fortunato")
;; ("of")
;; ("of" "Fortunato")
;; ("Fortunato"))
;; ("of" "fortunato")
;; ("fortunato"))
)
(defn make-trie
([] (tpt/->Trie
(fn update-fn [prev cur]
(if (nil? prev)
(sorted-map
:value (last cur)
:count 1)
(-> prev
(update :count (fnil inc 0))
(assoc :value (last cur)))))
(sorted-map)))
([& ks]
(reduce
(fn [t k]
(conj t k))
(make-trie)
ks)))
(defn add-terminal-value-to-ngram
"The Trie expects entries to be of the form '(k1 k2 k3 value).
The ngrams generated above are just '(k1 k2 k3).
@ -111,81 +117,10 @@
(comment
(let [ngrams (->> corpus
tokenize
(take 200)
(ngramify-tokens 1 4)
(map add-terminal-value-to-ngram))
trie (apply make-trie ngrams)]
(tpt/as-map trie))
;; {:root
;; {:children
;; {","
;; {:children
;; {"I"
;; {:children {"vowed" {:count 1, :value ("," "I" "vowed")}},
;; :count 1,
;; :value ("," "I")},
;; "and"
;; {:children {"he" {:count 1, :value ("," "and" "he")}},
;; :count 1,
;; :value ("," "and")},
;; ,,,}}}}}}
)
;; TODO: Move this to ITrie?
(defn trie->seq-of-nodes
"Returns a seq of every terminal node. Useful for things like
doing aggregation calculations."
[trie]
(->> trie
tpt/as-vec
zip/vector-zip
(iterate zip/next)
(take-while (complement zip/end?))
(map zip/node)
(filter map?)))
;; The tightly packed trie uses an encoding where integers are encoded with
;; variable lengths. To maximize memory efficiency, the most commonly used values
;; should have the smallest integer IDs. That way the values that most commonly appear
;; are encoded with the fewest bytes.
(defn seq-of-nodes->sorted-by-count
"Sorted first by the rank of the ngram, lowest ranks first.
Sorted second by the frequency of the ngram, highest frequencies first.
This is the order that you'd populate a mapping of keys to IDs."
[nodes]
(->> nodes
(map (comp first seq))
(map (fn [[k v]]
(vector (:value v) (:count v))))
;; root node and padded starts
(remove (comp nil? second))
(sort-by #(vector (count (first %))
(- (second %))))))
(comment
(let [ngrams (->> corpus
tokenize
(take 200)
(ngramify-tokens 1 4)
(map add-terminal-value-to-ngram))
trie (apply make-trie ngrams)]
(->> trie
trie->seq-of-nodes
seq-of-nodes->sorted-by-count
(take 10)))
;; => ([(",") 11]
;; [(".") 9]
;; [("I") 8]
;; [("the") 6]
;; [("to") 6]
;; [("was") 5]
;; [("a") 4]
;; [("my") 4]
;; [("of") 4]
;; [("as") 3])
)
tokenize
(take 200)
(ngramify-tokens 1 4)
(map add-terminal-value-to-ngram))]))
(defn trie->database
"It's convenient to work with a trie that has keys and values as
@ -194,10 +129,14 @@
we need every value to be an integer that we can variable-length-encode.
This creates a database for conveniently converting the human-readable
entries to ids and back from ids to human-readable entries."
entries to ids and back from ids to human-readable entries.
Ids will start at 1 so that 0 can be reserved for the root node."
[trie]
(let [sorted-keys (->> (trie->seq-of-nodes trie)
seq-of-nodes->sorted-by-count)]
(let [sorted-keys (->> (seq trie)
(sort-by (fn [[k v]]
(:count v)))
(reverse))]
(loop [sorted-keys sorted-keys
database {}
i 1]
@ -207,12 +146,12 @@
(rest sorted-keys)
(-> database
(assoc (first (first sorted-keys))
{:count (second (first sorted-keys))
:id i})
(assoc (second (first sorted-keys)) :id i))
(assoc i (first (first sorted-keys))))
(inc i))))))
(comment
(take 10 (trie->database trie))
(let [ngrams (->> corpus
tokenize
(take 200)
@ -220,56 +159,157 @@
(map add-terminal-value-to-ngram))
trie (apply make-trie ngrams)]
(trie->database trie))
;; {("at") {:count 1, :id 39},
;; 453 ("revenge" "." "You"),
;; ("The") {:count 1, :id 37},
;; ("resolved" ",") {:count 1, :id 256},
;; 487 ("very" "definitiveness" "with"),
;; ("be" "respected") {:count 1, :id 170},
;; ("a" "point") {:count 1, :id 158},
;; 357 ("and" "he" "did"),
;; 275 ("the" "very"),
;; ("doubt" "my" "good") {:count 1, :id 381},
;; ,,,}
;; {("at") {:count 1, :id 39},
;; 453 ("revenge" "." "You"),
;; ("The") {:count 1, :id 37},
;; ("resolved" ",") {:count 1, :id 256},
;; 487 ("very" "definitiveness" "with"),
;; ("be" "respected") {:count 1, :id 170},
;; ("a" "point") {:count 1, :id 158},
;; 357 ("and" "he" "did"),
;; 275 ("the" "very"),
;; ("doubt" "my" "good") {:count 1, :id 381},
;; ,,,}
)
(seq {"and" {:count 1 :value '("foo")}});; => (["and" {:count 1, :value ("foo")}])
(defn transform-trie->ids
"Once we have a database to convert from string-keys to integer-ids and back,
we can traverse the trie using its `transform` zipper and `zip/edit` each
node replacing the string-keys with their integer-ids."
we can traverse the trie replacing the string-keys with their integer-ids."
[trie database]
(let [transform-p #(map? (zip/node %))
transform-f
(fn tf [loc]
(zip/edit
loc
(fn [node]
;; {"And {:count 1, :value (! " "And)}} ;; <- Node
(let [[k v] (first (seq node))]
{(get-in database [(list k) :id] (if (= k :root) :root))
(assoc v :value (get-in database [(:value v) :id] 0))}))))]
(tpt/transform trie (tpt/visitor-filter transform-p transform-f))))
(->> trie
(map
(fn [[k v]]
[(vec (map #(get (get database [%]) :id) k))
{:id (get-in database [k :id])
:count (get-in database [k :count])}]))
(into (tr/make-trie))))
(def trie
(let [ngrams (->> corpus
tokenize
(ngramify-tokens 1 4)
(map add-terminal-value-to-ngram))]
(apply make-trie ngrams)))
(map add-terminal-value-to-ngram)
(map (fn [entry]
(list (butlast entry)
(last entry)))))]
(->> ngrams
(reduce
(fn [acc [k v]]
(update
acc
k
(fnil
(fn [existing]
(update existing :count inc))
{:value v
:count 0})))
(tr/make-trie)))))
(comment
(take 10 (drop 1000 trie))
;; => ([["be" "awaiting"] {:value ("be" "awaiting"), :count 1}]
;; [["be" "cautious" "as"] {:value ("be" "cautious" "as"), :count 1}]
;; [["be" "gone"] {:value ("be" "gone"), :count 2}]
;; [["be" "ill" ","] {:value ("be" "ill" ","), :count 1}])
)
(def trie-database
(trie->database trie))
(comment
(take 4 trie-database)
;; => ([0 ["."]]
;; [["to" "your" "long"] {:value ("to" "your" "long"), :count 1, :id 1119}]
;; [["an" "instant" "he"] {:value ("an" "instant" "he"), :count 1, :id 4800}]
;; [["fifth" "," "the"] {:value ("fifth" "," "the"), :count 1, :id 3919}])
)
(def tpt-ready-trie (transform-trie->ids trie trie-database))
(comment
(take 4 tpt-ready-trie)
;; => ([[0 1 27] {:id 5082, :count 1}]
;; [[0 1 104] {:id 5072, :count 1}]
;; [[0 1 112] {:id 5075, :count 1}]
;; [[0 1 146] {:id 5077, :count 1}])
)
(defn value-encode-fn [v]
(if (= v ::tpt/root)
(encoding/encode 0)
(byte-array
(concat (encoding/encode (:id v))
(encoding/encode (:count v))))))
(defn value-decode-fn [byte-buffer]
(let [id (encoding/decode byte-buffer)]
(if (zero? id)
{:id :root}
{:id id
:count (encoding/decode byte-buffer)})))
(def tightly-packed-trie
(tpt/tightly-packed-trie tpt-ready-trie))
(tpt/tightly-packed-trie tpt-ready-trie value-encode-fn value-decode-fn))
;;;; DEMO
;;
;;;; ** Out of date since new TrieAgain code
(comment
;;;; Size comparisons
;;
;; Original trie, keys and values are lists and strings.
;; ~1,900 kb
(d/sum [trie])
;; 61112 objects
;; 103249 references between them
;; 1901656 bytes total in all objects
;; no cycles
;; 8421 leaf objects (no references to other objects)
;; Original trie, keys and values numbers
;; ~900 kb
(d/sum [tpt-ready-trie])
;; 30008 objects
;; 62543 references between them
;; 907992 bytes total in all objects
;; no cycles
;; 5438 leaf objects (no references to other objects)
;; Tightly-packed-trie, keys and values numbers (backed by var-len encoded ints)
;; ~36 kb
(d/sum [tightly-packed-trie])
;; 6 objects
;; 5 references between them
;; 36736 bytes total in all objects
;; no cycles
;; 4 leaf objects (no references to other objects)
;;;; Size comparison summary
;;
;; Condensed original: 900 kb
;; Tightly packed: 36 kb
;; Compression: ~96% !!!
;;;; Getting value from each type of trie
;;
(get trie ["<s>" "i" "was"])
;; => {:value ("<s>" "i" "was"), :count 1}
(get tpt-ready-trie [0 8 21])
;; => {:id 5116, :count 1}
(get tightly-packed-trie [0 8 21])
;; => {:id 5116, :count 1}
;; And then to get back to a string version, use the database.
(->> [0 8 21]
(get tightly-packed-trie)
:id
(get trie-database)
(get trie-database))
;; => {:value ("<s>" "i" "was"), :count 1, :id 5116}
;;;; Our "database" (just a hash-map) serves a dual purpose.
;;
;; It maps n-grams to their frequency counts and to an integer ID.
@ -286,33 +326,14 @@
;; [("the" "feeble") {:count 1, :id 2693}]
;; [("intermingling" "," "into") {:count 1, :id 4488}])
;;;; We can `get` the value of an n-gram from a Trie.
;; The value returned will be a Trie that has as its root node the
;; value at the n-gram. This gives you access to all of the descendants.
;;
;; Having access to the descendants is useful for something like
;; auto-complete. You can get in the trie the input to the completion, the prefix.
;; Then you can get the completions by simple seq-ing over the child nodes.
(tpt/as-map (get trie '("," "I")))
;; => {"I"
;; {:count 10,
;; :value ("," "I"),
;; :children
;; {"am" {:count 1, :value ("," "I" "am")},
;; "began" {:count 2, :value ("," "I" "began")},
;; ,,,
;; "well" {:count 1, :value ("," "I" "well")}}}}
;;;; Database
;; Each n-gram has its own integer ID. The integer IDs should be handed
;; out to n-grams in order of frequency. Therefore, you're 1-grams will probably
;; have lower IDs than the higher-order n-grams.
;;
;; Here we see "," is the 2nd most-common n-gram.
(get-in trie-database ['(",") :id]) ;; => 2
(get-in trie-database ['("I") :id]) ;; => 4
(get-in trie-database ['(",") :id]) ;; => 7
(get-in trie-database ['("i") :id]) ;; => 8
;; The ID of a 2-gram is not related in any way to
;; the two 1-grams that make it up. Every n-gram is unique
;; and gets its own unique ID.
@ -326,9 +347,9 @@
;; To re-iterate: The keys are all 1-grams at every nesting level.
;; The values are the higher-order n-grams the lower in the nesting
;; that you go.
(get-in trie-database ['("," "I") :id]) ;; => 911
(get-in trie-database ['("," "i") :id]) ;; => 23
;;;; Map-based Trie vs Tightly Packed Trie
;;;; Trie vs Tightly Packed Trie
;;;;
;; The interface is *almost* the same between the two.
;; Tightly packed tries can't be updated or written to.
@ -338,141 +359,48 @@
;;
;; Other than that though, let's see it in action!
;;
;;;; Here is the map-based trie.
(->> (tpt/as-map (get trie '("," "I")))
(#(get-in % ["I" :children]))
(map seq)
(map first))
;; => ("am" "began" "ceased" "had" "resumed" "soon" "suffered" "vowed" "well")
;;;; Here is the Trie
(get trie '("i"))
;; => {:value ("i"), :count 107}
(->> (tr/lookup trie '("i"))
(take 5))
;; => ([["," "i"] {:value ("i" "," "i"), :count 1}]
;; [[","] {:value ("i" ","), :count 1}]
;; [["again" "paused"] {:value ("i" "again" "paused"), :count 1}]
;; [["again"] {:value ("i" "again"), :count 1}]
;; [["am" "on"] {:value ("i" "am" "on"), :count 1}])
(->> (tr/lookup trie '("i"))
(tr/children)
(map #(get % []))
(take 5))
;; => ({:value ("i" ","), :count 1}
;; {:value ("i" "again"), :count 1}
;; {:value ("i" "am"), :count 1}
;; {:value ("i" "answered"), :count 1}
;; {:value ("i" "began"), :count 2})
;;;; And here is the tightly-packed trie
(->> (get tightly-packed-trie '(2 4))
tpt/children
(map tpt/value)
(map :value)
(map #(get trie-database %))
(map last)
sort)
;; => ("am" "began" "ceased" "had" "resumed" "soon" "suffered" "vowed" "well")
(->> (tr/lookup tightly-packed-trie '(8))
(tr/children)
(map #(get % []))
(take 5))
;; => ({:id 3392, :count 1}
;; {:id 3270, :count 1}
;; {:id 129, :count 5}
;; {:id 70, :count 9}
;; {:id 69, :count 9})
(->> (tr/lookup tightly-packed-trie '(8))
(tr/children)
(map #(get % []))
(take 5)
(map #(get trie-database (:id %))))
;; => (["i" ","] ["i" "to"] ["i" "was"] ["i" "had"] ["i" "said"])
;;;; Ta-da!
;;;; Let's check the size difference
;; Memory footprint comparison
;; 2.2mb -> 37kb.
;; 1.7% of its original Clojure map size!!!
(->> trie (.trie) vector d/sum)
;; 65485 objects
;; 109687 references between them
;; 2179088 bytes total in all objects
;; no cycles
;; 8413 leaf objects (no references to other objects)
;; 1 root nodes (no reference to them from other objects _in this graph_)
(->> tightly-packed-trie (.byte-buffer) vector d/sum)
;; 2 objects
;; 1 references between them
;; 37680 bytes total in all objects
;; no cycles
;; 1 leaf objects (no references to other objects)
;; 1 root nodes (no reference to them from other objects _in this graph_)
(let [trie-at-2 (get tightly-packed-trie '(2))
address (.address trie-at-2)
byte-buffer (.byte-buffer trie-at-2)
limit (.limit trie-at-2)]
(tpt/wrap-byte-buffer
byte-buffer
(.position byte-buffer address)
(.limit byte-buffer limit)
(println "Address of node at 2" address)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(println "Size of index at 2" (encoding/decode byte-buffer))
(println "position of first key in index" (.position byte-buffer))
(encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
(encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
(.position byte-buffer (- address 1037)) ;; Position of '("," "the")
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer) ;; 11 size of index
;; 1462 position of buffer
;; max-address of index = 1473 (or 1472?)
(.position byte-buffer)
(.position byte-buffer 2618) ;; First mid of broken binary search
(tpt/rewind-to-key byte-buffer 2500)
(println (.position byte-buffer))
(println (bm/to-binary-string (.get byte-buffer (.position byte-buffer))))
(println (bm/to-binary-string (.get byte-buffer (dec (.position byte-buffer)))))
))
(let [trie-at-2 (get tightly-packed-trie '(2))
address (.address trie-at-2)
byte-buffer (.byte-buffer trie-at-2)
limit (.limit trie-at-2)]
(tpt/wrap-byte-buffer
byte-buffer
(.position byte-buffer address)
(.limit byte-buffer limit)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
(encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
(.position byte-buffer (- address 1037))
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)))
;; I's offset, 986.
;; See below. ID at offset 986 is 2! Same as at above offset!
;; And the count is 4? The count coincidentally is the ID we expect?
(let [trie-at-2 (get tightly-packed-trie '(2))
address 986
byte-buffer (.byte-buffer trie-at-2)
limit (.limit trie-at-2)]
(tpt/wrap-byte-buffer
byte-buffer
(.position byte-buffer address)
(.limit byte-buffer limit)
(encoding/decode byte-buffer)))
(get trie-database 4)
(let [byte-buffer (.byte-buffer tightly-packed-trie)]
(.position byte-buffer)))
;; Memory footprint comparison
;; 2.2mb -> 32kb.
;; 1.5% of its original Clojure map size!
(comment
(->> trie (.trie) vector d/sum)
;; 65485 objects
;; 109687 references between them
;; 2179088 bytes total in all objects
;; no cycles
;; 8413 leaf objects (no references to other objects)
;; 1 root nodes (no reference to them from other objects _in this graph_)
(->> tightly-packed-trie (.byte-buffer) vector d/sum)
;; 2 objects
;; 1 references between them
;; 32896 bytes total in all objects
;; no cycles
;; 1 leaf objects (no references to other objects)
;; 1 root nodes (no reference to them from other objects _in this graph_)
)
(defn key-get-in-tpt [tpt db ks]
(let [id (map #(get-in db [(list %) :id]) ks)
v (get tpt id)]
@ -488,45 +416,54 @@
(key-get-in-tpt
tightly-packed-trie
trie-database
'("another"))
'("i" "will"))
;; => {(8 49) {:id 3257, :count 1}}
;; => {(2 2 3) {:value 3263, :count 462}}
(id-get-in-tpt
tightly-packed-trie
trie-database
'(2 2 3))
;; => {("<s>" "<s>" "the") {:value ("<s>" "<s>" "the"), :count 462}}
'(8 49))
;; => {("i" "will") {:id 3257, :count 1, :value ["i" "will"]}}
)
;;;; Markov-generating text from trie
(comment
;; database
(let [texts (->> (dark-corpus-file-seq 500 2)
(map slurp))
trie (create-trie-from-texts texts)]
(->> (trie->database trie)
(#(get % 3))))
(let [texts (->> (dark-corpus-file-seq 500 2)
(map slurp))
trie (create-trie-from-texts texts)]
(tpt/as-map (transform-trie->ids trie)))
(let [texts (->> (dark-corpus-file-seq 500 2)
(map slurp))
trie (create-trie-from-texts texts)
tightly-packed-trie (tpt/tightly-packed-trie
(transform-trie->ids trie))]
(get tightly-packed-trie '(2 2 3)))
(let [texts (->> (dark-corpus-file-seq 500 2)
(map slurp))
trie (create-trie-from-texts texts)]
(tpt/as-map trie))
(let [text (slurp (first (dark-corpus-file-seq 500 1)))]
(->> text
util/clean-text
(#(string/split % #"\n+"))))
(def example-story
(loop [generated-text [(:id (get trie-database ["<s>"]))]
i 0]
(if (> i 100)
generated-text
(recur
(conj
generated-text
(tpt/.key
(math/weighted-selection
#(:count (get % []))
(loop [i 3
children
(tr/children
(tr/lookup
tightly-packed-trie
(vec (take-last i generated-text))))]
(if (empty? children)
(recur (dec i)
(tr/children
(tr/lookup
tightly-packed-trie
(vec (take-last i generated-text)))))
children)))))
(inc i)))))
(->> example-story
(map #(get trie-database %))
(apply concat)
(remove #{"<s>" "</s>"})
(string/join " ")
(#(string/replace % #" ([\.,\?])" "$1"))
((fn [txt]
(string/replace txt #"(^|\. |\? )([a-z])" (fn [[a b c]]
(str b (.toUpperCase c)))))))
;; => "I broke and reached him a flagon of de grave. We came at length. He again took my arm, and holding the flambeaux over the wall; i replied, were a great and numerous family. Whither? to your long life. Putting on a tight-fitting parti-striped dress, and descending again, and had given them explicit orders not to be found, and this time i made bold to seize fortunato by an arm above the elbow. In its destined position."
)

@ -0,0 +1,326 @@
(ns com.owoga.tightly-packed-trie
(:require [com.owoga.trie :as trie]
[com.owoga.tightly-packed-trie.encoding :as encoding]
[clojure.java.io :as io]
[clojure.string :as string]
[com.owoga.tightly-packed-trie.bit-manip :as bm]
[clojure.zip :as zip])
(:import (java.io ByteArrayOutputStream ByteArrayInputStream
DataOutputStream DataInputStream)))
;; A trie data structure that can be converted to
;; a contiguous array of bytes while maintaining
;; efficient lookups.
;;
;; A regular Clojure hash-map can be used as a trie,
;; but there's a lot of memory overhead that comes along
;; with hash-maps.
;;
;; To conveniently get the benefit of packing a trie into a contiguous array of
;; bytes, there are a few restrictions.
;;
;; Everything must be numeric IDs. Keys must be numeric IDs.
;; Values must be numeric IDs. Everything must be a number.
;;
;; This lets us encode everything as variable-length-encoded byte arrays.
;;
;; To maximize efficiency, your most common keys should have the
;; smallest IDs.
(defmacro wrap-byte-buffer
"Saves the position and limit of a byte buffer, runs body,
returns byte buffer to original position and limit."
[byte-buffer & body]
`(let [original-position# (.position ~byte-buffer)
original-limit# (.limit ~byte-buffer)]
(try (do ~@body)
(finally
(.limit ~byte-buffer original-limit#)
(.position ~byte-buffer original-position#)))))
(defn trie->depth-first-post-order-traversable-zipperable-vector
([path node decode-value-fn]
(vec
(map
(fn [child]
[(trie->depth-first-post-order-traversable-zipperable-vector
(conj path (.key child))
child
decode-value-fn)
(wrap-byte-buffer
(.byte-buffer child)
(.limit (.byte-buffer child) (.limit child))
(.position (.byte-buffer child) (.address child))
(clojure.lang.MapEntry.
(conj path (.key child))
(decode-value-fn (.byte-buffer child))))])
(trie/children node)))))
(defn rewind-to-key [bb stop]
(loop []
(let [current (.get bb (.position bb))
previous (.get bb (dec (.position bb)))]
(if (or (= stop (.position bb))
(and (encoding/key-byte? current)
(encoding/offset-byte? previous)))
bb
(do (.position bb (dec (.position bb)))
(recur))))))
(defn forward-to-key [bb stop]
(loop []
(if (or (= stop (.position bb))
(and (encoding/key-byte? (.get bb (.position bb)))
(encoding/offset-byte?
(.get bb (inc (.position bb))))))
bb
(do (.position bb (inc (.position bb)))
(recur)))))
(defn find-key-in-index
[bb target-key max-address not-found]
(.limit bb max-address)
(let [key
(loop [previous-key nil
min-position (.position bb)
max-position max-address]
(if (zero? (- max-position min-position))
not-found
(let [mid-position (+ min-position (quot (- max-position min-position) 2))]
(.position bb mid-position)
(let [bb (rewind-to-key bb min-position)
current-key
(encoding/decode-number-from-tightly-packed-trie-index bb)]
(cond
(= current-key target-key)
(encoding/decode-number-from-tightly-packed-trie-index bb)
(= current-key previous-key)
(do
(encoding/decode-number-from-tightly-packed-trie-index bb)
(let [final-key (encoding/decode-number-from-tightly-packed-trie-index bb)]
(if (= target-key final-key)
(encoding/decode-number-from-tightly-packed-trie-index bb)
(throw (Exception. "Key not found.")))))
(< current-key target-key)
;; Chew the next decoded number. It's a useless offset.
(do
(encoding/decode-number-from-tightly-packed-trie-index bb)
(recur
current-key
(.position bb)
max-position))
(> current-key target-key)
;; This could also be rewound.
(do
(rewind-to-key bb min-position)
(recur
current-key
min-position
(.position bb))))))))]
(.limit bb (.capacity bb))
key))
(defn tightly-packed-trie-node-value
[byte-buffer]
(let [value (encoding/decode byte-buffer)
freq (encoding/decode byte-buffer)]
{:id value
:count freq}))
(defn -value [trie value-decode-fn]
(wrap-byte-buffer
(.byte-buffer trie)
(.limit (.byte-buffer trie) (.limit trie))
(.position (.byte-buffer trie) (.address trie))
(value-decode-fn (.byte-buffer trie))))
(deftype TightlyPackedTrie [byte-buffer key address limit value-decode-fn]
trie/ITrie
(lookup [self ks]
(wrap-byte-buffer
byte-buffer
(.limit byte-buffer limit)
(.position byte-buffer address)
(if (empty? ks)
self
(let [val (value-decode-fn byte-buffer)
size-of-index (encoding/decode byte-buffer)
offset (find-key-in-index
byte-buffer
(first ks)
(+ (.position byte-buffer) size-of-index)
:not-found)]
(if (= offset :not-found)
nil
(let [child (TightlyPackedTrie.
byte-buffer
(first ks)
(- address offset)
(.capacity byte-buffer)
value-decode-fn)]
(trie/lookup child (rest ks))))))))
(children [self]
(wrap-byte-buffer
byte-buffer
(.limit byte-buffer limit)
(.position byte-buffer address)
(let [val (value-decode-fn byte-buffer)
size-of-index (encoding/decode byte-buffer)]
(.limit byte-buffer (+ (.position byte-buffer)
size-of-index))
(loop [children []]
(if (= (.position byte-buffer) (.limit byte-buffer))
children
(let [child-key (encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
child-offset (encoding/decode-number-from-tightly-packed-trie-index byte-buffer)]
(recur
(conj
children
(TightlyPackedTrie.
byte-buffer
child-key
(- address child-offset)
(.capacity byte-buffer)
value-decode-fn)))))))))
clojure.lang.ILookup
(valAt [self ks]
(if-let [node (trie/lookup self ks)]
(-value node value-decode-fn)
nil))
(valAt [self ks not-found]
(or (get self ks) not-found))
clojure.lang.Seqable
(seq [trie]
(->> trie
(#(trie->depth-first-post-order-traversable-zipperable-vector
[]
%
value-decode-fn))
zip/vector-zip
(iterate zip/next)
(take-while (complement zip/end?))
(map zip/node)
(filter (partial instance? clojure.lang.MapEntry))
(#(if (empty? %) nil %)))))
(defn tightly-packed-trie
[trie value-encode-fn value-decode-fn]
(let [baos (ByteArrayOutputStream.)]
(loop [nodes (seq trie)
current-offset 8
previous-depth 0
child-indexes []]
(let [current-node (first nodes)
current-depth (count (first current-node))]
(cond
(empty? nodes)
(let [child-index (first child-indexes)
child-index-baos (ByteArrayOutputStream.)
_ (->> child-index
(run!
(fn [[key offset]]
(.write
child-index-baos
(encoding/encode-key-to-tightly-packed-trie-index key))
(.write
child-index-baos
(encoding/encode-offset-to-tightly-packed-trie-index
(- current-offset offset))))))
child-index-byte-array (.toByteArray child-index-baos)
size-of-child-index (encoding/encode (count child-index-byte-array))
root-address current-offset
value (value-encode-fn 0)]
(.write baos value)
(.write baos size-of-child-index)
(.write baos child-index-byte-array)
(let [ba (.toByteArray baos)
byte-buf (java.nio.ByteBuffer/allocate (+ 8 (count ba)))]
(do (.putLong byte-buf root-address)
(.put byte-buf ba)
(.rewind byte-buf)
(->TightlyPackedTrie
byte-buf
0
(.getLong byte-buf)
(.capacity byte-buf)
value-decode-fn))))
;; Gone up from depth to a parent.
;; Process index of children.
(> previous-depth current-depth)
(do (let [[k v] (first nodes)
value (value-encode-fn v)
child-index (first child-indexes)
child-index-baos (ByteArrayOutputStream.)
_ (->> child-index
(run!
(fn [[key offset]]
(.write
child-index-baos
(encoding/encode-key-to-tightly-packed-trie-index key))
(.write
child-index-baos
(encoding/encode-offset-to-tightly-packed-trie-index
(- current-offset offset))))))
child-index-byte-array (.toByteArray child-index-baos)
size-of-child-index (encoding/encode (count child-index-byte-array))
current-index (second child-indexes)]
(.write baos value)
(.write baos size-of-child-index)
(.write baos child-index-byte-array)
(recur (rest nodes)
(+ current-offset
(count value)
(count size-of-child-index)
(count child-index-byte-array))
current-depth
(cons (conj current-index
[(last k)
current-offset])
(drop 2 child-indexes)))))
;; Down or even in depth to children
;; Start keeping track of new children index
:else
(do (let [[k v] (first nodes)
value (value-encode-fn v)
size-of-child-index (encoding/encode 0)
child-indexes (concat (repeat (- current-depth previous-depth) [])
child-indexes)
current-child-index (first child-indexes)]
(.write baos value)
(.write baos size-of-child-index)
(recur (rest nodes)
(+ current-offset
(count value)
(count size-of-child-index))
current-depth
(cons (conj current-child-index
[(last k)
current-offset])
(rest child-indexes))))))))))
;; TODO: Shared "save" interface for Trie?
(defn save-tightly-packed-trie-to-file
[filepath trie]
(with-open [o (io/output-stream filepath)]
(.write o (.array (.byte-buffer trie)))))
(defn load-tightly-packed-trie-from-file
[filepath value-decode-fn]
(with-open [i (io/input-stream filepath)
baos (ByteArrayOutputStream.)]
(io/copy i baos)
(let [byte-buffer (java.nio.ByteBuffer/wrap (.toByteArray baos))]
(.rewind byte-buffer)
(->TightlyPackedTrie
byte-buffer
0
(.getLong byte-buffer)
(.capacity byte-buffer)
value-decode-fn))))

@ -1,773 +0,0 @@
(ns com.owoga.tightly-packed-trie.core
(:require [clojure.zip :as zip]
[com.owoga.tightly-packed-trie.encoding :as encoding]
[com.owoga.tightly-packed-trie.bit-manip :as bm]
[clojure.java.io :as io]
[clojure.string :as string])
(:import (java.io ByteArrayOutputStream ByteArrayInputStream
DataOutputStream DataInputStream)))
;; A trie data structure that can be converted to
;; a contiguous array of bytes while maintaining
;; efficient lookups.
;;
;; A regular Clojure hash-map can be used as a trie,
;; but there's a lot of memory overhead that comes along
;; with hash-maps.
;;
;; To conveniently get the benefit of packing a trie into a contiguous array of
;; bytes, there are a few restrictions.
;;
;; Everything must be numeric IDs. Keys must be numeric IDs.
;; Values must be numeric IDs. Everything must be a number.
;;
;; This lets us encode everything as variable-length-encoded byte arrays.
;;
;; To maximize efficiency, your most common keys should have the
;; smallest IDs.
;; It's convenient to work with hash-map representations
;; while developing.
;;
;; {"T" {:children:
;; "I": {:value "IT"}
;; "A": {:value "AT"}}}
;;
;; If you want to work with a hash-map representation and
;; want this library to handle conversion of the trie,
;; then your hash-map version will need to follow a few conventions.
;;
;; For example, this code expects child nodes to be key/value pairs under
;; the :children key.
;; To pack a trie into a contiguous array of bytes
;; and still be able to find a key in a list of
;; child indexes in an efficient way, the child
;; indexes need to be sorted so they can be
;; binary-searched.
;;
;; When working with a hash-map-backed Trie,
;; it's convenient to use update-in to add
;; new values to the trie. But the default
;; update-in creates unsorted hash-maps when it encounters
;; a new key. This has the same functionality of update-in,
;; but new keys are given sorted-map values.
(defn update-in-sorted
"'Updates' a value in a nested associative structure, where ks is a
sequence of keys and f is a function that will take the old value
and any supplied args and return the new value, and returns a new
nested structure. If any levels do not exist, hash-maps will be
created."
{:added "1.0"
:static true}
([m ks f & args]
(let [up (fn up [m ks f args]
(let [m (or m (sorted-map))
[k & ks] ks]
(assert (instance? clojure.lang.PersistentTreeMap m)
(apply str
"A non-sorted hash-map in a sorted"
"hash-map will probably be the seed of some problems."))
(if ks
(assoc m k (up (get m k) ks f args))
(assoc m k (apply f (get m k) args)))))]
(up m ks f args))))
(defn children-of-map-trie-node
"Grab the children of a node.
A node is a map of {node-key {:children {,,,} ,,,}}.
This functions gives you the child nodes not as a single map
where all the child keys are part of the same map, but instead as a
a seq where each value is a single node.
This is a useful helper for turning a Trie from a map into a
depth-first post-order traversable zipper."
[node]
(let [[k {:keys [children]}] (first (seq node))]
(->> children
(map (partial apply hash-map)))))
(comment
(let [root-node {:root {:children {"T" {:some 'val} "U" {:other 'val}}}}]
(children-of-map-trie-node root-node))
;; => ({"T" {:some val}} {"U" {:other val}})
)
(defn without-children [node]
(let [[k v] (first (seq node))]
{k (dissoc v :children)}))
(defn map->depth-first-post-order-traversable-zipperable-vector
[node]
[(vec
(map
map->depth-first-post-order-traversable-zipperable-vector
(children-of-map-trie-node node)))
(without-children node)])
(comment
;; This comment demonstrates how we change the order in which
;; we traverse the map. If we don't turn each node into a vector
;; where the list of children are first, then the parent node would
;; get traversed befor the children. In the example below, the
;; "AT" node would be traversed befor the "SAT" and "TAT" nodes.
(let [m {:root
{:children
{"T"
{:children
{"A" {:children
{"T" {:value "TAT", :count 1}
"S" {:value "SAT" :count 1}}
:value "AT"
:count 1},
"U" {:children {"T" {:value "TUT", :count 1}}}}}}}}]
(let [z (zip/vector-zip
(map->depth-first-post-order-traversable-zipperable-vector m))]
(->> z
(iterate zip/next)
(take-while (complement zip/end?))
(map zip/node)
(filter map?))))
;; => ({"T" {:value "TAT", :count 1}}
;; {"S" {:value "SAT", :count 1}}
;; {"A" {:value "AT", :count 1}}
;; {"T" {:value "TUT", :count 1}}
;; {"U" {}}
;; {"T" {}}
;; {:root {}})
)
(defn depth-first-post-order-traversable-zipperable-vector->map
"Parity reversal of the code above. Maps are easier to reason about
while developing. It's just inconvenient to traverse them in the
order needed by the algorithms we use to pack them into a contiguous
array of byte for the tightly packed trie."
[node]
(let [children (first node)