Redesign API, add tests, add example useage

- Add tests
- Refactor new version of Trie to be more coll-like
- Add an example use-case using a Markov chain to generate text
main
Eric Ihli 4 years ago
parent 56be9e9898
commit 77475e0c13

@ -8,115 +8,70 @@ For example, creating a hash-map trie of 1, 2, and 3-grams of short story by Edg
If you're dealing with much larger corpuses, the memory footprint could become an issue.
A tightly packed trie, on the other hand, is tiny. A tightly packed trie on the same corpus is only 37 kilobytes. That's 1.7% of the hash-map backed trie's size!
A tightly packed trie, on the other hand, is tiny. A tightly packed trie on the same corpus is only 37 kilobytes. That's ~4% of the original trie's size, even after the original trie's keys/values have all been condensed to numbers!
** How do you use library?
A hash-map-backed trie is created by passing a variable number of "trie entries" to ~trie~.
A trie is created similar to a hash-map by passing a variable number of "trie entries" to ~trie~.
A "trie entry" is a list of keys with the last element of the list being the "value" of the node.
A "trie entry" is basically the same thing as a map entry. It's just a key and a value.
But for a Trie, the key *must* be seqable and for a tightly-packed trie all keys *must* be comparable.
#+begin_src clojure :results none :session usage-example
(require '[com.owoga.tightly-packed-trie.core :as tpt])
;; [trie-entry '(k1 k2 k3 val)
(def trie-entries
(let [trie-entry-1 '("D" "O" "G" "DOG")
trie-entry-2 '("D" "O" "T" "DOT")
trie-entry-3 '("D" "O" "DO")
trie-entry-4 '("D" "A" "Y" "DAY")]
[trie-entry-1
trie-entry-2
trie-entry-3
trie-entry-4]))
;; (trie trie-entry-1 trie-entry-2 trie-entry-n ,,,)
(def hash-map-backed-trie
(apply tpt/trie trie-entries))
(require '[com.owoga.trie :as trie])
(def loosely-packed-trie (trie/make-trie "dog" :dog "dot" :dot "do" :do "day" :day))
loosely-packed-trie
;; => {[\d \a \y] :day, [\d \o \g] :dog, [\d \o \t] :dot, [\d \o] :do}
#+end_src
Once the trie is created, you can get a seq of all of the descendants of below a certain key
by using ~get~.
You'll see from the output of that last line above that the default REPL representation of a Trie
is a flat hash-map-looking-thing. It's actually a sorted-hash-map-looking-thing, because if you seq
over it, you'll get the trie-entries in depth-first post-order traversal.
#+begin_src clojure :results none :session usage-example
;; All of the nodes that are descendants of '("D" "O")
(seq (get hash-map-backed-trie '("D" "O")))
;; => ({"G" {:value "DOG", :count 1}}
;; {"T" {:value "DOT", :count 1}}
;; {"O" {:count 1, :value "DO"}})
;; All of the nodes that are descendants of '("D" "A")
(seq (get hash-map-backed-trie '("D" "A")))
;; => ({"Y" {:value "DAY", :count 1}})
#+end_src
In some ways, a Trie behaves a lot like a map.
New nodes can be ~conj~ed into the trie.
`get` returns the value at the key.
#+begin_src clojure :results none :session usage-example
(let [new-trie (conj hash-map-backed-trie
'("D" "A" "D" "DAD"))]
(seq (get new-trie '("D" "A"))))
;; => ({"D" {:value "DAD", :count 1}} {"Y" {:value "DAY", :count 1}})
(get loosely-packed-trie "dog")
;; => :dog
(get loosely-packed-trie "do")
;; => :do
(get (assoc loosely-packed-trie "dove" {:value "dove" :count 10}) "dove")
;; => {:value "dove", :count 10}
#+end_src
The entire map can be viewed with ~as-map~.
But there's a couple cool Trie-specific functions.
There's also ~as-vec~ which returns the trie as a vector that can be
passed directly to ~clojure.zipper/vector-zip~.
`lookup` returns the *Trie* at the key. This way, you have access to all of the node's descendants.
#+begin_src clojure :results none :session usage-example
(tpt/as-map hash-map-backed-trie)
;; => {:root
;; {:children
;; {"D"
;; {:children
;; {"A" {:children {"Y" {:value "DAY", :count 1}}},
;; "O"
;; {:children {"G" {:value "DOG", :count 1}, "T" {:value "DOT", :count 1}},
;; :count 1,
;; :value "DO"}}}}}}
(trie/lookup loosely-packed-trie "do")
;; => {[\g] :dog, [\t] :dot}
#+end_src
~get~ returns a ~Trie~, so all of the ~ITrie~ protocol functions work on the value that is returned by ~get~.
`children` returns the direct children of a node.
#+begin_src clojure :results none :session usage-example
(tpt/as-map (get hash-map-backed-trie '("D" "O")))
;; => {"O"
;; {:children {"G" {:value "DOG", :count 1}, "T" {:value "DOT", :count 1}},
;; :count 1,
;; :value "DO"}}
(trie/children (trie/lookup loosely-packed-trie "do"))
;; => ({} {})
#+end_src
There's also a ~transform~ function in the ~ITrie~ protocol that iterates over each
loc in the zippered Trie and calls your given function on the loc.
This is useful, as the name suggests, for performing transformations.
That's odd... there's two things in there that look like empty maps.
#+begin_src clojure :results none :session usage-example
(require '[clojure.zip :as zip]
'[clojure.string :as string])
(let [lower-cased-keys-trie
(tpt/transform
hash-map-backed-trie
(fn [loc]
(if (map? (zip/node loc))
(zip/edit
loc
(fn [node]
(let [[k v] (first (seq (zip/node loc)))]
{(string/lower-case k) v})))
loc)))]
(seq lower-cased-keys-trie))
;; => ({"y" {:value "DAY", :count 1}}
;; {"g" {:value "DOG", :count 1}}
;; {"t" {:value "DOT", :count 1}}
;; {"o" {:count 1, :value "DO"}})
(map #(get % []) (trie/children (trie/lookup loosely-packed-trie "do")))
;; => (:dog :dot)
#+end_src
The REPL representation of a Trie only shows children key/values. The "root" node (not necessarily the "true" root node if you've travsersed down with `lookup`) doesn't print any data to REPL.
So if you're looking ata node with no children, you'll see `{}` in the REPL. But you can get the value of that node with `(get node [])`
* Tightly Packed Tries
The trie above is backed by a Clojure hash-map.
The trie above is backed by regular old Clojure data structures: hash-maps and vectors.
It's not very efficient. All of the strings, nested maps, pointers... it all adds up to a lot of wasted memory.
@ -127,133 +82,53 @@ One restriction though: all keys and values must be integers. To convert them fr
Here's a similar example to that above, but with values that we can tightly pack.
#+begin_src clojure :results none :session usage-example
(require '[com.owoga.tightly-packed-trie.core :as tpt])
;; [trie-entry '(path value)
(def trie-entries
(let [trie-entry-1 '(1 2 3 123)
trie-entry-2 '(1 2 1 121)
trie-entry-3 '(1 2 2 122)
trie-entry-4 '(1 3 1 131)]
[trie-entry-1
trie-entry-2
trie-entry-3
trie-entry-4]))
(def non-tightly-packed-trie
(apply tpt/trie trie-entries))
(tpt/as-map non-tightly-packed-trie)
;; => {:root
;; {:children
;; {1
;; {:children
;; {2
;; {:children
;; {1 {:value 121, :count 1},
;; 2 {:value 122, :count 1},
;; 3 {:value 123, :count 1}}},
;; 3 {:children {1 {:value 131, :count 1}}}}}}}}
#+end_src
There's a slightly mis-named function that creates a byte-array representation of each node.
~as-byte-array~ is named similarly to ~as-map~ and ~as-vec~. But it's mis-named because it doesn't
actually return a ~byte-array~ like the name suggests. I may fix that in the future.
Instead, it adds some keys to each value, ~byte-address~ and ~byte-array~.
(require '[com.owoga.tightly-packed-trie :as tpt]
'[com.owoga.tightly-packed-trie.encoding :as encoding])
The ~byte-address~ is the offset that this node is going to be at in the final contiguous byte-array that makes up the tightly packed trie.
(defn encode-fn [v]
(if (nil? v)
(encoding/encode 0)
(encoding/encode v)))
The ~byte-array~ is the byte-encoded value of the node's key, value, size of the node's children index, and encoded values for each child's key and byte-address-offset from the current node.
(defn decode-fn [byte-buffer]
(let [v (encoding/decode byte-buffer)]
v
(if (zero? v) nil v)))
The ~byte-addresses~ and ~byte-arrays~ are calculated assuming that the depth-first post-order traversal of the vector representation of the trie is the correct order that the nodes need to be written to the contiguous array of bytes that make up the final tightly-packed-trie.
(def loosely-packed-trie
(trie/make-trie '(1 2 3) 123 '(1 2 1) 121 '(1 2 2) 122 '(1 3 1) 131))
Part of that requirement means that the child nodes of each node need to be sorted!
Even though the Trie code looks like it's just backed by regular old hash-maps, it's actually backed by sorted-maps!
#+begin_src clojure :results none :session usage-example
(def non-tightly-packed-trie-with-raw-byte-info-added
(tpt/as-byte-array non-tightly-packed-trie))
(tpt/as-map non-tightly-packed-trie-with-raw-byte-info-added)
;; => {:root
;; {:byte-address 42,
;; :byte-array [-128, -128, -126, -127, 7],
;; :children
;; {1
;; {:byte-address 35,
;; :byte-array [-128, -128, -124, -126, 18, -125, 5],
;; :children
;; {2
;; {:byte-address 17,
;; :byte-array [-128, -128, -122, -127, 9, -126, 6, -125, 3],
;; :children
;; {1
;; {:value 121,
;; :count 1,
;; :byte-address 8,
;; :byte-array [-7, -127, -128],
;; :children {}},
;; 2
;; {:value 122,
;; :count 1,
;; :byte-address 11,
;; :byte-array [-6, -127, -128],
;; :children {}},
;; 3
;; {:value 123,
;; :count 1,
;; :byte-address 14,
;; :byte-array [-5, -127, -128],
;; :children {}}}},
;; 3
;; {:byte-address 30,
;; :byte-array [-128, -128, -126, -127, 4],
;; :children
;; {1
;; {:value 131,
;; :count 1,
;; :byte-address 26,
;; :byte-array [1, -125, -127, -128],
;; :children {}}}}}}}}}
(def tightly-packed-trie
(tpt/tightly-packed-trie
loosely-packed-trie
encode-fn
decode-fn))
(get tightly-packed-trie [1 2 3])
;; => 123
(map #(get % []) (trie/children (trie/lookup tightly-packed-trie [1 2])))
;; => (121 122 123)
(seq tightly-packed-trie)
;; => ([[1 2 1] 121]
;; [[1 2 2] 122]
;; [[1 2 3] 123]
;; [[1 2] nil]
;; [[1 3 1] 131]
;; [[1 3] nil]
;; [[1] nil])
#+end_src
Once the trie is transformed to have the byte-array info on each node, you can pass that
trie to ~tightly-packed-trie~ to get a MUCH more memory-efficient trie.
This trie is backed by a ByteBuffer rather than a hash-map.
Instead of a map with all of its pointers, we are storing all of the information
necessary for this trie in just 39 bytes!
#+begin_src clojure :results none :session usage-example
(def tightly-packed-trie
(tpt/tightly-packed-trie non-tightly-packed-trie-with-raw-byte-info-added))
(require '[cljol.dig9 :as d])
(.capacity (.byte-buffer tightly-packed-trie))
;; => 47
;;
;;;; Instead of a map with all of its pointers, we are storing
;;;; all of the information necessary for this trie in
;;;; just 47 bytes!
;;;; Hash-map-backed and Tightly-packed comparson
;; The apis are slightly different. But you have access to basically the same data.
;;;; Getting the value of a node in a hash-map-backed trie.
;;
(-> (get non-tightly-packed-trie '(1 2 3))
tpt/as-map
seq
first
second
(select-keys [:value :count]))
;; => {:value 123, :count 1}
;;;; Getting the value of a node in a tightly-packed trie.
;;
(tpt/value (get tightly-packed-trie '(1 2 3)))
;; => {:value 123, :count 1}
;; => 39
#+end_src
It's backed by a byte-buffer so saving to disk is trivial, but there's a helper for that.
@ -264,15 +139,17 @@ Here's the process of saving to and loading from disk. (Only works for tightly-p
(tpt/save-tightly-packed-trie-to-file "/tmp/tpt.bin" tightly-packed-trie)
(def saved-and-loaded-tpt
(tpt/load-tightly-packed-trie-from-file "/tmp/tpt.bin"))
(tpt/load-tightly-packed-trie-from-file "/tmp/tpt.bin" decode-fn))
(tpt/value (get saved-and-loaded-tpt '(1 2 3)))
;; => {:value 123, :count 1}
(get saved-and-loaded-tpt '(1 2 3))
;; => 123
#+end_src
* TODO Tests
* Credits
Ulrich Germann, Eric Joanis, and Samuel Larkin of the National Research Institute of Canada for the paper [[https://www.aclweb.org/anthology/W09-1505.pdf][Tightly Packed Tries: How to Fit Large Models into Memory,and Make them Load Fast, Too]].
https://practicalli.github.io/clojure/clojure-spec/
Lots of credit also goes to the Clojurians community.
* TODO Why would you want a trie data structure?
@ -289,13 +166,10 @@ A user types in the characters "D" "O" and you want to show all possible autocom
of words
- When you get to the next word that doesn't start with "D" "O", you have all the words you want to use for autocomplete.
#+begin_src clojure
#+begin_src clojure :results none
(def dictionary ["Apple" "Banana" "Carrot" "Do" "Dog" "Dot" "Dude" "Egg"])
#+end_src
#+RESULTS:
: #'markov-language-model/dictionary
**** Problems with a list.
It's slow if you have a big list. If you have a dictionary with hundreds of thousands of words
@ -306,7 +180,7 @@ If you're familiar with binary search over sorted lists, you'll know this is a c
*** Typical "Trie" in Clojure
#+begin_src clojure
#+begin_src clojure :results none
{"A" {:children {"P" {,,,} :value nil}}
"D" {:children {"O"
:children {"G" {:children {} :value "DOG"}
@ -315,9 +189,5 @@ If you're familiar with binary search over sorted lists, you'll know this is a c
:value nil}}
#+end_src
#+RESULTS:
: class java.lang.RuntimeException
**** How is a trie faster?
-

@ -1,16 +1,16 @@
(ns markov-language-model
(:require [clojure.string :as string]
[clojure.java.io :as io]
[com.owoga.tightly-packed-trie.core :as tpt]
[com.owoga.trie.math :as math]
[com.owoga.tightly-packed-trie :as tpt]
[com.owoga.tightly-packed-trie.encoding :as encoding]
[com.owoga.trie :as tr]
[cljol.dig9 :as d]
[clojure.zip :as zip]
[com.owoga.tightly-packed-trie.bit-manip :as bm]))
(def corpus (slurp (io/resource "cask_of_amontillado.txt")))
;; For better generation of text, you'll probably want to pad the starts
;; of sentences with n-1 "start-of-sentence" tokens.
(defn prep-punctuation-for-tokenization
"Puts spaces around punctuation so that they aren't
tokenized with the words they are attached to.
@ -20,6 +20,17 @@
[text]
(string/replace text #"([\.,!?])" " $1 "))
;; For better generation of text, you'll probably want to pad the starts
;; of sentences with n-1 "start-of-sentence" tokens.
(defn add-bol-and-eol-tokens [text]
(-> text
(string/replace #"(\.)" "</s> . <s>")
(#(str "<s> " %))))
(defn remove-quotes
[text]
(string/replace text #"\"" ""))
(defn remove-formatting-characters
"Input has underscores, presumably because the text
might be rendered by something that can italicize or bold text.
@ -31,6 +42,9 @@
(-> text
remove-formatting-characters
prep-punctuation-for-tokenization
remove-quotes
add-bol-and-eol-tokens
string/lower-case
(string/split #"[\n ]+")))
(defn interleave-all
@ -59,7 +73,16 @@
p1 (partition 1 1 tokens)
p2 (partition 2 1 tokens)
p3 (partition 3 1 tokens)]
(interleave-all p1 p2 p3)))
[p1
p2
p3
(interleave-all p1 p2 p3)])
;; => [((1) (2) (3) (4) (5))
;; ((1 2) (2 3) (3 4) (4 5))
;; ((1 2 3) (2 3 4) (3 4 5))
;; ((1) (1 2) (1 2 3) (2) (2 3) (2 3 4) (3) (3 4) (3 4 5) (4) (4 5) (5))]
)
(defn ngramify-tokens [n m tokens]
(let [partition-colls (map #(partition % 1 tokens) (range n m))
@ -69,38 +92,21 @@
(comment
(->> (tokenize corpus)
(take 5)
(ngramify-tokens 1 4)) ;; => (("The")
;; ("The" "thousand")
;; ("The" "thousand" "injuries")
(ngramify-tokens 1 4))
;; => (("the")
;; ("the" "thousand")
;; ("the" "thousand" "injuries")
;; ("thousand")
;; ("thousand" "injuries")
;; ("thousand" "injuries" "of")
;; ("injuries")
;; ("injuries" "of")
;; ("injuries" "of" "Fortunato")
;; ("injuries" "of" "fortunato")
;; ("of")
;; ("of" "Fortunato")
;; ("Fortunato"))
;; ("of" "fortunato")
;; ("fortunato"))
)
(defn make-trie
([] (tpt/->Trie
(fn update-fn [prev cur]
(if (nil? prev)
(sorted-map
:value (last cur)
:count 1)
(-> prev
(update :count (fnil inc 0))
(assoc :value (last cur)))))
(sorted-map)))
([& ks]
(reduce
(fn [t k]
(conj t k))
(make-trie)
ks)))
(defn add-terminal-value-to-ngram
"The Trie expects entries to be of the form '(k1 k2 k3 value).
The ngrams generated above are just '(k1 k2 k3).
@ -111,81 +117,10 @@
(comment
(let [ngrams (->> corpus
tokenize
(take 200)
(ngramify-tokens 1 4)
(map add-terminal-value-to-ngram))
trie (apply make-trie ngrams)]
(tpt/as-map trie))
;; {:root
;; {:children
;; {","
;; {:children
;; {"I"
;; {:children {"vowed" {:count 1, :value ("," "I" "vowed")}},
;; :count 1,
;; :value ("," "I")},
;; "and"
;; {:children {"he" {:count 1, :value ("," "and" "he")}},
;; :count 1,
;; :value ("," "and")},
;; ,,,}}}}}}
)
;; TODO: Move this to ITrie?
(defn trie->seq-of-nodes
"Returns a seq of every terminal node. Useful for things like
doing aggregation calculations."
[trie]
(->> trie
tpt/as-vec
zip/vector-zip
(iterate zip/next)
(take-while (complement zip/end?))
(map zip/node)
(filter map?)))
;; The tightly packed trie uses an encoding where integers are encoded with
;; variable lengths. To maximize memory efficiency, the most commonly used values
;; should have the smallest integer IDs. That way the values that most commonly appear
;; are encoded with the fewest bytes.
(defn seq-of-nodes->sorted-by-count
"Sorted first by the rank of the ngram, lowest ranks first.
Sorted second by the frequency of the ngram, highest frequencies first.
This is the order that you'd populate a mapping of keys to IDs."
[nodes]
(->> nodes
(map (comp first seq))
(map (fn [[k v]]
(vector (:value v) (:count v))))
;; root node and padded starts
(remove (comp nil? second))
(sort-by #(vector (count (first %))
(- (second %))))))
(comment
(let [ngrams (->> corpus
tokenize
(take 200)
(ngramify-tokens 1 4)
(map add-terminal-value-to-ngram))
trie (apply make-trie ngrams)]
(->> trie
trie->seq-of-nodes
seq-of-nodes->sorted-by-count
(take 10)))
;; => ([(",") 11]
;; [(".") 9]
;; [("I") 8]
;; [("the") 6]
;; [("to") 6]
;; [("was") 5]
;; [("a") 4]
;; [("my") 4]
;; [("of") 4]
;; [("as") 3])
)
tokenize
(take 200)
(ngramify-tokens 1 4)
(map add-terminal-value-to-ngram))]))
(defn trie->database
"It's convenient to work with a trie that has keys and values as
@ -194,10 +129,14 @@
we need every value to be an integer that we can variable-length-encode.
This creates a database for conveniently converting the human-readable
entries to ids and back from ids to human-readable entries."
entries to ids and back from ids to human-readable entries.
Ids will start at 1 so that 0 can be reserved for the root node."
[trie]
(let [sorted-keys (->> (trie->seq-of-nodes trie)
seq-of-nodes->sorted-by-count)]
(let [sorted-keys (->> (seq trie)
(sort-by (fn [[k v]]
(:count v)))
(reverse))]
(loop [sorted-keys sorted-keys
database {}
i 1]
@ -207,12 +146,12 @@
(rest sorted-keys)
(-> database
(assoc (first (first sorted-keys))
{:count (second (first sorted-keys))
:id i})
(assoc (second (first sorted-keys)) :id i))
(assoc i (first (first sorted-keys))))
(inc i))))))
(comment
(take 10 (trie->database trie))
(let [ngrams (->> corpus
tokenize
(take 200)
@ -220,56 +159,157 @@
(map add-terminal-value-to-ngram))
trie (apply make-trie ngrams)]
(trie->database trie))
;; {("at") {:count 1, :id 39},
;; 453 ("revenge" "." "You"),
;; ("The") {:count 1, :id 37},
;; ("resolved" ",") {:count 1, :id 256},
;; 487 ("very" "definitiveness" "with"),
;; ("be" "respected") {:count 1, :id 170},
;; ("a" "point") {:count 1, :id 158},
;; 357 ("and" "he" "did"),
;; 275 ("the" "very"),
;; ("doubt" "my" "good") {:count 1, :id 381},
;; ,,,}
;; {("at") {:count 1, :id 39},
;; 453 ("revenge" "." "You"),
;; ("The") {:count 1, :id 37},
;; ("resolved" ",") {:count 1, :id 256},
;; 487 ("very" "definitiveness" "with"),
;; ("be" "respected") {:count 1, :id 170},
;; ("a" "point") {:count 1, :id 158},
;; 357 ("and" "he" "did"),
;; 275 ("the" "very"),
;; ("doubt" "my" "good") {:count 1, :id 381},
;; ,,,}
)
(seq {"and" {:count 1 :value '("foo")}});; => (["and" {:count 1, :value ("foo")}])
(defn transform-trie->ids
"Once we have a database to convert from string-keys to integer-ids and back,
we can traverse the trie using its `transform` zipper and `zip/edit` each
node replacing the string-keys with their integer-ids."
we can traverse the trie replacing the string-keys with their integer-ids."
[trie database]
(let [transform-p #(map? (zip/node %))
transform-f
(fn tf [loc]
(zip/edit
loc
(fn [node]
;; {"And {:count 1, :value (! " "And)}} ;; <- Node
(let [[k v] (first (seq node))]
{(get-in database [(list k) :id] (if (= k :root) :root))
(assoc v :value (get-in database [(:value v) :id] 0))}))))]
(tpt/transform trie (tpt/visitor-filter transform-p transform-f))))
(->> trie
(map
(fn [[k v]]
[(vec (map #(get (get database [%]) :id) k))
{:id (get-in database [k :id])
:count (get-in database [k :count])}]))
(into (tr/make-trie))))
(def trie
(let [ngrams (->> corpus
tokenize
(ngramify-tokens 1 4)
(map add-terminal-value-to-ngram))]
(apply make-trie ngrams)))
(map add-terminal-value-to-ngram)
(map (fn [entry]
(list (butlast entry)
(last entry)))))]
(->> ngrams
(reduce
(fn [acc [k v]]
(update
acc
k
(fnil
(fn [existing]
(update existing :count inc))
{:value v
:count 0})))
(tr/make-trie)))))
(comment
(take 10 (drop 1000 trie))
;; => ([["be" "awaiting"] {:value ("be" "awaiting"), :count 1}]
;; [["be" "cautious" "as"] {:value ("be" "cautious" "as"), :count 1}]
;; [["be" "gone"] {:value ("be" "gone"), :count 2}]
;; [["be" "ill" ","] {:value ("be" "ill" ","), :count 1}])
)
(def trie-database
(trie->database trie))
(comment
(take 4 trie-database)
;; => ([0 ["."]]
;; [["to" "your" "long"] {:value ("to" "your" "long"), :count 1, :id 1119}]
;; [["an" "instant" "he"] {:value ("an" "instant" "he"), :count 1, :id 4800}]
;; [["fifth" "," "the"] {:value ("fifth" "," "the"), :count 1, :id 3919}])
)
(def tpt-ready-trie (transform-trie->ids trie trie-database))
(comment
(take 4 tpt-ready-trie)
;; => ([[0 1 27] {:id 5082, :count 1}]
;; [[0 1 104] {:id 5072, :count 1}]
;; [[0 1 112] {:id 5075, :count 1}]
;; [[0 1 146] {:id 5077, :count 1}])
)
(defn value-encode-fn [v]
(if (= v ::tpt/root)
(encoding/encode 0)
(byte-array
(concat (encoding/encode (:id v))
(encoding/encode (:count v))))))
(defn value-decode-fn [byte-buffer]
(let [id (encoding/decode byte-buffer)]
(if (zero? id)
{:id :root}
{:id id
:count (encoding/decode byte-buffer)})))
(def tightly-packed-trie
(tpt/tightly-packed-trie tpt-ready-trie))
(tpt/tightly-packed-trie tpt-ready-trie value-encode-fn value-decode-fn))
;;;; DEMO
;;
;;;; ** Out of date since new TrieAgain code
(comment
;;;; Size comparisons
;;
;; Original trie, keys and values are lists and strings.
;; ~1,900 kb
(d/sum [trie])
;; 61112 objects
;; 103249 references between them
;; 1901656 bytes total in all objects
;; no cycles
;; 8421 leaf objects (no references to other objects)
;; Original trie, keys and values numbers
;; ~900 kb
(d/sum [tpt-ready-trie])
;; 30008 objects
;; 62543 references between them
;; 907992 bytes total in all objects
;; no cycles
;; 5438 leaf objects (no references to other objects)
;; Tightly-packed-trie, keys and values numbers (backed by var-len encoded ints)
;; ~36 kb
(d/sum [tightly-packed-trie])
;; 6 objects
;; 5 references between them
;; 36736 bytes total in all objects
;; no cycles
;; 4 leaf objects (no references to other objects)
;;;; Size comparison summary
;;
;; Condensed original: 900 kb
;; Tightly packed: 36 kb
;; Compression: ~96% !!!
;;;; Getting value from each type of trie
;;
(get trie ["<s>" "i" "was"])
;; => {:value ("<s>" "i" "was"), :count 1}
(get tpt-ready-trie [0 8 21])
;; => {:id 5116, :count 1}
(get tightly-packed-trie [0 8 21])
;; => {:id 5116, :count 1}
;; And then to get back to a string version, use the database.
(->> [0 8 21]
(get tightly-packed-trie)
:id
(get trie-database)
(get trie-database))
;; => {:value ("<s>" "i" "was"), :count 1, :id 5116}
;;;; Our "database" (just a hash-map) serves a dual purpose.
;;
;; It maps n-grams to their frequency counts and to an integer ID.
@ -286,33 +326,14 @@
;; [("the" "feeble") {:count 1, :id 2693}]
;; [("intermingling" "," "into") {:count 1, :id 4488}])
;;;; We can `get` the value of an n-gram from a Trie.
;; The value returned will be a Trie that has as its root node the
;; value at the n-gram. This gives you access to all of the descendants.
;;
;; Having access to the descendants is useful for something like
;; auto-complete. You can get in the trie the input to the completion, the prefix.
;; Then you can get the completions by simple seq-ing over the child nodes.
(tpt/as-map (get trie '("," "I")))
;; => {"I"
;; {:count 10,
;; :value ("," "I"),
;; :children
;; {"am" {:count 1, :value ("," "I" "am")},
;; "began" {:count 2, :value ("," "I" "began")},
;; ,,,
;; "well" {:count 1, :value ("," "I" "well")}}}}
;;;; Database
;; Each n-gram has its own integer ID. The integer IDs should be handed
;; out to n-grams in order of frequency. Therefore, you're 1-grams will probably
;; have lower IDs than the higher-order n-grams.
;;
;; Here we see "," is the 2nd most-common n-gram.
(get-in trie-database ['(",") :id]) ;; => 2
(get-in trie-database ['("I") :id]) ;; => 4
(get-in trie-database ['(",") :id]) ;; => 7
(get-in trie-database ['("i") :id]) ;; => 8
;; The ID of a 2-gram is not related in any way to
;; the two 1-grams that make it up. Every n-gram is unique
;; and gets its own unique ID.
@ -326,9 +347,9 @@
;; To re-iterate: The keys are all 1-grams at every nesting level.
;; The values are the higher-order n-grams the lower in the nesting
;; that you go.
(get-in trie-database ['("," "I") :id]) ;; => 911
(get-in trie-database ['("," "i") :id]) ;; => 23
;;;; Map-based Trie vs Tightly Packed Trie
;;;; Trie vs Tightly Packed Trie
;;;;
;; The interface is *almost* the same between the two.
;; Tightly packed tries can't be updated or written to.
@ -338,141 +359,48 @@
;;
;; Other than that though, let's see it in action!
;;
;;;; Here is the map-based trie.
(->> (tpt/as-map (get trie '("," "I")))
(#(get-in % ["I" :children]))
(map seq)
(map first))
;; => ("am" "began" "ceased" "had" "resumed" "soon" "suffered" "vowed" "well")
;;;; Here is the Trie
(get trie '("i"))
;; => {:value ("i"), :count 107}
(->> (tr/lookup trie '("i"))
(take 5))
;; => ([["," "i"] {:value ("i" "," "i"), :count 1}]
;; [[","] {:value ("i" ","), :count 1}]
;; [["again" "paused"] {:value ("i" "again" "paused"), :count 1}]
;; [["again"] {:value ("i" "again"), :count 1}]
;; [["am" "on"] {:value ("i" "am" "on"), :count 1}])
(->> (tr/lookup trie '("i"))
(tr/children)
(map #(get % []))
(take 5))
;; => ({:value ("i" ","), :count 1}
;; {:value ("i" "again"), :count 1}
;; {:value ("i" "am"), :count 1}
;; {:value ("i" "answered"), :count 1}
;; {:value ("i" "began"), :count 2})
;;;; And here is the tightly-packed trie
(->> (get tightly-packed-trie '(2 4))
tpt/children
(map tpt/value)
(map :value)
(map #(get trie-database %))
(map last)
sort)
;; => ("am" "began" "ceased" "had" "resumed" "soon" "suffered" "vowed" "well")
(->> (tr/lookup tightly-packed-trie '(8))
(tr/children)
(map #(get % []))
(take 5))
;; => ({:id 3392, :count 1}
;; {:id 3270, :count 1}
;; {:id 129, :count 5}
;; {:id 70, :count 9}
;; {:id 69, :count 9})
(->> (tr/lookup tightly-packed-trie '(8))
(tr/children)
(map #(get % []))
(take 5)
(map #(get trie-database (:id %))))
;; => (["i" ","] ["i" "to"] ["i" "was"] ["i" "had"] ["i" "said"])
;;;; Ta-da!
;;;; Let's check the size difference
;; Memory footprint comparison
;; 2.2mb -> 37kb.
;; 1.7% of its original Clojure map size!!!
(->> trie (.trie) vector d/sum)
;; 65485 objects
;; 109687 references between them
;; 2179088 bytes total in all objects
;; no cycles
;; 8413 leaf objects (no references to other objects)
;; 1 root nodes (no reference to them from other objects _in this graph_)
(->> tightly-packed-trie (.byte-buffer) vector d/sum)
;; 2 objects
;; 1 references between them
;; 37680 bytes total in all objects
;; no cycles
;; 1 leaf objects (no references to other objects)
;; 1 root nodes (no reference to them from other objects _in this graph_)
(let [trie-at-2 (get tightly-packed-trie '(2))
address (.address trie-at-2)
byte-buffer (.byte-buffer trie-at-2)
limit (.limit trie-at-2)]
(tpt/wrap-byte-buffer
byte-buffer
(.position byte-buffer address)
(.limit byte-buffer limit)
(println "Address of node at 2" address)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(println "Size of index at 2" (encoding/decode byte-buffer))
(println "position of first key in index" (.position byte-buffer))
(encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
(encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
(.position byte-buffer (- address 1037)) ;; Position of '("," "the")
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer) ;; 11 size of index
;; 1462 position of buffer
;; max-address of index = 1473 (or 1472?)
(.position byte-buffer)
(.position byte-buffer 2618) ;; First mid of broken binary search
(tpt/rewind-to-key byte-buffer 2500)
(println (.position byte-buffer))
(println (bm/to-binary-string (.get byte-buffer (.position byte-buffer))))
(println (bm/to-binary-string (.get byte-buffer (dec (.position byte-buffer)))))
))
(let [trie-at-2 (get tightly-packed-trie '(2))
address (.address trie-at-2)
byte-buffer (.byte-buffer trie-at-2)
limit (.limit trie-at-2)]
(tpt/wrap-byte-buffer
byte-buffer
(.position byte-buffer address)
(.limit byte-buffer limit)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
(encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
(.position byte-buffer (- address 1037))
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)))
;; I's offset, 986.
;; See below. ID at offset 986 is 2! Same as at above offset!
;; And the count is 4? The count coincidentally is the ID we expect?
(let [trie-at-2 (get tightly-packed-trie '(2))
address 986
byte-buffer (.byte-buffer trie-at-2)
limit (.limit trie-at-2)]
(tpt/wrap-byte-buffer
byte-buffer
(.position byte-buffer address)
(.limit byte-buffer limit)
(encoding/decode byte-buffer)))
(get trie-database 4)
(let [byte-buffer (.byte-buffer tightly-packed-trie)]
(.position byte-buffer)))
;; Memory footprint comparison
;; 2.2mb -> 32kb.
;; 1.5% of its original Clojure map size!
(comment
(->> trie (.trie) vector d/sum)
;; 65485 objects
;; 109687 references between them
;; 2179088 bytes total in all objects
;; no cycles
;; 8413 leaf objects (no references to other objects)
;; 1 root nodes (no reference to them from other objects _in this graph_)
(->> tightly-packed-trie (.byte-buffer) vector d/sum)
;; 2 objects
;; 1 references between them
;; 32896 bytes total in all objects
;; no cycles
;; 1 leaf objects (no references to other objects)
;; 1 root nodes (no reference to them from other objects _in this graph_)
)
(defn key-get-in-tpt [tpt db ks]
(let [id (map #(get-in db [(list %) :id]) ks)
v (get tpt id)]
@ -488,45 +416,54 @@
(key-get-in-tpt
tightly-packed-trie
trie-database
'("another"))
'("i" "will"))
;; => {(8 49) {:id 3257, :count 1}}
;; => {(2 2 3) {:value 3263, :count 462}}
(id-get-in-tpt
tightly-packed-trie
trie-database
'(2 2 3))
;; => {("<s>" "<s>" "the") {:value ("<s>" "<s>" "the"), :count 462}}
'(8 49))
;; => {("i" "will") {:id 3257, :count 1, :value ["i" "will"]}}
)
;;;; Markov-generating text from trie
(comment
;; database
(let [texts (->> (dark-corpus-file-seq 500 2)
(map slurp))
trie (create-trie-from-texts texts)]
(->> (trie->database trie)
(#(get % 3))))
(let [texts (->> (dark-corpus-file-seq 500 2)
(map slurp))
trie (create-trie-from-texts texts)]
(tpt/as-map (transform-trie->ids trie)))
(let [texts (->> (dark-corpus-file-seq 500 2)
(map slurp))
trie (create-trie-from-texts texts)
tightly-packed-trie (tpt/tightly-packed-trie
(transform-trie->ids trie))]
(get tightly-packed-trie '(2 2 3)))
(let [texts (->> (dark-corpus-file-seq 500 2)
(map slurp))
trie (create-trie-from-texts texts)]
(tpt/as-map trie))
(let [text (slurp (first (dark-corpus-file-seq 500 1)))]
(->> text
util/clean-text
(#(string/split % #"\n+"))))
(def example-story
(loop [generated-text [(:id (get trie-database ["<s>"]))]
i 0]
(if (> i 100)
generated-text
(recur
(conj
generated-text
(tpt/.key
(math/weighted-selection
#(:count (get % []))
(loop [i 3
children
(tr/children
(tr/lookup
tightly-packed-trie
(vec (take-last i generated-text))))]
(if (empty? children)
(recur (dec i)
(tr/children
(tr/lookup
tightly-packed-trie
(vec (take-last i generated-text)))))
children)))))
(inc i)))))
(->> example-story
(map #(get trie-database %))
(apply concat)
(remove #{"<s>" "</s>"})
(string/join " ")
(#(string/replace % #" ([\.,\?])" "$1"))
((fn [txt]
(string/replace txt #"(^|\. |\? )([a-z])" (fn [[a b c]]
(str b (.toUpperCase c)))))))
;; => "I broke and reached him a flagon of de grave. We came at length. He again took my arm, and holding the flambeaux over the wall; i replied, were a great and numerous family. Whither? to your long life. Putting on a tight-fitting parti-striped dress, and descending again, and had given them explicit orders not to be found, and this time i made bold to seize fortunato by an arm above the elbow. In its destined position."
)

@ -0,0 +1,326 @@
(ns com.owoga.tightly-packed-trie
(:require [com.owoga.trie :as trie]
[com.owoga.tightly-packed-trie.encoding :as encoding]
[clojure.java.io :as io]
[clojure.string :as string]
[com.owoga.tightly-packed-trie.bit-manip :as bm]
[clojure.zip :as zip])
(:import (java.io ByteArrayOutputStream ByteArrayInputStream
DataOutputStream DataInputStream)))
;; A trie data structure that can be converted to
;; a contiguous array of bytes while maintaining
;; efficient lookups.
;;
;; A regular Clojure hash-map can be used as a trie,
;; but there's a lot of memory overhead that comes along
;; with hash-maps.
;;
;; To conveniently get the benefit of packing a trie into a contiguous array of
;; bytes, there are a few restrictions.
;;
;; Everything must be numeric IDs. Keys must be numeric IDs.
;; Values must be numeric IDs. Everything must be a number.
;;
;; This lets us encode everything as variable-length-encoded byte arrays.
;;
;; To maximize efficiency, your most common keys should have the
;; smallest IDs.
(defmacro wrap-byte-buffer
"Saves the position and limit of a byte buffer, runs body,
returns byte buffer to original position and limit."
[byte-buffer & body]
`(let [original-position# (.position ~byte-buffer)
original-limit# (.limit ~byte-buffer)]
(try (do ~@body)
(finally
(.limit ~byte-buffer original-limit#)
(.position ~byte-buffer original-position#)))))
(defn trie->depth-first-post-order-traversable-zipperable-vector
([path node decode-value-fn]
(vec
(map
(fn [child]
[(trie->depth-first-post-order-traversable-zipperable-vector
(conj path (.key child))
child
decode-value-fn)
(wrap-byte-buffer
(.byte-buffer child)
(.limit (.byte-buffer child) (.limit child))
(.position (.byte-buffer child) (.address child))
(clojure.lang.MapEntry.
(conj path (.key child))
(decode-value-fn (.byte-buffer child))))])
(trie/children node)))))
(defn rewind-to-key [bb stop]
(loop []
(let [current (.get bb (.position bb))
previous (.get bb (dec (.position bb)))]
(if (or (= stop (.position bb))
(and (encoding/key-byte? current)
(encoding/offset-byte? previous)))
bb
(do (.position bb (dec (.position bb)))
(recur))))))
(defn forward-to-key [bb stop]
(loop []
(if (or (= stop (.position bb))
(and (encoding/key-byte? (.get bb (.position bb)))
(encoding/offset-byte?
(.get bb (inc (.position bb))))))
bb
(do (.position bb (inc (.position bb)))
(recur)))))
(defn find-key-in-index
[bb target-key max-address not-found]
(.limit bb max-address)
(let [key
(loop [previous-key nil
min-position (.position bb)
max-position max-address]
(if (zero? (- max-position min-position))
not-found
(let [mid-position (+ min-position (quot (- max-position min-position) 2))]
(.position bb mid-position)
(let [bb (rewind-to-key bb min-position)
current-key
(encoding/decode-number-from-tightly-packed-trie-index bb)]
(cond
(= current-key target-key)
(encoding/decode-number-from-tightly-packed-trie-index bb)
(= current-key previous-key)
(do
(encoding/decode-number-from-tightly-packed-trie-index bb)
(let [final-key (encoding/decode-number-from-tightly-packed-trie-index bb)]
(if (= target-key final-key)
(encoding/decode-number-from-tightly-packed-trie-index bb)
(throw (Exception. "Key not found.")))))
(< current-key target-key)
;; Chew the next decoded number. It's a useless offset.
(do
(encoding/decode-number-from-tightly-packed-trie-index bb)
(recur
current-key
(.position bb)
max-position))
(> current-key target-key)
;; This could also be rewound.
(do
(rewind-to-key bb min-position)
(recur
current-key
min-position
(.position bb))))))))]
(.limit bb (.capacity bb))
key))
(defn tightly-packed-trie-node-value
[byte-buffer]
(let [value (encoding/decode byte-buffer)
freq (encoding/decode byte-buffer)]
{:id value
:count freq}))
(defn -value [trie value-decode-fn]
(wrap-byte-buffer
(.byte-buffer trie)
(.limit (.byte-buffer trie) (.limit trie))
(.position (.byte-buffer trie) (.address trie))
(value-decode-fn (.byte-buffer trie))))
(deftype TightlyPackedTrie [byte-buffer key address limit value-decode-fn]
trie/ITrie
(lookup [self ks]
(wrap-byte-buffer
byte-buffer
(.limit byte-buffer limit)
(.position byte-buffer address)
(if (empty? ks)
self
(let [val (value-decode-fn byte-buffer)
size-of-index (encoding/decode byte-buffer)
offset (find-key-in-index
byte-buffer
(first ks)
(+ (.position byte-buffer) size-of-index)
:not-found)]
(if (= offset :not-found)
nil
(let [child (TightlyPackedTrie.
byte-buffer
(first ks)
(- address offset)
(.capacity byte-buffer)
value-decode-fn)]
(trie/lookup child (rest ks))))))))
(children [self]
(wrap-byte-buffer
byte-buffer
(.limit byte-buffer limit)
(.position byte-buffer address)
(let [val (value-decode-fn byte-buffer)
size-of-index (encoding/decode byte-buffer)]
(.limit byte-buffer (+ (.position byte-buffer)
size-of-index))
(loop [children []]
(if (= (.position byte-buffer) (.limit byte-buffer))
children
(let [child-key (encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
child-offset (encoding/decode-number-from-tightly-packed-trie-index byte-buffer)]
(recur
(conj
children
(TightlyPackedTrie.
byte-buffer
child-key
(- address child-offset)
(.capacity byte-buffer)
value-decode-fn)))))))))
clojure.lang.ILookup
(valAt [self ks]
(if-let [node (trie/lookup self ks)]
(-value node value-decode-fn)
nil))
(valAt [self ks not-found]
(or (get self ks) not-found))
clojure.lang.Seqable
(seq [trie]
(->> trie
(#(trie->depth-first-post-order-traversable-zipperable-vector
[]
%
value-decode-fn))
zip/vector-zip
(iterate zip/next)
(take-while (complement zip/end?))
(map zip/node)
(filter (partial instance? clojure.lang.MapEntry))
(#(if (empty? %) nil %)))))
(defn tightly-packed-trie
[trie value-encode-fn value-decode-fn]
(let [baos (ByteArrayOutputStream.)]
(loop [nodes (seq trie)
current-offset 8
previous-depth 0
child-indexes []]
(let [current-node (first nodes)
current-depth (count (first current-node))]
(cond
(empty? nodes)
(let [child-index (first child-indexes)
child-index-baos (ByteArrayOutputStream.)
_ (->> child-index
(run!
(fn [[key offset]]
(.write
child-index-baos
(encoding/encode-key-to-tightly-packed-trie-index key))
(.write
child-index-baos
(encoding/encode-offset-to-tightly-packed-trie-index
(- current-offset offset))))))
child-index-byte-array (.toByteArray child-index-baos)
size-of-child-index (encoding/encode (count child-index-byte-array))
root-address current-offset
value (value-encode-fn 0)]
(.write baos value)
(.write baos size-of-child-index)
(.write baos child-index-byte-array)
(let [ba (.toByteArray baos)
byte-buf (java.nio.ByteBuffer/allocate (+ 8 (count ba)))]
(do (.putLong byte-buf root-address)
(.put byte-buf ba)
(.rewind byte-buf)
(->TightlyPackedTrie
byte-buf
0
(.getLong byte-buf)
(.capacity byte-buf)
value-decode-fn))))
;; Gone up from depth to a parent.
;; Process index of children.
(> previous-depth current-depth)
(do (let [[k v] (first nodes)
value (value-encode-fn v)
child-index (first child-indexes)
child-index-baos (ByteArrayOutputStream.)
_ (->> child-index
(run!
(fn [[key offset]]
(.write
child-index-baos
(encoding/encode-key-to-tightly-packed-trie-index key))
(.write
child-index-baos
(encoding/encode-offset-to-tightly-packed-trie-index
(- current-offset offset))))))
child-index-byte-array (.toByteArray child-index-baos)
size-of-child-index (encoding/encode (count child-index-byte-array))
current-index (second child-indexes)]
(.write baos value)
(.write baos size-of-child-index)
(.write baos child-index-byte-array)
(recur (rest nodes)
(+ current-offset
(count value)
(count size-of-child-index)
(count child-index-byte-array))
current-depth
(cons (conj current-index
[(last k)
current-offset])
(drop 2 child-indexes)))))
;; Down or even in depth to children
;; Start keeping track of new children index
:else
(do (let [[k v] (first nodes)
value (value-encode-fn v)
size-of-child-index (encoding/encode 0)
child-indexes (concat (repeat (- current-depth previous-depth) [])
child-indexes)
current-child-index (first child-indexes)]
(.write baos value)
(.write baos size-of-child-index)
(recur (rest nodes)
(+ current-offset
(count value)
(count size-of-child-index))
current-depth
(cons (conj current-child-index
[(last k)
current-offset])
(rest child-indexes))))))))))
;; TODO: Shared "save" interface for Trie?
(defn save-tightly-packed-trie-to-file
[filepath trie]
(with-open [o (io/output-stream filepath)]
(.write o (.array (.byte-buffer trie)))))
(defn load-tightly-packed-trie-from-file
[filepath value-decode-fn]
(with-open [i (io/input-stream filepath)
baos (ByteArrayOutputStream.)]
(io/copy i baos)
(let [byte-buffer (java.nio.ByteBuffer/wrap (.toByteArray baos))]
(.rewind byte-buffer)
(->TightlyPackedTrie
byte-buffer
0
(.getLong byte-buffer)
(.capacity byte-buffer)
value-decode-fn))))

@ -1,773 +0,0 @@
(ns com.owoga.tightly-packed-trie.core
(:require [clojure.zip :as zip]
[com.owoga.tightly-packed-trie.encoding :as encoding]
[com.owoga.tightly-packed-trie.bit-manip :as bm]
[clojure.java.io :as io]
[clojure.string :as string])
(:import (java.io ByteArrayOutputStream ByteArrayInputStream
DataOutputStream DataInputStream)))
;; A trie data structure that can be converted to
;; a contiguous array of bytes while maintaining
;; efficient lookups.
;;
;; A regular Clojure hash-map can be used as a trie,
;; but there's a lot of memory overhead that comes along
;; with hash-maps.
;;
;; To conveniently get the benefit of packing a trie into a contiguous array of
;; bytes, there are a few restrictions.
;;
;; Everything must be numeric IDs. Keys must be numeric IDs.
;; Values must be numeric IDs. Everything must be a number.
;;
;; This lets us encode everything as variable-length-encoded byte arrays.
;;
;; To maximize efficiency, your most common keys should have the
;; smallest IDs.
;; It's convenient to work with hash-map representations
;; while developing.
;;
;; {"T" {:children:
;; "I": {:value "IT"}
;; "A": {:value "AT"}}}
;;
;; If you want to work with a hash-map representation and
;; want this library to handle conversion of the trie,
;; then your hash-map version will need to follow a few conventions.
;;
;; For example, this code expects child nodes to be key/value pairs under
;; the :children key.
;; To pack a trie into a contiguous array of bytes
;; and still be able to find a key in a list of
;; child indexes in an efficient way, the child
;; indexes need to be sorted so they can be
;; binary-searched.
;;
;; When working with a hash-map-backed Trie,
;; it's convenient to use update-in to add
;; new values to the trie. But the default
;; update-in creates unsorted hash-maps when it encounters
;; a new key. This has the same functionality of update-in,
;; but new keys are given sorted-map values.
(defn update-in-sorted
"'Updates' a value in a nested associative structure, where ks is a
sequence of keys and f is a function that will take the old value
and any supplied args and return the new value, and returns a new
nested structure. If any levels do not exist, hash-maps will be
created."
{:added "1.0"
:static true}
([m ks f & args]
(let [up (fn up [m ks f args]
(let [m (or m (sorted-map))
[k & ks] ks]
(assert (instance? clojure.lang.PersistentTreeMap m)
(apply str
"A non-sorted hash-map in a sorted"
"hash-map will probably be the seed of some problems."))
(if ks
(assoc m k (up (get m k) ks f args))
(assoc m k (apply f (get m k) args)))))]
(up m ks f args))))
(defn children-of-map-trie-node
"Grab the children of a node.
A node is a map of {node-key {:children {,,,} ,,,}}.
This functions gives you the child nodes not as a single map
where all the child keys are part of the same map, but instead as a
a seq where each value is a single node.
This is a useful helper for turning a Trie from a map into a
depth-first post-order traversable zipper."
[node]
(let [[k {:keys [children]}] (first (seq node))]
(->> children
(map (partial apply hash-map)))))
(comment
(let [root-node {:root {:children {"T" {:some 'val} "U" {:other 'val}}}}]
(children-of-map-trie-node root-node))
;; => ({"T" {:some val}} {"U" {:other val}})
)
(defn without-children [node]
(let [[k v] (first (seq node))]
{k (dissoc v :children)}))
(defn map->depth-first-post-order-traversable-zipperable-vector
[node]
[(vec
(map
map->depth-first-post-order-traversable-zipperable-vector
(children-of-map-trie-node node)))
(without-children node)])
(comment
;; This comment demonstrates how we change the order in which
;; we traverse the map. If we don't turn each node into a vector
;; where the list of children are first, then the parent node would
;; get traversed befor the children. In the example below, the
;; "AT" node would be traversed befor the "SAT" and "TAT" nodes.
(let [m {:root
{:children
{"T"
{:children
{"A" {:children
{"T" {:value "TAT", :count 1}
"S" {:value "SAT" :count 1}}
:value "AT"
:count 1},
"U" {:children {"T" {:value "TUT", :count 1}}}}}}}}]
(let [z (zip/vector-zip
(map->depth-first-post-order-traversable-zipperable-vector m))]
(->> z
(iterate zip/next)
(take-while (complement zip/end?))
(map zip/node)
(filter map?))))
;; => ({"T" {:value "TAT", :count 1}}
;; {"S" {:value "SAT", :count 1}}
;; {"A" {:value "AT", :count 1}}
;; {"T" {:value "TUT", :count 1}}
;; {"U" {}}
;; {"T" {}}
;; {:root {}})
)
(defn depth-first-post-order-traversable-zipperable-vector->map
"Parity reversal of the code above. Maps are easier to reason about
while developing. It's just inconvenient to traverse them in the
order needed by the algorithms we use to pack them into a contiguous
array of byte for the tightly packed trie."
[node]
(let [children (first node)
parent (second node)
[parent-key parent-val] (first (seq parent))]
(sorted-map
parent-key
(assoc
parent-val
:children
(into
(sorted-map)
(map depth-first-post-order-traversable-zipperable-vector->map children))))))
(comment
(let [m {:root
{:children
{"T"
{:children
{"A" {:children
{"T" {:value "TAT", :count 1}
"S" {:value "SAT" :count 1}}
:value "AT"
:count 1},
"U" {:children {"T" {:value "TUT", :count 1}}}}}}}}]
(let [vect (map->depth-first-post-order-traversable-zipperable-vector m)]
(depth-first-post-order-traversable-zipperable-vector->map vect)))
;; => {:root
;; {:children
;; {"T"
;; {:children
;; {"A"
;; {:value "AT",
;; :count 1,
;; :children
;; {"S" {:value "SAT", :count 1, :children {}},
;; "T" {:value "TAT", :count 1, :children {}}}},
;; "U" {:children {"T" {:value "TUT", :count 1, :children {}}}}}}}}}
)
;;;; Zipper utilities
;;
;; Some useful functions for traversing/transforming zippers.
;;
;; This is what a tree looks like.
;;
;; [1 [2 [3 4 5] [6 7]]]
;; { }
;; / \
;; / \
;; 1 [2 [3 4 5] [6 7]]
;; / | \
;; / | \
;; 2 [3 4 5] [6 7]
;;
;; Clojure's zipper gives us convenient ways
;; to iterate over the nodes of the zipper.
;;
;; We can use the functions below to
;; recreate the functionality of
;; (map #(if (even? %) (* % %) %) coll)
;; but with zippers instead of collections.
(defn visitor-next
"Visits every loc in a zipper and calls zip/next with the result of applying f
to the loc.
Goes without saying that f should return a loc."
[zipper f]
(loop [loc zipper]
(if (zip/end? loc)
(zip/root loc)
(recur (zip/next (f loc))))))
(defn visitor-prev
"visitor-next in reverse."
[zipper f]
(loop [loc zipper]
(if (nil? (zip/prev loc))
(zip/root (f loc))
(recur (zip/prev (f loc))))))
(defn visitor-filter
"Helper for traversing a zipper with a visitor function.
Convenient for pulling filtering logic out of a visitor function
into composable and specific filter functions."
[pred visitor]
(fn [loc]
(if (pred loc)
(visitor loc)
loc)))
(comment
;; This comment shows an example of using the visitor
;; helpers to recreate the functionality similar to
;; (->> [1 2 3 4 5]
;; (map #(if (even? %) (* % %) %))
;;
;; This first example has the conditional logic inside
;; the visitor function.
(let [zipper (zip/vector-zip [1 [2 [3 4 5] [6 7]]])]
(visitor-next
zipper
#(if (and (int? (zip/node %)) (even? (zip/node %)))
(zip/edit % (fn [n] (int (Math/pow n 2))))
%)))
;; => [1 [4 [3 16 5] [36 7]]]
;;
;; This second example has a seperate filter predicate
;; from the transformation visitor and combines them
;; with the visitor-filter function.
(let [zipper (zip/vector-zip [1 [2 [3 4 5] [6 7]]])
pred (fn [loc]
(let [node (zip/node loc)]
(and (int? node)
(even? node))))
transform (fn [loc]
(zip/edit loc (fn [n] (int (Math/pow n 2)))))]
(visitor-next zipper (visitor-filter pred transform)))
;; => [1 [4 [3 16 5] [36 7]]]
)
(defn loc-children
"Takes a zipper loc and returns seq of children locs.
Written to work with zippers of a particular structure:
[[child1, child2, ,,,] parent]"
[loc]
(if (and (zip/left loc)
(zip/down (zip/left loc)))
(let [children (zip/down (zip/left loc))]
(->> children
(iterate zip/right)
(take-while (complement nil?))))))
(comment
(let [v [[1 2 3] :parent]
z (zip/vector-zip v)]
(->> z
zip/down
zip/right
loc-children
(map zip/node)))
;; => (1 2 3)
)
;;;; Tightly Packing Tries
;;
;; These next functions are all helpers
;; related to byte-packing nodes in preperation
;; for writing them to a byte stream.
(defn previous-node [loc]
(loop [loc (zip/prev loc)]
(cond
(nil? loc) nil
(map? (zip/node loc)) loc
:else (recur (zip/prev loc)))))
(defn loc->byte-address
"Given a loc without a byte-address, calculate it from the previous loc.
0x00|node val of previous loc |
0x01|size of child index |
0x02|child1 key |
0x03|child1 byte address offset|
0x04|child2 key |
0x05|child2 byte address offset|
0x??|node val of current loc |
We obviously wouldn't need this if we were writing directly to a
ByteBuffer. Whatever position we are at is our address.
But if we want to maintain the byte-packed data as
part of the map- or vector-like trie structure, then
we need this."
([loc]
(loc->byte-address loc 0))
([loc starting-offset]
(let [prev (previous-node loc)]
(if prev
(let [[k {:keys [byte-address byte-array]}] (first (seq (zip/node prev)))]
(+ byte-address (count byte-array)))
starting-offset))))
(defn child->index
"Given a child gets a map with info needed to build an index.
The below info is just a little bit of lagniappe. The only thing we're
pulling off the child are the values of the keys :byte-address and :key.
The encoding will happen later.
But for reference, the encoding will be:
The index will be a list of pairs of variable-length encoded bytes.
The first number of the pair, the bytes will be encoded with a flag bit of 1.
The second number of the pair, the bytes will be encoded with a flag bit of 0."
[child]
(let [[k {:keys [byte-address byte-array] :as v}] (first (seq child))]
{:byte-address byte-address
:key k}))
(defn child-node-key-and-offset-from-parent-as-byte-array
[{:keys [key offset]}]
(let [baos (ByteArrayOutputStream.)]
(.write baos (encoding/encode-key-to-tightly-packed-trie-index key))
(.write baos (encoding/encode-offset-to-tightly-packed-trie-index offset))
(.toByteArray baos)))
(defn pack-node-value
"Returns byte-array of node value.
Byte-array is 2 variable-length encoded numbers.
For a markov trie, this would be an number ID
of the n-gram and an number of the frequency.
Nodes without terminal values get a value and count of 0."
[node]
(let [baos (ByteArrayOutputStream.)]
(.write baos (encoding/encode (get node :value 0)))
(.write baos (encoding/encode (get node :count 0)))
(.toByteArray baos)))
(defn transform-trie-add-byte-pack-to-each-node
"Visitor for a vector-based trie of structure [[child1, child2, ,,,,] parent].
Transforms each node adding keys for a byte-array of the node and its children index key/offsets
as well as a key for its own offset.
Starts at byte 8, reserving the first 8 bytes for the for root address won't be known
until the end of the zipper.
Transforming the trie to add these keys is an intermediary step that can probably be
bypassed in the future by writing directly to a ByteArrayOutputStream."
[loc]
(let [baos (ByteArrayOutputStream.)
;; Byte-address of the current node. Gets calculated from the
;; previous node's byte address and the size of the previous
;; node's byte array.
byte-address (loc->byte-address loc 8)
child-nodes (->> loc
loc-children
(map (comp second zip/node)))
;; For child, we need to know the offset of the child node's address
;; from this parent node's address.
children (map
(fn [child-node]
(let [child-index (child->index child-node)]
(assoc
child-index
:offset
(- byte-address (:byte-address child-index)))))
child-nodes)
;; Create the byte array of the index of the children
index-ba (let [index-baos (ByteArrayOutputStream.)
child-byte-arrays
(map
child-node-key-and-offset-from-parent-as-byte-array
children)]
(loop [bas child-byte-arrays]
(if (empty? bas)
(.toByteArray index-baos)
(do (.write index-baos (first bas))
(recur (rest bas))))))]
(zip/edit
loc
(fn [node]
(let [[k v] (first (seq node))]
(.write baos (pack-node-value v))
(.write baos (encoding/encode (count index-ba)))
(.write baos index-ba)
{k (conj v {:byte-address byte-address
:byte-array (.toByteArray baos)})})))))
(defprotocol ITrie
(as-map [this] "Map that underlies trie.")
(as-vec [this] "Depth-first post-order vector.")
(as-byte-array [this] (str "Add key/values to each node containing"
" the tightly-packed byte-array representation of the node."))
(transform [this f] "Depth-first post-order apply each function to each node."))
(def not-found# (gensym))
(deftype Trie [f trie]
ITrie
(as-map [_] trie)
(as-vec [_] (map->depth-first-post-order-traversable-zipperable-vector trie))
(as-byte-array [self]
(->> (transform
self
(visitor-filter
#(map? (zip/node %))
transform-trie-add-byte-pack-to-each-node))))
(transform [self f]
(->> self
as-vec
zip/vector-zip
(#(visitor-next % f))
depth-first-post-order-traversable-zipperable-vector->map
(Trie. f)))
;; By returning lookups in the same strucure as root-level nodes
;; but with the root at the found descendent, we can treat this
;; descendant as a new root Trie node. This gives us the advantage
;; of being able to re-use the Trie functions on sub-nodes.
clojure.lang.ILookup
(valAt [_ ks]
(let [v (get-in trie (cons :root (interleave (repeat :children) ks)) not-found#)]
(if (= v not-found#)
(throw (Exception. (format "Key not found: %s" ks)))
(Trie. f (sorted-map (last ks) v)))))
(valAt [_ ks not-found]
(Trie. f (sorted-map
(last ks)
(get-in trie (cons :root (interleave (repeat :children) ks)) not-found))))
clojure.lang.IPersistentCollection
(seq [self]
(->> self
as-vec
zip/vector-zip
(iterate zip/next)
(take-while (complement zip/end?))
(map zip/node)
(filter map?)
(filter (comp :value second first seq))))
(cons [_ o]
(let [path (cons :root (interleave (repeat :children) (butlast o)))
node (get-in trie path)]
(Trie. f (update-in-sorted trie path f o))))
(empty [_] (Trie. f {}))
(equiv [_ o]
(and (isa? (class o) Trie)
(= (as-map o) trie))))
(defn trie
([] (->Trie
(fn update-fn [prev cur]
(if (nil? prev)
{:value (last cur)
:count 1}
(-> prev
(update :count (fnil inc 0))
(assoc :value (last cur)))))
(sorted-map)))
([& ks]
(reduce
(fn [t k]
(conj t k))
(trie)
ks)))
(comment
(let [v1 '(1 2 3 123)
v2 '(1 2 12)
v3 '(1 2 2 122)
v4 '(1 3 1 131)
t1 (trie v1 v2 v3 v4)]
{:as-vec (as-vec (get t1 '(1 2)))
:as-map (as-map (get t1 '(1 2)))
:as-byte-array (as-byte-array (get t1 '(1 2)))})
;; => {:as-vec
;; [[[[] {2 {:value 122, :count 1}}] [[] {3 {:value 123, :count 1}}]]
;; {2 {:count 1, :value 12}}],
;; :as-map
;; {2
;; {:children {2 {:value 122, :count 1}, 3 {:value 123, :count 1}},
;; :count 1,
;; :value 12}},
;; :as-byte-array
;; {2
;; {:byte-address 14,
;; :byte-array [-116, -127, -124, -126, 6, -125, 3],
;; :children
;; {2
;; {:value 122,
;; :count 1,
;; :byte-address 8,
;; :byte-array [-6, -127, -128],
;; :children {}},
;; 3
;; {:value 123,
;; :count 1,
;; :byte-address 11,
;; :byte-array [-5, -127, -128],
;; :children {}}},
;; :count 1,
;; :value 12}}}
)
(defn rewind-to-key [bb stop]
(loop []
(let [current (.get bb (.position bb))
previous (.get bb (dec (.position bb)))]
(if (or (= stop (.position bb))
(and (encoding/key-byte? current)
(encoding/offset-byte? previous)))
bb
(do (.position bb (dec (.position bb)))
(recur))))))
(defn forward-to-key [bb stop]
(loop []
(if (or (= stop (.position bb))
(and (encoding/key-byte? (.get bb (.position bb)))
(encoding/offset-byte?
(.get bb (inc (.position bb))))))
bb
(do (.position bb (inc (.position bb)))
(recur)))))
(defn find-key-in-index
[bb target-key max-address not-found]
(.limit bb max-address)
(let [key
(loop [previous-key nil
min-position (.position bb)
max-position max-address]
(if (zero? (- max-position min-position))
not-found
(let [mid-position (+ min-position (quot (- max-position min-position) 2))]
(.position bb mid-position)
(let [bb (rewind-to-key bb min-position)
current-key
(encoding/decode-number-from-tightly-packed-trie-index bb)]
(cond
(= current-key target-key)
(encoding/decode-number-from-tightly-packed-trie-index bb)
(= current-key previous-key)
(do
(encoding/decode-number-from-tightly-packed-trie-index bb)
(let [final-key (encoding/decode-number-from-tightly-packed-trie-index bb)]
(if (= target-key final-key)
(encoding/decode-number-from-tightly-packed-trie-index bb)
(throw (Exception. "Key not found.")))))
(< current-key target-key)
;; Chew the next decoded number. It's a useless offset.
(do
(encoding/decode-number-from-tightly-packed-trie-index bb)
(recur
current-key
(.position bb)
max-position))
(> current-key target-key)
;; This could also be rewound.
(do
(rewind-to-key bb min-position)
(recur
current-key
min-position
(.position bb))))))))]
(.limit bb (.capacity bb))
key))
(defn tightly-packed-trie-node-value
[byte-buffer]
(let [value (encoding/decode byte-buffer)
freq (encoding/decode byte-buffer)]
{:value value
:count freq}))
(defmacro wrap-byte-buffer
"Saves the position and limit of a byte buffer, runs body,
returns byte buffer to original position and limit."
[byte-buffer & body]
`(let [original-position# (.position ~byte-buffer)
original-limit# (.limit ~byte-buffer)]
(try (do ~@body)
(finally
(.limit ~byte-buffer original-limit#)
(.position ~byte-buffer original-position#)))))
(defprotocol ITightlyPackedTrie
(children [self] "Immediate children of a node.")
(value [self] "Return node value, disassociated with children."))
(deftype TightlyPackedTrie [byte-buffer key address limit]
ITightlyPackedTrie
(value [self]
(wrap-byte-buffer
byte-buffer
(.limit byte-buffer limit)
(.position byte-buffer address)
(tightly-packed-trie-node-value byte-buffer)))
(children [self]
(wrap-byte-buffer
byte-buffer
(.limit byte-buffer limit)
(.position byte-buffer address)
(let [val (encoding/decode byte-buffer)
freq (encoding/decode byte-buffer)
size-of-index (encoding/decode byte-buffer)]
(.limit byte-buffer (+ (.position byte-buffer)
size-of-index))
(loop [children []]
(if (= (.position byte-buffer) (.limit byte-buffer))
children
(let [child-key (encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
child-offset (encoding/decode-number-from-tightly-packed-trie-index byte-buffer)]
(recur
(conj
children
(TightlyPackedTrie.
byte-buffer
child-key
(- address child-offset)
(.capacity byte-buffer))))))))))
clojure.lang.ILookup
(valAt [self ks]
(wrap-byte-buffer
byte-buffer
(.limit byte-buffer limit)
(.position byte-buffer address)
(if (empty? ks)
self
(let [val (encoding/decode byte-buffer)
freq (encoding/decode byte-buffer)
size-of-index (encoding/decode byte-buffer)
offset (find-key-in-index
byte-buffer
(first ks)
(+ (.position byte-buffer) size-of-index)
:not-found)]
(if (= offset :not-found)
(throw (Exception. (format "Index not found %s" ks)))
(let [child (TightlyPackedTrie.
byte-buffer
(first ks)
(- address offset)
(.capacity byte-buffer))]
(get child (rest ks))))))))
(valAt [self ks not-found]
(wrap-byte-buffer
byte-buffer
(.limit byte-buffer limit)
(.position byte-buffer address)
(if (empty? ks)
self
(let [val (encoding/decode byte-buffer)
freq (encoding/decode byte-buffer)
size-of-index (encoding/decode byte-buffer)
offset (find-key-in-index
byte-buffer
(first ks)
(+ (.position byte-buffer) size-of-index)
:not-found)]
(if (= offset :not-found)
not-found
(let [child (TightlyPackedTrie.
byte-buffer
(first ks)
(- address offset)
(.capacity byte-buffer))]
(get child (rest ks)))))))))
(defn tightly-packed-trie
"Assumes the trie has been transformed so that each node
includes a :byte-array key to the byte array that needs to be written
for that node and a :byte-address key that has been calculated
with an offset of 8. (The first 8 bytes are reserved for the root address.)"
[trie]
(let [baos (ByteArrayOutputStream.)
trie (as-byte-array trie)]
;; This transform writes to the ByteArrayOutputStream.
(transform
trie
(visitor-filter
#(map? (zip/node %))
(fn [loc]
(let [{:keys [byte-array]} (second (first (seq (zip/node loc))))]
(.write baos byte-array)
loc))))
(let [ba (.toByteArray baos)
root-address (get-in (as-map trie) [:root :byte-address])
byte-buf (java.nio.ByteBuffer/allocate (+ 8 (count ba)))]
(.putLong byte-buf root-address)
(.put byte-buf ba)
(.rewind byte-buf)
(->TightlyPackedTrie byte-buf 0 (.getLong byte-buf) (.capacity byte-buf)))))
(defn zipper-tpt
"Turns a tightly-packed trie into a zipper.
Since the byte buffer that backs the trie can't be edited,
`make-node` and the zipper edit functions won't work."
[tpt]
(let [branch? (fn branch? [node]
(and (instance? TightlyPackedTrie node)
(not-empty (children node))))
zipper-children (fn zippper-children [node]
(children node))
make-node (fn make-node [node children]
(throw (Exception. "Can't add children to Tightly Packed Trie nodes.")))]
(zip/zipper branch? zipper-children make-node tpt)))
(comment
(let [v1 '(1 2 3 123)
v2 '(1 2 12)
v3 '(1 2 2 122)
v4 '(1 3 1 131)
t1 (trie v1 v2 v3 v4)
tpt (tightly-packed-trie t1)]
(->> tpt
zipper-tpt
(iterate zip/next)
(take-while (complement zip/end?))
(map zip/node)
(map #(hash-map (.key %) (value %)))))
)
;; TODO: Shared "save" interface for Trie?
(defn save-tightly-packed-trie-to-file
[filepath trie]
(with-open [o (io/output-stream filepath)]
(.write o (.array (.byte-buffer trie)))))
(defn load-tightly-packed-trie-from-file
[filepath]
(with-open [i (io/input-stream filepath)
baos (ByteArrayOutputStream.)]
(io/copy i baos)
(let [byte-buffer (java.nio.ByteBuffer/wrap (.toByteArray baos))]
(.rewind byte-buffer)
(->TightlyPackedTrie byte-buffer 0 (.getLong byte-buffer) (.capacity byte-buffer)))))

@ -0,0 +1,192 @@
(ns com.owoga.trie
(:require [clojure.zip :as zip]))
(defn trie->depth-first-post-order-traversable-zipperable-vector
([path node]
(vec
(map
(fn [[k v]]
[(trie->depth-first-post-order-traversable-zipperable-vector (conj path k) v)
(clojure.lang.MapEntry. (conj path k) (.value v))])
(.children- node)))))
(defn depth-first-post-order-traversable-zipperable-vector->trie
[cls [children [key node]]]
(sorted-map
(last key)
(cls (.key node) (.value node)
(into (sorted-map)
(map depth-first-post-order-traversable-zipperable-vector->trie children)))))
(declare ->Trie)
(defn -without
[trie [k & ks]]
(if k
(if-let [next-trie (get (.children- trie) k)]
(let [next-trie-without (-without next-trie ks)
new-trie (->Trie (.key trie)
(.value trie)
(if next-trie-without
(assoc (.children- trie) k next-trie-without)
(dissoc (.children- trie) k)))]
(if (and (empty? new-trie)
(nil? (.value new-trie)))
nil
new-trie)))
(if (seq (.children- trie))
(->Trie
(.key trie)
nil
(.children- trie))
nil)))
(defprotocol ITrie
(children [self] "Immediate children of a node.")
(lookup [self ks] "Return node at key."))
(deftype Trie [key value children-]
ITrie
(children [trie]
(map
(fn [[k child]]
(Trie. k
(.value child)
#_(sorted-map)
(.children- child)))
children-))
(lookup [trie k]
(loop [k' k
trie' trie]
(cond
;; Allows `update` to work the same as with maps... can use `fnil`.
;; (nil? trie') (throw (Exception. (format "Key not found: %s" k)))
(nil? trie') nil
(empty? k')
(Trie. (.key trie')
(.value trie')
(.children- trie'))
:else (recur
(rest k')
(get (.children- trie') (first k'))))))
clojure.lang.ILookup
(valAt [trie k]
(loop [k' k
trie' trie]
(cond
;; Allows `update` to work the same as with maps... can use `fnil`.
;; (nil? trie') (throw (Exception. (format "Key not found: %s" k)))
(nil? trie') nil
(empty? k') (.value trie')
:else (recur
(rest k')
(get (.children- trie') (first k'))))))
(valAt [trie k not-found]
(loop [k' k
trie' trie]
(cond
(nil? trie') not-found
(empty? k') (.value trie')
:else (recur
(rest k')
(get (.children- trie') (first k'))))))
clojure.lang.IPersistentCollection
(cons [trie entry]
(cond
(instance? Trie (second entry))
(assoc trie (first entry) (.value (second entry)))
:else
(assoc trie (first entry) (second entry))))
(empty [trie]
(Trie. key nil (sorted-map)))
(equiv [trie o]
(and (= (.value trie)
(.value o))
(= (.children- trie)
(.children- o))
(= (.key trie)
(.key o))))
clojure.lang.Associative
(assoc [trie opath ovalue]
(if (empty? opath)
(Trie. key ovalue children-)
(Trie. key value (update
children-
(first opath)
(fnil assoc (Trie. (first opath) nil (sorted-map)))
(rest opath)
ovalue))))
(entryAt [trie key]
(clojure.lang.MapEntry. key (get trie key)))
(containsKey [trie key]
(boolean (get trie key)))
clojure.lang.IPersistentMap
(assocEx [trie key val]
(if (contains? trie key)
(throw (Exception. (format "Value already exists at key %s." key)))
(assoc trie key val)))
(without [trie key]
(-without trie key))
clojure.lang.Counted
(count [trie]
(count (seq trie)))
clojure.lang.Seqable
(seq [trie]
(->> trie
((partial trie->depth-first-post-order-traversable-zipperable-vector []))
zip/vector-zip
(iterate zip/next)
(take-while (complement zip/end?))
(map zip/node)
(filter (partial instance? clojure.lang.MapEntry))
(#(if (empty? %) nil %)))))
(defn make-trie
([]
(->Trie '() nil (sorted-map)))
([& ks]
(reduce
(fn [t kv]
(conj t kv))
(make-trie)
(partition 2 ks))))
(comment
(make-trie "do" "do" "dot" "dot" "dog" "dog")
;; => {[\d \o \g] "dog", [\d \o \t] "dot", [\d \o] "do"}
;; Access values at a particular key with get
;; and the key (as any seqable).
(let [trie (make-trie "do" "do" "dot" "dot" "dog" "dog")]
(get trie [\d \o]))
;; => "do"
;; Access children with `traverse`
(let [trie (make-trie "do" "do" "dot" "dot" "dog" "dog" "doggo" "fluffy")]
(traverse trie "do"))
;; => {[\g \g \o] "fluffy", [\g] "dog", [\t] "dot"}
;; Any seq of comparable elements will work for keys.
;; But they *must* be comparable, so you can't mix-and-match
;; different types.
(make-trie '(:k :e :y) 1 '(:k :e :e) 2)
;; => {[:k :e :e] 2, [:k :e :y] 1}
;; Seqing over a trie will return the elements in
;; depth-first post-order traversal with children sorted by key.
(->> (make-trie '(1 2 4) 124 '(1 2) 12 '(1 2 3) 123 '(1 2 2) 122)
(map (fn [[k v]]
[k (* 2 v)]))
(into (make-trie)))
;; => {[1 2 2] 244, [1 2 3] 246, [1 2 4] 248, [1 2] 24}
)

@ -0,0 +1,99 @@
;; Fast weighted random selection thanks to the Vose algorithm.
;; https://gist.github.com/ghadishayban/a26cc402958ef3c7ce61
(ns com.owoga.trie.math
(:import clojure.lang.PersistentQueue))
;; Vose's alias method
;; http://www.keithschwarz.com/darts-dice-coins/
(defprotocol Rand
(nextr [_ rng]))
(deftype Vose [n ^ints alias ^doubles prob]
Rand
;; returns the index of the chosen weight
(nextr [_ rng] ;; not using the rng for now
(let [i (rand-int n)
p (aget prob i)]
(if (or (= p 1.0)
(< (rand) p))
i
(aget alias i)))))
(defn ^:private make-vose [dist]
(let [N (count dist)
alias (int-array N)
prob (double-array N)]
(if (zero? N)
(->Vose N alias prob)
(let [^doubles ps (->> dist
(map (partial * N))
(into-array Double/TYPE))
[small large] (loop [i 0
[small large] [PersistentQueue/EMPTY
PersistentQueue/EMPTY]
ps (seq ps)]
(if (seq ps)
(let [p (first ps)]
(if (< p 1)
(recur (inc i)
[(conj small i) large]
(rest ps))
(recur (inc i)
[small (conj large i)]
(rest ps))))
[small large]))
[small large] (loop [small small
large large]
(if (and (seq small) (seq large))
(let [l (first small)
g (first large)
small (pop small)
large (pop large)]
(aset-double prob l (aget ps l))
(aset-int alias l g)
(let [pg (- (+ (aget ps g) (aget ps l))
1.0)]
(aset-double ps g pg)
(if (< pg 1)
(recur (conj small g) large)
(recur small (conj large g)))))
[small large]))]
(doseq [g (concat large small)]
(aset-double prob g 1))
(->Vose N alias prob)))))
(defn from-weights [ws]
(let [tot (reduce + 0.0 ws)]
(assert (> tot 0) "Can't Vose RNG from 0 weights.")
(let [dist (map #(/ % tot) ws)]
(make-vose (vec dist)))))
(comment
(let [ws [1 2 1 3 3]
rng (from-weights ws)
chosen (repeatedly 1000000 #(nextr rng nil))
accuracy (mapv (comp float
#(/ % 100000)
(frequencies chosen))
(range (count ws)))]
accuracy))
(defn weighted-selection
"If given a coll, assumes the coll is weights and returns the selected index by
weighted random selection.
If given a key function and a collection, uses the key function to get a
collection of weights and returns the value at the randomly selected index."
([coll]
(assert (not-empty coll) "Can't select from empty coll")
(let [rng (from-weights coll)
index (nextr rng nil)]
index))
([key-fn coll]
(assert (not-empty coll) "Can't select from empty coll")
(let [rng (from-weights (map key-fn coll))
index (nextr rng nil)
selection (nth coll index)]
selection)))

@ -0,0 +1,21 @@
(ns tightly-packed-trie.core-test
"Basic tests for the primary API of `next.jdbc`."
(:require [clojure.test :refer [deftest is testing use-fixtures]]
[com.owoga.tightly-packed-trie.core :as tpt]))
(deftest basic-tests
;; use ds-opts instead of (ds) anywhere you want default options applied:
(testing "map-based trie"
(let [trie (tpt/trie)]
(testing "key not found"
(is (thrown-with-msg?
Exception
#"Key not found"
(get trie '("foo"))))
(is (= :not-found
(get trie '("foo") :not-found))))
(testing "conjing to trie"
(let [trie (conj trie '("d" "o" "g" "dog"))]
(is (instance? com.owoga.tightly_packed_trie.core.Trie (get trie '("d" "o" "g"))))
(is (= (tpt/as-map (get trie '("d" "o" "g")))
{"g" {:value "dog" :count 1}})))))))

@ -0,0 +1,89 @@
(ns tightly-packed-trie-test
(:require [clojure.test :refer [deftest is testing] :as t]
[com.owoga.trie :as trie]
[com.owoga.tightly-packed-trie :as tpt]
[com.owoga.tightly-packed-trie.encoding :as encode]
[com.owoga.tightly-packed-trie.encoding :as encoding]
[com.owoga.tightly-packed-trie.bit-manip :as bm]))
(defn value-encode-fn [v]
(if (or (= v ::tpt/root)
(nil? v))
(encode/encode 0)
(encode/encode v)))
(defn value-decode-fn [byte-buffer]
(let [v (encode/decode byte-buffer)]
(if (zero? v)
nil
v)))
(comment
(let [t (->> '([1 3] 13 [1] 1 [1 2] 12)
(apply trie/make-trie)
(#(tpt/tightly-packed-trie % value-encode-fn value-decode-fn)))
bb (.byte-buffer t)]
(trie/lookup t [1 2]))
)
(deftest tightly-packed-trie-tests
(let [empty-trie (-> (trie/make-trie)
(#(tpt/tightly-packed-trie % value-encode-fn value-decode-fn)))
initialized-trie (->> '([1 3] 13 [1] 1 [1 2] 12)
(apply trie/make-trie)
(#(tpt/tightly-packed-trie % value-encode-fn value-decode-fn)))]
(testing "ILookup"
(is (= 13 (get initialized-trie [1 3])))
(is (= :not-found (get initialized-trie [4] :not-found)))
(is (= nil (get initialized-trie [4]))))
(testing "ITrie"
(testing "lookup"
(is (= nil (trie/lookup empty-trie [1])))
(is (= 1 (get (trie/lookup initialized-trie [1]) [])))
(is (= 12 (get (trie/lookup initialized-trie [1]) [2]))))
(testing "children"
(is (= '(12 13)
(map #(get % [])
(trie/children (trie/lookup initialized-trie [1])))))))))
(deftest extended-tightly-packed-trie-tests
(let [initialized-trie (->> (trie/make-trie '(1 2 3) 123 '(1 2 1) 121 '(1 2 2) 122 '(1 3 1) 131)
(#(tpt/tightly-packed-trie % value-encode-fn value-decode-fn)))]
(testing "ILookup"
(is (= 123 (get initialized-trie [1 2 3])))
(is (= :not-found (get initialized-trie [4] :not-found)))
(is (= nil (get initialized-trie [4]))))
(testing "ITrie"
(testing "lookup"
(is (= nil (trie/lookup initialized-trie [4])))
(is (= nil (get (trie/lookup initialized-trie [1]) [])))
(is (= nil (get (trie/lookup initialized-trie [1]) [2]))))
(testing "children"
(is (= '(121 122 123)
(map #(get % [])
(trie/children (trie/lookup initialized-trie [1 2])))))))
(testing "Seq"
(is (= '([[1 2 1] 121]
[[1 2 2] 122]
[[1 2 3] 123]
[[1 2] nil]
[[1 3 1] 131]
[[1 3] nil]
[[1] nil])
(seq initialized-trie))))))
(comment
(let [trie (trie/make-trie '(1 2 3) 123 '(1 2 1) 121 '(1 2 2) 122 '(1 3 1) 131)
tpt (tpt/tightly-packed-trie trie value-encode-fn value-decode-fn)
byte-buffer (.byte-buffer tpt)]
(tpt/wrap-byte-buffer
byte-buffer
(.limit byte-buffer (.limit tpt))
(.position byte-buffer (.address tpt))
[(value-decode-fn byte-buffer)
(value-decode-fn byte-buffer)
(encode/decode-number-from-tightly-packed-trie-index byte-buffer)
(encode/decode-number-from-tightly-packed-trie-index byte-buffer)]))
)

@ -0,0 +1,34 @@
(ns trie-test
(:require [clojure.test :refer [deftest is testing use-fixtures] :as t]
[com.owoga.trie :as trie]
[clojure.zip :as zip]
[clojure.main :as main]))
(deftest trie-tests
(let [empty-trie (trie/make-trie)
initialized-trie (trie/make-trie '(1 2) 12)]
(testing "assoc"
(is (= (assoc empty-trie '(1 2) 12)
initialized-trie)))
(testing "dissoc"
(let [expected (-> (trie/make-trie)
(assoc '(1) 1))
trie (-> (trie/make-trie)
(assoc '(1) 1)
(assoc '(1 3) 13))]
(is (= expected (dissoc trie '(1 3))))))
(testing "ILookup"
(is (= 12 (get initialized-trie '(1 2))))
(is (= :not-found (get initialized-trie '(1 3) :not-found)))
(is (nil? (get initialized-trie '(1 3)))))
(testing "IPersistentCollection"
(is (empty? empty-trie))
(is (= (conj empty-trie ['(1 2) 12])
initialized-trie)))
(testing "Counted"
(is (zero? (count empty-trie)))
(is (= 2 (count initialized-trie))))
(testing "Seqable"
(is (= '([[1 2] 12] [[1] nil])
(seq initialized-trie))))))
Loading…
Cancel
Save