Implement hash-map and byte-array tries

The hash-map trie is convenient to work with at the REPL since the
key/values are human readable and the backing data is traversible in a
fashion familiar to Clojure.

The byte-array backed trie has a slightly different API but is far more
memory efficient.

The paper that the tightly backed trie is based on can be viewed at
https://www.aclweb.org/anthology/W09-1505.pdf
main
Eric Ihli 4 years ago
commit 56be9e9898

7
.gitignore vendored

@ -0,0 +1,7 @@
.DS_Store
.idea
*.log
tmp/
.nrepl-port
.cpcache

@ -0,0 +1,323 @@
#+TITLE: Clojure Tightly Packed Trie
* What does this do?
Tries as hash-maps are common, but hash-maps take up a lot of memory (relatively speaking).
For example, creating a hash-map trie of 1, 2, and 3-grams of short story by Edgar Allen Poe results in a hash-map that consumes over 2 megabytes of memory. [[file:examples/markov_language_model.clj][See this markov language model example]].
If you're dealing with much larger corpuses, the memory footprint could become an issue.
A tightly packed trie, on the other hand, is tiny. A tightly packed trie on the same corpus is only 37 kilobytes. That's 1.7% of the hash-map backed trie's size!
** How do you use library?
A hash-map-backed trie is created by passing a variable number of "trie entries" to ~trie~.
A "trie entry" is a list of keys with the last element of the list being the "value" of the node.
#+begin_src clojure :results none :session usage-example
(require '[com.owoga.tightly-packed-trie.core :as tpt])
;; [trie-entry '(k1 k2 k3 val)
(def trie-entries
(let [trie-entry-1 '("D" "O" "G" "DOG")
trie-entry-2 '("D" "O" "T" "DOT")
trie-entry-3 '("D" "O" "DO")
trie-entry-4 '("D" "A" "Y" "DAY")]
[trie-entry-1
trie-entry-2
trie-entry-3
trie-entry-4]))
;; (trie trie-entry-1 trie-entry-2 trie-entry-n ,,,)
(def hash-map-backed-trie
(apply tpt/trie trie-entries))
#+end_src
Once the trie is created, you can get a seq of all of the descendants of below a certain key
by using ~get~.
#+begin_src clojure :results none :session usage-example
;; All of the nodes that are descendants of '("D" "O")
(seq (get hash-map-backed-trie '("D" "O")))
;; => ({"G" {:value "DOG", :count 1}}
;; {"T" {:value "DOT", :count 1}}
;; {"O" {:count 1, :value "DO"}})
;; All of the nodes that are descendants of '("D" "A")
(seq (get hash-map-backed-trie '("D" "A")))
;; => ({"Y" {:value "DAY", :count 1}})
#+end_src
New nodes can be ~conj~ed into the trie.
#+begin_src clojure :results none :session usage-example
(let [new-trie (conj hash-map-backed-trie
'("D" "A" "D" "DAD"))]
(seq (get new-trie '("D" "A"))))
;; => ({"D" {:value "DAD", :count 1}} {"Y" {:value "DAY", :count 1}})
#+end_src
The entire map can be viewed with ~as-map~.
There's also ~as-vec~ which returns the trie as a vector that can be
passed directly to ~clojure.zipper/vector-zip~.
#+begin_src clojure :results none :session usage-example
(tpt/as-map hash-map-backed-trie)
;; => {:root
;; {:children
;; {"D"
;; {:children
;; {"A" {:children {"Y" {:value "DAY", :count 1}}},
;; "O"
;; {:children {"G" {:value "DOG", :count 1}, "T" {:value "DOT", :count 1}},
;; :count 1,
;; :value "DO"}}}}}}
#+end_src
~get~ returns a ~Trie~, so all of the ~ITrie~ protocol functions work on the value that is returned by ~get~.
#+begin_src clojure :results none :session usage-example
(tpt/as-map (get hash-map-backed-trie '("D" "O")))
;; => {"O"
;; {:children {"G" {:value "DOG", :count 1}, "T" {:value "DOT", :count 1}},
;; :count 1,
;; :value "DO"}}
#+end_src
There's also a ~transform~ function in the ~ITrie~ protocol that iterates over each
loc in the zippered Trie and calls your given function on the loc.
This is useful, as the name suggests, for performing transformations.
#+begin_src clojure :results none :session usage-example
(require '[clojure.zip :as zip]
'[clojure.string :as string])
(let [lower-cased-keys-trie
(tpt/transform
hash-map-backed-trie
(fn [loc]
(if (map? (zip/node loc))
(zip/edit
loc
(fn [node]
(let [[k v] (first (seq (zip/node loc)))]
{(string/lower-case k) v})))
loc)))]
(seq lower-cased-keys-trie))
;; => ({"y" {:value "DAY", :count 1}}
;; {"g" {:value "DOG", :count 1}}
;; {"t" {:value "DOT", :count 1}}
;; {"o" {:count 1, :value "DO"}})
#+end_src
* Tightly Packed Tries
The trie above is backed by a Clojure hash-map.
It's not very efficient. All of the strings, nested maps, pointers... it all adds up to a lot of wasted memory.
A tightly packed trie provides the same functionality at an impressively small fraction of the memory footprint.
One restriction though: all keys and values must be integers. To convert them from integer identifiers back into the values that your biological self can process, you'll need to keep some type of database or in-memory map of ids to human-parseable things.
Here's a similar example to that above, but with values that we can tightly pack.
#+begin_src clojure :results none :session usage-example
(require '[com.owoga.tightly-packed-trie.core :as tpt])
;; [trie-entry '(path value)
(def trie-entries
(let [trie-entry-1 '(1 2 3 123)
trie-entry-2 '(1 2 1 121)
trie-entry-3 '(1 2 2 122)
trie-entry-4 '(1 3 1 131)]
[trie-entry-1
trie-entry-2
trie-entry-3
trie-entry-4]))
(def non-tightly-packed-trie
(apply tpt/trie trie-entries))
(tpt/as-map non-tightly-packed-trie)
;; => {:root
;; {:children
;; {1
;; {:children
;; {2
;; {:children
;; {1 {:value 121, :count 1},
;; 2 {:value 122, :count 1},
;; 3 {:value 123, :count 1}}},
;; 3 {:children {1 {:value 131, :count 1}}}}}}}}
#+end_src
There's a slightly mis-named function that creates a byte-array representation of each node.
~as-byte-array~ is named similarly to ~as-map~ and ~as-vec~. But it's mis-named because it doesn't
actually return a ~byte-array~ like the name suggests. I may fix that in the future.
Instead, it adds some keys to each value, ~byte-address~ and ~byte-array~.
The ~byte-address~ is the offset that this node is going to be at in the final contiguous byte-array that makes up the tightly packed trie.
The ~byte-array~ is the byte-encoded value of the node's key, value, size of the node's children index, and encoded values for each child's key and byte-address-offset from the current node.
The ~byte-addresses~ and ~byte-arrays~ are calculated assuming that the depth-first post-order traversal of the vector representation of the trie is the correct order that the nodes need to be written to the contiguous array of bytes that make up the final tightly-packed-trie.
Part of that requirement means that the child nodes of each node need to be sorted!
Even though the Trie code looks like it's just backed by regular old hash-maps, it's actually backed by sorted-maps!
#+begin_src clojure :results none :session usage-example
(def non-tightly-packed-trie-with-raw-byte-info-added
(tpt/as-byte-array non-tightly-packed-trie))
(tpt/as-map non-tightly-packed-trie-with-raw-byte-info-added)
;; => {:root
;; {:byte-address 42,
;; :byte-array [-128, -128, -126, -127, 7],
;; :children
;; {1
;; {:byte-address 35,
;; :byte-array [-128, -128, -124, -126, 18, -125, 5],
;; :children
;; {2
;; {:byte-address 17,
;; :byte-array [-128, -128, -122, -127, 9, -126, 6, -125, 3],
;; :children
;; {1
;; {:value 121,
;; :count 1,
;; :byte-address 8,
;; :byte-array [-7, -127, -128],
;; :children {}},
;; 2
;; {:value 122,
;; :count 1,
;; :byte-address 11,
;; :byte-array [-6, -127, -128],
;; :children {}},
;; 3
;; {:value 123,
;; :count 1,
;; :byte-address 14,
;; :byte-array [-5, -127, -128],
;; :children {}}}},
;; 3
;; {:byte-address 30,
;; :byte-array [-128, -128, -126, -127, 4],
;; :children
;; {1
;; {:value 131,
;; :count 1,
;; :byte-address 26,
;; :byte-array [1, -125, -127, -128],
;; :children {}}}}}}}}}
#+end_src
Once the trie is transformed to have the byte-array info on each node, you can pass that
trie to ~tightly-packed-trie~ to get a MUCH more memory-efficient trie.
This trie is backed by a ByteBuffer rather than a hash-map.
#+begin_src clojure :results none :session usage-example
(def tightly-packed-trie
(tpt/tightly-packed-trie non-tightly-packed-trie-with-raw-byte-info-added))
(.capacity (.byte-buffer tightly-packed-trie))
;; => 47
;;
;;;; Instead of a map with all of its pointers, we are storing
;;;; all of the information necessary for this trie in
;;;; just 47 bytes!
;;;; Hash-map-backed and Tightly-packed comparson
;; The apis are slightly different. But you have access to basically the same data.
;;;; Getting the value of a node in a hash-map-backed trie.
;;
(-> (get non-tightly-packed-trie '(1 2 3))
tpt/as-map
seq
first
second
(select-keys [:value :count]))
;; => {:value 123, :count 1}
;;;; Getting the value of a node in a tightly-packed trie.
;;
(tpt/value (get tightly-packed-trie '(1 2 3)))
;; => {:value 123, :count 1}
#+end_src
It's backed by a byte-buffer so saving to disk is trivial, but there's a helper for that.
Here's the process of saving to and loading from disk. (Only works for tightly-packed tries.)
#+begin_src clojure :results none :session usage-example
(tpt/save-tightly-packed-trie-to-file "/tmp/tpt.bin" tightly-packed-trie)
(def saved-and-loaded-tpt
(tpt/load-tightly-packed-trie-from-file "/tmp/tpt.bin"))
(tpt/value (get saved-and-loaded-tpt '(1 2 3)))
;; => {:value 123, :count 1}
#+end_src
* TODO Tests
https://practicalli.github.io/clojure/clojure-spec/
* TODO Why would you want a trie data structure?
TODO: The below is closer to a CSCI lesson than library documentation. If it's necessary, figure out where to put it, how to word it, etc... It might not be worth cluttering documentation with so much detail.
** Autocomplete
A user types in the characters "D" "O" and you want to show all possible autocompletions.
*** Typical "List" data structure
- Iterate through each word starting from the beginning.
- When you get to the first word that starts with the letters "D" "O", start keeping track
of words
- When you get to the next word that doesn't start with "D" "O", you have all the words you want to use for autocomplete.
#+begin_src clojure
(def dictionary ["Apple" "Banana" "Carrot" "Do" "Dog" "Dot" "Dude" "Egg"])
#+end_src
#+RESULTS:
: #'markov-language-model/dictionary
**** Problems with a list.
It's slow if you have a big list. If you have a dictionary with hundreds of thousands of words
and the user is typing in letters that don't show up until the end of the list, then you're searching
through the first few hundred thousand items in the list before you get to what you need.
If you're familiar with binary search over sorted lists, you'll know this is a contrived example.
*** Typical "Trie" in Clojure
#+begin_src clojure
{"A" {:children {"P" {,,,} :value nil}}
"D" {:children {"O"
:children {"G" {:children {} :value "DOG"}
"T" {:children {} :value "DOT"}}
:value "DO"}
:value nil}}
#+end_src
#+RESULTS:
: class java.lang.RuntimeException
**** How is a trie faster?
-

@ -0,0 +1,17 @@
{:paths ["src" "resources"]
:jvm-opts ["-Xmx6g"]
:deps {org.clojure/clojure {:mvn/version "1.10.0"}
;; profiling
cljol/cljol {:git/url "https://github.com/jafingerhut/cljol"
:sha "11d4aa72fdd19248bd7600fb7b5cde7189f32938"}
com.taoensso/tufte {:mvn/version "2.2.0"}
;; logging
com.taoensso/timbre {:mvn/version "4.10.0"}}
:aliases {:dev {:extra-paths ["test" "examples" "dev"]
:extra-deps {}}
:jar {:replace-deps {com.github.seancorfield/depstar {:mvn/version "2.0.193"}}
:exec-fn hf.depstar/jar
:exec-args {:jar "tightly-packed-trie.jar" :sync-pom true}}
:deploy {:replace-deps {slipset/deps-deploy {:mvn/version "0.1.5"}}
:exec-fn deps-deploy.deps-deploy/deploy
:exec-args {:installer :remote :artifact "tightly-packed-trie.jar"}}}}

@ -0,0 +1,532 @@
(ns markov-language-model
(:require [clojure.string :as string]
[clojure.java.io :as io]
[com.owoga.tightly-packed-trie.core :as tpt]
[com.owoga.tightly-packed-trie.encoding :as encoding]
[cljol.dig9 :as d]
[clojure.zip :as zip]
[com.owoga.tightly-packed-trie.bit-manip :as bm]))
(def corpus (slurp (io/resource "cask_of_amontillado.txt")))
;; For better generation of text, you'll probably want to pad the starts
;; of sentences with n-1 "start-of-sentence" tokens.
(defn prep-punctuation-for-tokenization
"Puts spaces around punctuation so that they aren't
tokenized with the words they are attached to.
Might add extraneous whitespace, but presumedly that will be ignored/removed
during tokenization."
[text]
(string/replace text #"([\.,!?])" " $1 "))
(defn remove-formatting-characters
"Input has underscores, presumably because the text
might be rendered by something that can italicize or bold text.
We'll just ignore them for now."
[text]
(string/replace text #"[_*]" ""))
(defn tokenize [text]
(-> text
remove-formatting-characters
prep-punctuation-for-tokenization
(string/split #"[\n ]+")))
(defn interleave-all
"Like interleave, but instead of ending the interleave when the shortest collection
has been consumed, continues to interleave the remaining collections."
{:added "1.0"
:static true}
([] ())
([c1] (lazy-seq c1))
([c1 c2]
(lazy-seq
(let [s1 (seq c1) s2 (seq c2)]
(if (and s1 s2)
(cons (first s1) (cons (first s2)
(interleave-all (rest s1) (rest s2))))
(lazy-seq (or s1 s2))))))
([c1 c2 & colls]
(lazy-seq
(let [ss (->> (map seq (conj colls c2 c1))
(remove nil?))]
(when ss
(concat (map first ss) (apply interleave-all (map rest ss))))))))
(comment
(let [tokens [1 2 3 4 5]
p1 (partition 1 1 tokens)
p2 (partition 2 1 tokens)
p3 (partition 3 1 tokens)]
(interleave-all p1 p2 p3)))
(defn ngramify-tokens [n m tokens]
(let [partition-colls (map #(partition % 1 tokens) (range n m))
ngrams (apply interleave-all partition-colls)]
ngrams))
(comment
(->> (tokenize corpus)
(take 5)
(ngramify-tokens 1 4)) ;; => (("The")
;; ("The" "thousand")
;; ("The" "thousand" "injuries")
;; ("thousand")
;; ("thousand" "injuries")
;; ("thousand" "injuries" "of")
;; ("injuries")
;; ("injuries" "of")
;; ("injuries" "of" "Fortunato")
;; ("of")
;; ("of" "Fortunato")
;; ("Fortunato"))
)
(defn make-trie
([] (tpt/->Trie
(fn update-fn [prev cur]
(if (nil? prev)
(sorted-map
:value (last cur)
:count 1)
(-> prev
(update :count (fnil inc 0))
(assoc :value (last cur)))))
(sorted-map)))
([& ks]
(reduce
(fn [t k]
(conj t k))
(make-trie)
ks)))
(defn add-terminal-value-to-ngram
"The Trie expects entries to be of the form '(k1 k2 k3 value).
The ngrams generated above are just '(k1 k2 k3).
This adds a value that is simply the ngram itself:
'(k1 k2 k3 '(k1 k2 k3))."
[ngram]
(concat ngram (list ngram)))
(comment
(let [ngrams (->> corpus
tokenize
(take 200)
(ngramify-tokens 1 4)
(map add-terminal-value-to-ngram))
trie (apply make-trie ngrams)]
(tpt/as-map trie))
;; {:root
;; {:children
;; {","
;; {:children
;; {"I"
;; {:children {"vowed" {:count 1, :value ("," "I" "vowed")}},
;; :count 1,
;; :value ("," "I")},
;; "and"
;; {:children {"he" {:count 1, :value ("," "and" "he")}},
;; :count 1,
;; :value ("," "and")},
;; ,,,}}}}}}
)
;; TODO: Move this to ITrie?
(defn trie->seq-of-nodes
"Returns a seq of every terminal node. Useful for things like
doing aggregation calculations."
[trie]
(->> trie
tpt/as-vec
zip/vector-zip
(iterate zip/next)
(take-while (complement zip/end?))
(map zip/node)
(filter map?)))
;; The tightly packed trie uses an encoding where integers are encoded with
;; variable lengths. To maximize memory efficiency, the most commonly used values
;; should have the smallest integer IDs. That way the values that most commonly appear
;; are encoded with the fewest bytes.
(defn seq-of-nodes->sorted-by-count
"Sorted first by the rank of the ngram, lowest ranks first.
Sorted second by the frequency of the ngram, highest frequencies first.
This is the order that you'd populate a mapping of keys to IDs."
[nodes]
(->> nodes
(map (comp first seq))
(map (fn [[k v]]
(vector (:value v) (:count v))))
;; root node and padded starts
(remove (comp nil? second))
(sort-by #(vector (count (first %))
(- (second %))))))
(comment
(let [ngrams (->> corpus
tokenize
(take 200)
(ngramify-tokens 1 4)
(map add-terminal-value-to-ngram))
trie (apply make-trie ngrams)]
(->> trie
trie->seq-of-nodes
seq-of-nodes->sorted-by-count
(take 10)))
;; => ([(",") 11]
;; [(".") 9]
;; [("I") 8]
;; [("the") 6]
;; [("to") 6]
;; [("was") 5]
;; [("a") 4]
;; [("my") 4]
;; [("of") 4]
;; [("as") 3])
)
(defn trie->database
"It's convenient to work with a trie that has keys and values as
human-readable strings, as pulled straight from a corpus in the case
of a markov trie. But to tightly pack the trie into a byte array,
we need every value to be an integer that we can variable-length-encode.
This creates a database for conveniently converting the human-readable
entries to ids and back from ids to human-readable entries."
[trie]
(let [sorted-keys (->> (trie->seq-of-nodes trie)
seq-of-nodes->sorted-by-count)]
(loop [sorted-keys sorted-keys
database {}
i 1]
(if (empty? sorted-keys)
database
(recur
(rest sorted-keys)
(-> database
(assoc (first (first sorted-keys))
{:count (second (first sorted-keys))
:id i})
(assoc i (first (first sorted-keys))))
(inc i))))))
(comment
(let [ngrams (->> corpus
tokenize
(take 200)
(ngramify-tokens 1 4)
(map add-terminal-value-to-ngram))
trie (apply make-trie ngrams)]
(trie->database trie))
;; {("at") {:count 1, :id 39},
;; 453 ("revenge" "." "You"),
;; ("The") {:count 1, :id 37},
;; ("resolved" ",") {:count 1, :id 256},
;; 487 ("very" "definitiveness" "with"),
;; ("be" "respected") {:count 1, :id 170},
;; ("a" "point") {:count 1, :id 158},
;; 357 ("and" "he" "did"),
;; 275 ("the" "very"),
;; ("doubt" "my" "good") {:count 1, :id 381},
;; ,,,}
)
(seq {"and" {:count 1 :value '("foo")}});; => (["and" {:count 1, :value ("foo")}])
(defn transform-trie->ids
"Once we have a database to convert from string-keys to integer-ids and back,
we can traverse the trie using its `transform` zipper and `zip/edit` each
node replacing the string-keys with their integer-ids."
[trie database]
(let [transform-p #(map? (zip/node %))
transform-f
(fn tf [loc]
(zip/edit
loc
(fn [node]
;; {"And {:count 1, :value (! " "And)}} ;; <- Node
(let [[k v] (first (seq node))]
{(get-in database [(list k) :id] (if (= k :root) :root))
(assoc v :value (get-in database [(:value v) :id] 0))}))))]
(tpt/transform trie (tpt/visitor-filter transform-p transform-f))))
(def trie
(let [ngrams (->> corpus
tokenize
(ngramify-tokens 1 4)
(map add-terminal-value-to-ngram))]
(apply make-trie ngrams)))
(def trie-database
(trie->database trie))
(def tpt-ready-trie (transform-trie->ids trie trie-database))
(def tightly-packed-trie
(tpt/tightly-packed-trie tpt-ready-trie))
;;;; DEMO
;;
(comment
;;;; Our "database" (just a hash-map) serves a dual purpose.
;;
;; It maps n-grams to their frequency counts and to an integer ID.
;; It also maps that integer ID back to the n-gram.
(take 10 trie-database)
;; => ([("Then" "I") {:count 1, :id 1475}]
;; [("to" "your" "long") {:count 1, :id 5371}]
;; [("an" "instant" "he") {:count 1, :id 3842}]
;; [("fifth" "," "the") {:count 1, :id 4209}]
;; [("from" "the" "depth") {:count 1, :id 4270}]
;; [2721 ("the" "more")]
;; [("during" "which" ",") {:count 1, :id 4144}]
;; [("nodded") {:count 1, :id 674}]
;; [("the" "feeble") {:count 1, :id 2693}]
;; [("intermingling" "," "into") {:count 1, :id 4488}])
;;;; We can `get` the value of an n-gram from a Trie.
;; The value returned will be a Trie that has as its root node the
;; value at the n-gram. This gives you access to all of the descendants.
;;
;; Having access to the descendants is useful for something like
;; auto-complete. You can get in the trie the input to the completion, the prefix.
;; Then you can get the completions by simple seq-ing over the child nodes.
(tpt/as-map (get trie '("," "I")))
;; => {"I"
;; {:count 10,
;; :value ("," "I"),
;; :children
;; {"am" {:count 1, :value ("," "I" "am")},
;; "began" {:count 2, :value ("," "I" "began")},
;; ,,,
;; "well" {:count 1, :value ("," "I" "well")}}}}
;;;; Database
;; Each n-gram has its own integer ID. The integer IDs should be handed
;; out to n-grams in order of frequency. Therefore, you're 1-grams will probably
;; have lower IDs than the higher-order n-grams.
;;
;; Here we see "," is the 2nd most-common n-gram.
(get-in trie-database ['(",") :id]) ;; => 2
(get-in trie-database ['("I") :id]) ;; => 4
;; The ID of a 2-gram is not related in any way to
;; the two 1-grams that make it up. Every n-gram is unique
;; and gets its own unique ID.
;;
;; BUT... Every node is referenced by a 1-gram key.
;; So the 2-gram '("," "I") is referenced from
;; the :root key's children by the 1-gram key '(",")
;; and then by that 1-gram key's children by the 1-gram '("I").
;; The VALUE of that node though is the 2-gram '("," "I").
;;
;; To re-iterate: The keys are all 1-grams at every nesting level.
;; The values are the higher-order n-grams the lower in the nesting
;; that you go.
(get-in trie-database ['("," "I") :id]) ;; => 911
;;;; Map-based Trie vs Tightly Packed Trie
;;;;
;; The interface is *almost* the same between the two.
;; Tightly packed tries can't be updated or written to.
;; They can only be read.
;; And to get from integer IDs to human-readable strings,
;; you need to go through the database.
;;
;; Other than that though, let's see it in action!
;;
;;;; Here is the map-based trie.
(->> (tpt/as-map (get trie '("," "I")))
(#(get-in % ["I" :children]))
(map seq)
(map first))
;; => ("am" "began" "ceased" "had" "resumed" "soon" "suffered" "vowed" "well")
;;;; And here is the tightly-packed trie
(->> (get tightly-packed-trie '(2 4))
tpt/children
(map tpt/value)
(map :value)
(map #(get trie-database %))
(map last)
sort)
;; => ("am" "began" "ceased" "had" "resumed" "soon" "suffered" "vowed" "well")
;;;; Ta-da!
;;;; Let's check the size difference
;; Memory footprint comparison
;; 2.2mb -> 37kb.
;; 1.7% of its original Clojure map size!!!
(->> trie (.trie) vector d/sum)
;; 65485 objects
;; 109687 references between them
;; 2179088 bytes total in all objects
;; no cycles
;; 8413 leaf objects (no references to other objects)
;; 1 root nodes (no reference to them from other objects _in this graph_)
(->> tightly-packed-trie (.byte-buffer) vector d/sum)
;; 2 objects
;; 1 references between them
;; 37680 bytes total in all objects
;; no cycles
;; 1 leaf objects (no references to other objects)
;; 1 root nodes (no reference to them from other objects _in this graph_)
(let [trie-at-2 (get tightly-packed-trie '(2))
address (.address trie-at-2)
byte-buffer (.byte-buffer trie-at-2)
limit (.limit trie-at-2)]
(tpt/wrap-byte-buffer
byte-buffer
(.position byte-buffer address)
(.limit byte-buffer limit)
(println "Address of node at 2" address)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(println "Size of index at 2" (encoding/decode byte-buffer))
(println "position of first key in index" (.position byte-buffer))
(encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
(encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
(.position byte-buffer (- address 1037)) ;; Position of '("," "the")
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer) ;; 11 size of index
;; 1462 position of buffer
;; max-address of index = 1473 (or 1472?)
(.position byte-buffer)
(.position byte-buffer 2618) ;; First mid of broken binary search
(tpt/rewind-to-key byte-buffer 2500)
(println (.position byte-buffer))
(println (bm/to-binary-string (.get byte-buffer (.position byte-buffer))))
(println (bm/to-binary-string (.get byte-buffer (dec (.position byte-buffer)))))
))
(let [trie-at-2 (get tightly-packed-trie '(2))
address (.address trie-at-2)
byte-buffer (.byte-buffer trie-at-2)
limit (.limit trie-at-2)]
(tpt/wrap-byte-buffer
byte-buffer
(.position byte-buffer address)
(.limit byte-buffer limit)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
(encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
(.position byte-buffer (- address 1037))
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)
(encoding/decode byte-buffer)))
;; I's offset, 986.
;; See below. ID at offset 986 is 2! Same as at above offset!
;; And the count is 4? The count coincidentally is the ID we expect?
(let [trie-at-2 (get tightly-packed-trie '(2))
address 986
byte-buffer (.byte-buffer trie-at-2)
limit (.limit trie-at-2)]
(tpt/wrap-byte-buffer
byte-buffer
(.position byte-buffer address)
(.limit byte-buffer limit)
(encoding/decode byte-buffer)))
(get trie-database 4)
(let [byte-buffer (.byte-buffer tightly-packed-trie)]
(.position byte-buffer)))
;; Memory footprint comparison
;; 2.2mb -> 32kb.
;; 1.5% of its original Clojure map size!
(comment
(->> trie (.trie) vector d/sum)
;; 65485 objects
;; 109687 references between them
;; 2179088 bytes total in all objects
;; no cycles
;; 8413 leaf objects (no references to other objects)
;; 1 root nodes (no reference to them from other objects _in this graph_)
(->> tightly-packed-trie (.byte-buffer) vector d/sum)
;; 2 objects
;; 1 references between them
;; 32896 bytes total in all objects
;; no cycles
;; 1 leaf objects (no references to other objects)
;; 1 root nodes (no reference to them from other objects _in this graph_)
)
(defn key-get-in-tpt [tpt db ks]
(let [id (map #(get-in db [(list %) :id]) ks)
v (get tpt id)]
{id v}))
(defn id-get-in-tpt [tpt db ids]
(let [ks (apply concat (map #(get db %) ids))
v (get tpt ids)
id (get-in db [ks :id])]
{ks (assoc v :value (get db id))}))
(comment
(key-get-in-tpt
tightly-packed-trie
trie-database
'("another"))
;; => {(2 2 3) {:value 3263, :count 462}}
(id-get-in-tpt
tightly-packed-trie
trie-database
'(2 2 3))
;; => {("<s>" "<s>" "the") {:value ("<s>" "<s>" "the"), :count 462}}
)
(comment
;; database
(let [texts (->> (dark-corpus-file-seq 500 2)
(map slurp))
trie (create-trie-from-texts texts)]
(->> (trie->database trie)
(#(get % 3))))
(let [texts (->> (dark-corpus-file-seq 500 2)
(map slurp))
trie (create-trie-from-texts texts)]
(tpt/as-map (transform-trie->ids trie)))
(let [texts (->> (dark-corpus-file-seq 500 2)
(map slurp))
trie (create-trie-from-texts texts)
tightly-packed-trie (tpt/tightly-packed-trie
(transform-trie->ids trie))]
(get tightly-packed-trie '(2 2 3)))
(let [texts (->> (dark-corpus-file-seq 500 2)
(map slurp))
trie (create-trie-from-texts texts)]
(tpt/as-map trie))
(let [text (slurp (first (dark-corpus-file-seq 500 1)))]
(->> text
util/clean-text
(#(string/split % #"\n+"))))
)

@ -0,0 +1,45 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<packaging>jar</packaging>
<groupId>com.owoga</groupId>
<artifactId>tightly-packed-trie</artifactId>
<version>0.1.1</version>
<name>tightly-packed-trie</name>
<dependencies>
<dependency>
<groupId>org.clojure</groupId>
<artifactId>clojure</artifactId>
<version>1.10.0</version>
</dependency>
<dependency>
<groupId>com.taoensso</groupId>
<artifactId>tufte</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.clojure</groupId>
<artifactId>data.fressian</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>com.taoensso</groupId>
<artifactId>timbre</artifactId>
<version>4.10.0</version>
</dependency>
<dependency>
<groupId>com.taoensso</groupId>
<artifactId>nippy</artifactId>
<version>3.0.0</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src</sourceDirectory>
</build>
<repositories>
<repository>
<id>clojars</id>
<url>https://repo.clojars.org/</url>
</repository>
</repositories>
</project>

@ -0,0 +1,321 @@
The thousand injuries of Fortunato I had borne as I best could, but
when he ventured upon insult, I vowed revenge. You, who so well know
the nature of my soul, will not suppose, however, that I gave utterance
to a threat. _At length_ I would be avenged; this was a point definitely
settled--but the very definitiveness with which it was resolved,
precluded the idea of risk. I must not only punish, but punish with
impunity. A wrong is unredressed when retribution overtakes its
redresser. It is equally unredressed when the avenger fails to make
himself felt as such to him who has done the wrong.
It must be understood that neither by word nor deed had I given
Fortunato cause to doubt my good will. I continued, as was my wont, to
smile in his face, and he did not perceive that my smile _now_ was at
the thought of his immolation.
He had a weak point--this Fortunato--although in other regards he was a
man to be respected and even feared. He prided himself on his
connoisseurship in wine. Few Italians have the true virtuoso spirit.
For the most part their enthusiasm is adopted to suit the time and
opportunity--to practise imposture upon the British and Austrian
_millionaires_. In painting and gemmary, Fortunato, like his countrymen,
was a quack--but in the matter of old wines he was sincere. In this
respect I did not differ from him materially: I was skillful in the
Italian vintages myself, and bought largely whenever I could.
It was about dusk, one evening during the supreme madness of the
carnival season, that I encountered my friend. He accosted me with
excessive warmth, for he had been drinking much. The man wore motley.
He had on a tight-fitting parti-striped dress, and his head was
surmounted by the conical cap and bells. I was so pleased to see him,
that I thought I should never have done wringing his hand.
I said to him--"My dear Fortunato, you are luckily met. How remarkably
well you are looking to-day! But I have received a pipe of what passes
for Amontillado, and I have my doubts."
"How?" said he. "Amontillado? A pipe? Impossible! And in the middle
of the carnival!"
"I have my doubts," I replied; "and I was silly enough to pay the full
Amontillado price without consulting you in the matter. You were not to
be found, and I was fearful of losing a bargain."
"Amontillado!"
"I have my doubts."
"Amontillado!"
"And I must satisfy them."
"Amontillado!"
"As you are engaged, I am on my way to Luchesi. If any one has a
critical turn, it is he. He will tell me--"
"Luchesi cannot tell Amontillado from Sherry."
"And yet some fools will have it that his taste is a match for your
own."
"Come, let us go."
"Whither?"
"To your vaults."
"My friend, no; I will not impose upon your good nature. I perceive
you have an engagement. Luchesi--"
"I have no engagement;--come."
"My friend, no. It is not the engagement, but the severe cold with
which I perceive you are afflicted. The vaults are insufferably damp.
They are encrusted with nitre."
"Let us go, nevertheless. The cold is merely nothing. Amontillado!
You have been imposed upon. And as for Luchesi, he cannot distinguish
Sherry from Amontillado."
Thus speaking, Fortunato possessed himself of my arm. Putting on a mask
of black silk, and drawing a _roquelaire_ closely about my person, I
suffered him to hurry me to my palazzo.
There were no attendants at home; they had absconded to make merry in
honour of the time. I had told them that I should not return until the
morning, and had given them explicit orders not to stir from the house.
These orders were sufficient, I well knew, to insure their immediate
disappearance, one and all, as soon as my back was turned.
I took from their sconces two flambeaux, and giving one to Fortunato,
bowed him through several suites of rooms to the archway that led into
the vaults. I passed down a long and winding staircase, requesting him
to be cautious as he followed. We came at length to the foot of the
descent, and stood together on the damp ground of the catacombs of the
Montresors.
The gait of my friend was unsteady, and the bells upon his cap jingled
as he strode.
"The pipe," said he.
"It is farther on," said I; "but observe the white web-work which
gleams from these cavern walls."
He turned towards me, and looked into my eyes with two filmy orbs that
distilled the rheum of intoxication.
"Nitre?" he asked, at length.
"Nitre," I replied. "How long have you had that cough?"
"Ugh! ugh! ugh!--ugh! ugh! ugh!--ugh! ugh! ugh!--ugh! ugh! ugh!--ugh!
ugh! ugh!"
My poor friend found it impossible to reply for many minutes.
"It is nothing," he said, at last.
"Come," I said, with decision, "we will go back; your health is
precious. You are rich, respected, admired, beloved; you are happy, as
once I was. You are a man to be missed. For me it is no matter. We
will go back; you will be ill, and I cannot be responsible. Besides,
there is Luchesi--"
"Enough," he said; "the cough is a mere nothing; it will not kill me.
I shall not die of a cough."
"True--true," I replied; "and, indeed, I had no intention of alarming
you unnecessarily--but you should use all proper caution. A draught of
this Medoc will defend us from the damps."
Here I knocked off the neck of a bottle which I drew from a long row of
its fellows that lay upon the mould.
"Drink," I said, presenting him the wine.
He raised it to his lips with a leer. He paused and nodded to me
familiarly, while his bells jingled.
"I drink," he said, "to the buried that repose around us."
"And I to your long life."
He again took my arm, and we proceeded.
"These vaults," he said, "are extensive."
"The Montresors," I replied, "were a great and numerous family."
"I forget your arms."
"A huge human foot d'or, in a field azure; the foot crushes a serpent
rampant whose fangs are imbedded in the heel."
"And the motto?"
"_Nemo me impune lacessit_."
"Good!" he said.
The wine sparkled in his eyes and the bells jingled. My own fancy grew
warm with the Medoc. We had passed through walls of piled bones, with
casks and puncheons intermingling, into the inmost recesses of
catacombs. I paused again, and this time I made bold to seize
Fortunato by an arm above the elbow.
"The nitre!" I said; "see, it increases. It hangs like moss upon the
vaults. We are below the river's bed. The drops of moisture trickle
among the bones. Come, we will go back ere it is too late. Your
cough--"
"It is nothing," he said; "let us go on. But first, another draught of
the Medoc."
I broke and reached him a flagon of De Grave. He emptied it at a
breath. His eyes flashed with a fierce light. He laughed and threw
the bottle upwards with a gesticulation I did not understand.
I looked at him in surprise. He repeated the movement--a grotesque one.
"You do not comprehend?" he said.
"Not I," I replied.
"Then you are not of the brotherhood."
"How?"
"You are not of the masons."
"Yes, yes," I said; "yes, yes."
"You? Impossible! A mason?"
"A mason," I replied.
"A sign," he said, "a sign."
"It is this," I answered, producing a trowel from beneath the folds of
my _roquelaire_.
"You jest," he exclaimed, recoiling a few paces. "But let us proceed
to the Amontillado."
"Be it so," I said, replacing the tool beneath the cloak and again
offering him my arm. He leaned upon it heavily. We continued our
route in search of the Amontillado. We passed through a range of low
arches, descended, passed on, and descending again, arrived at a deep
crypt, in which the foulness of the air caused our flambeaux rather to
glow than flame.
At the most remote end of the crypt there appeared another less
spacious. Its walls had been lined with human remains, piled to the
vault overhead, in the fashion of the great catacombs of Paris. Three
sides of this interior crypt were still ornamented in this manner.
From the fourth side the bones had been thrown down, and lay
promiscuously upon the earth, forming at one point a mound of some
size. Within the wall thus exposed by the displacing of the bones, we
perceived a still interior recess, in depth about four feet in width
three, in height six or seven. It seemed to have been constructed for
no especial use within itself, but formed merely the interval between
two of the colossal supports of the roof of the catacombs, and was
backed by one of their circumscribing walls of solid granite.
It was in vain that Fortunato, uplifting his dull torch, endeavoured to
pry into the depth of the recess. Its termination the feeble light did
not enable us to see.
"Proceed," I said; "herein is the Amontillado. As for Luchesi--"
"He is an ignoramus," interrupted my friend, as he stepped unsteadily
forward, while I followed immediately at his heels. In an instant he
had reached the extremity of the niche, and finding his progress
arrested by the rock, stood stupidly bewildered. A moment more and I
had fettered him to the granite. In its surface were two iron staples,
distant from each other about two feet, horizontally. From one of
these depended a short chain, from the other a padlock. Throwing the
links about his waist, it was but the work of a few seconds to secure
it. He was too much astounded to resist. Withdrawing the key I
stepped back from the recess.
"Pass your hand," I said, "over the wall; you cannot help feeling the
nitre. Indeed, it is _very_ damp. Once more let me _implore_ you to
return. No? Then I must positively leave you. But I must first
render you all the little attentions in my power."
"The Amontillado!" ejaculated my friend, not yet recovered from his
astonishment.
"True," I replied; "the Amontillado."
As I said these words I busied myself among the pile of bones of which
I have before spoken. Throwing them aside, I soon uncovered a quantity
of building stone and mortar. With these materials and with the aid of
my trowel, I began vigorously to wall up the entrance of the niche.
I had scarcely laid the first tier of the masonry when I discovered
that the intoxication of Fortunato had in a great measure worn off. The
earliest indication I had of this was a low moaning cry from the depth
of the recess. It was _not_ the cry of a drunken man. There was then a
long and obstinate silence. I laid the second tier, and the third, and
the fourth; and then I heard the furious vibrations of the chain. The
noise lasted for several minutes, during which, that I might hearken to
it with the more satisfaction, I ceased my labours and sat down upon
the bones. When at last the clanking subsided, I resumed the trowel,
and finished without interruption the fifth, the sixth, and the seventh
tier. The wall was now nearly upon a level with my breast. I again
paused, and holding the flambeaux over the mason-work, threw a few
feeble rays upon the figure within.
A succession of loud and shrill screams, bursting suddenly from the
throat of the chained form, seemed to thrust me violently back. For a
brief moment I hesitated--I trembled. Unsheathing my rapier, I began
to grope with it about the recess; but the thought of an instant
reassured me. I placed my hand upon the solid fabric of the catacombs,
and felt satisfied. I reapproached the wall; I replied to the yells of
him who clamoured. I re-echoed--I aided--I surpassed them in volume
and in strength. I did this, and the clamourer grew still.
It was now midnight, and my task was drawing to a close. I had
completed the eighth, the ninth, and the tenth tier. I had finished a
portion of the last and the eleventh; there remained but a single stone
to be fitted and plastered in. I struggled with its weight; I placed
it partially in its destined position. But now there came from out the
niche a low laugh that erected the hairs upon my head. It was
succeeded by a sad voice, which I had difficulty in recognizing as that
of the noble Fortunato. The voice said--
"Ha! ha! ha!--he! he! he!--a very good joke indeed--an excellent jest.
We shall have many a rich laugh about it at the palazzo--he! he!
he!--over our wine--he! he! he!"
"The Amontillado!" I said.
"He! he! he!--he! he! he!--yes, the Amontillado. But is it not getting
late? Will not they be awaiting us at the palazzo, the Lady Fortunato
and the rest? Let us be gone."
"Yes," I said, "let us be gone."
"_For the love of God, Montresor!_"
"Yes," I said, "for the love of God!"
But to these words I hearkened in vain for a reply. I grew impatient.
I called aloud--
"Fortunato!"
No answer. I called again--
"Fortunato--"
No answer still. I thrust a torch through the remaining aperture and
let it fall within. There came forth in reply only a jingling of the
bells. My heart grew sick on account of the dampness of the catacombs.
I hastened to make an end of my labour. I forced the last stone into
its position; I plastered it up. Against the new masonry I re-erected
the old rampart of bones. For the half of a century no mortal has
disturbed them.

@ -0,0 +1,125 @@
(ns com.owoga.tightly-packed-trie.bit-manip
(:require [clojure.string :as string]))
(defn bitstring->int
"Turns a binary string representation of an integer into an integer.
Throws at Integer/MAX_VALUE."
[binary-string]
(Integer/parseInt binary-string 2))
(comment
(map
bitstring->int
["0" "1" "10" "11" "100" (Integer/toBinaryString (bit-shift-right 0xFFFFFFFF 1))])
;; => (0 1 2 3 4 2147483647)
(bitstring->int (Integer/toBinaryString (inc (bit-shift-right 0xFFFFFFFF 1))))
;; Value out of range for int: 2147483648
)
(defn to-binary-string
"For consistency when viewing bytes, 0-pads Integer/toBinaryString as 8 characters.
Assumes b is an unsigned byte!
Truncates all but 8 least significant bits.
(Integer/toBinaryString 3) -> 11
(to-binary-string 3) -> 00000011
(to-binary-string 128 -> 00000000)
"
[b]
(let [s (string/replace
(format "%8s" (Integer/toBinaryString b))
#" "
"0")]
(subs s (- (count s) 8))))
(defn ubyte
"Clojures `byte` is signed, making the max value 127.
This gives us a 'byte'-like thing (It's actually a java.lang.Long),
but it has the same bits as a byte. It just takes up more space in memory."
[b]
(if (> b 255)
(throw (java.lang.IllegalArgumentException.
(format "Value ouf of range for ubyte: %d" b)))
(bit-and 0xff b)))
(defn sbyte
"The reverse of ubyte. Turns a long representation of a byte into
an actual signed byte. If the long has significant bits beyond 8,
they are left as-is. If you're giving this a long that has significant
bits beyond 8, take note: that might not be what you want?"
[b]
(byte (bit-or -0x100 b)))
(defn nth-bit
"Returns as a byte the nth bit of b indexed from the least-significant-bit."
[b n]
(if (bit-test b n)
(byte 1)
(byte 0)))
(comment
(let [n (bitstring->int "0101")]
(->> (range 4)
(map (partial nth-bit n))))
;; => (1 0 1 0)
)
(defn ones-mask
"Returns a long that is reprsented by the binary string of n 1s."
[n]
(reduce (fn [a _] (bit-or 1 (bit-shift-left a 1))) 0 (range n)))
;; Decoding variable-length-encoded numbers.
;;
;; bit-slice and combine-significant-bits
;; are a useful combination when you need to decode
;; a number that is encoded as bytes with flag bits set.
;;
;; Slice the flag bits off of each byte, then
;; combine the significant digits of each byte.
(defn bit-slice
"Start is least-significant bit.
(bit-slice 2 6 10101010)
-> ,1010,
"
[start end b]
(let [mask (bit-shift-left (ones-mask (- end start)) start)]
(bit-shift-right (bit-and b mask) start)))
(defn combine-significant-bits
"Chops off all but the significant bits of each byte
and then 'concats' the bits together into a long.
(combine-significant-bits 7 [10001010 11001011])
-> 0001010 1001011
-> 00010101001011
-> 1355 (As a long...)"
[num-significant-bits & bytes]
(reduce
(fn [a b]
(bit-or b (bit-shift-left a num-significant-bits)))
0
bytes))
(comment
(let [b1 (bitstring->int "0110110")
b2 (bitstring->int "1001001")
;; remove 2 flag bits
slice (partial bit-slice 0 6)
b1' (slice b1)
b2' (slice b2)]
(map
to-binary-string
[b1
b1'
b2
b2'
(combine-significant-bits 6 b1' b2' )]))
;; => ("00110110"
;; "00110110"
;; "01001001"
;; "00001001"
;; "110110001001")
)

@ -0,0 +1,773 @@
(ns com.owoga.tightly-packed-trie.core
(:require [clojure.zip :as zip]
[com.owoga.tightly-packed-trie.encoding :as encoding]
[com.owoga.tightly-packed-trie.bit-manip :as bm]
[clojure.java.io :as io]
[clojure.string :as string])
(:import (java.io ByteArrayOutputStream ByteArrayInputStream
DataOutputStream DataInputStream)))
;; A trie data structure that can be converted to
;; a contiguous array of bytes while maintaining
;; efficient lookups.
;;
;; A regular Clojure hash-map can be used as a trie,
;; but there's a lot of memory overhead that comes along
;; with hash-maps.
;;
;; To conveniently get the benefit of packing a trie into a contiguous array of
;; bytes, there are a few restrictions.
;;
;; Everything must be numeric IDs. Keys must be numeric IDs.
;; Values must be numeric IDs. Everything must be a number.
;;
;; This lets us encode everything as variable-length-encoded byte arrays.
;;
;; To maximize efficiency, your most common keys should have the
;; smallest IDs.
;; It's convenient to work with hash-map representations
;; while developing.
;;
;; {"T" {:children:
;; "I": {:value "IT"}
;; "A": {:value "AT"}}}
;;
;; If you want to work with a hash-map representation and
;; want this library to handle conversion of the trie,
;; then your hash-map version will need to follow a few conventions.
;;
;; For example, this code expects child nodes to be key/value pairs under
;; the :children key.
;; To pack a trie into a contiguous array of bytes
;; and still be able to find a key in a list of
;; child indexes in an efficient way, the child
;; indexes need to be sorted so they can be
;; binary-searched.
;;
;; When working with a hash-map-backed Trie,
;; it's convenient to use update-in to add
;; new values to the trie. But the default
;; update-in creates unsorted hash-maps when it encounters
;; a new key. This has the same functionality of update-in,
;; but new keys are given sorted-map values.
(defn update-in-sorted
"'Updates' a value in a nested associative structure, where ks is a
sequence of keys and f is a function that will take the old value
and any supplied args and return the new value, and returns a new
nested structure. If any levels do not exist, hash-maps will be
created."
{:added "1.0"
:static true}
([m ks f & args]
(let [up (fn up [m ks f args]
(let [m (or m (sorted-map))
[k & ks] ks]
(assert (instance? clojure.lang.PersistentTreeMap m)
(apply str
"A non-sorted hash-map in a sorted"
"hash-map will probably be the seed of some problems."))
(if ks
(assoc m k (up (get m k) ks f args))
(assoc m k (apply f (get m k) args)))))]
(up m ks f args))))
(defn children-of-map-trie-node
"Grab the children of a node.
A node is a map of {node-key {:children {,,,} ,,,}}.
This functions gives you the child nodes not as a single map
where all the child keys are part of the same map, but instead as a
a seq where each value is a single node.
This is a useful helper for turning a Trie from a map into a
depth-first post-order traversable zipper."
[node]
(let [[k {:keys [children]}] (first (seq node))]
(->> children
(map (partial apply hash-map)))))
(comment
(let [root-node {:root {:children {"T" {:some 'val} "U" {:other 'val}}}}]
(children-of-map-trie-node root-node))
;; => ({"T" {:some val}} {"U" {:other val}})
)
(defn without-children [node]
(let [[k v] (first (seq node))]
{k (dissoc v :children)}))
(defn map->depth-first-post-order-traversable-zipperable-vector
[node]
[(vec
(map
map->depth-first-post-order-traversable-zipperable-vector
(children-of-map-trie-node node)))
(without-children node)])
(comment
;; This comment demonstrates how we change the order in which
;; we traverse the map. If we don't turn each node into a vector
;; where the list of children are first, then the parent node would
;; get traversed befor the children. In the example below, the
;; "AT" node would be traversed befor the "SAT" and "TAT" nodes.
(let [m {:root
{:children
{"T"
{:children
{"A" {:children
{"T" {:value "TAT", :count 1}
"S" {:value "SAT" :count 1}}
:value "AT"
:count 1},
"U" {:children {"T" {:value "TUT", :count 1}}}}}}}}]
(let [z (zip/vector-zip
(map->depth-first-post-order-traversable-zipperable-vector m))]
(->> z
(iterate zip/next)
(take-while (complement zip/end?))
(map zip/node)
(filter map?))))
;; => ({"T" {:value "TAT", :count 1}}
;; {"S" {:value "SAT", :count 1}}
;; {"A" {:value "AT", :count 1}}
;; {"T" {:value "TUT", :count 1}}
;; {"U" {}}
;; {"T" {}}
;; {:root {}})
)
(defn depth-first-post-order-traversable-zipperable-vector->map
"Parity reversal of the code above. Maps are easier to reason about
while developing. It's just inconvenient to traverse them in the
order needed by the algorithms we use to pack them into a contiguous
array of byte for the tightly packed trie."
[node]
(let [children (first node)
parent (second node)
[parent-key parent-val] (first (seq parent))]
(sorted-map
parent-key
(assoc
parent-val
:children
(into
(sorted-map)
(map depth-first-post-order-traversable-zipperable-vector->map children))))))
(comment
(let [m {:root
{:children
{"T"
{:children
{"A" {:children
{"T" {:value "TAT", :count 1}
"S" {:value "SAT" :count 1}}
:value "AT"
:count 1},
"U" {:children {"T" {:value "TUT", :count 1}}}}}}}}]
(let [vect (map->depth-first-post-order-traversable-zipperable-vector m)]
(depth-first-post-order-traversable-zipperable-vector->map vect)))
;; => {:root
;; {:children
;; {"T"
;; {:children
;; {"A"
;; {:value "AT",
;; :count 1,
;; :children
;; {"S" {:value "SAT", :count 1, :children {}},
;; "T" {:value "TAT", :count 1, :children {}}}},
;; "U" {:children {"T" {:value "TUT", :count 1, :children {}}}}}}}}}
)
;;;; Zipper utilities
;;
;; Some useful functions for traversing/transforming zippers.
;;
;; This is what a tree looks like.
;;
;; [1 [2 [3 4 5] [6 7]]]
;; { }
;; / \
;; / \
;; 1 [2 [3 4 5] [6 7]]
;; / | \
;; / | \
;; 2 [3 4 5] [6 7]
;;
;; Clojure's zipper gives us convenient ways
;; to iterate over the nodes of the zipper.
;;
;; We can use the functions below to
;; recreate the functionality of
;; (map #(if (even? %) (* % %) %) coll)
;; but with zippers instead of collections.
(defn visitor-next
"Visits every loc in a zipper and calls zip/next with the result of applying f
to the loc.
Goes without saying that f should return a loc."
[zipper f]
(loop [loc zipper]
(if (zip/end? loc)
(zip/root loc)
(recur (zip/next (f loc))))))
(defn visitor-prev
"visitor-next in reverse."
[zipper f]
(loop [loc zipper]
(if (nil? (zip/prev loc))
(zip/root (f loc))
(recur (zip/prev (f loc))))))
(defn visitor-filter
"Helper for traversing a zipper with a visitor function.
Convenient for pulling filtering logic out of a visitor function
into composable and specific filter functions."
[pred visitor]
(fn [loc]
(if (pred loc)
(visitor loc)
loc)))
(comment
;; This comment shows an example of using the visitor
;; helpers to recreate the functionality similar to
;; (->> [1 2 3 4 5]
;; (map #(if (even? %) (* % %) %))
;;
;; This first example has the conditional logic inside
;; the visitor function.
(let [zipper (zip/vector-zip [1 [2 [3 4 5] [6 7]]])]
(visitor-next
zipper
#(if (and (int? (zip/node %)) (even? (zip/node %)))
(zip/edit % (fn [n] (int (Math/pow n 2))))
%)))
;; => [1 [4 [3 16 5] [36 7]]]
;;
;; This second example has a seperate filter predicate
;; from the transformation visitor and combines them
;; with the visitor-filter function.
(let [zipper (zip/vector-zip [1 [2 [3 4 5] [6 7]]])
pred (fn [loc]
(let [node (zip/node loc)]
(and (int? node)
(even? node))))
transform (fn [loc]
(zip/edit loc (fn [n] (int (Math/pow n 2)))))]
(visitor-next zipper (visitor-filter pred transform)))
;; => [1 [4 [3 16 5] [36 7]]]
)
(defn loc-children
"Takes a zipper loc and returns seq of children locs.
Written to work with zippers of a particular structure:
[[child1, child2, ,,,] parent]"
[loc]
(if (and (zip/left loc)
(zip/down (zip/left loc)))
(let [children (zip/down (zip/left loc))]
(->> children
(iterate zip/right)
(take-while (complement nil?))))))
(comment
(let [v [[1 2 3] :parent]
z (zip/vector-zip v)]
(->> z
zip/down
zip/right
loc-children
(map zip/node)))
;; => (1 2 3)
)
;;;; Tightly Packing Tries
;;
;; These next functions are all helpers
;; related to byte-packing nodes in preperation
;; for writing them to a byte stream.
(defn previous-node [loc]
(loop [loc (zip/prev loc)]
(cond
(nil? loc) nil
(map? (zip/node loc)) loc
:else (recur (zip/prev loc)))))
(defn loc->byte-address
"Given a loc without a byte-address, calculate it from the previous loc.
0x00|node val of previous loc |
0x01|size of child index |
0x02|child1 key |
0x03|child1 byte address offset|
0x04|child2 key |
0x05|child2 byte address offset|
0x??|node val of current loc |
We obviously wouldn't need this if we were writing directly to a
ByteBuffer. Whatever position we are at is our address.
But if we want to maintain the byte-packed data as
part of the map- or vector-like trie structure, then
we need this."
([loc]
(loc->byte-address loc 0))
([loc starting-offset]
(let [prev (previous-node loc)]
(if prev
(let [[k {:keys [byte-address byte-array]}] (first (seq (zip/node prev)))]
(+ byte-address (count byte-array)))
starting-offset))))
(defn child->index
"Given a child gets a map with info needed to build an index.
The below info is just a little bit of lagniappe. The only thing we're
pulling off the child are the values of the keys :byte-address and :key.
The encoding will happen later.
But for reference, the encoding will be:
The index will be a list of pairs of variable-length encoded bytes.
The first number of the pair, the bytes will be encoded with a flag bit of 1.
The second number of the pair, the bytes will be encoded with a flag bit of 0."
[child]
(let [[k {:keys [byte-address byte-array] :as v}] (first (seq child))]
{:byte-address byte-address
:key k}))
(defn child-node-key-and-offset-from-parent-as-byte-array
[{:keys [key offset]}]
(let [baos (ByteArrayOutputStream.)]
(.write baos (encoding/encode-key-to-tightly-packed-trie-index key))
(.write baos (encoding/encode-offset-to-tightly-packed-trie-index offset))
(.toByteArray baos)))
(defn pack-node-value
"Returns byte-array of node value.
Byte-array is 2 variable-length encoded numbers.
For a markov trie, this would be an number ID
of the n-gram and an number of the frequency.
Nodes without terminal values get a value and count of 0."
[node]
(let [baos (ByteArrayOutputStream.)]
(.write baos (encoding/encode (get node :value 0)))
(.write baos (encoding/encode (get node :count 0)))
(.toByteArray baos)))
(defn transform-trie-add-byte-pack-to-each-node
"Visitor for a vector-based trie of structure [[child1, child2, ,,,,] parent].
Transforms each node adding keys for a byte-array of the node and its children index key/offsets
as well as a key for its own offset.
Starts at byte 8, reserving the first 8 bytes for the for root address won't be known
until the end of the zipper.
Transforming the trie to add these keys is an intermediary step that can probably be
bypassed in the future by writing directly to a ByteArrayOutputStream."
[loc]
(let [baos (ByteArrayOutputStream.)
;; Byte-address of the current node. Gets calculated from the
;; previous node's byte address and the size of the previous
;; node's byte array.
byte-address (loc->byte-address loc 8)
child-nodes (->> loc
loc-children
(map (comp second zip/node)))
;; For child, we need to know the offset of the child node's address
;; from this parent node's address.
children (map
(fn [child-node]
(let [child-index (child->index child-node)]
(assoc
child-index
:offset
(- byte-address (:byte-address child-index)))))
child-nodes)
;; Create the byte array of the index of the children
index-ba (let [index-baos (ByteArrayOutputStream.)
child-byte-arrays
(map
child-node-key-and-offset-from-parent-as-byte-array
children)]
(loop [bas child-byte-arrays]
(if (empty? bas)
(.toByteArray index-baos)
(do (.write index-baos (first bas))
(recur (rest bas))))))]
(zip/edit
loc
(fn [node]
(let [[k v] (first (seq node))]
(.write baos (pack-node-value v))
(.write baos (encoding/encode (count index-ba)))
(.write baos index-ba)
{k (conj v {:byte-address byte-address
:byte-array (.toByteArray baos)})})))))
(defprotocol ITrie
(as-map [this] "Map that underlies trie.")
(as-vec [this] "Depth-first post-order vector.")
(as-byte-array [this] (str "Add key/values to each node containing"
" the tightly-packed byte-array representation of the node."))
(transform [this f] "Depth-first post-order apply each function to each node."))
(def not-found# (gensym))
(deftype Trie [f trie]
ITrie
(as-map [_] trie)
(as-vec [_] (map->depth-first-post-order-traversable-zipperable-vector trie))
(as-byte-array [self]
(->> (transform
self
(visitor-filter
#(map? (zip/node %))
transform-trie-add-byte-pack-to-each-node))))
(transform [self f]
(->> self
as-vec
zip/vector-zip
(#(visitor-next % f))
depth-first-post-order-traversable-zipperable-vector->map
(Trie. f)))
;; By returning lookups in the same strucure as root-level nodes
;; but with the root at the found descendent, we can treat this
;; descendant as a new root Trie node. This gives us the advantage
;; of being able to re-use the Trie functions on sub-nodes.
clojure.lang.ILookup
(valAt [_ ks]
(let [v (get-in trie (cons :root (interleave (repeat :children) ks)) not-found#)]
(if (= v not-found#)
(throw (Exception. (format "Key not found: %s" ks)))
(Trie. f (sorted-map (last ks) v)))))
(valAt [_ ks not-found]
(Trie. f (sorted-map
(last ks)
(get-in trie (cons :root (interleave (repeat :children) ks)) not-found))))
clojure.lang.IPersistentCollection
(seq [self]
(->> self
as-vec
zip/vector-zip
(iterate zip/next)
(take-while (complement zip/end?))
(map zip/node)
(filter map?)
(filter (comp :value second first seq))))
(cons [_ o]
(let [path (cons :root (interleave (repeat :children) (butlast o)))
node (get-in trie path)]
(Trie. f (update-in-sorted trie path f o))))
(empty [_] (Trie. f {}))
(equiv [_ o]
(and (isa? (class o) Trie)
(= (as-map o) trie))))
(defn trie
([] (->Trie
(fn update-fn [prev cur]
(if (nil? prev)
{:value (last cur)
:count 1}
(-> prev
(update :count (fnil inc 0))
(assoc :value (last cur)))))
(sorted-map)))
([& ks]
(reduce
(fn [t k]
(conj t k))
(trie)
ks)))
(comment
(let [v1 '(1 2 3 123)
v2 '(1 2 12)
v3 '(1 2 2 122)
v4 '(1 3 1 131)
t1 (trie v1 v2 v3 v4)]
{:as-vec (as-vec (get t1 '(1 2)))
:as-map (as-map (get t1 '(1 2)))
:as-byte-array (as-byte-array (get t1 '(1 2)))})
;; => {:as-vec
;; [[[[] {2 {:value 122, :count 1}}] [[] {3 {:value 123, :count 1}}]]
;; {2 {:count 1, :value 12}}],
;; :as-map
;; {2
;; {:children {2 {:value 122, :count 1}, 3 {:value 123, :count 1}},
;; :count 1,
;; :value 12}},
;; :as-byte-array
;; {2
;; {:byte-address 14,
;; :byte-array [-116, -127, -124, -126, 6, -125, 3],
;; :children
;; {2
;; {:value 122,
;; :count 1,
;; :byte-address 8,
;; :byte-array [-6, -127, -128],
;; :children {}},
;; 3
;; {:value 123,
;; :count 1,
;; :byte-address 11,
;; :byte-array [-5, -127, -128],
;; :children {}}},
;; :count 1,
;; :value 12}}}
)
(defn rewind-to-key [bb stop]
(loop []
(let [current (.get bb (.position bb))
previous (.get bb (dec (.position bb)))]
(if (or (= stop (.position bb))
(and (encoding/key-byte? current)
(encoding/offset-byte? previous)))
bb
(do (.position bb (dec (.position bb)))
(recur))))))
(defn forward-to-key [bb stop]
(loop []
(if (or (= stop (.position bb))
(and (encoding/key-byte? (.get bb (.position bb)))
(encoding/offset-byte?
(.get bb (inc (.position bb))))))
bb
(do (.position bb (inc (.position bb)))
(recur)))))
(defn find-key-in-index
[bb target-key max-address not-found]
(.limit bb max-address)
(let [key
(loop [previous-key nil
min-position (.position bb)
max-position max-address]
(if (zero? (- max-position min-position))
not-found
(let [mid-position (+ min-position (quot (- max-position min-position) 2))]
(.position bb mid-position)
(let [bb (rewind-to-key bb min-position)
current-key
(encoding/decode-number-from-tightly-packed-trie-index bb)]
(cond
(= current-key target-key)
(encoding/decode-number-from-tightly-packed-trie-index bb)
(= current-key previous-key)
(do
(encoding/decode-number-from-tightly-packed-trie-index bb)
(let [final-key (encoding/decode-number-from-tightly-packed-trie-index bb)]
(if (= target-key final-key)
(encoding/decode-number-from-tightly-packed-trie-index bb)
(throw (Exception. "Key not found.")))))
(< current-key target-key)
;; Chew the next decoded number. It's a useless offset.
(do
(encoding/decode-number-from-tightly-packed-trie-index bb)
(recur
current-key
(.position bb)
max-position))
(> current-key target-key)
;; This could also be rewound.
(do
(rewind-to-key bb min-position)
(recur
current-key
min-position
(.position bb))))))))]
(.limit bb (.capacity bb))
key))
(defn tightly-packed-trie-node-value
[byte-buffer]
(let [value (encoding/decode byte-buffer)
freq (encoding/decode byte-buffer)]
{:value value
:count freq}))
(defmacro wrap-byte-buffer
"Saves the position and limit of a byte buffer, runs body,
returns byte buffer to original position and limit."
[byte-buffer & body]
`(let [original-position# (.position ~byte-buffer)
original-limit# (.limit ~byte-buffer)]
(try (do ~@body)
(finally
(.limit ~byte-buffer original-limit#)
(.position ~byte-buffer original-position#)))))
(defprotocol ITightlyPackedTrie
(children [self] "Immediate children of a node.")
(value [self] "Return node value, disassociated with children."))
(deftype TightlyPackedTrie [byte-buffer key address limit]
ITightlyPackedTrie
(value [self]
(wrap-byte-buffer
byte-buffer
(.limit byte-buffer limit)
(.position byte-buffer address)
(tightly-packed-trie-node-value byte-buffer)))
(children [self]
(wrap-byte-buffer
byte-buffer
(.limit byte-buffer limit)
(.position byte-buffer address)
(let [val (encoding/decode byte-buffer)
freq (encoding/decode byte-buffer)
size-of-index (encoding/decode byte-buffer)]
(.limit byte-buffer (+ (.position byte-buffer)
size-of-index))
(loop [children []]
(if (= (.position byte-buffer) (.limit byte-buffer))
children
(let [child-key (encoding/decode-number-from-tightly-packed-trie-index byte-buffer)
child-offset (encoding/decode-number-from-tightly-packed-trie-index byte-buffer)]
(recur
(conj
children
(TightlyPackedTrie.
byte-buffer
child-key
(- address child-offset)
(.capacity byte-buffer))))))))))
clojure.lang.ILookup
(valAt [self ks]
(wrap-byte-buffer
byte-buffer
(.limit byte-buffer limit)
(.position byte-buffer address)
(if (empty? ks)
self
(let [val (encoding/decode byte-buffer)
freq (encoding/decode byte-buffer)
size-of-index (encoding/decode byte-buffer)
offset (find-key-in-index
byte-buffer
(first ks)
(+ (.position byte-buffer) size-of-index)
:not-found)]
(if (= offset :not-found)
(throw (Exception. (format "Index not found %s" ks)))
(let [child (TightlyPackedTrie.
byte-buffer
(first ks)
(- address offset)
(.capacity byte-buffer))]
(get child (rest ks))))))))
(valAt [self ks not-found]
(wrap-byte-buffer
byte-buffer
(.limit byte-buffer limit)
(.position byte-buffer address)
(if (empty? ks)
self
(let [val (encoding/decode byte-buffer)
freq (encoding/decode byte-buffer)
size-of-index (encoding/decode byte-buffer)
offset (find-key-in-index
byte-buffer
(first ks)
(+ (.position byte-buffer) size-of-index)
:not-found)]
(if (= offset :not-found)
not-found
(let [child (TightlyPackedTrie.
byte-buffer
(first ks)
(- address offset)
(.capacity byte-buffer))]
(get child (rest ks)))))))))
(defn tightly-packed-trie
"Assumes the trie has been transformed so that each node
includes a :byte-array key to the byte array that needs to be written
for that node and a :byte-address key that has been calculated
with an offset of 8. (The first 8 bytes are reserved for the root address.)"
[trie]
(let [baos (ByteArrayOutputStream.)
trie (as-byte-array trie)]
;; This transform writes to the ByteArrayOutputStream.
(transform
trie
(visitor-filter
#(map? (zip/node %))
(fn [loc]
(let [{:keys [byte-array]} (second (first (seq (zip/node loc))))]
(.write baos byte-array)
loc))))
(let [ba (.toByteArray baos)
root-address (get-in (as-map trie) [:root :byte-address])
byte-buf (java.nio.ByteBuffer/allocate (+ 8 (count ba)))]
(.putLong byte-buf root-address)
(.put byte-buf ba)
(.rewind byte-buf)
(->TightlyPackedTrie byte-buf 0 (.getLong byte-buf) (.capacity byte-buf)))))
(defn zipper-tpt
"Turns a tightly-packed trie into a zipper.
Since the byte buffer that backs the trie can't be edited,
`make-node` and the zipper edit functions won't work."
[tpt]
(let [branch? (fn branch? [node]
(and (instance? TightlyPackedTrie node)
(not-empty (children node))))
zipper-children (fn zippper-children [node]
(children node))
make-node (fn make-node [node children]
(throw (Exception. "Can't add children to Tightly Packed Trie nodes.")))]
(zip/zipper branch? zipper-children make-node tpt)))
(comment
(let [v1 '(1 2 3 123)
v2 '(1 2 12)
v3 '(1 2 2 122)
v4 '(1 3 1 131)
t1 (trie v1 v2 v3 v4)
tpt (tightly-packed-trie t1)]
(->> tpt
zipper-tpt
(iterate zip/next)
(take-while (complement zip/end?))
(map zip/node)
(map #(hash-map (.key %) (value %)))))
)
;; TODO: Shared "save" interface for Trie?
(defn save-tightly-packed-trie-to-file
[filepath trie]
(with-open [o (io/output-stream filepath)]
(.write o (.array (.byte-buffer trie)))))
(defn load-tightly-packed-trie-from-file
[filepath]
(with-open [i (io/input-stream filepath)
baos (ByteArrayOutputStream.)]
(io/copy i baos)
(let [byte-buffer (java.nio.ByteBuffer/wrap (.toByteArray baos))]
(.rewind byte-buffer)
(->TightlyPackedTrie byte-buffer 0 (.getLong byte-buffer) (.capacity byte-buffer)))))

@ -0,0 +1,121 @@
(ns com.owoga.tightly-packed-trie.encoding
(:require [clojure.string :as string]
[clojure.java.io :as io]
[com.owoga.tightly-packed-trie.bit-manip :as bm])
(:import (java.nio ByteBuffer)))
(defn encode
"Encodes number as a variable-length byte-array.
The first bit of each byte in the array is a flag bit
that specifies whether to read the next byte.
To encode: Take the first 7 bits and put it in a byte
that has the most significant bit set to 1. This is the flag
bit that tells us that this byte is the last byte in the encoded number.
If the number didn't fit into those first 7 bits, bit-shift-right 7 bits
and grow the list of bytes by taking 7 bits at a time from the number
while leaving the flag bit on every byte other than the first as 0.
(->> [0 1 2 127 128 129]
(map encode)
(map #(map bm/to-binary-string %)))
;; => ((\"10000000\")
;; (\"10000001\")
;; (\"10000010\")
;; (\"11111111\")
;; (\"00000001\" \"10000000\")
;; (\"00000001\" \"10000001\"))
To decode: if the flag bit is not set, read the next byte and
concat the last 7 bits of the current byte to
the last 7 bits of the next byte."
[n]
(loop [b (list (bit-set (mod n 0x80) 7))
n (quot n 0x80)]
(if (zero? n)
(byte-array b)
(recur (cons (mod n 0x80) b) (quot n 0x80)))))
(comment
(->> [0 1 2 127 128 129]
(map encode)
(map #(map bm/to-binary-string %)))
;; => (("10000000")
;; ("10000001")
;; ("10000010")
;; ("11111111")
;; ("00000001" "10000000")
;; ("00000001" "10000001"))
)
(defn decode
"Decode one variable-length-encoded number from a ByteBuffer,
advancing the buffer's position to the byte following the encoded number."
[byte-buffer]
(loop [bytes (list (.get byte-buffer))]
(if (bit-test (first bytes) 7)
(->> (cons (bit-clear (first bytes) 7) (rest bytes))
reverse
(map bm/ubyte)
(apply (partial bm/combine-significant-bits 7)))
(recur (cons (.get byte-buffer) bytes)))))
(comment
(->> [0 1 2 127 128 129 9876543210]
(map encode)
(map #(java.nio.ByteBuffer/wrap %))
(map decode))
;; => (0 1 2 127 128 129 9876543210)
)
(defn key-byte? [b]
(bit-test b 7))
(def offset-byte? (complement key-byte?))
(defn encode-key-to-tightly-packed-trie-index
[n]
(->> n encode (map #(bit-set % 7)) byte-array))
(defn encode-offset-to-tightly-packed-trie-index
[n]
(->> n encode (map #(bit-clear % 7)) byte-array))
(defn decode-number-from-tightly-packed-trie-index
([byte-buffer]
(let [first-byte (.get byte-buffer)
continue? (fn []
(and (.hasRemaining byte-buffer)
(= (key-byte? (.get byte-buffer (.position byte-buffer)))
(key-byte? first-byte))))]
(loop [bytes [first-byte]]
(if (continue?)
(recur (conj bytes (.get byte-buffer)))
(->> bytes
(map (partial bit-and 0xFF))
(map #(bit-clear % 7))
(apply (partial bm/combine-significant-bits 7))))))))
(bm/to-binary-string 0xff)
(comment
(let [byte-buffer (java.nio.ByteBuffer/allocate 64)]
(.put byte-buffer (encode-key-to-tightly-packed-trie-index 0))
(.put byte-buffer (encode-offset-to-tightly-packed-trie-index 1))
(.put byte-buffer (encode-key-to-tightly-packed-trie-index 9876543210))
(.put byte-buffer (encode-offset-to-tightly-packed-trie-index 1234567890))
(.limit byte-buffer (.position byte-buffer))
(.rewind byte-buffer)
[(decode-number-from-tightly-packed-trie-index byte-buffer)
(decode-number-from-tightly-packed-trie-index byte-buffer)
(decode-number-from-tightly-packed-trie-index byte-buffer)
(decode-number-from-tightly-packed-trie-index byte-buffer)])
;; => [0 1 9876543210 1234567890]
)
(defn slurp-bytes [x]
(with-open [out (java.io.ByteArrayOutputStream.)]
(io/copy (io/input-stream x) out)
(.toByteArray out)))
Loading…
Cancel
Save