You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

11 KiB

Capstone Documentation

C

Design and develop a fully functional data product that addresses your identified business problem or organizational need. Include each of the following attributes as they are the minimum required elements for the product:

one descriptive method and one non-descriptive (predictive or prescriptive) method

Descriptive Method

Most common sentence structures

Here is the code to generate a report on the most common sentence structures given a directory of lyrics files.

(require '[com.owoga.corpus.markov :as markov]
         '[com.owoga.prhyme.nlp.core :as nlp]
         '[clojure.string :as string]
         '[clojure.java.io :as io])

(let [lines (transduce
             (comp
              (map slurp)
              (map #(string/split % #"\n"))
              (map (partial remove empty?))
              (map nlp/structure-freqs))
             merge
             {}
             (eduction (markov/xf-file-seq 0 10) (file-seq (io/file "/home/eihli/src/prhyme/dark-corpus"))))]
  (take 5 (sort-by (comp - second) lines)))
(TOP (NP (NNP) (.))) 6
(TOP (S (NP (PRP)) (VP (VBP) (ADJP (JJ))) (.))) 6
(INC (NP (JJ) (NN)) nil (IN) (NP (DT)) (NP (PRP)) (VBP)) 4
(TOP (NP (NP (JJ) (NN)) nil (NP (NN) (CC) (NN)))) 4
(TOP (S (NP (JJ) (NN)) nil (VP (VBG) (ADJP (JJ))))) 4

Prescriptive Method

Most likely next words
(require '[com.darklimericks.server.models :as models]
         '[com.owoga.trie :as trie])

(let [seed ["bother" "me"]
      seed-ids (map models/database seed)
      lookup (reverse seed-ids)
      results (trie/children (trie/lookup models/markov-trie lookup))]
  (->> results
       (map #(get % []))
       (sort-by (comp - second))
       (map #(update % 0 models/database))
       (take 10)))
don't 36
doesn't 21
to 14
won't 9
really 5
not 4
you 4
it 3
even 3
shouldn't 3

collected or available datasets

The dataset currently in use is in /dark-corpus. Further dataset will need to be provided by the end-user.

Decision support functionality

Choosing words for a lyric based on markov likelihood

Choosing words to complete a lyric based on rhyme quality

(require '[com.darklimericks.linguistics.core :as linguistics])

(let [results
      (linguistics/rhymes-with-frequencies-and-rhyme-quality
       "bother me"
       models/markov-trie
       models/database)]
  (->> results
       (map
        (fn [[rhyming-word
              rhyming-word-phones
              frequency-count-of-rhyming-word
              target-word
              target-word-phones
              rhyme-quality]]
          [rhyming-word frequency-count-of-rhyming-word rhyme-quality]))
       (take 10)
       (vec)
       (into [["rhyme" "frequency count" "rhyme quality"]])))
rhyme frequency count rhyme quality
honoree 2 7
referee 3 6
repartee 2 6
nominee 2 6
undersea 1 6
oversea 1 6
rosemarie 0 6
disagree 180 5
poverty 175 5
mockery 122 5

Ability to support featurizing, parsing, cleaning, and wrangling datasets

The data processing code is in prhyme

Each line gets tokenized using a regular expression to split the string into tokens.

(def re-word
  "Regex for tokenizing a string into words
  (including contractions and hyphenations),
  commas, periods, and newlines."
  #"(?s).*?([a-zA-Z\d]+(?:['\-]?[a-zA-Z]+)?|,|\.|\?|\n)")

Along with tokenization, the lines get stripped of whitespace and converted to lowercase. This conversion is done so that words can be compared: "Foo" is the same as "foo".

(def xf-tokenize
  (comp
   (map string/trim)
   (map (partial re-seq re-word))
   (map (partial map second))
   (map (partial mapv string/lower-case))))

methods and algorithms supporting data exploration and preparation

The primary data structure and algorithms supporting exploration of the data are a Markov Trie

The Trie data structure suppors a lookup function that returns the child trie at a certain lookup key and a children function that returns all of the immediate children of a particular Trie.

(defprotocol ITrie
  (children [self] "Immediate children of a node.")
  (lookup [self ^clojure.lang.PersistentList ks] "Return node at key."))

(deftype Trie [key value ^clojure.lang.PersistentTreeMap children-]
  ITrie
  (children [trie]
    (map
     (fn [[k ^Trie child]]
       (Trie. k
              (.value child)
              (.children- child)))
     children-))

  (lookup [trie k]
    (loop [k k
           trie trie]
      (cond
        ;; Allows `update` to work the same as with maps... can use `fnil`.
        ;; (nil? trie') (throw (Exception. (format "Key not found: %s" k)))
        (nil? trie) nil
        (empty? k)
        (Trie. (.key trie)
               (.value trie)
               (.children- trie))
        :else (recur
               (rest k)
               (get (.children- trie) (first k))))))

data visualization functionalities for data exploration and inspection

implementation of interactive queries

Interactive query capability at https://darklimericks.com/wgu.

implementation of machine-learning methods and algorithms

Functions for training both forwards and backwards

(defn file-seq->markov-trie
  "For forwards markov."
  [database files n m]
  (transduce
   (comp
    (map slurp)
    (map #(string/split % #"[\n+\?\.]"))
    (map (partial transduce data-transform/xf-tokenize conj))
    (map (partial transduce data-transform/xf-filter-english conj))
    (map (partial remove empty?))
    (map (partial into [] (data-transform/xf-pad-tokens (dec m) "<s>" 1 "</s>")))
    (map (partial mapcat (partial data-transform/n-to-m-partitions n (inc m))))
    (mapcat (partial mapv (data-transform/make-database-processor database))))
   (completing
    (fn [trie lookup]
      (update trie lookup (fnil #(update % 1 inc) [lookup 0]))))
   (trie/make-trie)
   files))

(comment
  (let [files (->> "dark-corpus"
                   io/file
                   file-seq
                   (eduction (xf-file-seq 501 2)))
        database (atom {:next-id 1})
        trie (file-seq->markov-trie database files 1 3)]
    [(take 20 trie)
     (map (comp (partial map @database) first) (take 20 (drop 105 trie)))
     (take 10 @database)])
  ;; [([(1 1 2) [[1 1 2] 1]]
  ;;   [(1 1 3) [[1 1 3] 1]]
  ;;   [(1 1 7) [[1 1 7] 2]]
  ;;   [(1 1 9) [[1 1 9] 3]]
  ;;   [(1 1 16) [[1 1 16] 4]])
  ;;  (("<s>" "call" "me")
  ;;   ("<s>" "call")
  ;;   ("<s>" "right" "</s>")
  ;;   ("<s>" "right")
  ;;   ("<s>" "that's" "proportional")
  ;;   ("<s>" "that's")
  ;;   ("<s>" "don't" "</s>")
  ;;   ("<s>" "don't")
  ;;   ("<s>" "yourself" "in")
  ;;   ("<s>" "yourself")
  ;;   ("<s>" "transformation" "</s>")
  ;;   ("<s>" "transformation")
  ;;   ("<s>")
  ;;   ("them" "from" "their")
  ;;   ("them" "from")
  ;;   ("them")
  ;;   ("from" "their" "pain")
  ;;   ("from" "their")
  ;;   ("from" "your" "side")
  ;;   ("from" "your"))
  ;;  (["come" 92]
  ;;   ["summer" 17]
  ;;   ["more" 101]
  ;;   [121 "that's"]
  ;;   [65 "by"]
  ;;   ["dust" 133]
  ;;   [70 "said"]
  ;;   ["misery" 128]
  ;;   [62 "get"]
  ;;   [74 "gone"])]
  )
(defn train-backwards
  "For building lines backwards so they can be seeded with a target rhyme."
  [files n m trie-filepath database-filepath tightly-packed-trie-filepath]
  (let [database (atom {:next-id 1})
        trie (file-seq->backwards-markov-trie database files n m)]
    (nippy/freeze-to-file trie-filepath (seq trie))
    (println "Froze" trie-filepath)
    (nippy/freeze-to-file database-filepath @database)
    (println "Froze" database-filepath)
    (save-tightly-packed-trie trie database tightly-packed-trie-filepath)
    (let [loaded-trie (->> trie-filepath
                           nippy/thaw-from-file
                           (into (trie/make-trie)))
          loaded-db (->> database-filepath
                         nippy/thaw-from-file)
          loaded-tightly-packed-trie (tpt/load-tightly-packed-trie-from-file
                                      tightly-packed-trie-filepath
                                      (decode-fn loaded-db))]
      (println "Loaded trie:" (take 5 loaded-trie))
      (println "Loaded database:" (take 5 loaded-db))
      (println "Loaded tightly-packed-trie:" (take 5 loaded-tightly-packed-trie))
      (println "Successfully loaded trie and database."))))

(comment
  (time
   (let [files (->> "dark-corpus"
                    io/file
                    file-seq
                    (eduction (xf-file-seq 0 250000)))
         [trie database] (train-backwards
                          files
                          1
                          5
                          "/home/eihli/.models/markov-trie-4-gram-backwards.bin"
                          "/home/eihli/.models/markov-database-4-gram-backwards.bin"
                          "/home/eihli/.models/markov-tightly-packed-trie-4-gram-backwards.bin")]))

  (time
   (def markov-trie (into (trie/make-trie) (nippy/thaw-from-file "/home/eihli/.models/markov-trie-4-gram-backwards.bin"))))
  (time
   (def database (nippy/thaw-from-file "/home/eihli/.models/markov-database-4-gram-backwards.bin")))
  (time
   (def markov-tight-trie
     (tpt/load-tightly-packed-trie-from-file
      "/home/eihli/.models/markov-tightly-packed-trie-4-gram-backwards.bin"
      (decode-fn database))))
  (take 20 markov-tight-trie)
  )

functionalities to evaluate the accuracy of the data product

industry-appropriate security features

tools to monitor and maintain the product

a user-friendly, functional dashboard that includes at least three visualization types

Documentation

  1. Create each of the following forms of documentation for the product you have developed:

Business Vision

Provide rhyming lyric suggestions optionally constrained by syllable count.

Data Sets

See resources/darklyrics-markov.tpt

Data Analysis

See src/com/owoga/darklyrics/core.clj

See https://github.com/eihli/prhyme

Assessment

See visualization of rhyme suggestion in action.

See perplexity?

Visualizations

See visualization of smoothing technique.

See wordcloud

Accuracy

• assessment of the products accuracy

Testing

• the results from the data product testing, revisions, and optimization based on the provided plans, including screenshots

Source

• source code and executable file(s)

Quick Start

• a quick start guide summarizing the steps necessary to install and use the product

Notes

http-kit doesn't support https so no need to bother with keystore stuff like you would with jetty. Just proxy from haproxy.