darklimericks/web/README_WGU.org

#+TITLE: RhymeStorm - WGU CSCI Capstone Project

:PROPERTIES:
:END:

* A. Letter Of Transmittal

Create a letter of transmittal and a project proposal to convince senior, non-technical managers and executives to implement the data product you have designed. The proposal should include each of the following:

** Problem Summary

Songwriters, artists, and record labels can save time and discover better lyrics with the help of a machine learning tool that supports their creative endeavours.

Songwriters have several old-fashioned tools at their disposal including dictionaries and thesauruses. But machine learning exposes a new set of powerful possibilities. Using simple machine learning techniques, it is possible to automatically generate vast numbers of lyrics that match specified criteria for rhyming, syllable count, genre, and more.

** a description of how the data product benefits the customer and supports the decision-making process

How many sensible phrases can you think of that rhyme with "war on poverty"? What if I say that there's a restriction to only come up with phrases that are exactly 14 syllables? That's a common restriction when a songwriter is trying to match the meter of a previous line. What if I add another restriction that there must be primary stress at certain spots in that 14 syllable phrase?

This is the process that songwriters go through all day. It's a process that gets little help from traditional tools like dictionaries and thesauruses.

And this is a process that is perfect for machine learning. Machine learning can learn the most likely grammatical structure of phrases and can make predictions about likely words that follow a given sequence of other words. Computers can iterate through millions of words, checking for restrictions on rhyme, syllable count, and more. The most tedious part of lyric generation can be automated with machine learning software, leaving the songwriter free to cherry-pick from the best lyrics and make minor touch-ups to make them perfect.

** an outline of the data product

The machine learning part of software that I described above can be implemented with a simple machine learning technique known as a Hidden Markov Model.

Without getting too technical, using a Hidden Markov Model will involve using an existing lyrics database as input and the output will be a function that returns the likelihood of a word following a sequence of previous words.

A choice of many different programming languages and algorithms are sufficient to handle the other parts of the product, like splitting a word into phonetic sounds, finding rhymes, and matching stress between phrases.

An initial version of the software will be trained on the heavy metal lyrics database at http://darklyrics.com and a website will be created where users can type in a "seed" sequence of word(s) and the model will output a variety of possible completions.

This auto-complete functionality will be similar to the auto-complete that is commonly found on phone keyboard applications that help users type faster on phone touchscreens.

** a description of the data that will be used to construct the data product

The initial model will be trained on the lyrics from http://darklyrics.com. This is a publicly available data set with minimal meta-data. Record labels will have more valuable datasets that will include meta-data along with lyrics, such as the date the song was popular, the number of radio plays of the song, the profit of the song/artist, etc...

The software can be augmented with additional algorithms to account for the type of meta-data that a record label may have. The augmentations can happen in iterative software development cycles, using Agile methodologies.

** the objectives and hypotheses of the project

This software will accomplish its primary objective if it makes its way into the daily toolkit of a handful of singers/songwriters.

Several secondary objectives are also desirable and reasonably expected. The architecture of the software lends itself to existing as several independently useful modules.

For example, the Markov Model can be conveniently backed by a Trie data structure. This Trie data structure can be released as its own software package and used any application that benefits from prefix matching.

Another example is the package that turns phrases into phones. That package can find use for a number of natural language processing and natural language generation tasks, aside from the task required by this particular project.

** an outline of the project methodology

** funding requirements

** the impact of the solution on stakeholders

** ethical and legal considerations and precautions that will be used when working with and communicating about sensitive data

** your expertise relevant to the solution you propose


Note: Expertise described here could be real or hypothetical to fit the project topic you have created.


* B. Executive Summary - Technical Notes And Requirements

Write an executive summary directed to IT professionals that addresses each of the following requirements:

** the decision-support problem or opportunity you are solving for

** a description of the customers and why this product will fulfill their needs

** existing gaps in the data products you are replacing or modifying (if applicable)

** the data available or the data that needs to be collected to support the data product lifecycle

** the methodology you use to guide and support the data product design and development

** deliverables associated with the design and development of the data product

** the plan for implementation of your data product, including the anticipated outcomes from this development

** the methods for validating and verifying that the developed data product meets the requirements and subsequently the needs of the customers

** the programming environments and any related costs, as well as the human resources that are necessary to execute each phase in the development of the data product

** a projected timeline, including milestones, start and end dates, duration for each milestone, dependencies, and resources assigned to each task

* C. RhymeStorm Capstone Requirements Documentation

RhymeStorm is an application to help singers and songwriters brainstorm new lyrics.

** Descriptive And Predictive Methods

*** Descriptive Method

**** Most Common Grammatical Structures In A Set Of Lyrics

By filtering songs by metrics such as popularity, number of awards, etc... we can use this software package to determine the most common grammatical phrase structure for different filtered categories.

Since  much of the data a record label might want to categorize songs by is likely proprietary, filtering the songs by whatever metric is the responsibility of the user.

Once the songs are filtered/categorized, they can be passed to this software where a list of the most popular grammar structures will be returned.

In the example below, you'll see that a simple noun-phrase is the most popular structure with 6 occurrences, tied with a sentence composed of a prepositional-phrase, verb-phrase, and adjective.

#+begin_src clojure :results value :session main :exports both
(require '[com.owoga.corpus.markov :as markov]
         '[com.owoga.prhyme.nlp.core :as nlp]
         '[clojure.string :as string]
         '[clojure.java.io :as io])

(let [lines (transduce
             (comp
              (map slurp)
              (map #(string/split % #"\n"))
              (map (partial remove empty?))
              (map nlp/structure-freqs))
             merge
             {}
             (eduction (markov/xf-file-seq 0 10) (file-seq (io/file "/home/eihli/src/prhyme/dark-corpus"))))]
  (take 5 (sort-by (comp - second) lines)))
#+end_src

#+RESULTS:
| (TOP (NP (NNP) (.)))                                     | 6 |
| (TOP (S (NP (PRP)) (VP (VBP) (ADJP (JJ))) (.)))          | 6 |
| (INC (NP (JJ) (NN)) nil (IN) (NP (DT)) (NP (PRP)) (VBP)) | 4 |
| (TOP (NP (NP (JJ) (NN)) nil (NP (NN) (CC) (NN))))        | 4 |
| (TOP (S (NP (JJ) (NN)) nil (VP (VBG) (ADJP (JJ)))))      | 4 |

*** Prescriptive Method

**** Most Likely Word To Follow A Given Phrase

To help songwriters think of new lyrics, we provide an API to receive a list of words that commonly follow/precede a given phrase.

Models can be trained on different genres or categories of songs. This will ensure that recommended lyric completions are apt.

In the example below, we provide a seed suffix of "bother me" and ask the software to predict the most likely words that precede that phrase. The resulting most popular phrases are "don't bother me", "doesn't bother me", "to bother me", "won't bother me", etc...

#+begin_src clojure :session main :exports both
(require '[com.darklimericks.server.models :as models]
         '[com.owoga.trie :as trie])

(let [seed ["bother" "me"]
      seed-ids (map models/database seed)
      lookup (reverse seed-ids)
      results (trie/children (trie/lookup models/markov-trie lookup))]
  (->> results
       (map #(get % []))
       (sort-by (comp - second))
       (map #(update % 0 models/database))
       (take 10)))
#+end_src

#+RESULTS:
| don't     | 36 |
| doesn't   | 21 |
| to        | 14 |
| won't     |  9 |
| really    |  5 |
| not       |  4 |
| you       |  4 |
| it        |  3 |
| even      |  3 |
| shouldn't |  3 |

** Datasets

The dataset currently in use was generated from the publicly available lyrics at http://darklyrics.com.

Further datasets will need to be provided by the end-user.

** Decision Support Functionality

*** Choosing Words For A Lyric Based On Markov Likelihood

Entire phrases can be generated using the previously mentioned functionality of generating lists of likely prefix/suffix words.

The software can be seeded with a simple "end-of-sentence" or "beginning-of-sentence" token and can be asked to work backwards to build a phrase that meets certain criteria.

The user can supply criteria such as restrictions on the number of syllables, number of words, rhyme scheme, etc...

*** Choosing Words To Complete A Lyric Based On Rhyme Quality

Another part of the decision support functionality is filtering and ordering predicted words based on their rhyme quality.

The official definition of a "perfect" rhyme is when two words have matching phonemes starting from their primary stress.

For example: technology and ecology. Both of those words have a stress on the second syllable. The first syllables differ. But from the stressed syllable on, they have exactly matching phones.

A rhyme that might be useful to a songwriter but that doesn't fit the definition of a "perfect" rhyme would be "technology" and "economy". Those two words just barely break the rules for a perfect rhyme. Their vowel phones match from their primary stress to their ends. But one of the consonant phones doesn't match.

Singers and songwriters have some flexibility and artistic freedom and imperfect rhymes can be a fallback.

Therefore, this software provides functionality to sort rhymes so that rhymes that are closer to perfect are first in the ordering.

In the example below, you'll see that the first 20 or so rhymes are perfect, but then "hypocrisy" is listed as rhyming with "technology". This is for the reason just mentioned. It's close to a perfect rhyme and it's of interest to singers/songwriters.

#+begin_src clojure :results value table :colnames yes :session main :exports both
(require '[com.darklimericks.linguistics.core :as linguistics]
         '[com.darklimericks.server.models :as models])

(let [results
      (linguistics/rhymes-with-frequencies-and-rhyme-quality
       "technology"
       models/markov-trie
       models/database)]
  (->> results
       (map
        (fn [[rhyming-word
              rhyming-word-phones
              frequency-count-of-rhyming-word
              target-word
              target-word-phones
              rhyme-quality]]
          [rhyming-word frequency-count-of-rhyming-word rhyme-quality]))
       (take 25)
       (vec)
       (into [["rhyme" "frequency count" "rhyme quality"]])))
#+end_src

#+RESULTS:
| rhyme          | frequency count | rhyme quality |
| technology     |             318 |             8 |
| apology        |              68 |             7 |
| pathology      |              42 |             7 |
| mythology      |              27 |             7 |
| psychology     |              24 |             7 |
| theology       |              23 |             7 |
| biology        |              20 |             7 |
| ecology        |              11 |             7 |
| chronology     |              10 |             7 |
| astrology      |               9 |             7 |
| biotechnology  |               8 |             7 |
| nanotechnology |               5 |             7 |
| geology        |               3 |             7 |
| ontology       |               2 |             7 |
| morphology     |               2 |             7 |
| seismology     |               1 |             7 |
| urology        |               1 |             7 |
| doxology       |               0 |             7 |
| neurology      |               0 |             7 |
| hypocrisy      |             723 |             6 |
| democracy      |             238 |             6 |
| atrocity       |             224 |             6 |
| philosophy     |             181 |             6 |
| equality       |             109 |             6 |
| ideology       |             105 |             6 |

** Featurizing, Parsing, Cleaning, And Wrangling Data

The data processing code is in [[https://github.com/eihli/prhyme]]

Each line gets tokenized using a regular expression to split the string into tokens.

#+begin_src clojure :session main
(def re-word
  "Regex for tokenizing a string into words
  (including contractions and hyphenations),
  commas, periods, and newlines."
  #"(?s).*?([a-zA-Z\d]+(?:['\-]?[a-zA-Z]+)?|,|\.|\?|\n)")
#+end_src

Along with tokenization, the lines get stripped of whitespace and converted to lowercase. This conversion is done so that
words can be compared: "Foo" is the same as "foo".

#+begin_src clojure
(def xf-tokenize
  (comp
   (map string/trim)
   (map (partial re-seq re-word))
   (map (partial map second))
   (map (partial mapv string/lower-case))))
#+end_src


** Data Exploration And Preparation

The primary data structure and algorithms supporting exploration of the data are a Markov Trie

The Trie data structure supports a ~lookup~ function that returns the child trie at a certain lookup key and a ~children~ function that returns all of the immediate children of a particular Trie.

All Trie code is hosted in the git repo located at [[https://github.com/eihli/clj-tightly-packed-trie]].

#+begin_src clojure :eval no
(defprotocol ITrie
  (children [self] "Immediate children of a node.")
  (lookup [self ^clojure.lang.PersistentList ks] "Return node at key."))

(deftype Trie [key value ^clojure.lang.PersistentTreeMap children-]
  ITrie
  (children [trie]
    (map
     (fn [[k ^Trie child]]
       (Trie. k
              (.value child)
              (.children- child)))
     children-))

  (lookup [trie k]
    (loop [k k
           trie trie]
      (cond
        ;; Allows `update` to work the same as with maps... can use `fnil`.
        ;; (nil? trie') (throw (Exception. (format "Key not found: %s" k)))
        (nil? trie) nil
        (empty? k)
        (Trie. (.key trie)
               (.value trie)
               (.children- trie))
        :else (recur
               (rest k)
               (get (.children- trie) (first k))))))
#+end_src

** TODO Data Visualization Functionalities For Data Exploration And Inspection

- graph of phrase complexity on one axis and rhyme quality on another axis.

** TODO Implementation Of Interactive Queries

Interactive query capability at [[https://darklimericks.com/wgu]].

** TODO implementation of machine-learning methods and algorithms

The machine learning method chosen for this software is a Hidden Markov Model.

Each line of each song is split into "tokens" (words) and then the previous ~n - 1~ tokens are used to predict the ~nth~ token.

The algorithm is implemented in several parts which are demonstrated below.

1. Read each song line-by-line.
2. Split each line into tokens.
3. Partition the tokens into sequences of length ~n~.
4. Associate each sequence into a Trie and update the value representing the number of times that sequence has been encountered.

That is the process for building the Hidden Markov Model.

The algorithm for generating predictions from the HMM is as follows.

1. Look up the ~n - 1~ tokens in the Trie.
2. Normalize the frequencies of the children of the ~n - 1~ tokens into percentage likelihoods.
3. Account for "unseen ~n grams~" (Simple Good Turing).
4. Sort results by maximum likelihood.

#+begin_src clojure :session main :results output :exports both
(require '[com.owoga.prhyme.data-transform :as data-transform]
         '[clojure.pprint :as pprint])

(defn file-seq->markov-trie
  "For forwards markov."
  [database files n m]
  (transduce
   (comp
    (map slurp)
    (map #(string/split % #"[\n+\?\.]"))
    (map (partial transduce data-transform/xf-tokenize conj))
    (map (partial transduce data-transform/xf-filter-english conj))
    (map (partial remove empty?))
    (map (partial into [] (data-transform/xf-pad-tokens (dec m) "<s>" 1 "</s>")))
    (map (partial mapcat (partial data-transform/n-to-m-partitions n (inc m))))
    (mapcat (partial mapv (data-transform/make-database-processor database))))
   (completing
    (fn [trie lookup]
      (update trie lookup (fnil #(update % 1 inc) [lookup 0]))))
   (trie/make-trie)
   files))

(let [files (->> "/home/eihli/src/prhyme/dark-corpus"
                 io/file
                 file-seq
                 (eduction (data-transform/xf-file-seq 501 2)))
      database (atom {:next-id 1})
      trie (file-seq->markov-trie database files 1 3)]
 (pprint/pprint [(map (comp (partial map @database) first) (take 10 (drop 105 trie)))]))
#+end_src

#+RESULTS:
#+begin_example
[(("<s>" "call" "me")
  ("<s>" "call")
  ("<s>" "right" "</s>")
  ("<s>" "right")
  ("<s>" "that's" "proportional")
  ("<s>" "that's")
  ("<s>" "don't" "</s>")
  ("<s>" "don't")
  ("<s>" "yourself" "in")
  ("<s>" "yourself"))]
#+end_example

The results above show a sample of 10 elements in a 1-to-3-gram trie

The code sample below demonstrates training a Hidden Markov Model on a set of lyrics where each line gets reversed. This model is useful for predicting words backwards, so that you can start with the rhyming end of a word or phrase and generate backwards to the start of the lyric.

It also performs compaction and serialization. Song lyrics are typically provided as text files. Reading files on a hard drive is an expensive process, but we can perform that expensive training process only once and save the resulting Markov Model in a more memory-efficient format.

#+begin_src clojure
(defn train-backwards
  "For building lines backwards so they can be seeded with a target rhyme."
  [files n m trie-filepath database-filepath tightly-packed-trie-filepath]
  (let [database (atom {:next-id 1})
        trie (file-seq->backwards-markov-trie database files n m)]
    (nippy/freeze-to-file trie-filepath (seq trie))
    (println "Froze" trie-filepath)
    (nippy/freeze-to-file database-filepath @database)
    (println "Froze" database-filepath)
    (save-tightly-packed-trie trie database tightly-packed-trie-filepath)
    (let [loaded-trie (->> trie-filepath
                           nippy/thaw-from-file
                           (into (trie/make-trie)))
          loaded-db (->> database-filepath
                         nippy/thaw-from-file)
          loaded-tightly-packed-trie (tpt/load-tightly-packed-trie-from-file
                                      tightly-packed-trie-filepath
                                      (decode-fn loaded-db))]
      (println "Loaded trie:" (take 5 loaded-trie))
      (println "Loaded database:" (take 5 loaded-db))
      (println "Loaded tightly-packed-trie:" (take 5 loaded-tightly-packed-trie))
      (println "Successfully loaded trie and database."))))

(comment
  (time
   (let [files (->> "dark-corpus"
                    io/file
                    file-seq
                    (eduction (xf-file-seq 0 250000)))
         [trie database] (train-backwards
                          files
                          1
                          5
                          "/home/eihli/.models/markov-trie-4-gram-backwards.bin"
                          "/home/eihli/.models/markov-database-4-gram-backwards.bin"
                          "/home/eihli/.models/markov-tightly-packed-trie-4-gram-backwards.bin")]))

  (time
   (def markov-trie (into (trie/make-trie) (nippy/thaw-from-file "/home/eihli/.models/markov-trie-4-gram-backwards.bin"))))
  (time
   (def database (nippy/thaw-from-file "/home/eihli/.models/markov-database-4-gram-backwards.bin")))
  (time
   (def markov-tight-trie
     (tpt/load-tightly-packed-trie-from-file
      "/home/eihli/.models/markov-tightly-packed-trie-4-gram-backwards.bin"
      (decode-fn database))))
  (take 20 markov-tight-trie)
  )
#+end_src

** Functionalities To Evaluate The Accuracy Of The Data Product

Since creative brainstorming is the goal, "accuracy" is subjective.

We can, however, measure and compare language generation algorithms against how "expected" a phrase is given the training data. This measurement is "perplexity".

#+begin_src clojure :session main :exports both :results output
(require '[taoensso.nippy :as nippy]
         '[com.owoga.tightly-packed-trie :as tpt]
         '[com.owoga.corpus.markov :as markov])

(defonce database (nippy/thaw-from-file "/home/eihli/.models/markov-database-4-gram-backwards.bin"))

(defonce markov-tight-trie
  (tpt/load-tightly-packed-trie-from-file
   "/home/eihli/.models/markov-tightly-packed-trie-4-gram-backwards.bin"
   (markov/decode-fn database)))

(let [likely-phrase ["a" "hole" "</s>" "</s>"]
      less-likely-phrase ["this" "hole" "</s>" "</s>"]
      least-likely-phrase ["that" "hole" "</s>" "</s>"]]
  (run!
   (fn [word]
     (println
      (format
       "\"%s\" has preceeded \"hole\" \"</s>\" \"</s>\" a total of %s times"
       word
       (second (get markov-tight-trie (map database ["</s>" "</s>" "hole" word]))))))
   ["a" "this" "that"])
  (run!
   (fn [word]
     (let [seed ["</s>" "</s>" "hole" word]]
       (println
        (format
         "%s is the perplexity of \"%s\" \"hole\" \"</s>\" \"</s>\""
         (->> seed
              (map database)
              (markov/perplexity 4 markov-tight-trie))
         word))))
   ["a" "this" "that"])
  nil)
#+end_src

#+RESULTS:
: "a" has preceeded "hole" "</s>" "</s>" a total of 250 times
: "this" has preceeded "hole" "</s>" "</s>" a total of 173 times
: "that" has preceeded "hole" "</s>" "</s>" a total of 45 times
: -12.184088569934774 is the perplexity of "a" "hole" "</s>" "</s>"
: -12.552930899563904 is the perplexity of "this" "hole" "</s>" "</s>"
: -13.905719644461469 is the perplexity of "that" "hole" "</s>" "</s>"

The results above make intuitive sense. The most common word to preceed "hole" at the end of a sentence is the word "a". There are 250 instances of sentences of "... a hole.". That can be compared to 173 instances of "... this hole." and 45 instances of "... that hole.".

Therefore, "... a hole." is has the lowest "perplexity".

This standardized measure of accuracy can be used to compare different language generation algorithms.

** Security Features

Artists/Songwriters place a lot of value in the secrecy of their content. Therefore, all communication with the web-based interface occurs over a secure connection using HTTPS.

Security certificates are generated using Let's Encrypt and an Nginx web server handles the SSL termination.

With this precaution in place, attackers will not be able to snoop the content that songwriters are sending to or receiving from the servers.

** TODO Tools To Monitor And Maintain The Product

- Script to auto-update SSL cert
- Enable NGINX dashboard?

** TODO A User-Friendly, Functional Dashboard That Includes At Least Three Visualization Types


* D. Documentation

Create each of the following forms of documentation for the product you have developed:

** Business Vision

Provide rhyming lyric suggestions optionally constrained by syllable count.

** Data Sets

See ~resources/darklyrics-markov.tpt~

** Data Analysis

See ~src/com/owoga/darklyrics/core.clj~

See https://github.com/eihli/prhyme

** Assessment

See visualization of rhyme suggestion in action.

See perplexity?

** Visualizations

See visualization of smoothing technique.

See wordcloud

** Accuracy

•  assessment of the product’s accuracy

** Testing

•  the results from the data product testing, revisions, and optimization based on the provided plans, including screenshots

** Source

•  source code and executable file(s)

** Quick Start

•  a quick start guide summarizing the steps necessary to install and use the product

* Notes

http-kit doesn't support https so no need to bother with keystore stuff like you would with jetty. Just proxy from haproxy.