Update WGU Readme

4 years ago · 57263684e1
parent a5d37f5185
commit 57263684e1
2 changed files with 915 additions and 173 deletions
--- a/web/README_WGU.org
+++ b/web/README_WGU.org
@ -19,9 +19,9 @@ This is the process that songwriters go through all day. It's a process that get

 And this is a process that is perfect for machine learning. Machine learning can learn the most likely grammatical structure of phrases and can make predictions about likely words that follow a given sequence of other words. Computers can iterate through millions of words, checking for restrictions on rhyme, syllable count, and more. The most tedious part of lyric generation can be automated with machine learning software, leaving the songwriter free to cherry-pick from the best lyrics and make minor touch-ups to make them perfect.

-** Product - RhymeStorm®
+** Product - RhymeStorm™

-RhymeStorm® is a tool to help songwriters brainstorm. It provides lyrics automatically generated based on training data from existing songs while adhering to restrictions based on rhyme scheme, meter, genre, and more.
+RhymeStorm™ is a tool to help songwriters brainstorm. It provides lyrics automatically generated based on training data from existing songs while adhering to restrictions based on rhyme scheme, meter, genre, and more.

 The machine learning part of software that I described above can be implemented with a simple machine learning technique known as a Hidden Markov Model.

@ -93,15 +93,15 @@ I have 10 years experience as a programmer and have worked extensively on both f

 I've also been writing limericks my entire life and hold the International Limerick Imaginative Enthusiast's ILIE award for the years 2013 and 2019.

-* B. Executive Summary - RhymeStorm® Technical Notes And Requirements
+* B. Executive Summary - RhymeStorm™ Technical Notes And Requirements

 Write an executive summary directed to IT professionals that addresses each of the following requirements:

 ** Decision Support Opportunity

-Songwriters expend a lot of time and effort finding the perfect rhyming word or phrase. RhymeStorm® is going to amplify user's creative abilities by searching its machine learning model for sensible and proven-successful words and phrases that meet the rhyme scheme and meter requirements requested by the user.
+Songwriters expend a lot of time and effort finding the perfect rhyming word or phrase. RhymeStorm™ is going to amplify user's creative abilities by searching its machine learning model for sensible and proven-successful words and phrases that meet the rhyme scheme and meter requirements requested by the user.

-When a songwriter needs to find likely phrases that rhyme with "war on poverty" and has 14 syllables, RhymeStorm® will automatically generate dozens of possibilities and rank them by "perplexity" and rhyme quality. The songwriter can focus there efforts on simple touch-ups to perfect the automatically generated lyrics.
+When a songwriter needs to find likely phrases that rhyme with "war on poverty" and has 14 syllables, RhymeStorm™ will automatically generate dozens of possibilities and rank them by "perplexity" and rhyme quality. The songwriter can focus there efforts on simple touch-ups to perfect the automatically generated lyrics.

 ** Customer Needs And Product Description

@ -125,7 +125,7 @@ RhymeZone is limited in its capability. It doesn't do well finding rhymes for ph

 The initial dataset will be gathered by downloading lyrics from http://darklyrics.com and future models can be generated by downloading lyrics from other websites. Alternatively, data can be provided by record labels and combined with meta-data that the record label may have, such as how many radio plays each song gets and how much profit they make from each song.

-RhymeStorm® can offer multiple models depending on the genre or theme that the songwriter is looking for. With the initial dataset from http://darklyrics.com, all suggestions will have a heavy metal theme. But future data sets can be trained on rap, pop, or other genres.
+RhymeStorm™ can offer multiple models depending on the genre or theme that the songwriter is looking for. With the initial dataset from http://darklyrics.com, all suggestions will have a heavy metal theme. But future data sets can be trained on rap, pop, or other genres.

 Songs don't get released fast enough that training needs to be an automated ongoing process. Perhaps once a year, or whenever a new dataset becomes available, someone can run a script that will update the data models.

@ -135,7 +135,7 @@ Each new model can be uploaded to the web server and users can select which mode

 ** Methodology - Agile

-RhymeStorm® development will proceed with an iterative Agile methodology. It will be composed of several independent modules that can be worked on independently, in parallel, and iteratively.
+RhymeStorm™ development will proceed with an iterative Agile methodology. It will be composed of several independent modules that can be worked on independently, in parallel, and iteratively.

 The Trie data structure that will be used as a backing to the Hidden Markov Model can be worked on in isolation from any other aspect of the project. The first iteration can use a simple hash-map as a backing store. The second iteration can improve memory efficiency by using a ByteBuffer as a [[https://aclanthology.org/W09-1505.pdf][Tightly Packed Trie]]. Future iterations can continue to improve performance metrics.

@ -147,12 +147,16 @@ Much of data science is exploratory and taking an iterative Agile approach can t

 ** Deliverables

+Three aspects of this project are available as open source repositories on Github.
+
 [[https://github.com/eihli/clj-tightly-packed-trie][Tightly Packed Trie]]

 [[https://github.com/eihli/phonetics][Phonetics and Syllabification]]

 [[https://github.com/eihli/prhyme][Data Processing, Markov, and Rhyme Algorithms]]

+The trained data model and web interface has been deployed at the following address and the code will be provided in an archive file.
+
 [[https://darklimericks.com/wgu][Web GUI and Documentation]]

 ** Implementation Plan And Anticipations
@ -163,7 +167,7 @@ I'll start by writing and releasing the supporting libraries and packages: Tries

 Then I'll write a website that imports and uses those libraries.

-Since I'll be writing and releasing these packages iteratively as open source, I'll share them publicly as I progress and can use feedback to improve them before RhymeStorm® takes its final form.
+Since I'll be writing and releasing these packages iteratively as open source, I'll share them publicly as I progress and can use feedback to improve them before RhymeStorm™ takes its final form.

 In anticipation of user growth, I'll be deploying the final product on DigitalOcean Droplets. They are virtual machines with resources that can be resized to meet growing demands or shrunk to save money in times of low traffic.

@ -200,9 +204,9 @@ All code was written and all models were trained on a Lenovo T15G with an Intel
 |      5 | 2021-08-14 | 2021-08-21 | Create Web Interface And Visualizations                       |
 |      6 | 2021-08-21 | 2021-09-07 | QA - Testing - Deploy And Release Web App                     |

-* C. RhymeStorm Capstone Requirements Documentation
+* C. RhymeStormg™ Capstone Requirements Documentation

-RhymeStorm is an application to help singers and songwriters brainstorm new lyrics.
+RhymeStorm™ is an application to help singers and songwriters brainstorm new lyrics.

 ** Descriptive And Predictive Methods

@ -369,7 +373,7 @@ The data processing code is in [[https://github.com/eihli/prhyme]]

 Each line gets tokenized using a regular expression to split the string into tokens.

-#+begin_src clojure :session main
+#+begin_src clojure :session main :eval no
 (def re-word
  "Regex for tokenizing a string into words
  (including contractions and hyphenations),
@ -380,7 +384,7 @@ Each line gets tokenized using a regular expression to split the string into tok
 Along with tokenization, the lines get stripped of whitespace and converted to lowercase. This conversion is done so that
 words can be compared: "Foo" is the same as "foo".

-#+begin_src clojure
+#+begin_src clojure :eval no
 (def xf-tokenize
  (comp
   (map string/trim)
@ -437,7 +441,7 @@ All Trie code is hosted in the git repo located at [[https://github.com/eihli/cl

 Interactive query capability at [[https://darklimericks.com/wgu]].

-** TODO implementation of machine-learning methods and algorithms
+** Implementation Of Machine Learning Methods

 The machine learning method chosen for this software is a Hidden Markov Model.

@ -488,6 +492,7 @@ The algorithm for generating predictions from the HMM is as follows.
                 (eduction (data-transform/xf-file-seq 501 2)))
      database (atom {:next-id 1})
      trie (file-seq->markov-trie database files 1 3)]
+
 (pprint/pprint [(map (comp (partial map @database) first) (take 10 (drop 105 trie)))]))
 #+end_src

@ -511,17 +516,25 @@ The code sample below demonstrates training a Hidden Markov Model on a set of ly

 It also performs compaction and serialization. Song lyrics are typically provided as text files. Reading files on a hard drive is an expensive process, but we can perform that expensive training process only once and save the resulting Markov Model in a more memory-efficient format.

-#+begin_src clojure
+#+begin_src clojure :session main :results output pp
+(require '[com.owoga.corpus.markov :as markov]
+         '[taoensso.nippy :as nippy]
+         '[com.owoga.prhyme.data-transform :as data-transform]
+         '[clojure.pprint :as pprint]
+         '[clojure.string :as string]
+         '[com.owoga.trie :as trie]
+         '[com.owoga.tightly-packed-trie :as tpt])
+
 (defn train-backwards
  "For building lines backwards so they can be seeded with a target rhyme."
  [files n m trie-filepath database-filepath tightly-packed-trie-filepath]
  (let [database (atom {:next-id 1})
-        trie (file-seq->backwards-markov-trie database files n m)]
+        trie (markov/file-seq->backwards-markov-trie database files n m)]
    (nippy/freeze-to-file trie-filepath (seq trie))
    (println "Froze" trie-filepath)
    (nippy/freeze-to-file database-filepath @database)
    (println "Froze" database-filepath)
-    (save-tightly-packed-trie trie database tightly-packed-trie-filepath)
+    (markov/save-tightly-packed-trie trie database tightly-packed-trie-filepath)
    (let [loaded-trie (->> trie-filepath
                           nippy/thaw-from-file
                           (into (trie/make-trie)))
@ -529,53 +542,81 @@ It also performs compaction and serialization. Song lyrics are typically provide
                         nippy/thaw-from-file)
          loaded-tightly-packed-trie (tpt/load-tightly-packed-trie-from-file
                                      tightly-packed-trie-filepath
-                                      (decode-fn loaded-db))]
+                                      (markov/decode-fn loaded-db))]
      (println "Loaded trie:" (take 5 loaded-trie))
      (println "Loaded database:" (take 5 loaded-db))
      (println "Loaded tightly-packed-trie:" (take 5 loaded-tightly-packed-trie))
      (println "Successfully loaded trie and database."))))

-(comment
-  (time
-   (let [files (->> "dark-corpus"
-                    io/file
-                    file-seq
-                    (eduction (xf-file-seq 0 250000)))
-         [trie database] (train-backwards
-                          files
-                          1
-                          5
-                          "/home/eihli/.models/markov-trie-4-gram-backwards.bin"
-                          "/home/eihli/.models/markov-database-4-gram-backwards.bin"
-                          "/home/eihli/.models/markov-tightly-packed-trie-4-gram-backwards.bin")]))
-
-  (time
-   (def markov-trie (into (trie/make-trie) (nippy/thaw-from-file "/home/eihli/.models/markov-trie-4-gram-backwards.bin"))))
-  (time
-   (def database (nippy/thaw-from-file "/home/eihli/.models/markov-database-4-gram-backwards.bin")))
-  (time
-   (def markov-tight-trie
-     (tpt/load-tightly-packed-trie-from-file
-      "/home/eihli/.models/markov-tightly-packed-trie-4-gram-backwards.bin"
-      (decode-fn database))))
-  (take 20 markov-tight-trie)
-  )
+(let [files (->> "/home/eihli/src/prhyme/dark-corpus"
+                 io/file
+                 file-seq
+                 (eduction (data-transform/xf-file-seq 0 4)))
+      [trie database] (train-backwards
+                       files
+                       1
+                       5
+                       "/tmp/markov-trie-4-gram-backwards.bin"
+                       "/tmp/markov-database-4-gram-backwards.bin"
+                       "/tmp/markov-tightly-packed-trie-4-gram-backwards.bin")])
+
+(def markov-trie (into (trie/make-trie) (nippy/thaw-from-file "/tmp/markov-trie-4-gram-backwards.bin")))
+(def database (nippy/thaw-from-file "/tmp/markov-database-4-gram-backwards.bin"))
+(def markov-tight-trie
+  (tpt/load-tightly-packed-trie-from-file
+   "/tmp/markov-tightly-packed-trie-4-gram-backwards.bin"
+   (markov/decode-fn database)))
+
+(println "\n\n Example n-grams frequencies from Hidden Markov Model:\n")
+(pprint/pprint
+ (->> markov-tight-trie
+      (drop 600)
+      (take 10)
+      (map
+       (fn [[ngram-ids [id freq]]]
+         [(string/join " " (map database ngram-ids)) freq]))))
 #+end_src

+#+RESULTS:
+#+begin_example
+Froze /tmp/markov-trie-4-gram-backwards.bin
+Froze /tmp/markov-database-4-gram-backwards.bin
+Loaded trie: ([(1 1 1 1 2) [2 2]] [(1 1 1 1 11) [11 1]] [(1 1 1 1 14) [14 2]] [(1 1 1 1 17) [17 1]] [(1 1 1 1 22) [22 1]])
+Loaded database: ([hole 7] [trash 227] [come 87] [275 overkill] [breaking 205])
+Loaded tightly-packed-trie: ([(1 1 1 1 2) [2 2]] [(1 1 1 1 11) [11 1]] [(1 1 1 1 14) [14 2]] [(1 1 1 1 17) [17 1]] [(1 1 1 1 22) [22 1]])
+Successfully loaded trie and database.
+
+
+ Example n-grams frequencies from Hidden Markov Model:
+
+(["</s> behind from attack cowards" 1]
+ ["</s> behind from attack" 1]
+ ["</s> behind from" 1]
+ ["</s> behind" 1]
+ ["</s> hate recharging , crushing" 1]
+ ["</s> hate recharging ," 1]
+ ["</s> hate recharging" 1]
+ ["</s> hate" 1]
+ ["</s> bills and sins pay" 1]
+ ["</s> bills and sins" 1])
+
+
+#+end_example
+
 ** Functionalities To Evaluate The Accuracy Of The Data Product

 Since creative brainstorming is the goal, "accuracy" is subjective.

 We can, however, measure and compare language generation algorithms against how "expected" a phrase is given the training data. This measurement is "perplexity".

-#+begin_src clojure :session main :exports both :results output
+#+begin_src clojure :session main :exports both :results output pp
 (require '[taoensso.nippy :as nippy]
         '[com.owoga.tightly-packed-trie :as tpt]
         '[com.owoga.corpus.markov :as markov])

-(defonce database (nippy/thaw-from-file "/home/eihli/.models/markov-database-4-gram-backwards.bin"))
+(def database (nippy/thaw-from-file "/home/eihli/.models/markov-database-4-gram-backwards.bin"))

-(defonce markov-tight-trie
+(def markov-tight-trie
  (tpt/load-tightly-packed-trie-from-file
   "/home/eihli/.models/markov-tightly-packed-trie-4-gram-backwards.bin"
   (markov/decode-fn database)))
@ -601,8 +642,7 @@ We can, however, measure and compare language generation algorithms against how
              (map database)
              (markov/perplexity 4 markov-tight-trie))
         word))))
-   ["a" "this" "that"])
-  nil)
+   ["a" "this" "that"]))
 #+end_src

 #+RESULTS:
@ -612,6 +652,9 @@ We can, however, measure and compare language generation algorithms against how
 : -12.184088569934774 is the perplexity of "a" "hole" "</s>" "</s>"
 : -12.552930899563904 is the perplexity of "this" "hole" "</s>" "</s>"
 : -13.905719644461469 is the perplexity of "that" "hole" "</s>" "</s>"
+:
+:
+

 The results above make intuitive sense. The most common word to preceed "hole" at the end of a sentence is the word "a". There are 250 instances of sentences of "... a hole.". That can be compared to 173 instances of "... this hole." and 45 instances of "... that hole.".

--- a/web/resources/public/README_WGU.htm
+++ b/web/resources/public/README_WGU.htm