diff --git a/README.org b/README.org index ed8b6d4..e438657 100644 --- a/README.org +++ b/README.org @@ -4,8 +4,39 @@ See [[file:web/README_WGU.org][the WGU Readme]]. + * How To Initialize Development Environment +** Required Software + +- [[https://www.docker.com/][Docker]] +- [[https://clojure.org/releases/downloads][Clojure Version 1.10+]] +- [[https://github.com/clojure-emacs/cider][Emacs and CIDER]] + +** Steps + +1. Run ~./db/run.sh && ./kv/run.sh~ to start the docker containers for the database and key-value store. + a. The ~run.sh~ scripts only need to run once. They initialize development data containers. Subsequent development can continue with ~docker start db && docker start kv~. +2. Start a Clojure REPL in Emacs, evaluate the ~dev/user.clj~ namespace, and run ~(init)~ +3. Visit ~http://localhost:8000/wgu~ + + +* How To Run Software Locally + +** Requirements + +- [[https://www.java.com/download/ie_manual.jsp][Java]] +- [[https://www.docker.com/][Docker]] + +** Steps +1. Run ~./db/run.sh && ./kv/run.sh~ to start the docker containers for the database and key-value store. + a. The ~run.sh~ scripts only need to run once. They initialize development data containers. Subsequent development can continue with ~docker start db && docker start kv~. +2. The application's ~jar~ builds with a ~make~ run from the root directory. (See [[file:../Makefile][Makefile]]). +3. Navigate to the root directory of this git repo and run ~java -jar darklimericks.jar~ +4. Visit http://localhost:8000/wgu + + + * Development Requires [[https://github.com/tachyons-css/tachyons/][Tachyons CSS]]. There is a symlink in ~web/resources/public~ to the pre-built ~tachyons.css~ and ~tachyons.min.css~ found in the repo. diff --git a/web/README_WGU.org b/web/README_WGU.org index 750d900..87bb49a 100644 --- a/web/README_WGU.org +++ b/web/README_WGU.org @@ -34,7 +34,6 @@ It's probably not necessary for you to replicate my development environment in o 2. Start a Clojure REPL in Emacs, evaluate the ~dev/user.clj~ namespace, and run ~(init)~ 3. Visit ~http://localhost:8000/wgu~ - ** How To Run Software Locally *** Requirements @@ -125,7 +124,6 @@ These are my estimates for the time and cost of different aspects of initial dev | Quality Assurance | 20 | $200 | | Total | 330 | $3,300 | - ** Stakeholder Impact The only stakeholders in the project will be the record labels or songwriters. I describe the only impact to them in the [[Benefits]] section above. @@ -450,7 +448,6 @@ words can be compared: "Foo" is the same as "foo". (map (partial mapv string/lower-case)))) #+end_src - ** Data Exploration And Preparation The primary data structure and algorithms supporting exploration of the data are a Markov Trie @@ -872,30 +869,165 @@ For example, there is natural language processing code at [[https://github.com/e ** Assessment -See visualization of rhyme suggestion in action. ** Visualizations -See visualization of smoothing technique. +[[file:resources/images/rhyme-scatterplot.png]] + +[[file:resources/images/wordcloud.png]] -See wordcloud? +[[file:resources/images/rhyme-table.png]] ** Accuracy -• assessment of the product’s accuracy +It's difficult to objectively test the models accuracy since the goal of "brainstorm new lyric" is such a subjective goal. A valid test of that goal will require many human subjects to subjectively evaluate their performance while using the tool compared to their performance without the tool. + +If we allow ourselves the assumption that the close a generated phrase is to a valid english sentence then the better the generated phrase is at helping a songwriter brainstorm, then one objective assessment measure can be the percentage of generated lyrics that are valid English sentences. + +*** Percentage Of Generated Lines That Are Valid English Sentences + +We can use [[https://opennlp.apache.org/][Apache OpenNLP]] to parse sentences into a grammar structure conforming to the parts of speech specified by the [[https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html][University of Pennsylvania's Treebank Project]]. + +If OpenNLP parses a line of text into a "simple declarative clause" from the Treebank Tag Set, as described [[https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html][here]], then we consider it a valid sentence. + +Using this technique on a (small) sample of 100 generated sentences reveals that ~47 are valid. + +This is just one of many possible assessment techniques we could use. It's simple but could be expanded to include valid phrases other than Treebank's clauses. For the purpose of having a measurement by which to compare changes to the algorithm, this suffices. + +#+begin_src clojure :session main :eval no-export :results output +(require '[com.darklimericks.linguistics.core :as linguistics] + '[com.owoga.prhyme.nlp.core :as nlp]) + +;; wgu-lyric-suggestion returns 20 suggestions. Each suggestion is a vector of +;; the rhyming word/quality/frequency and the sentence/parse. This function +;; returns just the sentences. The sentences can be further filtered using +;; OpenNLP to only those that are grammatically valid english sentences. + +(defn sample-of-20 + [] + (->> "technology" + linguistics/wgu-lyric-suggestions + (map (comp first second)))) + +(defn average-valid-of-100-suggestions [] + (let [generated-suggestions (apply concat (repeatedly 5 sample-of-20)) + valid-english (filter nlp/valid-sentence? generated-suggestions)] + (/ (count valid-english) 100))) + +(println (average-valid-of-100-suggestions)) +#+end_src + +#+RESULTS: +: 47/100 + +Where ~nlp/valid-sentence?~ is defined as follows. + +#+begin_src clojure +(defn valid-sentence? + "Tokenizes and parses the phrase using OpenNLP models from + http://opennlp.sourceforge.net/models-1.5/ + + If the parse tree has a clause as the top-level tag, then + we consider it a valid English sentence." + [phrase] + (->> phrase + tokenize + (string/join " ") + vector + parse + first + tb/make-tree + :chunk + first + :tag + tb2/clauses + boolean)) +#+end_src ** Testing -• the results from the data product testing, revisions, and optimization based on the provided plans, including screenshots +My language of choice for this project encourages a programming technique or paradigm known as REPL-driven development. REPL stands for Read-Eval-Print-Loop. This is a way to write and test code in real-time without a compilation step. Individual code chunks can be evaluated inside an editor, resulting in rapid feedback. + +Therefore, many "tests" exist as comments immediately following the code under test. For example: + +#+begin_src clojure :eval no +(defn perfect-rhyme + [phones] + (->> phones + reverse + (util/take-through stress-manip/primary-stress?) + first + reverse + (#(cons (first %) + (stress-manip/remove-any-stress-signifiers (rest %)))))) + +(comment + (perfect-rhyme (first (phonetics/get-phones "technology"))) + ;; => ("AA1" "L" "AH" "JH" "IY") + ) +#+end_src + +The code inside that comment can be evaluated with a simple keystroke while +inside an editor. It serves as both a test and a form of documentation, as you +can see the input and the expected output. + +Supporting libraries have a more robust test suite, since their purpose is to be used more widely across other projects with contributions accepted from anyone. + +Here is an example of the test suite for the code related to syllabification: [[https://github.com/eihli/phonetics/blob/main/test/com/owoga/phonetics/syllabify_test.clj]]. + +** Source Code -** Source +*** Tightly Packed Trie -• source code and executable file(s) +This is the data structure that backs the Hidden Markov Model. + +https://github.com/eihli/clj-tightly-packed-trie + +*** Phonetics + +This is the helper library that syllabifies and manipulates words, phones, and syllables. + +https://github.com/eihli/phonetics + +*** Rhyming + +This library contains code for analyzing rhymes, sentence structure, and manipulating corpuses. + +https://github.com/eihli/prhyme + +*** Web Server And User Interface + +This application is not publicly available. I'll upload it with submission of the project. ** Quick Start -• a quick start guide summarizing the steps necessary to install and use the product +*** How To Initialize Development Environment + +**** Required Software + +- [[https://www.docker.com/][Docker]] +- [[https://clojure.org/releases/downloads][Clojure Version 1.10+]] +- [[https://github.com/clojure-emacs/cider][Emacs and CIDER]] + +**** Steps -* Notes +1. Run ~./db/run.sh && ./kv/run.sh~ to start the docker containers for the database and key-value store. + a. The ~run.sh~ scripts only need to run once. They initialize development data containers. Subsequent development can continue with ~docker start db && docker start kv~. +2. Start a Clojure REPL in Emacs, evaluate the ~dev/user.clj~ namespace, and run ~(init)~ +3. Visit ~http://localhost:8000/wgu~ + +*** How To Run Software Locally + +**** Requirements + +- [[https://www.java.com/download/ie_manual.jsp][Java]] +- [[https://www.docker.com/][Docker]] + +**** Steps + +1. Run ~./db/run.sh && ./kv/run.sh~ to start the docker containers for the database and key-value store. + a. The ~run.sh~ scripts only need to run once. They initialize development data containers. Subsequent development can continue with ~docker start db && docker start kv~. +2. The application's ~jar~ builds with a ~make~ run from the root directory. (See [[file:../Makefile][Makefile]]). +3. Navigate to the root directory of this git repo and run ~java -jar darklimericks.jar~ +4. Visit http://localhost:8000/wgu -http-kit doesn't support https so no need to bother with keystore stuff like you would with jetty. Just proxy from haproxy. diff --git a/web/resources/images/rhyme-scatterplot.png b/web/resources/images/rhyme-scatterplot.png new file mode 100644 index 0000000..567393d Binary files /dev/null and b/web/resources/images/rhyme-scatterplot.png differ diff --git a/web/resources/images/rhyme-table.png b/web/resources/images/rhyme-table.png new file mode 100644 index 0000000..f76dfc0 Binary files /dev/null and b/web/resources/images/rhyme-table.png differ diff --git a/web/resources/images/wordcloud.png b/web/resources/images/wordcloud.png new file mode 100644 index 0000000..a26e7dc Binary files /dev/null and b/web/resources/images/wordcloud.png differ diff --git a/web/resources/public/README_WGU.htm b/web/resources/public/README_WGU.htm index f9f067d..3d20be6 100644 --- a/web/resources/public/README_WGU.htm +++ b/web/resources/public/README_WGU.htm @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
- +Hello! I hope you enjoy your time with this evaluation! @@ -346,20 +361,20 @@ After I describe the steps to initialize a development environment, you’ll
It’s probably not necessary for you to replicate my development environment in order to evaluate this project. You can access the deployed application at https://darklimericks.com/wgu and the libraries and supporting code that I wrote for this project at https://github.com/eihli/clj-tightly-packed-trie, https://github.com/eihli/syllabify, and https://github.com/eihli/prhyme. The web server and web application is not hosted publicly but you will find it uploaded with my submission as a .tar
archive.
./db/run.sh && ./kv/run.sh
to start the docker containers for the database and key-value store.
@@ -384,13 +399,12 @@ It’s probably not necessary for you to replicate my development environmen
./db/run.sh && ./kv/run.sh
to start the docker containers for the database and key-value store.
@@ -421,8 +435,8 @@ It’s probably not necessary for you to replicate my development environmen
Songwriters, artists, and record labels can save time and discover better lyrics with the help of a machine learning tool that supports their creative endeavours. @@ -434,8 +448,8 @@ Songwriters have several old-fashioned tools at their disposal including diction
How many sensible phrases can you think of that rhyme with “war on poverty”? What if I say that there’s a restriction to only come up with phrases that are exactly 14 syllables? That’s a common restriction when a songwriter is trying to match the meter of a previous line. What if I add another restriction that there must be primary stress at certain spots in that 14 syllable phrase? @@ -451,8 +465,8 @@ And this is a process that is perfect for machine learning. Machine learning can
RhymeStorm™ is a tool to help songwriters brainstorm. It provides lyrics automatically generated based on training data from existing songs while adhering to restrictions based on rhyme scheme, meter, genre, and more. @@ -480,8 +494,8 @@ This auto-complete functionality will be similar to the auto-complete that is co
The initial model will be trained on the lyrics from http://darklyrics.com. This is a publicly available data set with minimal meta-data. Record labels will have more valuable datasets that will include meta-data along with lyrics, such as the date the song was popular, the number of radio plays of the song, the profit of the song/artist, etc… @@ -493,8 +507,8 @@ The software can be augmented with additional algorithms to account for the type
This software will accomplish its primary objective if it makes its way into the daily toolkit of a handful of singers/songwriters. @@ -514,8 +528,8 @@ Another example is the package that turns phrases into phones (symbols of pronun
This project will be developed with an iterative Agile methodology. Since a large part of data science and machine learning is exploration, this project will benefit from ongoing exploration in tandem with development. @@ -531,8 +545,8 @@ The prices quoted below are for an initial minimum-viable-product that will serv
Funding requirements are minimal. The initial dataset is public and freely available. On a typical consumer laptop, Hidden Markov Models can be trained on fairly large datasets in short time and the training doesn’t require the use of expensive hardware like the GPUs used to train Deep Neural Networks. @@ -616,18 +630,17 @@ These are my estimates for the time and cost of different aspects of initial dev
Web scraping, the method used to obtain the initial dataset from http://darklyrics.com, is protected given the ruling in https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn. @@ -639,8 +652,8 @@ The use of publicly available data in generative works is less clear. But Micros
I have 10 years experience as a programmer and have worked extensively on both frontend technologies like HTML/JavaScript, backend technologies like Django, and building libraries/packages/frameworks. @@ -661,8 +674,8 @@ Write an executive summary directed to IT professionals that addresses each of t
Songwriters expend a lot of time and effort finding the perfect rhyming word or phrase. RhymeStorm™ is going to amplify user’s creative abilities by searching its machine learning model for sensible and proven-successful words and phrases that meet the rhyme scheme and meter requirements requested by the user. @@ -674,8 +687,8 @@ When a songwriter needs to find likely phrases that rhyme with “war on pov
Songwriters spend money on dictionaries, compilations of slang, thesauruses, and phrase dictionaries. They spend their time daydreaming, brainstorming, contemplating, and mixing and matching the knowledge they acquire through these traditional means. @@ -695,8 +708,8 @@ Computers can process and sort this information and sort the results by quality
We’re all familiar with dictionaries, thesauruses, and their shortcomings. @@ -712,8 +725,8 @@ RhymeZone is limited in its capability. It doesn’t do well finding rhymes
The initial dataset will be gathered by downloading lyrics from http://darklyrics.com and future models can be generated by downloading lyrics from other websites. Alternatively, data can be provided by record labels and combined with meta-data that the record label may have, such as how many radio plays each song gets and how much profit they make from each song. @@ -737,8 +750,8 @@ Each new model can be uploaded to the web server and users can select which mode
RhymeStorm™ development will proceed with an iterative Agile methodology. It will be composed of several independent modules that can be worked on independently, in parallel, and iteratively. @@ -762,8 +775,8 @@ Much of data science is exploratory and taking an iterative Agile approach can t
I’ll start by writing and releasing the supporting libraries and packages: Tries, Syllabification/Phonetics, Rhyming. @@ -818,8 +831,8 @@ In anticipation of user growth, I’ll be deploying the final product on Dig
the methods for validating and verifying that the developed data product meets the requirements and subsequently the needs of the customers @@ -839,8 +852,8 @@ The final website will integrate multiple technologies and the integrations won&
the programming environments and any related costs, as well as the human resources that are necessary to execute each phase in the development of the data product @@ -864,8 +877,8 @@ All code was written and all models were trained on a Lenovo T15G with an Intel
class java.lang.IllegalStateException | +rhyme | +frequency count | +rhyme quality | +
technology | +318 | +8 | +|
apology | +68 | +7 | +|
pathology | +42 | +7 | +|
mythology | +27 | +7 | +|
psychology | +24 | +7 | +|
theology | +23 | +7 | +|
biology | +20 | +7 | +|
ecology | +11 | +7 | +|
chronology | +10 | +7 | +|
astrology | +9 | +7 | +|
biotechnology | +8 | +7 | +|
nanotechnology | +5 | +7 | +|
geology | +3 | +7 | +|
ontology | +2 | +7 | +|
morphology | +2 | +7 | +|
seismology | +1 | +7 | +|
urology | +1 | +7 | +|
doxology | +0 | +7 | +|
neurology | +0 | +7 | +|
hypocrisy | +723 | +6 | +|
democracy | +238 | +6 | |
[[“rhyme” “frequency count” “rhyme quality”] [“technology” 318 8] [“apology” 68 7] [“pathology” 42 7] [“mythology” 27 7] [“psychology” 24 7] [“theology” 23 7] [“biology” 20 7] [“ecology” 11 7] [“chronology” 10 7] [“astrology” 9 7] [“biotechnology” 8 7] [“nanotechnology” 5 7] [“geology” 3 7] [“ontology” 2 7] [“morphology” 2 7] [“seismology” 1 7] [“urology” 1 7] [“doxology” 0 7] [“neurology” 0 7] [“hypocrisy” 723 6] [“democracy” 238 6] [“atrocity” 224 6] [“philosophy” 181 6] [“equality” 109 6] [“ideology” 105 6]] | +atrocity | +224 | +6 | +
philosophy | +181 | +6 | +|
equality | +109 | +6 | +|
ideology | +105 | +6 |
The data processing code is in https://github.com/eihli/prhyme @@ -1276,9 +1441,8 @@ words can be compared: “Foo” is the same as “foo”.
The primary data structure and algorithms supporting exploration of the data are a Markov Trie
@@ -1326,8 +1490,8 @@ All Trie code is hosted in the git repo located at
-
The functionality to explore and visualize data is baked into the Trie data structure.
@@ -1337,7 +1501,7 @@ The functionality to explore and visualize data is baked into the Trie data stru
By simply viewing the Trie in a Clojure REPL, you can inspect the Trie’s structure.
This interactive query will return a list of rhyming phrases to any word or phrase you enter.
@@ -1527,8 +1691,8 @@ The interactive query for the above can be found at
-
This interactive query will return a list of lyrics completing the given suffix with randomly generated prefixes.
@@ -1630,8 +1794,8 @@ The interactive query for the above can be found at
-
The machine learning method chosen for this software is a Hidden Markov Model.
@@ -1701,11 +1865,19 @@ The algorithm for generating predictions from the HMM is as follows.
The results above show a sample of 10 elements in a 1-to-3-gram trie
Since creative brainstorming is the goal, “accuracy” is subjective.
@@ -1850,8 +2022,8 @@ This standardized measure of accuracy can be used to compare different language
Artists/Songwriters place a lot of value in the secrecy of their content. Therefore, all communication with the web-based interface occurs over a secure connection using HTTPS.
@@ -1867,8 +2039,8 @@ With this precaution in place, attackers will not be able to snoop the content t
+You can access an example of the user interface at https://darklimericks.com/wgu.
+
+You’ll see 3 input fields.
+
+The first input field is for a word or phrase for which you wish to find a rhyme. Submitting that field will return three visualizations to help you pick a rhyme.
+
+The first visualization is a scatter plot of rhyming words with the “quality” of the rhyme on the Y axis and the number of times that rhyming word/phrase occurrs in the training corpus on the X axis.
+
+
+The second visualization is a word cloud where the size of each word is based on the frequency with which the word appears in the training corpus.
+
+
+The third visualization is a table that lists all of the rhymes, their pronunciations, the rhyme quality, and the frequency. The table is sorted first by the rhyme quality then by the frequency.
+
+
Provide rhyming lyric suggestions optionally constrained by syllable count.
+I obtained the dataset from http://darklyrics.com.
+
+The code that I used to download all of the lyrics is at https://github.com/eihli/prhyme/blob/master/src/com/owoga/corpus/darklyrics.clj.
+
+In the interest of being nice to the owners of http://darklyrics.com, I’m keeping private the files containing the lyrics.
+
+The trained data model is available.
+
See
-See
-See https://github.com/eihli/prhyme
+For example, there is natural language processing code at https://github.com/eihli/prhyme/blob/master/src/com/owoga/prhyme/nlp/core.clj that parses a line into a grammar tree. I wrote several functions to manipulate and aggregate information about the grammar trees that compose the corpus. But I didn’t use any of that information in creation of the n-gram Hidden Markov Model nor in the user display. For tasks related to brainstorming rhyming lyrics, that extra information lacked significant value.
-See visualization of rhyme suggestion in action.
+
-See perplexity?
+
+
+
-See visualization of smoothing technique.
+It’s difficult to objectively test the models accuracy since the goal of “brainstorm new lyric” is such a subjective goal. A valid test of that goal will require many human subjects to subjectively evaluate their performance while using the tool compared to their performance without the tool.
-See wordcloud
+If we allow ourselves the assumption that the close a generated phrase is to a valid english sentence then the better the generated phrase is at helping a songwriter brainstorm, then one objective assessment measure can be the percentage of generated lyrics that are valid English sentences.
+We can use Apache OpenNLP to parse sentences into a grammar structure conforming to the parts of speech specified by the University of Pennsylvania’s Treebank Project.
+
+If OpenNLP parses a line of text into a “simple declarative clause” from the Treebank Tag Set, as described here, then we consider it a valid sentence.
+
+Using this technique on a (small) sample of 100 generated sentences reveals that ~47 are valid.
+
+This is just one of many possible assessment techniques we could use. It’s simple but could be expanded to include valid phrases other than Treebank’s clauses. For the purpose of having a measurement by which to compare changes to the algorithm, this suffices.
+
-• assessment of the product’s accuracy
+Where
-• the results from the data product testing, revisions, and optimization based on the provided plans, including screenshots
+My language of choice for this project encourages a programming technique or paradigm known as REPL-driven development. REPL stands for Read-Eval-Print-Loop. This is a way to write and test code in real-time without a compilation step. Individual code chunks can be evaluated inside an editor, resulting in rapid feedback.
+
+Therefore, many “tests” exist as comments immediately following the code under test. For example:
+
+The code inside that comment can be evaluated with a simple keystroke while
+inside an editor. It serves as both a test and a form of documentation, as you
+can see the input and the expected output.
+
+Supporting libraries have a more robust test suite, since their purpose is to be used more widely across other projects with contributions accepted from anyone.
+
+Here is an example of the test suite for the code related to syllabification: https://github.com/eihli/phonetics/blob/main/test/com/owoga/phonetics/syllabify_test.clj.
+This is the data structure that backs the Hidden Markov Model.
+
-• source code and executable file(s)
+https://github.com/eihli/clj-tightly-packed-trie
+This is the helper library that syllabifies and manipulates words, phones, and syllables.
+
-• a quick start guide summarizing the steps necessary to install and use the product
+https://github.com/eihli/phonetics
+This library contains code for analyzing rhymes, sentence structure, and manipulating corpuses.
+
-http-kit doesn’t support https so no need to bother with keystore stuff like you would with jetty. Just proxy from haproxy.
+This application is not publicly available. I’ll upload it with submission of the project.
Created: 2021-07-20 Tue 16:38 Created: 2021-07-22 Thu 16:365.6 Data Visualization Functionalities For Data Exploration And Inspection
+5.6 Data Visualization Functionalities For Data Exploration And Inspection
+
(let [initialized-trie (->> (trie/make-trie "dog" "dog" "dot" "dot" "do" "do"))]
initialized-trie)
;; => {(\d \o \g) "dog", (\d \o \t) "dot", (\d \o) "do", (\d) nil}
@@ -1379,12 +1543,12 @@ The Hidden Markov Model data structure doesn’t lend itself to any useful g
5.7 Implementation Of Interactive Queries
+5.7 Implementation Of Interactive Queries
5.7.1 Generate Rhyming Lyrics
+5.7.1 Generate Rhyming Lyrics
5.7.2 Complete Lyric Containing Suffix
+5.7.2 Complete Lyric Containing Suffix
5.8 Implementation Of Machine Learning Methods
+5.8 Implementation Of Machine Learning Methods
-class java.lang.IllegalStateException
+
+[(("<s>" "call" "me")
+ ("<s>" "call")
+ ("<s>" "right" "</s>")
+ ("<s>" "right")
+ ("<s>" "that's" "proportional")
+ ("<s>" "that's")
+ ("<s>" "don't" "</s>")
+ ("<s>" "don't")
+ ("<s>" "yourself" "in")
+ ("<s>" "yourself"))]
-
5.9 Functionalities To Evaluate The Accuracy Of The Data Product
+5.9 Functionalities To Evaluate The Accuracy Of The Data Product
5.10 Security Features
+5.10 Security Features
5.11 TODO Tools To Monitor And Maintain The Product
+5.11 TODO Tools To Monitor And Maintain The Product
5.12 TODO A User-Friendly, Functional Dashboard That Includes At Least Three Visualization Types
+5.12 A User-Friendly, Functional Dashboard That Includes At Least Three Visualization Types
-
+6.1 Business Vision
+6.1 Business Vision
6.1.1 Requirements
+6.1.1 Requirements
-
[ ]
Given a word or phrase, suggest rhymes (ranked by quality) (Trie)[ ]
Given a word or phrase, suggest lyric completion (Hidden Markov Model)
+[X]
Given a word or phrase, suggest rhymes (ranked by quality) (Trie)[-]
Given a word or phrase, suggest lyric completion (Hidden Markov Model)
-
[ ]
Restrict suggestion by syllable count[ ]
Restrict suggestion by rhyme quality[ ]
Show graph of suggestions with perplexity on one axis and rhyme quality on the other[ ]
(Future iteration) Restrict suggestion by syllable count[X]
Sort suggestions by frequency of occurrence in training corpus[X]
Sort suggestions by rhyme quality[ ]
(Future iteration) Show graph of suggestions with perplexity on one axis and rhyme quality on the other6.2 Data Sets
+6.2 Data Sets
resources/darklyrics-markov.tpt
6.3 Data Analysis
+6.3 Data Analysis
src/com/owoga/darklyrics/core.clj
+I wrote code to perform certain types of data analysis, but I didn’t find it useful to meet the business requirements of this project.
6.4 Assessment
-6.4 Assessment
+6.5 Visualizations
+6.5 Visualizations
-6.6 Accuracy
+6.6.1 Percentage Of Generated Lines That Are Valid English Sentences
+(require '[com.darklimericks.linguistics.core :as linguistics]
+ '[com.owoga.prhyme.nlp.core :as nlp])
+
+;; wgu-lyric-suggestion returns 20 suggestions. Each suggestion is a vector of
+;; the rhyming word/quality/frequency and the sentence/parse. This function
+;; returns just the sentences. The sentences can be further filtered using
+;; OpenNLP to only those that are grammatically valid english sentences.
+
+(defn sample-of-20
+ []
+ (->> "technology"
+ linguistics/wgu-lyric-suggestions
+ (map (comp first second))))
+
+(defn average-valid-of-100-suggestions []
+ (let [generated-suggestions (apply concat (repeatedly 5 sample-of-20))
+ valid-english (filter nlp/valid-sentence? generated-suggestions)]
+ (/ (count valid-english) 100)))
+
+(println (average-valid-of-100-suggestions))
+
6.6 Accuracy
-nlp/valid-sentence?
is defined as follows.
(defn valid-sentence?
+ "Tokenizes and parses the phrase using OpenNLP models from
+ http://opennlp.sourceforge.net/models-1.5/
+
+ If the parse tree has a clause as the top-level tag, then
+ we consider it a valid English sentence."
+ [phrase]
+ (->> phrase
+ tokenize
+ (string/join " ")
+ vector
+ parse
+ first
+ tb/make-tree
+ :chunk
+ first
+ :tag
+ tb2/clauses
+ boolean))
+
+6.7 Testing
+6.7 Testing
(defn perfect-rhyme
+ [phones]
+ (->> phones
+ reverse
+ (util/take-through stress-manip/primary-stress?)
+ first
+ reverse
+ (#(cons (first %)
+ (stress-manip/remove-any-stress-signifiers (rest %))))))
+
+(comment
+ (perfect-rhyme (first (phonetics/get-phones "technology")))
+ ;; => ("AA1" "L" "AH" "JH" "IY")
+ )
+
+6.8 Source
+6.8 Source Code
6.8.1 Tightly Packed Trie
+6.9 Quick Start
-6.8.2 Phonetics
+6.8.3 Rhyming
+7 Notes
-6.8.4 Web Server And User Interface
+6.9 Quick Start
+6.9.1 How To Initialize Development Environment
+
+
+
+
+
+
+./db/run.sh && ./kv/run.sh
to start the docker containers for the database and key-value store.
+
+
run.sh
scripts only need to run once. They initialize development data containers. Subsequent development can continue with docker start db && docker start kv
.dev/user.clj
namespace, and run (init)
http://localhost:8000/wgu
6.9.2 How To Run Software Locally
+
+
+
+
+
+
+
+./db/run.sh && ./kv/run.sh
to start the docker containers for the database and key-value store.
+
+
run.sh
scripts only need to run once. They initialize development data containers. Subsequent development can continue with docker start db && docker start kv
.jar
builds with a make
run from the root directory. (See Makefile).java -jar darklimericks.jar