diff --git a/web/README_WGU.org b/web/README_WGU.org index bf2fdd0..5e576be 100644 --- a/web/README_WGU.org +++ b/web/README_WGU.org @@ -809,7 +809,7 @@ With this precaution in place, attackers will not be able to snoop the content t By having the application server behind an HAProxy load balancer, we can take advantage of the built-in HAProxy stats page for monitoring amount of traffic and health of the application servers. -[[file:resources/public/images/stats.png]] +[[file:assets/images/stats.png]] http://darklimericks.com:8404/stats @@ -827,15 +827,15 @@ The first input field is for a word or phrase for which you wish to find a rhyme The first visualization is a scatter plot of rhyming words with the "quality" of the rhyme on the Y axis and the number of times that rhyming word/phrase occurrs in the training corpus on the X axis. -[[file:resources/public/images/wgu-vis.png]] +[[file:assets/images/wgu-vis.png]] The second visualization is a word cloud where the size of each word is based on the frequency with which the word appears in the training corpus. -[[file:resources/public/images/wgu-vis-cloud.png]] +[[file:assets/images/wgu-vis-cloud.png]] The third visualization is a table that lists all of the rhymes, their pronunciations, the rhyme quality, and the frequency. The table is sorted first by the rhyme quality then by the frequency. -[[file:resources/public/images/wgu-vis-table.png]] +[[file:assets/images/wgu-vis-table.png]] * D. Documentation :PROPERTIES: @@ -875,16 +875,61 @@ I wrote code to perform certain types of data analysis, but I didn't find it use For example, there is natural language processing code at [[https://github.com/eihli/prhyme/blob/master/src/com/owoga/prhyme/nlp/core.clj]] that parses a line into a grammar tree. I wrote several functions to manipulate and aggregate information about the grammar trees that compose the corpus. But I didn't use any of that information in creation of the n-gram Hidden Markov Model nor in the user display. For tasks related to brainstorming rhyming lyrics, that extra information lacked significant value. -** Assessment +** Assessment Of Hypothesis +I'll use an example output to subjectively assess the results of the project. + +Below are some of the lyrics suggested to rhyme with the word "technologies". + +| Rhyme | Quality | Lyric | Perplexity | +| technologies | 8 | you will tear the skin from the nuclear technologies | -0.04695091652785746 | +| pathologies | 7 | there's no hope for body's pathologies | -0.09800371561934312 | +| apologies | 7 | swimming in a grey world dying it's time for apologies | -0.14781111654643642 | +| chronologies | 7 | damn god damn the seed lurks in chronologies | -0.20912909334441387 | +| anomalies | 6 | yesterday was born i encounter the anomalies | -0.19578505194217627 | +| atrocities | 6 | there's no return and and the pimp your atrocities | -0.21516240668167685 | +| ideologies | 6 | entrenched ideologies | -0.27407234083849513 | +| monopolies | 6 | monopolies | -0.8472654185540912 | +| qualities | 5 | with such qualities | -0.0793752454750395 | +| policies | 5 | stop looking at insurance policies | -0.11580898408112054 | +| colonies | 5 | betwixt my heels, through the tears you collapse the colonies | -0.1610184959356118 | +| harmonies | 5 | broken harmonies | -0.18655087962492334 | +| prophecies | 5 | seek the truth prophecies | -0.24506696021938001 | +| festivities | 4 | you have touching the festivities | -0.09271388814221376 | +| delicacies | 4 | grey that consumes what it never was sun and the delicacies | -0.14553081854920977 | +| anybody's | 4 | your eyes, will remain violent the anybody's | -0.17560987263626957 | +| extremities | 4 | i am missing extremities | -0.30386279996641197 | +| casualties | 3 | feed the casualties | -0.23600199637494926 | + +Do these lyrics provide benefit to the brainstorming process? + +The lines "make sense" to varying degrees. + +The "pathologies" line, for example, contains a sensible 2-gram of "body's pathologies". The model has learned that the possessive form of "body" is a reasonable prefix to the word "pathologies". + +| pathologies | 7 | there's no hope for body's pathologies | -0.09800371561934312 | + +And the beginning of that line contains a phrase, "there's no hope", that fits perfectly with the genre/context of the training set (dark heavy metal). + +It's clear that the training worked. The output is relevant to the genre and grammatically reasonable. + +There's also a wide variety in the output, which is beneficial for +brainstorming. Suggestions range from clean and clear rhymes, like +"technologies" and "pathologies", to more abstract rhymes like "technologies" +and "anybody's", which some artists can creatively manipulate effectively. + +I assess this version of the product proves viable and there's exciting +possibilities for improvements by integrating with making suggestions that meet +certain stress patterns, preferring phrases that contain synonyms or antonyms, +and more. ** Visualizations -[[file:resources/public/images/rhyme-scatterplot.png]] +[[file:assets/images/rhyme-scatterplot.png]] -[[file:resources/public/images/wordcloud.png]] +[[file:assets/images/wordcloud.png]] -[[file:resources/public/images/rhyme-table.png]] +[[file:assets/images/rhyme-table.png]] ** Accuracy @@ -902,7 +947,7 @@ Using this technique on a (small) sample of 100 generated sentences reveals that This is just one of many possible assessment techniques we could use. It's simple but could be expanded to include valid phrases other than Treebank's clauses. For the purpose of having a measurement by which to compare changes to the algorithm, this suffices. -#+begin_src clojure :session main :eval no-export :results output +#+begin_src clojure :session main :eval no-export :results output :exports both (require '[com.darklimericks.linguistics.core :as linguistics] '[com.owoga.prhyme.nlp.core :as nlp]) @@ -923,6 +968,7 @@ This is just one of many possible assessment techniques we could use. It's simpl (/ (count valid-english) 100))) (println (average-valid-of-100-suggestions)) +;; 47/100 #+end_src #+RESULTS: diff --git a/web/resources/public/README_WGU.htm b/web/resources/public/README_WGU.htm index a14b6a7..322e519 100644 --- a/web/resources/public/README_WGU.htm +++ b/web/resources/public/README_WGU.htm @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
- +Hello! I hope you enjoy your time with this evaluation! @@ -361,20 +361,20 @@ After I describe the steps to initialize a development environment, you’ll
It’s probably not necessary for you to replicate my development environment in order to evaluate this project. You can access the deployed application at https://darklimericks.com/wgu and the libraries and supporting code that I wrote for this project at https://github.com/eihli/clj-tightly-packed-trie, https://github.com/eihli/syllabify, and https://github.com/eihli/prhyme. The web server and web application is not hosted publicly but you will find it uploaded with my submission as a .tar
archive.
./db/run.sh && ./kv/run.sh
to start the docker containers for the database and key-value store.
@@ -399,12 +399,12 @@ It’s probably not necessary for you to replicate my development environmen
./db/run.sh && ./kv/run.sh
to start the docker containers for the database and key-value store.
@@ -435,8 +435,8 @@ It’s probably not necessary for you to replicate my development environmen
Songwriters, artists, and record labels can save time and discover better lyrics with the help of a machine learning tool that supports their creative endeavours. @@ -448,8 +448,8 @@ Songwriters have several old-fashioned tools at their disposal including diction
How many sensible phrases can you think of that rhyme with “war on poverty”? What if I say that there’s a restriction to only come up with phrases that are exactly 14 syllables? That’s a common restriction when a songwriter is trying to match the meter of a previous line. What if I add another restriction that there must be primary stress at certain spots in that 14 syllable phrase? @@ -465,8 +465,8 @@ And this is a process that is perfect for machine learning. Machine learning can
RhymeStorm™ is a tool to help songwriters brainstorm. It provides lyrics automatically generated based on training data from existing songs while adhering to restrictions based on rhyme scheme, meter, genre, and more. @@ -494,8 +494,8 @@ This auto-complete functionality will be similar to the auto-complete that is co
The initial model will be trained on the lyrics from http://darklyrics.com. This is a publicly available data set with minimal meta-data. Record labels will have more valuable datasets that will include meta-data along with lyrics, such as the date the song was popular, the number of radio plays of the song, the profit of the song/artist, etc… @@ -507,8 +507,8 @@ The software can be augmented with additional algorithms to account for the type
This software will accomplish its primary objective if it makes its way into the daily toolkit of a handful of singers/songwriters. @@ -528,8 +528,8 @@ Another example is the package that turns phrases into phones (symbols of pronun
This project will be developed with an iterative Agile methodology. Since a large part of data science and machine learning is exploration, this project will benefit from ongoing exploration in tandem with development. @@ -545,8 +545,8 @@ The prices quoted below are for an initial minimum-viable-product that will serv
Funding requirements are minimal. The initial dataset is public and freely available. On a typical consumer laptop, Hidden Markov Models can be trained on fairly large datasets in short time and the training doesn’t require the use of expensive hardware like the GPUs used to train Deep Neural Networks. @@ -630,17 +630,17 @@ These are my estimates for the time and cost of different aspects of initial dev
Web scraping, the method used to obtain the initial dataset from http://darklyrics.com, is protected given the ruling in https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn. @@ -652,8 +652,8 @@ The use of publicly available data in generative works is less clear. But Micros
I have 10 years experience as a programmer and have worked extensively on both frontend technologies like HTML/JavaScript, backend technologies like Django, and building libraries/packages/frameworks. @@ -674,8 +674,8 @@ Write an executive summary directed to IT professionals that addresses each of t
Songwriters expend a lot of time and effort finding the perfect rhyming word or phrase. RhymeStorm™ is going to amplify user’s creative abilities by searching its machine learning model for sensible and proven-successful words and phrases that meet the rhyme scheme and meter requirements requested by the user. @@ -687,8 +687,8 @@ When a songwriter needs to find likely phrases that rhyme with “war on pov
Songwriters spend money on dictionaries, compilations of slang, thesauruses, and phrase dictionaries. They spend their time daydreaming, brainstorming, contemplating, and mixing and matching the knowledge they acquire through these traditional means. @@ -708,8 +708,8 @@ Computers can process and sort this information and sort the results by quality
We’re all familiar with dictionaries, thesauruses, and their shortcomings. @@ -725,8 +725,8 @@ RhymeZone is limited in its capability. It doesn’t do well finding rhymes
The initial dataset will be gathered by downloading lyrics from http://darklyrics.com and future models can be generated by downloading lyrics from other websites. Alternatively, data can be provided by record labels and combined with meta-data that the record label may have, such as how many radio plays each song gets and how much profit they make from each song. @@ -750,8 +750,8 @@ Each new model can be uploaded to the web server and users can select which mode
RhymeStorm™ development will proceed with an iterative Agile methodology. It will be composed of several independent modules that can be worked on independently, in parallel, and iteratively. @@ -775,8 +775,8 @@ Much of data science is exploratory and taking an iterative Agile approach can t
I’ll start by writing and releasing the supporting libraries and packages: Tries, Syllabification/Phonetics, Rhyming. @@ -831,8 +831,8 @@ In anticipation of user growth, I’ll be deploying the final product on Dig
the methods for validating and verifying that the developed data product meets the requirements and subsequently the needs of the customers @@ -852,8 +852,8 @@ The final website will integrate multiple technologies and the integrations won&
the programming environments and any related costs, as well as the human resources that are necessary to execute each phase in the development of the data product @@ -877,8 +877,8 @@ All code was written and all models were trained on a Lenovo T15G with an Intel