You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Eric Ihli 836f17b0f6 | 3 years ago | |
---|---|---|
.. | ||
images | 3 years ago | |
README.htm | 3 years ago | |
favicon.ico | 4 years ago | |
index.html | 4 years ago | |
tachyons.css | 4 years ago | |
tachyons.min.css | 4 years ago | |
test.txt | 4 years ago | |
vega-embed.js | 3 years ago | |
vega-lite.js | 3 years ago | |
vega.js | 3 years ago |
README.htm
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<!-- 2021-07-23 Fri 17:16 -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>RhymeStorm™ - RHYMESTORM CSCI Capstone Project</title>
<meta name="author" content="Eric Ihli" />
<meta name="generator" content="Org Mode" />
<style type="text/css">
<!--/*--><![CDATA[/*><!--*/
.title { text-align: center;
margin-bottom: .2em; }
.subtitle { text-align: center;
font-size: medium;
font-weight: bold;
margin-top:0; }
.todo { font-family: monospace; color: red; }
.done { font-family: monospace; color: green; }
.priority { font-family: monospace; color: orange; }
.tag { background-color: #eee; font-family: monospace;
padding: 2px; font-size: 80%; font-weight: normal; }
.timestamp { color: #bebebe; }
.timestamp-kwd { color: #5f9ea0; }
.org-right { margin-left: auto; margin-right: 0px; text-align: right; }
.org-left { margin-left: 0px; margin-right: auto; text-align: left; }
.org-center { margin-left: auto; margin-right: auto; text-align: center; }
.underline { text-decoration: underline; }
#postamble p, #preamble p { font-size: 90%; margin: .2em; }
p.verse { margin-left: 3%; }
pre {
border: 1px solid #ccc;
box-shadow: 3px 3px 3px #eee;
padding: 8pt;
font-family: monospace;
overflow: auto;
margin: 1.2em;
}
pre.src {
position: relative;
overflow: auto;
padding-top: 1.2em;
}
pre.src:before {
display: none;
position: absolute;
background-color: white;
top: -10px;
right: 10px;
padding: 3px;
border: 1px solid black;
}
pre.src:hover:before { display: inline; margin-top: 14px;}
/* Languages per Org manual */
pre.src-asymptote:before { content: 'Asymptote'; }
pre.src-awk:before { content: 'Awk'; }
pre.src-C:before { content: 'C'; }
/* pre.src-C++ doesn't work in CSS */
pre.src-clojure:before { content: 'Clojure'; }
pre.src-css:before { content: 'CSS'; }
pre.src-D:before { content: 'D'; }
pre.src-ditaa:before { content: 'ditaa'; }
pre.src-dot:before { content: 'Graphviz'; }
pre.src-calc:before { content: 'Emacs Calc'; }
pre.src-emacs-lisp:before { content: 'Emacs Lisp'; }
pre.src-fortran:before { content: 'Fortran'; }
pre.src-gnuplot:before { content: 'gnuplot'; }
pre.src-haskell:before { content: 'Haskell'; }
pre.src-hledger:before { content: 'hledger'; }
pre.src-java:before { content: 'Java'; }
pre.src-js:before { content: 'Javascript'; }
pre.src-latex:before { content: 'LaTeX'; }
pre.src-ledger:before { content: 'Ledger'; }
pre.src-lisp:before { content: 'Lisp'; }
pre.src-lilypond:before { content: 'Lilypond'; }
pre.src-lua:before { content: 'Lua'; }
pre.src-matlab:before { content: 'MATLAB'; }
pre.src-mscgen:before { content: 'Mscgen'; }
pre.src-ocaml:before { content: 'Objective Caml'; }
pre.src-octave:before { content: 'Octave'; }
pre.src-org:before { content: 'Org mode'; }
pre.src-oz:before { content: 'OZ'; }
pre.src-plantuml:before { content: 'Plantuml'; }
pre.src-processing:before { content: 'Processing.js'; }
pre.src-python:before { content: 'Python'; }
pre.src-R:before { content: 'R'; }
pre.src-ruby:before { content: 'Ruby'; }
pre.src-sass:before { content: 'Sass'; }
pre.src-scheme:before { content: 'Scheme'; }
pre.src-screen:before { content: 'Gnu Screen'; }
pre.src-sed:before { content: 'Sed'; }
pre.src-sh:before { content: 'shell'; }
pre.src-sql:before { content: 'SQL'; }
pre.src-sqlite:before { content: 'SQLite'; }
/* additional languages in org.el's org-babel-load-languages alist */
pre.src-forth:before { content: 'Forth'; }
pre.src-io:before { content: 'IO'; }
pre.src-J:before { content: 'J'; }
pre.src-makefile:before { content: 'Makefile'; }
pre.src-maxima:before { content: 'Maxima'; }
pre.src-perl:before { content: 'Perl'; }
pre.src-picolisp:before { content: 'Pico Lisp'; }
pre.src-scala:before { content: 'Scala'; }
pre.src-shell:before { content: 'Shell Script'; }
pre.src-ebnf2ps:before { content: 'ebfn2ps'; }
/* additional language identifiers per "defun org-babel-execute"
in ob-*.el */
pre.src-cpp:before { content: 'C++'; }
pre.src-abc:before { content: 'ABC'; }
pre.src-coq:before { content: 'Coq'; }
pre.src-groovy:before { content: 'Groovy'; }
/* additional language identifiers from org-babel-shell-names in
ob-shell.el: ob-shell is the only babel language using a lambda to put
the execution function name together. */
pre.src-bash:before { content: 'bash'; }
pre.src-csh:before { content: 'csh'; }
pre.src-ash:before { content: 'ash'; }
pre.src-dash:before { content: 'dash'; }
pre.src-ksh:before { content: 'ksh'; }
pre.src-mksh:before { content: 'mksh'; }
pre.src-posh:before { content: 'posh'; }
/* Additional Emacs modes also supported by the LaTeX listings package */
pre.src-ada:before { content: 'Ada'; }
pre.src-asm:before { content: 'Assembler'; }
pre.src-caml:before { content: 'Caml'; }
pre.src-delphi:before { content: 'Delphi'; }
pre.src-html:before { content: 'HTML'; }
pre.src-idl:before { content: 'IDL'; }
pre.src-mercury:before { content: 'Mercury'; }
pre.src-metapost:before { content: 'MetaPost'; }
pre.src-modula-2:before { content: 'Modula-2'; }
pre.src-pascal:before { content: 'Pascal'; }
pre.src-ps:before { content: 'PostScript'; }
pre.src-prolog:before { content: 'Prolog'; }
pre.src-simula:before { content: 'Simula'; }
pre.src-tcl:before { content: 'tcl'; }
pre.src-tex:before { content: 'TeX'; }
pre.src-plain-tex:before { content: 'Plain TeX'; }
pre.src-verilog:before { content: 'Verilog'; }
pre.src-vhdl:before { content: 'VHDL'; }
pre.src-xml:before { content: 'XML'; }
pre.src-nxml:before { content: 'XML'; }
/* add a generic configuration mode; LaTeX export needs an additional
(add-to-list 'org-latex-listings-langs '(conf " ")) in .emacs */
pre.src-conf:before { content: 'Configuration File'; }
table { border-collapse:collapse; }
caption.t-above { caption-side: top; }
caption.t-bottom { caption-side: bottom; }
td, th { vertical-align:top; }
th.org-right { text-align: center; }
th.org-left { text-align: center; }
th.org-center { text-align: center; }
td.org-right { text-align: right; }
td.org-left { text-align: left; }
td.org-center { text-align: center; }
dt { font-weight: bold; }
.footpara { display: inline; }
.footdef { margin-bottom: 1em; }
.figure { padding: 1em; }
.figure p { text-align: center; }
.equation-container {
display: table;
text-align: center;
width: 100%;
}
.equation {
vertical-align: middle;
}
.equation-label {
display: table-cell;
text-align: right;
vertical-align: middle;
}
.inlinetask {
padding: 10px;
border: 2px solid gray;
margin: 10px;
background: #ffffcc;
}
#org-div-home-and-up
{ text-align: right; font-size: 70%; white-space: nowrap; }
textarea { overflow-x: auto; }
.linenr { font-size: smaller }
.code-highlighted { background-color: #ffff00; }
.org-info-js_info-navigation { border-style: none; }
#org-info-js_console-label
{ font-size: 10px; font-weight: bold; white-space: nowrap; }
.org-info-js_search-highlight
{ background-color: #ffff00; color: #000000; font-weight: bold; }
.org-svg { width: 90%; }
/*]]>*/-->
</style>
<script type="text/javascript">
// @license magnet:?xt=urn:btih:e95b018ef3580986a04669f1b5879592219e2a7a&dn=public-domain.txt Public Domain
<!--/*--><![CDATA[/*><!--*/
function CodeHighlightOn(elem, id)
{
var target = document.getElementById(id);
if(null != target) {
elem.classList.add("code-highlighted");
target.classList.add("code-highlighted");
}
}
function CodeHighlightOff(elem, id)
{
var target = document.getElementById(id);
if(null != target) {
elem.classList.remove("code-highlighted");
target.classList.remove("code-highlighted");
}
}
/*]]>*///-->
// @license-end
</script>
</head>
<body>
<div id="content">
<h1 class="title">RhymeStorm™ - RHYMESTORM CSCI Capstone Project</h1>
<div id="table-of-contents">
<h2>Table of Contents</h2>
<div id="text-table-of-contents">
<ul>
<li><a href="#orgd38ae05">1. RHYMESTORM Evaluator Notes</a></li>
<li><a href="#orgc9caa35">2. Evaluation Technical Documentation</a>
<ul>
<li><a href="#org4f30257">2.1. How To Initialize Development Environment</a>
<ul>
<li><a href="#orgf188cf0">2.1.1. Required Software</a></li>
<li><a href="#orgf197a7a">2.1.2. Steps</a></li>
</ul>
</li>
<li><a href="#org853b767">2.2. How To Run Software Locally</a>
<ul>
<li><a href="#org6ee0f4f">2.2.1. Requirements</a></li>
<li><a href="#orgedf3725">2.2.2. Steps</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#letter-of-transmittal">3. A. Letter Of Transmittal</a>
<ul>
<li><a href="#org5ee843e">3.1. Problem Summary</a></li>
<li><a href="#org890f8eb">3.2. Benefits</a></li>
<li><a href="#orgf459037">3.3. Product - RhymeStorm™</a></li>
<li><a href="#org853971e">3.4. Data</a></li>
<li><a href="#org1e5fdb9">3.5. Objectives</a></li>
<li><a href="#org8b76d5a">3.6. Development Methodology - Agile</a></li>
<li><a href="#orgc9a5773">3.7. Costs</a></li>
<li><a href="#orgc102cbe">3.8. Stakeholder Impact</a></li>
<li><a href="#org63d5a71">3.9. Ethical And Legal Considerations</a></li>
<li><a href="#orge7ed6b6">3.10. Expertise</a></li>
</ul>
</li>
<li><a href="#executive-summary">4. B. Executive Summary - RhymeStorm™ Technical Notes And Requirements</a>
<ul>
<li><a href="#org0ffe6ee">4.1. Decision Support Opportunity</a></li>
<li><a href="#org24903e6">4.2. Customer Needs And Product Description</a></li>
<li><a href="#orgc7e0d50">4.3. Existing Products</a></li>
<li><a href="#orgd471480">4.4. Available Data And Future Data Lifecycle</a></li>
<li><a href="#org46d6de3">4.5. Methodology - Agile</a></li>
<li><a href="#orga321efb">4.6. Deliverables</a></li>
<li><a href="#orgada24b3">4.7. Implementation Plan And Anticipations</a></li>
<li><a href="#org8467485">4.8. Requirements Validation And Verification</a></li>
<li><a href="#orga48f74d">4.9. Programming Environments And Costs</a></li>
<li><a href="#org1712f4e">4.10. Timeline And Milestones</a></li>
</ul>
</li>
<li><a href="#requirements-documentation">5. C. RhymeStorm™ Capstone Requirements Documentation</a>
<ul>
<li><a href="#orgda35db8">5.1. Descriptive And Predictive Methods</a>
<ul>
<li><a href="#orgab98aaf">5.1.1. Descriptive Method</a></li>
<li><a href="#orgc07d72f">5.1.2. Prescriptive Method</a></li>
</ul>
</li>
<li><a href="#org8f499c5">5.2. Datasets</a></li>
<li><a href="#org2d4eaec">5.3. Decision Support Functionality</a>
<ul>
<li><a href="#org7c927a3">5.3.1. Choosing Words For A Lyric Based On Markov Likelihood</a></li>
<li><a href="#org0a51a02">5.3.2. Choosing Words To Complete A Lyric Based On Rhyme Quality</a></li>
</ul>
</li>
<li><a href="#orgc667065">5.4. Featurizing, Parsing, Cleaning, And Wrangling Data</a></li>
<li><a href="#org6b7a95d">5.5. Data Exploration And Preparation</a></li>
<li><a href="#org1d3435f">5.6. Data Visualization Functionalities For Data Exploration And Inspection</a></li>
<li><a href="#orgec327c6">5.7. Implementation Of Interactive Queries</a>
<ul>
<li><a href="#org92a52fa">5.7.1. Generate Rhyming Lyrics</a></li>
<li><a href="#org4eb310c">5.7.2. Complete Lyric Containing Suffix</a></li>
</ul>
</li>
<li><a href="#org875011a">5.8. Implementation Of Machine Learning Methods</a></li>
<li><a href="#org5824f12">5.9. Functionalities To Evaluate The Accuracy Of The Data Product</a></li>
<li><a href="#org88dc329">5.10. Security Features</a></li>
<li><a href="#org613bd8f">5.11. Tools To Monitor And Maintain The Product</a></li>
<li><a href="#orgc6266b7">5.12. A User-Friendly, Functional Dashboard That Includes At Least Three Visualization Types</a></li>
</ul>
</li>
<li><a href="#remaining-documentation">6. D. Documentation</a>
<ul>
<li><a href="#org9df4605">6.1. Business Vision</a>
<ul>
<li><a href="#orga3bdd1c">6.1.1. Requirements</a></li>
</ul>
</li>
<li><a href="#orgd136d58">6.2. Data Sets</a></li>
<li><a href="#orgf736042">6.3. Data Analysis</a></li>
<li><a href="#org407721c">6.4. Assessment Of Hypothesis</a></li>
<li><a href="#org2d951c6">6.5. Visualizations</a></li>
<li><a href="#org60086e9">6.6. Accuracy</a>
<ul>
<li><a href="#orgd2e3d30">6.6.1. Percentage Of Generated Lines That Are Valid English Sentences</a></li>
</ul>
</li>
<li><a href="#org8d29ef2">6.7. Testing</a></li>
<li><a href="#orgbcd20cb">6.8. Source Code</a>
<ul>
<li><a href="#orgb5bde0d">6.8.1. Tightly Packed Trie</a></li>
<li><a href="#org68009bd">6.8.2. Phonetics</a></li>
<li><a href="#org615c902">6.8.3. Rhyming</a></li>
<li><a href="#org8ffc320">6.8.4. Web Server And User Interface</a></li>
</ul>
</li>
<li><a href="#org9010313">6.9. Quick Start</a>
<ul>
<li><a href="#org00f3e76">6.9.1. How To Initialize Development Environment</a></li>
<li><a href="#org7cd2611">6.9.2. How To Run Software Locally</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#orgffa2fb6">7. Citations</a></li>
</ul>
</div>
</div>
<div id="outline-container-orgd38ae05" class="outline-2">
<h2 id="orgd38ae05"><span class="section-number-2">1</span> RHYMESTORM Evaluator Notes</h2>
<div class="outline-text-2" id="text-1">
<p>
Hello! I hope you enjoy your time with this evaluation!
</p>
<p>
Here’s a quick introduction to help you navigate this project.
</p>
<p>
The document you are reading now contains or points to each of the requirements listed at the course task overview page for C964.
</p>
<p>
The section immediately following this contains notes on how to view and run the software locally. In addition, I’m hosting a demo of the application at <a href="https://darklimericks.com/rhymestorm">https://darklimericks.com/rhymestorm</a>.
</p>
<p>
After I describe the steps to initialize a development environment, you’ll find a <a href="#letter-of-transmittal">Letter Of Transmittal</a>, <a href="#executive-summary">Technical Executive Summary</a>, <a href="#requirements-documentation">links to the final product and details of how it meets each requirement</a>, and the <a href="#remaining-documentation">remaining required documentation</a>.
</p>
</div>
</div>
<div id="outline-container-orgc9caa35" class="outline-2">
<h2 id="orgc9caa35"><span class="section-number-2">2</span> Evaluation Technical Documentation</h2>
<div class="outline-text-2" id="text-2">
<p>
It’s probably not necessary for you to replicate my development environment in order to evaluate this project. You can access the deployed application at <a href="https://darklimericks.com/rhymestorm">https://darklimericks.com/rhymestorm</a> and the libraries and supporting code that I wrote for this project at <a href="https://github.com/eihli/clj-tightly-packed-trie">https://github.com/eihli/clj-tightly-packed-trie</a>, <a href="https://github.com/eihli/syllabify">https://github.com/eihli/syllabify</a>, and <a href="https://github.com/eihli/prhyme">https://github.com/eihli/prhyme</a>. The web server and web application is not hosted publicly but you will find it uploaded with my submission as a <code>.tar</code> archive.
</p>
</div>
<div id="outline-container-org4f30257" class="outline-3">
<h3 id="org4f30257"><span class="section-number-3">2.1</span> How To Initialize Development Environment</h3>
<div class="outline-text-3" id="text-2-1">
</div>
<div id="outline-container-orgf188cf0" class="outline-4">
<h4 id="orgf188cf0"><span class="section-number-4">2.1.1</span> Required Software</h4>
<div class="outline-text-4" id="text-2-1-1">
<ul class="org-ul">
<li><a href="https://www.docker.com/">Docker</a></li>
<li><a href="https://clojure.org/releases/downloads">Clojure Version 1.10+</a></li>
<li><a href="https://github.com/clojure-emacs/cider">Emacs and CIDER</a></li>
</ul>
</div>
</div>
<div id="outline-container-orgf197a7a" class="outline-4">
<h4 id="orgf197a7a"><span class="section-number-4">2.1.2</span> Steps</h4>
<div class="outline-text-4" id="text-2-1-2">
<ol class="org-ol">
<li>Run <code>./db/run.sh && ./kv/run.sh</code> to start the docker containers for the database and key-value store.
<ol class="org-ol">
<li>The <code>run.sh</code> scripts only need to run once. They initialize development data containers. Subsequent development can continue with <code>docker start db && docker start kv</code>.</li>
</ol></li>
<li>Start a Clojure REPL in Emacs, evaluate the <code>dev/user.clj</code> namespace, and run <code>(init)</code></li>
<li>Visit <code>http://localhost:8000/rhymestorm</code></li>
</ol>
</div>
</div>
</div>
<div id="outline-container-org853b767" class="outline-3">
<h3 id="org853b767"><span class="section-number-3">2.2</span> How To Run Software Locally</h3>
<div class="outline-text-3" id="text-2-2">
</div>
<div id="outline-container-org6ee0f4f" class="outline-4">
<h4 id="org6ee0f4f"><span class="section-number-4">2.2.1</span> Requirements</h4>
<div class="outline-text-4" id="text-2-2-1">
<ul class="org-ul">
<li><a href="https://www.java.com/download/ie_manual.jsp">Java</a></li>
<li><a href="https://www.docker.com/">Docker</a></li>
</ul>
</div>
</div>
<div id="outline-container-orgedf3725" class="outline-4">
<h4 id="orgedf3725"><span class="section-number-4">2.2.2</span> Steps</h4>
<div class="outline-text-4" id="text-2-2-2">
<ol class="org-ol">
<li>Run <code>./db/run.sh && ./kv/run.sh</code> to start the docker containers for the database and key-value store.
<ol class="org-ol">
<li>The <code>run.sh</code> scripts only need to run once. They initialize development data containers. Subsequent development can continue with <code>docker start db && docker start kv</code>.</li>
</ol></li>
<li>The application’s <code>jar</code> builds with a <code>make</code> run from the root directory. (See <a href="../Makefile">Makefile</a>).</li>
<li>Navigate to the root directory of this git repo and run <code>java -jar darklimericks.jar</code></li>
<li>Visit <a href="http://localhost:8000/rhymestorm">http://localhost:8000/rhymestorm</a></li>
</ol>
</div>
</div>
</div>
</div>
<div id="outline-container-letter-of-transmittal" class="outline-2">
<h2 id="letter-of-transmittal"><span class="section-number-2">3</span> A. Letter Of Transmittal</h2>
<div class="outline-text-2" id="text-letter-of-transmittal">
</div>
<div id="outline-container-org5ee843e" class="outline-3">
<h3 id="org5ee843e"><span class="section-number-3">3.1</span> Problem Summary</h3>
<div class="outline-text-3" id="text-3-1">
<p>
Songwriters, artists, and record labels can save time and discover better lyrics with the help of a machine learning tool that supports their creative endeavours.
</p>
<p>
Songwriters have several old-fashioned tools at their disposal including dictionaries and thesauruses. But machine learning exposes a new set of powerful possibilities. Using simple machine learning techniques, it is possible to automatically generate vast numbers of lyrics that match specified criteria for rhyming, syllable count, genre, and more.
</p>
</div>
</div>
<div id="outline-container-org890f8eb" class="outline-3">
<h3 id="org890f8eb"><span class="section-number-3">3.2</span> Benefits</h3>
<div class="outline-text-3" id="text-3-2">
<p>
How many sensible phrases can you think of that rhyme with “war on poverty”? What if I say that there’s a restriction to only come up with phrases that are exactly 14 syllables? That’s a common restriction when a songwriter is trying to match the meter of a previous line. What if I add another restriction that there must be primary stress at certain spots in that 14 syllable phrase?
</p>
<p>
This is the process that songwriters go through all day. It’s a process that gets little help from traditional tools like dictionaries and thesauruses.
</p>
<p>
And this is a process that is perfect for machine learning. Machine learning can learn the most likely grammatical structure of phrases and can make predictions about likely words that follow a given sequence of other words. Computers can iterate through millions of words, checking for restrictions on rhyme, syllable count, and more. The most tedious part of lyric generation can be automated with machine learning software, leaving the songwriter free to cherry-pick from the best lyrics and make minor touch-ups to make them perfect.
</p>
</div>
</div>
<div id="outline-container-orgf459037" class="outline-3">
<h3 id="orgf459037"><span class="section-number-3">3.3</span> Product - RhymeStorm™</h3>
<div class="outline-text-3" id="text-3-3">
<p>
RhymeStorm™ is a tool to help songwriters brainstorm. It provides lyrics automatically generated based on training data from existing songs while adhering to restrictions based on rhyme scheme, meter, genre, and more.
</p>
<p>
The machine learning part of software that I described above can be implemented with a simple machine learning technique known as a Hidden Markov Model.
</p>
<p>
Without getting too technical, using a Hidden Markov Model will involve using an existing lyrics database as input and the output will be a function that returns the likelihood of a word following a sequence of previous words.
</p>
<p>
A choice of many different programming languages and algorithms are sufficient to handle the other parts of the product, like splitting a word into phonetic sounds, finding rhymes, and matching stress between phrases.
</p>
<p>
An initial version of the software will be trained on the heavy metal lyrics database at <a href="http://darklyrics.com">http://darklyrics.com</a> and a website will be created where users can type in a “seed” sequence of word(s) and the model will output a variety of possible completions.
</p>
<p>
This auto-complete functionality will be similar to the auto-complete that is commonly found on phone keyboard applications that help users type faster on phone touchscreens.
</p>
</div>
</div>
<div id="outline-container-org853971e" class="outline-3">
<h3 id="org853971e"><span class="section-number-3">3.4</span> Data</h3>
<div class="outline-text-3" id="text-3-4">
<p>
The initial model will be trained on the lyrics from <a href="http://darklyrics.com">http://darklyrics.com</a>. This is a publicly available data set with minimal meta-data. Record labels will have more valuable datasets that will include meta-data along with lyrics, such as the date the song was popular, the number of radio plays of the song, the profit of the song/artist, etc…
</p>
<p>
The software can be augmented with additional algorithms to account for the type of meta-data that a record label may have. The augmentations can happen in iterative software development cycles, using Agile methodologies.
</p>
</div>
</div>
<div id="outline-container-org1e5fdb9" class="outline-3">
<h3 id="org1e5fdb9"><span class="section-number-3">3.5</span> Objectives</h3>
<div class="outline-text-3" id="text-3-5">
<p>
This software will accomplish its primary objective if it makes its way into the daily toolkit of a handful of singers/songwriters.
</p>
<p>
Several secondary objectives are also desirable and reasonably expected. The architecture of the software lends itself to existing as several independently useful modules.
</p>
<p>
For example, the <a href="https://en.wikipedia.org/wiki/Hidden_Markov_model">Markov Model</a> (Markov Model 2021) can be conveniently backed by a <a href="https://en.wikipedia.org/wiki/Trie">Trie data structure</a> (Trie 2021). This Trie data structure can be released as its own software package and used any application that benefits from prefix matching.
</p>
<p>
Another example is the package that turns phrases into phones (symbols of pronunciation). That package can find use for a number of natural language processing and natural language generation tasks, aside from the task required by this particular project.
</p>
</div>
</div>
<div id="outline-container-org8b76d5a" class="outline-3">
<h3 id="org8b76d5a"><span class="section-number-3">3.6</span> Development Methodology - Agile</h3>
<div class="outline-text-3" id="text-3-6">
<p>
This project will be developed with an iterative Agile methodology. Since a large part of data science and machine learning is exploration, this project will benefit from ongoing exploration in tandem with development.
</p>
<p>
Additionally, the developer(s) working on the project won’t have (and won’t need to have) access to the data sets that songwriters and record labels may have. Work can begin immediately with an iterative approach and future data sets can be integrated as they become available.
</p>
<p>
The prices quoted below are for an initial minimum-viable-product that will serve as a proof-of-concept. Future contracts can be negotiated for ongoing development at similar rates.
</p>
</div>
</div>
<div id="outline-container-orgc9a5773" class="outline-3">
<h3 id="orgc9a5773"><span class="section-number-3">3.7</span> Costs</h3>
<div class="outline-text-3" id="text-3-7">
<p>
Funding requirements are minimal. The initial dataset is public and freely available. On a typical consumer laptop, Hidden Markov Models can be trained on fairly large datasets in short time and the training doesn’t require the use of expensive hardware like the GPUs used to train Deep Neural Networks.
</p>
<p>
For the initial product, the only development expensive would be the hourly rate of a full-stack developer. The ongoing expensive for the website hosting the user interface would be roughly $20 to $200 per month depending on how many users access the site at the same time.
</p>
<p>
These are my estimates for the time and cost of different aspects of initial development.
</p>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<colgroup>
<col class="org-left" />
<col class="org-right" />
<col class="org-left" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">Task</th>
<th scope="col" class="org-right">Hours</th>
<th scope="col" class="org-left">Cost</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Trie</td>
<td class="org-right">60</td>
<td class="org-left">$600</td>
</tr>
<tr>
<td class="org-left">Phonetics</td>
<td class="org-right">30</td>
<td class="org-left">$300</td>
</tr>
<tr>
<td class="org-left">HMM Training Algorithms</td>
<td class="org-right">60</td>
<td class="org-left">$600</td>
</tr>
<tr>
<td class="org-left">Web User Interface</td>
<td class="org-right">80</td>
<td class="org-left">$800</td>
</tr>
<tr>
<td class="org-left">Web Server</td>
<td class="org-right">60</td>
<td class="org-left">$600</td>
</tr>
<tr>
<td class="org-left">Testing</td>
<td class="org-right">20</td>
<td class="org-left">$200</td>
</tr>
<tr>
<td class="org-left">Quality Assurance</td>
<td class="org-right">20</td>
<td class="org-left">$200</td>
</tr>
<tr>
<td class="org-left">Total</td>
<td class="org-right">330</td>
<td class="org-left">$3,300</td>
</tr>
</tbody>
</table>
</div>
</div>
<div id="outline-container-orgc102cbe" class="outline-3">
<h3 id="orgc102cbe"><span class="section-number-3">3.8</span> Stakeholder Impact</h3>
<div class="outline-text-3" id="text-3-8">
<p>
The only stakeholders in the project will be the record labels or songwriters. I describe the only impact to them in the <a href="#org890f8eb">3.2</a> section above.
</p>
</div>
</div>
<div id="outline-container-org63d5a71" class="outline-3">
<h3 id="org63d5a71"><span class="section-number-3">3.9</span> Ethical And Legal Considerations</h3>
<div class="outline-text-3" id="text-3-9">
<p>
Web scraping, the method used to obtain the initial dataset from <a href="http://darklyrics.com">http://darklyrics.com</a>, is protected given the ruling in <a href="https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn">https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn</a> (HiQ Labs v. LinkedIn 2021).
</p>
<p>
The use of publicly available data in generative works is less clear. But Microsoft’s lawyers deemed it sound given their recent release of Github CoPilot (Gershgorn, 2021).
</p>
</div>
</div>
<div id="outline-container-orge7ed6b6" class="outline-3">
<h3 id="orge7ed6b6"><span class="section-number-3">3.10</span> Expertise</h3>
<div class="outline-text-3" id="text-3-10">
<p>
I have 10 years experience as a programmer and have worked extensively on both frontend technologies like HTML/JavaScript, backend technologies like Django, and building libraries/packages/frameworks.
</p>
<p>
I’ve also been writing limericks my entire life and hold the International Limerick Imaginative Enthusiast’s ILIE award for the years 2013 and 2019.
</p>
</div>
</div>
</div>
<div id="outline-container-executive-summary" class="outline-2">
<h2 id="executive-summary"><span class="section-number-2">4</span> B. Executive Summary - RhymeStorm™ Technical Notes And Requirements</h2>
<div class="outline-text-2" id="text-executive-summary">
</div>
<div id="outline-container-org0ffe6ee" class="outline-3">
<h3 id="org0ffe6ee"><span class="section-number-3">4.1</span> Decision Support Opportunity</h3>
<div class="outline-text-3" id="text-4-1">
<p>
Songwriters expend a lot of time and effort finding the perfect rhyming word or phrase. RhymeStorm™ is going to amplify user’s creative abilities by searching its machine learning model for sensible and proven-successful words and phrases that meet the rhyme scheme and meter requirements requested by the user.
</p>
<p>
When a songwriter needs to find likely phrases that rhyme with “war on poverty” and has 14 syllables, RhymeStorm™ will automatically generate dozens of possibilities and rank them by “perplexity” and rhyme quality. The songwriter can focus there efforts on simple touch-ups to perfect the automatically generated lyrics.
</p>
</div>
</div>
<div id="outline-container-org24903e6" class="outline-3">
<h3 id="org24903e6"><span class="section-number-3">4.2</span> Customer Needs And Product Description</h3>
<div class="outline-text-3" id="text-4-2">
<p>
Songwriters spend money on dictionaries, compilations of slang, thesauruses, and phrase dictionaries. They spend their time daydreaming, brainstorming, contemplating, and mixing and matching the knowledge they acquire through these traditional means.
</p>
<p>
A simple experiment you can try yourself will show that it takes between 5 and 30 seconds to look up a word in a dictionary or thesaurus. Then it takes an equal amount of time to look up each synonym, antonym, or other word that comes to mind. A few of those words may rhyme, but each word requires building an entire sentence around it that meets restrictions for sensibility, meter, and scheme.
</p>
<p>
This process can take a person hours for a single line and weeks for a single song.
</p>
<p>
Computers can process and sort this information and sort the results by quality millions of times faster. A few minutes of a songwriter specifying filters, restrictions, and requirements can save them days of traditional brainstorming.
</p>
</div>
</div>
<div id="outline-container-orgc7e0d50" class="outline-3">
<h3 id="orgc7e0d50"><span class="section-number-3">4.3</span> Existing Products</h3>
<div class="outline-text-3" id="text-4-3">
<p>
We’re all familiar with dictionaries, thesauruses, and their shortcomings.
</p>
<p>
There is a small amount of technology being applied to this problem. A popular site to find rhymes is <a href="https://www.rhymezone.com">https://www.rhymezone.com</a>.
</p>
<p>
RhymeZone is limited in its capability. It doesn’t do well finding rhymes for phrases more than a couple of words and it can’t generate suggestions for lyric completions.
</p>
</div>
</div>
<div id="outline-container-orgd471480" class="outline-3">
<h3 id="orgd471480"><span class="section-number-3">4.4</span> Available Data And Future Data Lifecycle</h3>
<div class="outline-text-3" id="text-4-4">
<p>
The initial dataset will be gathered by downloading lyrics from <a href="http://darklyrics.com">http://darklyrics.com</a> and future models can be generated by downloading lyrics from other websites. Alternatively, data can be provided by record labels and combined with meta-data that the record label may have, such as how many radio plays each song gets and how much profit they make from each song.
</p>
<p>
RhymeStorm™ can offer multiple models depending on the genre or theme that the songwriter is looking for. With the initial dataset from <a href="http://darklyrics.com">http://darklyrics.com</a>, all suggestions will have a heavy metal theme. But future data sets can be trained on rap, pop, or other genres.
</p>
<p>
Songs don’t get released fast enough that training needs to be an automated ongoing process. Perhaps once a year, or whenever a new dataset becomes available, someone can run a script that will update the data models.
</p>
<p>
The script to generate data models will accept as arguments a directory containing files of songs, a filepath to save the completed model, the “rank” of the Hidden Markov Model, and it will generate a Trie representing the HMM and save it to disk at the specified location.
</p>
<p>
Each new model can be uploaded to the web server and users can select which model they want to use.
</p>
</div>
</div>
<div id="outline-container-org46d6de3" class="outline-3">
<h3 id="org46d6de3"><span class="section-number-3">4.5</span> Methodology - Agile</h3>
<div class="outline-text-3" id="text-4-5">
<p>
RhymeStorm™ development will proceed with an iterative Agile methodology. It will be composed of several independent modules that can be worked on independently, in parallel, and iteratively.
</p>
<p>
The Trie data structure that will be used as a backing to the Hidden Markov Model can be worked on in isolation from any other aspect of the project. The first iteration can use a simple hash-map as a backing store. The second iteration can improve memory efficiency by using a ByteBuffer as a <a href="https://aclanthology.org/W09-1505.pdf">Tightly Packed Trie</a> (Germann et al., 2009) Future iterations can continue to improve performance metrics.
</p>
<p>
The web server can be implemented initially without security measures like HTTPS and performance measures like load balancing. Future iterations can add these features as they become necessary.
</p>
<p>
The user interface can be implemented as a wireframe and extended as new functionality becomes available from the backend.
</p>
<p>
Much of data science is exploratory and taking an iterative Agile approach can take advantage of delaying decisions while information is gathered.
</p>
</div>
</div>
<div id="outline-container-orga321efb" class="outline-3">
<h3 id="orga321efb"><span class="section-number-3">4.6</span> Deliverables</h3>
<div class="outline-text-3" id="text-4-6">
<ul class="org-ul">
<li>Supporting libraries source code</li>
<li>Application source code</li>
<li>Deployed application</li>
</ul>
<p>
The supporting libraries of this project are available as open source repositories on Github.
</p>
<p>
<a href="https://github.com/eihli/clj-tightly-packed-trie">Tightly Packed Trie</a>
</p>
<p>
<a href="https://github.com/eihli/phonetics">Phonetics and Syllabification</a>
</p>
<p>
<a href="https://github.com/eihli/prhyme">Data Processing, Markov, and Rhyme Algorithms</a>
</p>
<p>
The trained data model and web interface has been deployed at the following address and the code will be provided in an archive file.
</p>
<p>
<a href="https://darklimericks.com/rhymestorm">Web GUI and Documentation</a>
</p>
</div>
</div>
<div id="outline-container-orgada24b3" class="outline-3">
<h3 id="orgada24b3"><span class="section-number-3">4.7</span> Implementation Plan And Anticipations</h3>
<div class="outline-text-3" id="text-4-7">
<p>
I’ll start by writing and releasing the supporting libraries and packages: Tries, Syllabification/Phonetics, Rhyming.
</p>
<p>
Then I’ll write a website that imports and uses those libraries.
</p>
<p>
Since I’ll be writing and releasing these packages iteratively as open source, I’ll share them publicly as I progress and can use feedback to improve them before RhymeStorm™ takes its final form.
</p>
<p>
In anticipation of user growth, I’ll be deploying the final product on DigitalOcean Droplets. They are virtual machines with resources that can be resized to meet growing demands or shrunk to save money in times of low traffic.
</p>
</div>
</div>
<div id="outline-container-org8467485" class="outline-3">
<h3 id="org8467485"><span class="section-number-3">4.8</span> Requirements Validation And Verification</h3>
<div class="outline-text-3" id="text-4-8">
<p>
For the known requirements, I’ll perform personally perform manual tests and quality assurance. This is a small enough project that one individual can thoroughly test all of the primary requirements.
</p>
<p>
Since the project is broken down into isolated sub-projects, unit tests will be added to the sub-projects to make sure they meet their own goals and performance standards.
</p>
<p>
The final website will integrate multiple technologies and the integrations won’t be ideal for unit testing. But as mentioned, the user acceptance requirements are not major and can be manually ensured.
</p>
</div>
</div>
<div id="outline-container-orga48f74d" class="outline-3">
<h3 id="orga48f74d"><span class="section-number-3">4.9</span> Programming Environments And Costs</h3>
<div class="outline-text-3" id="text-4-9">
<p>
One of the benefits of a Hidden Markov Model is its relative computational affordability when compared to other machine learning techniques, like Deep Neural Networks.
</p>
<p>
We don’t require a GPU or long training times on powerful computers. The over 200,000 songs obtained from <a href="http://darklyrics.com">http://darklyrics.com</a> can be trained into a 4-gram Hidden Markov Model in just a few hours on a consumer laptop.
</p>
<p>
The training process never uses more than 20 gigabytes of ram.
</p>
<p>
All code was written and all models were trained on a Lenovo T15G with an Intel i9 2.4 ghz processor and 32gb of RAM.
</p>
</div>
</div>
<div id="outline-container-org1712f4e" class="outline-3">
<h3 id="org1712f4e"><span class="section-number-3">4.10</span> Timeline And Milestones</h3>
<div class="outline-text-3" id="text-4-10">
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<colgroup>
<col class="org-right" />
<col class="org-right" />
<col class="org-right" />
<col class="org-left" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-right">Sprint</th>
<th scope="col" class="org-right">Start</th>
<th scope="col" class="org-right">End</th>
<th scope="col" class="org-left">Tasks</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-right">1</td>
<td class="org-right">2021-07-01</td>
<td class="org-right">2021-07-07</td>
<td class="org-left">Acquire corpus - Explore Modelling - Review Existing Material</td>
</tr>
<tr>
<td class="org-right">2</td>
<td class="org-right">2021-07-07</td>
<td class="org-right">2021-07-21</td>
<td class="org-left">Data Cleanup - Feature Extraction - Lyric Generation (POC)</td>
</tr>
<tr>
<td class="org-right">3</td>
<td class="org-right">2021-07-21</td>
<td class="org-right">2021-07-28</td>
<td class="org-left">Lyric Generation Restrictions (Syllable-count, Rhyme, Etc…)</td>
</tr>
<tr>
<td class="org-right">4</td>
<td class="org-right">2021-07-28</td>
<td class="org-right">2021-08-14</td>
<td class="org-left">Train Full-scale Model - Performance Tuning</td>
</tr>
<tr>
<td class="org-right">5</td>
<td class="org-right">2021-08-14</td>
<td class="org-right">2021-08-21</td>
<td class="org-left">Create Web Interface And Visualizations</td>
</tr>
<tr>
<td class="org-right">6</td>
<td class="org-right">2021-08-21</td>
<td class="org-right">2021-09-07</td>
<td class="org-left">QA - Testing - Deploy And Release Web App</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div id="outline-container-requirements-documentation" class="outline-2">
<h2 id="requirements-documentation"><span class="section-number-2">5</span> C. RhymeStorm™ Capstone Requirements Documentation</h2>
<div class="outline-text-2" id="text-requirements-documentation">
<p>
RhymeStorm™ is an application to help singers and songwriters brainstorm new lyrics.
</p>
</div>
<div id="outline-container-orgda35db8" class="outline-3">
<h3 id="orgda35db8"><span class="section-number-3">5.1</span> Descriptive And Predictive Methods</h3>
<div class="outline-text-3" id="text-5-1">
</div>
<div id="outline-container-orgab98aaf" class="outline-4">
<h4 id="orgab98aaf"><span class="section-number-4">5.1.1</span> Descriptive Method</h4>
<div class="outline-text-4" id="text-5-1-1">
</div>
<ol class="org-ol">
<li><a id="org00830d9"></a>Most Common Grammatical Structures In A Set Of Lyrics<br />
<div class="outline-text-5" id="text-5-1-1-1">
<p>
By filtering songs by metrics such as popularity, number of awards, etc… we can use this software package to determine the most common grammatical phrase structure for different filtered categories.
</p>
<p>
Since much of the data a record label might want to categorize songs by is likely proprietary, filtering the songs by whatever metric is the responsibility of the user.
</p>
<p>
Once the songs are filtered/categorized, they can be passed to this software where a list of the most popular grammar structures will be returned.
</p>
<p>
In the example below, you’ll see that a simple noun-phrase is the most popular structure with 6 occurrences, tied with a sentence composed of a prepositional-phrase, verb-phrase, and adjective.
</p>
<div class="org-src-container">
<pre class="src src-clojure"><span style="color: #51afef;">(</span>require '<span style="color: #c678dd;">[</span>com.owoga.corpus.markov <span style="color: #a9a1e1;">:as</span> markov<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>com.owoga.prhyme.nlp.core <span style="color: #a9a1e1;">:as</span> nlp<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>clojure.string <span style="color: #a9a1e1;">:as</span> string<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>clojure.java.io <span style="color: #a9a1e1;">:as</span> io<span style="color: #c678dd;">]</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">let</span> <span style="color: #c678dd;">[</span>lines <span style="color: #98be65;">(</span>transduce
<span style="color: #a9a1e1;">(</span>comp
<span style="color: #51afef;">(</span>map slurp<span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span>map #<span style="color: #c678dd;">(</span><span style="color: #ECBE7B;">string</span>/split <span style="color: #dcaeea;">%</span> #<span style="color: #98be65;">"</span><span style="color: #98be65; font-weight: bold;">\n</span><span style="color: #98be65;">"</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span>map <span style="color: #c678dd;">(</span>partial remove empty?<span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span>map <span style="color: #ECBE7B;">nlp</span>/structure-freqs<span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span>
merge
<span style="color: #a9a1e1;">{}</span>
<span style="color: #a9a1e1;">(</span>eduction <span style="color: #51afef;">(</span><span style="color: #ECBE7B;">markov</span>/xf-file-seq <span style="color: #da8548; font-weight: bold;">0</span> <span style="color: #da8548; font-weight: bold;">10</span><span style="color: #51afef;">)</span> <span style="color: #51afef;">(</span>file-seq <span style="color: #c678dd;">(</span><span style="color: #ECBE7B;">io</span>/file <span style="color: #98be65;">"/home/eihli/src/prhyme/dark-corpus"</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span>take <span style="color: #da8548; font-weight: bold;">5</span> <span style="color: #98be65;">(</span>sort-by <span style="color: #a9a1e1;">(</span>comp - second<span style="color: #a9a1e1;">)</span> lines<span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
</pre>
</div>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<colgroup>
<col class="org-left" />
<col class="org-right" />
</colgroup>
<tbody>
<tr>
<td class="org-left">(TOP (NP (NNP) (.)))</td>
<td class="org-right">6</td>
</tr>
<tr>
<td class="org-left">(TOP (S (NP (PRP)) (VP (VBP) (ADJP (JJ))) (.)))</td>
<td class="org-right">6</td>
</tr>
<tr>
<td class="org-left">(INC (NP (JJ) (NN)) nil (IN) (NP (DT)) (NP (PRP)) (VBP))</td>
<td class="org-right">4</td>
</tr>
<tr>
<td class="org-left">(TOP (NP (NP (JJ) (NN)) nil (NP (NN) (CC) (NN))))</td>
<td class="org-right">4</td>
</tr>
<tr>
<td class="org-left">(TOP (S (NP (JJ) (NN)) nil (VP (VBG) (ADJP (JJ)))))</td>
<td class="org-right">4</td>
</tr>
</tbody>
</table>
</div>
</li>
</ol>
</div>
<div id="outline-container-orgc07d72f" class="outline-4">
<h4 id="orgc07d72f"><span class="section-number-4">5.1.2</span> Prescriptive Method</h4>
<div class="outline-text-4" id="text-5-1-2">
</div>
<ol class="org-ol">
<li><a id="org79ddf78"></a>Most Likely Word To Follow A Given Phrase<br />
<div class="outline-text-5" id="text-5-1-2-1">
<p>
To help songwriters think of new lyrics, we provide an API to receive a list of words that commonly follow/precede a given phrase.
</p>
<p>
Models can be trained on different genres or categories of songs. This will ensure that recommended lyric completions are apt.
</p>
<p>
In the example below, we provide a seed suffix of “bother me” and ask the software to predict the most likely words that precede that phrase. The resulting most popular phrases are “don’t bother me”, “doesn’t bother me”, “to bother me”, “won’t bother me”, etc…
</p>
<div class="org-src-container">
<pre class="src src-clojure"><span style="color: #51afef;">(</span>require '<span style="color: #c678dd;">[</span>com.darklimericks.server.models <span style="color: #a9a1e1;">:as</span> models<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>com.owoga.trie <span style="color: #a9a1e1;">:as</span> trie<span style="color: #c678dd;">]</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">let</span> <span style="color: #c678dd;">[</span>seed <span style="color: #98be65;">[</span><span style="color: #98be65;">"bother"</span> <span style="color: #98be65;">"me"</span><span style="color: #98be65;">]</span>
seed-ids <span style="color: #98be65;">(</span>map <span style="color: #ECBE7B;">models</span>/database seed<span style="color: #98be65;">)</span>
lookup <span style="color: #98be65;">(</span>reverse seed-ids<span style="color: #98be65;">)</span>
results <span style="color: #98be65;">(</span><span style="color: #ECBE7B;">trie</span>/children <span style="color: #a9a1e1;">(</span><span style="color: #ECBE7B;">trie</span>/lookup <span style="color: #ECBE7B;">models</span>/markov-trie lookup<span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span><span style="color: #51afef;">->></span> results
<span style="color: #98be65;">(</span>map #<span style="color: #a9a1e1;">(</span>get <span style="color: #dcaeea;">%</span> <span style="color: #51afef;">[]</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>sort-by <span style="color: #a9a1e1;">(</span>comp - second<span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>map #<span style="color: #a9a1e1;">(</span>update <span style="color: #dcaeea;">%</span> <span style="color: #da8548; font-weight: bold;">0</span> <span style="color: #ECBE7B;">models</span>/database<span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>take <span style="color: #da8548; font-weight: bold;">10</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
</pre>
</div>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<colgroup>
<col class="org-left" />
<col class="org-right" />
</colgroup>
<tbody>
<tr>
<td class="org-left">don’t</td>
<td class="org-right">36</td>
</tr>
<tr>
<td class="org-left">doesn’t</td>
<td class="org-right">21</td>
</tr>
<tr>
<td class="org-left">to</td>
<td class="org-right">14</td>
</tr>
<tr>
<td class="org-left">won’t</td>
<td class="org-right">9</td>
</tr>
<tr>
<td class="org-left">really</td>
<td class="org-right">5</td>
</tr>
<tr>
<td class="org-left">not</td>
<td class="org-right">4</td>
</tr>
<tr>
<td class="org-left">you</td>
<td class="org-right">4</td>
</tr>
<tr>
<td class="org-left">it</td>
<td class="org-right">3</td>
</tr>
<tr>
<td class="org-left">even</td>
<td class="org-right">3</td>
</tr>
<tr>
<td class="org-left">shouldn’t</td>
<td class="org-right">3</td>
</tr>
</tbody>
</table>
</div>
</li>
</ol>
</div>
</div>
<div id="outline-container-org8f499c5" class="outline-3">
<h3 id="org8f499c5"><span class="section-number-3">5.2</span> Datasets</h3>
<div class="outline-text-3" id="text-5-2">
<p>
The dataset currently in use was generated from the publicly available lyrics at <a href="http://darklyrics.com">http://darklyrics.com</a>.
</p>
<p>
Further datasets will need to be provided by the end-user.
</p>
<p>
The trained dataset is available as a resource in this repository at <code>web/resources/models/</code>.
</p>
</div>
</div>
<div id="outline-container-org2d4eaec" class="outline-3">
<h3 id="org2d4eaec"><span class="section-number-3">5.3</span> Decision Support Functionality</h3>
<div class="outline-text-3" id="text-5-3">
</div>
<div id="outline-container-org7c927a3" class="outline-4">
<h4 id="org7c927a3"><span class="section-number-4">5.3.1</span> Choosing Words For A Lyric Based On Markov Likelihood</h4>
<div class="outline-text-4" id="text-5-3-1">
<p>
Entire phrases can be generated using the previously mentioned functionality of generating lists of likely prefix/suffix words.
</p>
<p>
The software can be seeded with a simple “end-of-sentence” or “beginning-of-sentence” token and can be asked to work backwards to build a phrase that meets certain criteria.
</p>
<p>
The user can supply criteria such as restrictions on the number of syllables, number of words, rhyme scheme, etc…
</p>
</div>
</div>
<div id="outline-container-org0a51a02" class="outline-4">
<h4 id="org0a51a02"><span class="section-number-4">5.3.2</span> Choosing Words To Complete A Lyric Based On Rhyme Quality</h4>
<div class="outline-text-4" id="text-5-3-2">
<p>
Another part of the decision support functionality is filtering and ordering predicted words based on their rhyme quality.
</p>
<p>
The official definition of a “perfect” rhyme is when two words have matching phonemes starting from their primary stress.
</p>
<p>
For example: technology and ecology. Both of those words have a stress on the second syllable. The first syllables differ. But from the stressed syllable on, they have exactly matching phones.
</p>
<p>
A rhyme that might be useful to a songwriter but that doesn’t fit the definition of a “perfect” rhyme would be “technology” and “economy”. Those two words just barely break the rules for a perfect rhyme. Their vowel phones match from their primary stress to their ends. But one of the consonant phones doesn’t match.
</p>
<p>
Singers and songwriters have some flexibility and artistic freedom and imperfect rhymes can be a fallback.
</p>
<p>
Therefore, this software provides functionality to sort rhymes so that rhymes that are closer to perfect are first in the ordering.
</p>
<p>
In the example below, you’ll see that the first 20 or so rhymes are perfect, but then “hypocrisy” is listed as rhyming with “technology”. This is for the reason just mentioned. It’s close to a perfect rhyme and it’s of interest to singers/songwriters.
</p>
<div class="org-src-container">
<pre class="src src-clojure"><span style="color: #51afef;">(</span>require '<span style="color: #c678dd;">[</span>com.darklimericks.linguistics.core <span style="color: #a9a1e1;">:as</span> linguistics<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>com.darklimericks.server.models <span style="color: #a9a1e1;">:as</span> models<span style="color: #c678dd;">]</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">let</span> <span style="color: #c678dd;">[</span>results
<span style="color: #98be65;">(</span><span style="color: #ECBE7B;">linguistics</span>/rhymes-with-frequencies-and-rhyme-quality
<span style="color: #98be65;">"technology"</span>
<span style="color: #ECBE7B;">models</span>/markov-trie
<span style="color: #ECBE7B;">models</span>/database<span style="color: #98be65;">)</span><span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span><span style="color: #51afef;">->></span> results
<span style="color: #98be65;">(</span>map
<span style="color: #a9a1e1;">(</span><span style="color: #51afef;">fn</span> <span style="color: #51afef;">[</span><span style="color: #c678dd;">[</span>rhyming-word
rhyming-word-phones
frequency-count-of-rhyming-word
target-word
target-word-phones
rhyme-quality<span style="color: #c678dd;">]</span><span style="color: #51afef;">]</span>
<span style="color: #51afef;">[</span>rhyming-word frequency-count-of-rhyming-word rhyme-quality<span style="color: #51afef;">]</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>take <span style="color: #da8548; font-weight: bold;">25</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>vec<span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>into <span style="color: #a9a1e1;">[</span><span style="color: #51afef;">[</span><span style="color: #98be65;">"rhyme"</span> <span style="color: #98be65;">"frequency count"</span> <span style="color: #98be65;">"rhyme quality"</span><span style="color: #51afef;">]</span><span style="color: #a9a1e1;">]</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
</pre>
</div>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<colgroup>
<col class="org-left" />
<col class="org-right" />
<col class="org-right" />
</colgroup>
<tbody>
<tr>
<td class="org-left">rhyme</td>
<td class="org-right">frequency count</td>
<td class="org-right">rhyme quality</td>
</tr>
<tr>
<td class="org-left">technology</td>
<td class="org-right">318</td>
<td class="org-right">8</td>
</tr>
<tr>
<td class="org-left">apology</td>
<td class="org-right">68</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">pathology</td>
<td class="org-right">42</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">mythology</td>
<td class="org-right">27</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">psychology</td>
<td class="org-right">24</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">theology</td>
<td class="org-right">23</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">biology</td>
<td class="org-right">20</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">ecology</td>
<td class="org-right">11</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">chronology</td>
<td class="org-right">10</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">astrology</td>
<td class="org-right">9</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">biotechnology</td>
<td class="org-right">8</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">nanotechnology</td>
<td class="org-right">5</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">geology</td>
<td class="org-right">3</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">ontology</td>
<td class="org-right">2</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">morphology</td>
<td class="org-right">2</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">seismology</td>
<td class="org-right">1</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">urology</td>
<td class="org-right">1</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">doxology</td>
<td class="org-right">0</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">neurology</td>
<td class="org-right">0</td>
<td class="org-right">7</td>
</tr>
<tr>
<td class="org-left">hypocrisy</td>
<td class="org-right">723</td>
<td class="org-right">6</td>
</tr>
<tr>
<td class="org-left">democracy</td>
<td class="org-right">238</td>
<td class="org-right">6</td>
</tr>
<tr>
<td class="org-left">atrocity</td>
<td class="org-right">224</td>
<td class="org-right">6</td>
</tr>
<tr>
<td class="org-left">philosophy</td>
<td class="org-right">181</td>
<td class="org-right">6</td>
</tr>
<tr>
<td class="org-left">equality</td>
<td class="org-right">109</td>
<td class="org-right">6</td>
</tr>
<tr>
<td class="org-left">ideology</td>
<td class="org-right">105</td>
<td class="org-right">6</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div id="outline-container-orgc667065" class="outline-3">
<h3 id="orgc667065"><span class="section-number-3">5.4</span> Featurizing, Parsing, Cleaning, And Wrangling Data</h3>
<div class="outline-text-3" id="text-5-4">
<p>
The data processing code is in <a href="https://github.com/eihli/prhyme">https://github.com/eihli/prhyme</a>
</p>
<p>
Each line gets tokenized using a regular expression to split the string into tokens.
</p>
<div class="org-src-container">
<pre class="src src-clojure"><span style="color: #51afef;">(</span><span style="color: #51afef;">def</span> <span style="color: #dcaeea;">re-word</span>
<span style="color: #83898d;">"Regex for tokenizing a string into words</span>
<span style="color: #83898d;"> (including contractions and hyphenations),</span>
<span style="color: #83898d;"> commas, periods, and newlines."</span>
#<span style="color: #98be65;">"</span><span style="color: #51afef; font-weight: bold;">(</span><span style="color: #98be65;">?s</span><span style="color: #51afef; font-weight: bold;">)</span><span style="color: #98be65;">.*?</span><span style="color: #51afef; font-weight: bold;">(</span><span style="color: #98be65;">[a-zA-Z</span><span style="color: #98be65; font-weight: bold;">\d</span><span style="color: #98be65;">]+</span><span style="color: #51afef; font-weight: bold;">(?:</span><span style="color: #98be65;">['</span><span style="color: #98be65; font-weight: bold;">\-</span><span style="color: #98be65;">]?[a-zA-Z]+</span><span style="color: #51afef; font-weight: bold;">)</span><span style="color: #98be65;">?</span><span style="color: #51afef; font-weight: bold;">|</span><span style="color: #98be65;">,</span><span style="color: #51afef; font-weight: bold;">|</span><span style="color: #98be65; font-weight: bold;">\.</span><span style="color: #51afef; font-weight: bold;">|</span><span style="color: #98be65; font-weight: bold;">\?</span><span style="color: #51afef; font-weight: bold;">|</span><span style="color: #98be65; font-weight: bold;">\n</span><span style="color: #51afef; font-weight: bold;">)</span><span style="color: #98be65;">"</span><span style="color: #51afef;">)</span>
</pre>
</div>
<p>
Along with tokenization, the lines get stripped of whitespace and converted to lowercase. This conversion is done so that
words can be compared: “Foo” is the same as “foo”.
</p>
<div class="org-src-container">
<pre class="src src-clojure"><span style="color: #51afef;">(</span><span style="color: #51afef;">def</span> <span style="color: #dcaeea;">xf-tokenize</span>
<span style="color: #c678dd;">(</span>comp
<span style="color: #98be65;">(</span>map <span style="color: #ECBE7B;">string</span>/trim<span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>map <span style="color: #a9a1e1;">(</span>partial re-seq re-word<span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>map <span style="color: #a9a1e1;">(</span>partial map second<span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>map <span style="color: #a9a1e1;">(</span>partial mapv <span style="color: #ECBE7B;">string</span>/lower-case<span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
</pre>
</div>
</div>
</div>
<div id="outline-container-org6b7a95d" class="outline-3">
<h3 id="org6b7a95d"><span class="section-number-3">5.5</span> Data Exploration And Preparation</h3>
<div class="outline-text-3" id="text-5-5">
<p>
The primary data structure and algorithms supporting exploration of the data are a Markov Trie
</p>
<p>
The Trie data structure supports a <code>lookup</code> function that returns the child trie at a certain lookup key and a <code>children</code> function that returns all of the immediate children of a particular Trie.
</p>
<p>
All Trie code is hosted in the git repo located at <a href="https://github.com/eihli/clj-tightly-packed-trie">https://github.com/eihli/clj-tightly-packed-trie</a>.
</p>
<div class="org-src-container">
<pre class="src src-clojure"><span style="color: #51afef;">(</span><span style="color: #51afef;">defprotocol</span> <span style="color: #ECBE7B;">ITrie</span>
<span style="color: #c678dd;">(</span>children <span style="color: #98be65;">[</span>self<span style="color: #98be65;">]</span> <span style="color: #98be65;">"Immediate children of a node."</span><span style="color: #c678dd;">)</span>
<span style="color: #c678dd;">(</span>lookup <span style="color: #98be65;">[</span>self <span style="color: #bbc2cf; background-color: #21242b;">^</span><span style="color: #ECBE7B;">clojure.lang.PersistentList</span> ks<span style="color: #98be65;">]</span> <span style="color: #98be65;">"Return node at key."</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">deftype</span> <span style="color: #ECBE7B;">Trie</span> <span style="color: #c678dd;">[</span>key value <span style="color: #bbc2cf; background-color: #21242b;">^</span><span style="color: #ECBE7B;">clojure.lang.PersistentTreeMap</span> children-<span style="color: #c678dd;">]</span>
ITrie
<span style="color: #c678dd;">(</span>children <span style="color: #98be65;">[</span>trie<span style="color: #98be65;">]</span>
<span style="color: #98be65;">(</span>map
<span style="color: #a9a1e1;">(</span><span style="color: #51afef;">fn</span> <span style="color: #51afef;">[</span><span style="color: #c678dd;">[</span>k <span style="color: #bbc2cf; background-color: #21242b;">^</span><span style="color: #ECBE7B;">Trie</span> child<span style="color: #c678dd;">]</span><span style="color: #51afef;">]</span>
<span style="color: #51afef;">(</span>Trie. k
<span style="color: #c678dd;">(</span>.value child<span style="color: #c678dd;">)</span>
<span style="color: #c678dd;">(</span>.children- child<span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span>
children-<span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span>
<span style="color: #c678dd;">(</span>lookup <span style="color: #98be65;">[</span>trie k<span style="color: #98be65;">]</span>
<span style="color: #98be65;">(</span><span style="color: #51afef;">loop</span> <span style="color: #a9a1e1;">[</span>k k
trie trie<span style="color: #a9a1e1;">]</span>
<span style="color: #a9a1e1;">(</span><span style="color: #51afef;">cond</span>
<span style="color: #5B6268;">;; </span><span style="color: #5B6268;">Allows `</span><span style="color: #a9a1e1;">update</span><span style="color: #5B6268;">` to work the same as with maps... can use `</span><span style="color: #a9a1e1;">fnil</span><span style="color: #5B6268;">`.</span>
<span style="color: #5B6268;">;; </span><span style="color: #5B6268;">(nil? trie') (throw (Exception. (format "Key not found: %s" k)))</span>
<span style="color: #51afef;">(</span>nil? trie<span style="color: #51afef;">)</span> <span style="color: #a9a1e1;">nil</span>
<span style="color: #51afef;">(</span>empty? k<span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span>Trie. <span style="color: #c678dd;">(</span>.key trie<span style="color: #c678dd;">)</span>
<span style="color: #c678dd;">(</span>.value trie<span style="color: #c678dd;">)</span>
<span style="color: #c678dd;">(</span>.children- trie<span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #a9a1e1;">:else</span> <span style="color: #51afef;">(</span><span style="color: #51afef;">recur</span>
<span style="color: #c678dd;">(</span>rest k<span style="color: #c678dd;">)</span>
<span style="color: #c678dd;">(</span>get <span style="color: #98be65;">(</span>.children- trie<span style="color: #98be65;">)</span> <span style="color: #98be65;">(</span>first k<span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span>
</pre>
</div>
</div>
</div>
<div id="outline-container-org1d3435f" class="outline-3">
<h3 id="org1d3435f"><span class="section-number-3">5.6</span> Data Visualization Functionalities For Data Exploration And Inspection</h3>
<div class="outline-text-3" id="text-5-6">
<p>
The functionality to explore and visualize data is baked into the Trie data structure.
</p>
<p>
By simply viewing the Trie in a Clojure REPL, you can inspect the Trie’s structure.
</p>
<pre class="example" id="orgd88bf1a">
(let [initialized-trie (->> (trie/make-trie "dog" "dog" "dot" "dot" "do" "do"))]
initialized-trie)
;; => {(\d \o \g) "dog", (\d \o \t) "dot", (\d \o) "do", (\d) nil}
</pre>
<p>
This functionality is provided by the implementations of the <code>Associative</code> and <code>IPersistentMap</code> interfaces.
</p>
<div class="org-src-container">
<pre class="src src-clojure">clojure.lang.Associative
<span style="color: #51afef;">(</span>assoc <span style="color: #c678dd;">[</span>trie opath ovalue<span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span><span style="color: #51afef;">if</span> <span style="color: #98be65;">(</span>empty? opath<span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>IntKeyTrie. key ovalue children-<span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>IntKeyTrie. key value <span style="color: #a9a1e1;">(</span>update
children-
<span style="color: #51afef;">(</span>first opath<span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span>fnil assoc <span style="color: #c678dd;">(</span>IntKeyTrie. <span style="color: #98be65;">(</span>first opath<span style="color: #98be65;">)</span> <span style="color: #a9a1e1;">nil</span> <span style="color: #98be65;">(</span>fast-sorted-map<span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span>rest opath<span style="color: #51afef;">)</span>
ovalue<span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span>entryAt <span style="color: #c678dd;">[</span>trie key<span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span>clojure.lang.MapEntry. key <span style="color: #98be65;">(</span>get trie key<span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span>containsKey <span style="color: #c678dd;">[</span>trie key<span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span>boolean <span style="color: #98be65;">(</span>get trie key<span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
clojure.lang.IPersistentMap
<span style="color: #51afef;">(</span>assocEx <span style="color: #c678dd;">[</span>trie key val<span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span><span style="color: #51afef;">if</span> <span style="color: #98be65;">(</span>contains? trie key<span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span><span style="color: #51afef;">throw</span> <span style="color: #a9a1e1;">(</span>Exception. <span style="color: #51afef;">(</span>format <span style="color: #98be65;">"Value already exists at key %s."</span> key<span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>assoc trie key val<span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span>without <span style="color: #c678dd;">[</span>trie key<span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span>-without trie key<span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
</pre>
</div>
<p>
The Hidden Markov Model data structure doesn’t lend itself to any useful graphical type of visualization or exploration.
</p>
</div>
</div>
<div id="outline-container-orgec327c6" class="outline-3">
<h3 id="orgec327c6"><span class="section-number-3">5.7</span> Implementation Of Interactive Queries</h3>
<div class="outline-text-3" id="text-5-7">
</div>
<div id="outline-container-org92a52fa" class="outline-4">
<h4 id="org92a52fa"><span class="section-number-4">5.7.1</span> Generate Rhyming Lyrics</h4>
<div class="outline-text-4" id="text-5-7-1">
<p>
This interactive query will return a list of rhyming phrases to any word or phrase you enter.
</p>
<p>
For example, the phrase <code>don't bother me</code> returns the following results.
</p>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<colgroup>
<col class="org-left" />
<col class="org-right" />
<col class="org-left" />
<col class="org-right" />
</colgroup>
<tbody>
<tr>
<td class="org-left">Rhyme</td>
<td class="org-right">Quality</td>
<td class="org-left">Lyric</td>
<td class="org-right">Perplexity</td>
</tr>
<tr>
<td class="org-left">forsee</td>
<td class="org-right">5</td>
<td class="org-left">i’m not one of us forsee</td>
<td class="org-right">-0.150812027039802</td>
</tr>
<tr>
<td class="org-left">wholeheartedly</td>
<td class="org-right">5</td>
<td class="org-left">purification has replaced wholeheartedly</td>
<td class="org-right">-0.23227389702753784</td>
</tr>
<tr>
<td class="org-left">merci</td>
<td class="org-right">5</td>
<td class="org-left">domine, non merci</td>
<td class="org-right">-0.2567394520839273</td>
</tr>
<tr>
<td class="org-left">oversea</td>
<td class="org-right">5</td>
<td class="org-left">i let’s torch oversea</td>
<td class="org-right">-0.3940312599117676</td>
</tr>
<tr>
<td class="org-left">me</td>
<td class="org-right">4</td>
<td class="org-left">that is found in me</td>
<td class="org-right">-0.12708613143793374</td>
</tr>
<tr>
<td class="org-left">thee</td>
<td class="org-right">4</td>
<td class="org-left">you ask thee</td>
<td class="org-right">-0.20919974848757947</td>
</tr>
<tr>
<td class="org-left">free</td>
<td class="org-right">4</td>
<td class="org-left">direct from me free</td>
<td class="org-right">-0.29056603191271085</td>
</tr>
<tr>
<td class="org-left">harmony</td>
<td class="org-right">3</td>
<td class="org-left">it’s time to go, this harmony</td>
<td class="org-right">-0.06634608923365708</td>
</tr>
<tr>
<td class="org-left">society</td>
<td class="org-right">3</td>
<td class="org-left">mutilation rejected by society</td>
<td class="org-right">-0.10624747249791901</td>
</tr>
<tr>
<td class="org-left">prophecy</td>
<td class="org-right">3</td>
<td class="org-left">take us to the brink of disaster dreamer just a savage prophecy</td>
<td class="org-right">-0.13097443386137644</td>
</tr>
<tr>
<td class="org-left">honesty</td>
<td class="org-right">3</td>
<td class="org-left">for you my threw all that can be the power not honesty</td>
<td class="org-right">-0.2423380760939454</td>
</tr>
<tr>
<td class="org-left">constantly</td>
<td class="org-right">3</td>
<td class="org-left">i thrust my sword into the dragon’s annihilation that constantly</td>
<td class="org-right">-0.2474276676860057</td>
</tr>
<tr>
<td class="org-left">reality</td>
<td class="org-right">2</td>
<td class="org-left">smack of reality</td>
<td class="org-right">-0.14811632033013192</td>
</tr>
<tr>
<td class="org-left">eternity</td>
<td class="org-right">2</td>
<td class="org-left">with trust in loneliness in eternity</td>
<td class="org-right">-0.1507561510378151</td>
</tr>
<tr>
<td class="org-left">misery</td>
<td class="org-right">2</td>
<td class="org-left">reminiscing over misery</td>
<td class="org-right">-0.29506597978960253</td>
</tr>
</tbody>
</table>
<p>
The interactive query for the above can be found at <a href="https://darklimericks.com/rhymestorm/lyric-from-seed?seed=don%27t+bother+me">https://darklimericks.com/rhymestorm/lyric-from-seed?seed=don%27t+bother+me</a>. Note that, since these lyrics are randomly generated, your results will vary.
</p>
</div>
</div>
<div id="outline-container-org4eb310c" class="outline-4">
<h4 id="org4eb310c"><span class="section-number-4">5.7.2</span> Complete Lyric Containing Suffix</h4>
<div class="outline-text-4" id="text-5-7-2">
<p>
This interactive query will return a list of lyrics completing the given suffix with randomly generated prefixes.
</p>
<p>
For example, let’s say a songwriter liked the phrase <code>rejected by society</code> above, but they want to brainstorm different beginnings of that line.
</p>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<colgroup>
<col class="org-left" />
<col class="org-right" />
<col class="org-right" />
</colgroup>
<tbody>
<tr>
<td class="org-left">Lyric</td>
<td class="org-right">OpenNLP Perplexity</td>
<td class="org-right">Per-word OpenNLP Perplexity</td>
</tr>
<tr>
<td class="org-left">we have rejected by society</td>
<td class="org-right">-0.6593112258099724</td>
<td class="org-right">-0.03878301328293955</td>
</tr>
<tr>
<td class="org-left">she rejected by society</td>
<td class="org-right">-1.0992937688019973</td>
<td class="org-right">-0.07852098348585694</td>
</tr>
<tr>
<td class="org-left">i was despised and rejected by society</td>
<td class="org-right">-3.5925278871864497</td>
<td class="org-right">-0.15619686466028043</td>
</tr>
<tr>
<td class="org-left">the exiled and rejected by society</td>
<td class="org-right">-3.6944350673672144</td>
<td class="org-right">-0.21731970984513027</td>
</tr>
<tr>
<td class="org-left">to smell the death mutilation rejected by society</td>
<td class="org-right">-5.899263654566813</td>
<td class="org-right">-0.2458026522736172</td>
</tr>
<tr>
<td class="org-left">time goes yearning again only to be rejected by society</td>
<td class="org-right">-2.764028722852962</td>
<td class="org-right">-0.08375844614705946</td>
</tr>
<tr>
<td class="org-left">you won’t survive the mutilation rejected by society</td>
<td class="org-right">-2.5299544352623986</td>
<td class="org-right">-0.09035551554508567</td>
</tr>
<tr>
<td class="org-left">your rejected by society</td>
<td class="org-right">-1.4840658880458661</td>
<td class="org-right">-0.10600470628899043</td>
</tr>
<tr>
<td class="org-left">dividing lands, rejected by society</td>
<td class="org-right">-2.2975947244849793</td>
<td class="org-right">-0.12764415136027663</td>
</tr>
<tr>
<td class="org-left">a voice summons all angry exiled and rejected by society</td>
<td class="org-right">-9.900290597751827</td>
<td class="org-right">-0.17679090353128263</td>
</tr>
<tr>
<td class="org-left">protect the rejected by society</td>
<td class="org-right">-4.210741684291847</td>
<td class="org-right">-0.28071611228612314</td>
</tr>
</tbody>
</table>
<p>
The interactive query for the above can be found at <a href="https://darklimericks.com/rhymestorm/rhyming-lyric?rhyming-lyric-target=rejected+by+society">https://darklimericks.com/rhymestorm/rhyming-lyric?rhyming-lyric-target=rejected+by+society</a>. Note again that your results will vary.
</p>
</div>
</div>
</div>
<div id="outline-container-org875011a" class="outline-3">
<h3 id="org875011a"><span class="section-number-3">5.8</span> Implementation Of Machine Learning Methods</h3>
<div class="outline-text-3" id="text-5-8">
<p>
The machine learning method chosen for this software is a Hidden Markov Model.
</p>
<p>
Each line of each song is split into “tokens” (words) and then the previous <code>n - 1</code> tokens are used to predict the <code>nth</code> token.
</p>
<p>
The algorithm is implemented in several parts which are demonstrated below.
</p>
<ol class="org-ol">
<li>Read each song line-by-line.</li>
<li>Split each line into tokens.</li>
<li>Partition the tokens into sequences of length <code>n</code>.</li>
<li>Associate each sequence into a Trie and update the value representing the number of times that sequence has been encountered.</li>
</ol>
<p>
That is the process for building the Hidden Markov Model.
</p>
<p>
The algorithm for generating predictions from the HMM is as follows.
</p>
<ol class="org-ol">
<li>Look up the <code>n - 1</code> tokens in the Trie.</li>
<li>Normalize the frequencies of the children of the <code>n - 1</code> tokens into percentage likelihoods.</li>
<li>Account for “unseen <code>n grams</code>” (Simple Good Turing).</li>
<li>Sort results by maximum likelihood.</li>
</ol>
<div class="org-src-container">
<pre class="src src-clojure"><span style="color: #51afef;">(</span>require '<span style="color: #c678dd;">[</span>com.owoga.prhyme.data-transform <span style="color: #a9a1e1;">:as</span> data-transform<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>clojure.pprint <span style="color: #a9a1e1;">:as</span> pprint<span style="color: #c678dd;">]</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">defn</span> <span style="color: #c678dd;">file-seq->markov-trie</span>
<span style="color: #83898d;">"For forwards markov."</span>
<span style="color: #c678dd;">[</span>database files n m<span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span>transduce
<span style="color: #98be65;">(</span>comp
<span style="color: #a9a1e1;">(</span>map slurp<span style="color: #a9a1e1;">)</span>
<span style="color: #a9a1e1;">(</span>map #<span style="color: #51afef;">(</span><span style="color: #ECBE7B;">string</span>/split <span style="color: #dcaeea;">%</span> #<span style="color: #98be65;">"[</span><span style="color: #98be65; font-weight: bold;">\n</span><span style="color: #98be65;">+</span><span style="color: #98be65; font-weight: bold;">\?\.</span><span style="color: #98be65;">]"</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span>
<span style="color: #a9a1e1;">(</span>map <span style="color: #51afef;">(</span>partial transduce <span style="color: #ECBE7B;">data-transform</span>/xf-tokenize conj<span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span>
<span style="color: #a9a1e1;">(</span>map <span style="color: #51afef;">(</span>partial transduce <span style="color: #ECBE7B;">data-transform</span>/xf-filter-english conj<span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span>
<span style="color: #a9a1e1;">(</span>map <span style="color: #51afef;">(</span>partial remove empty?<span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span>
<span style="color: #a9a1e1;">(</span>map <span style="color: #51afef;">(</span>partial into <span style="color: #c678dd;">[]</span> <span style="color: #c678dd;">(</span><span style="color: #ECBE7B;">data-transform</span>/xf-pad-tokens <span style="color: #98be65;">(</span>dec m<span style="color: #98be65;">)</span> <span style="color: #98be65;">"<s>"</span> <span style="color: #da8548; font-weight: bold;">1</span> <span style="color: #98be65;">"</s>"</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span>
<span style="color: #a9a1e1;">(</span>map <span style="color: #51afef;">(</span>partial mapcat <span style="color: #c678dd;">(</span>partial <span style="color: #ECBE7B;">data-transform</span>/n-to-m-partitions n <span style="color: #98be65;">(</span>inc m<span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span>
<span style="color: #a9a1e1;">(</span>mapcat <span style="color: #51afef;">(</span>partial mapv <span style="color: #c678dd;">(</span><span style="color: #ECBE7B;">data-transform</span>/make-database-processor database<span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>completing
<span style="color: #a9a1e1;">(</span><span style="color: #51afef;">fn</span> <span style="color: #51afef;">[</span>trie lookup<span style="color: #51afef;">]</span>
<span style="color: #51afef;">(</span>update trie lookup <span style="color: #c678dd;">(</span>fnil #<span style="color: #98be65;">(</span>update <span style="color: #dcaeea;">%</span> <span style="color: #da8548; font-weight: bold;">1</span> inc<span style="color: #98be65;">)</span> <span style="color: #98be65;">[</span>lookup <span style="color: #da8548; font-weight: bold;">0</span><span style="color: #98be65;">]</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span><span style="color: #ECBE7B;">trie</span>/make-trie<span style="color: #98be65;">)</span>
files<span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">let</span> <span style="color: #c678dd;">[</span>files <span style="color: #98be65;">(</span><span style="color: #51afef;">->></span> <span style="color: #98be65;">"/home/eihli/src/prhyme/dark-corpus"</span>
<span style="color: #ECBE7B;">io</span>/file
file-seq
<span style="color: #a9a1e1;">(</span>eduction <span style="color: #51afef;">(</span><span style="color: #ECBE7B;">data-transform</span>/xf-file-seq <span style="color: #da8548; font-weight: bold;">501</span> <span style="color: #da8548; font-weight: bold;">2</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
database <span style="color: #98be65;">(</span>atom <span style="color: #a9a1e1;">{</span><span style="color: #a9a1e1;">:next-id</span> <span style="color: #da8548; font-weight: bold;">1</span><span style="color: #a9a1e1;">}</span><span style="color: #98be65;">)</span>
trie <span style="color: #98be65;">(</span>file-seq->markov-trie database files <span style="color: #da8548; font-weight: bold;">1</span> <span style="color: #da8548; font-weight: bold;">3</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span><span style="color: #ECBE7B;">pprint</span>/pprint <span style="color: #98be65;">[</span><span style="color: #a9a1e1;">(</span>map <span style="color: #51afef;">(</span>comp <span style="color: #c678dd;">(</span>partial map @database<span style="color: #c678dd;">)</span> first<span style="color: #51afef;">)</span> <span style="color: #51afef;">(</span>take <span style="color: #da8548; font-weight: bold;">10</span> <span style="color: #c678dd;">(</span>drop <span style="color: #da8548; font-weight: bold;">105</span> trie<span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">]</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
</pre>
</div>
<pre class="example" id="orgeb7813e">
[(("<s>" "call" "me")
("<s>" "call")
("<s>" "right" "</s>")
("<s>" "right")
("<s>" "that's" "proportional")
("<s>" "that's")
("<s>" "don't" "</s>")
("<s>" "don't")
("<s>" "yourself" "in")
("<s>" "yourself"))]
</pre>
<p>
The results above show a sample of 10 elements in a 1-to-3-gram trie
</p>
<p>
The code sample below demonstrates training a Hidden Markov Model on a set of lyrics where each line gets reversed. This model is useful for predicting words backwards, so that you can start with the rhyming end of a word or phrase and generate backwards to the start of the lyric.
</p>
<p>
It also performs compaction and serialization. Song lyrics are typically provided as text files. Reading files on a hard drive is an expensive process, but we can perform that expensive training process only once and save the resulting Markov Model in a more memory-efficient format.
</p>
<div class="org-src-container">
<pre class="src src-clojure"><span style="color: #51afef;">(</span>require '<span style="color: #c678dd;">[</span>com.owoga.corpus.markov <span style="color: #a9a1e1;">:as</span> markov<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>taoensso.nippy <span style="color: #a9a1e1;">:as</span> nippy<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>com.owoga.prhyme.data-transform <span style="color: #a9a1e1;">:as</span> data-transform<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>clojure.pprint <span style="color: #a9a1e1;">:as</span> pprint<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>clojure.string <span style="color: #a9a1e1;">:as</span> string<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>com.owoga.trie <span style="color: #a9a1e1;">:as</span> trie<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>com.owoga.tightly-packed-trie <span style="color: #a9a1e1;">:as</span> tpt<span style="color: #c678dd;">]</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">defn</span> <span style="color: #c678dd;">train-backwards</span>
<span style="color: #83898d;">"For building lines backwards so they can be seeded with a target rhyme."</span>
<span style="color: #c678dd;">[</span>files n m trie-filepath database-filepath tightly-packed-trie-filepath<span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span><span style="color: #51afef;">let</span> <span style="color: #98be65;">[</span>database <span style="color: #a9a1e1;">(</span>atom <span style="color: #51afef;">{</span><span style="color: #a9a1e1;">:next-id</span> <span style="color: #da8548; font-weight: bold;">1</span><span style="color: #51afef;">}</span><span style="color: #a9a1e1;">)</span>
trie <span style="color: #a9a1e1;">(</span><span style="color: #ECBE7B;">markov</span>/file-seq->backwards-markov-trie database files n m<span style="color: #a9a1e1;">)</span><span style="color: #98be65;">]</span>
<span style="color: #98be65;">(</span><span style="color: #ECBE7B;">nippy</span>/freeze-to-file trie-filepath <span style="color: #a9a1e1;">(</span>seq trie<span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>println <span style="color: #98be65;">"Froze"</span> trie-filepath<span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span><span style="color: #ECBE7B;">nippy</span>/freeze-to-file database-filepath @database<span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>println <span style="color: #98be65;">"Froze"</span> database-filepath<span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span><span style="color: #ECBE7B;">markov</span>/save-tightly-packed-trie trie database tightly-packed-trie-filepath<span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span><span style="color: #51afef;">let</span> <span style="color: #a9a1e1;">[</span>loaded-trie <span style="color: #51afef;">(</span><span style="color: #51afef;">->></span> trie-filepath
<span style="color: #ECBE7B;">nippy</span>/thaw-from-file
<span style="color: #c678dd;">(</span>into <span style="color: #98be65;">(</span><span style="color: #ECBE7B;">trie</span>/make-trie<span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
loaded-db <span style="color: #51afef;">(</span><span style="color: #51afef;">->></span> database-filepath
<span style="color: #ECBE7B;">nippy</span>/thaw-from-file<span style="color: #51afef;">)</span>
loaded-tightly-packed-trie <span style="color: #51afef;">(</span><span style="color: #ECBE7B;">tpt</span>/load-tightly-packed-trie-from-file
tightly-packed-trie-filepath
<span style="color: #c678dd;">(</span><span style="color: #ECBE7B;">markov</span>/decode-fn loaded-db<span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">]</span>
<span style="color: #a9a1e1;">(</span>println <span style="color: #98be65;">"Loaded trie:"</span> <span style="color: #51afef;">(</span>take <span style="color: #da8548; font-weight: bold;">5</span> loaded-trie<span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span>
<span style="color: #a9a1e1;">(</span>println <span style="color: #98be65;">"Loaded database:"</span> <span style="color: #51afef;">(</span>take <span style="color: #da8548; font-weight: bold;">5</span> loaded-db<span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span>
<span style="color: #a9a1e1;">(</span>println <span style="color: #98be65;">"Loaded tightly-packed-trie:"</span> <span style="color: #51afef;">(</span>take <span style="color: #da8548; font-weight: bold;">5</span> loaded-tightly-packed-trie<span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span>
<span style="color: #a9a1e1;">(</span>println <span style="color: #98be65;">"Successfully loaded trie and database."</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">let</span> <span style="color: #c678dd;">[</span>files <span style="color: #98be65;">(</span><span style="color: #51afef;">->></span> <span style="color: #98be65;">"/home/eihli/src/prhyme/dark-corpus"</span>
<span style="color: #ECBE7B;">io</span>/file
file-seq
<span style="color: #a9a1e1;">(</span>eduction <span style="color: #51afef;">(</span><span style="color: #ECBE7B;">data-transform</span>/xf-file-seq <span style="color: #da8548; font-weight: bold;">0</span> <span style="color: #da8548; font-weight: bold;">4</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">[</span>trie database<span style="color: #98be65;">]</span> <span style="color: #98be65;">(</span>train-backwards
files
<span style="color: #da8548; font-weight: bold;">1</span>
<span style="color: #da8548; font-weight: bold;">5</span>
<span style="color: #98be65;">"/tmp/markov-trie-4-gram-backwards.bin"</span>
<span style="color: #98be65;">"/tmp/markov-database-4-gram-backwards.bin"</span>
<span style="color: #98be65;">"/tmp/markov-tightly-packed-trie-4-gram-backwards.bin"</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">]</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">def</span> <span style="color: #dcaeea;">markov-trie</span> <span style="color: #c678dd;">(</span>into <span style="color: #98be65;">(</span><span style="color: #ECBE7B;">trie</span>/make-trie<span style="color: #98be65;">)</span> <span style="color: #98be65;">(</span><span style="color: #ECBE7B;">nippy</span>/thaw-from-file <span style="color: #98be65;">"/tmp/markov-trie-4-gram-backwards.bin"</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">def</span> <span style="color: #dcaeea;">database</span> <span style="color: #c678dd;">(</span><span style="color: #ECBE7B;">nippy</span>/thaw-from-file <span style="color: #98be65;">"/tmp/markov-database-4-gram-backwards.bin"</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">def</span> <span style="color: #dcaeea;">markov-tight-trie</span>
<span style="color: #c678dd;">(</span><span style="color: #ECBE7B;">tpt</span>/load-tightly-packed-trie-from-file
<span style="color: #98be65;">"/tmp/markov-tightly-packed-trie-4-gram-backwards.bin"</span>
<span style="color: #98be65;">(</span><span style="color: #ECBE7B;">markov</span>/decode-fn database<span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span>println <span style="color: #98be65;">"</span><span style="color: #98be65; font-weight: bold;">\n\n</span><span style="color: #98be65;"> Example n-grams frequencies from Hidden Markov Model:</span><span style="color: #98be65; font-weight: bold;">\n</span><span style="color: #98be65;">"</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #ECBE7B;">pprint</span>/pprint
<span style="color: #c678dd;">(</span><span style="color: #51afef;">->></span> markov-tight-trie
<span style="color: #98be65;">(</span>drop <span style="color: #da8548; font-weight: bold;">600</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>take <span style="color: #da8548; font-weight: bold;">10</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span>map
<span style="color: #a9a1e1;">(</span><span style="color: #51afef;">fn</span> <span style="color: #51afef;">[</span><span style="color: #c678dd;">[</span>ngram-ids <span style="color: #98be65;">[</span>id freq<span style="color: #98be65;">]</span><span style="color: #c678dd;">]</span><span style="color: #51afef;">]</span>
<span style="color: #51afef;">[</span><span style="color: #c678dd;">(</span><span style="color: #ECBE7B;">string</span>/join <span style="color: #98be65;">" "</span> <span style="color: #98be65;">(</span>map database ngram-ids<span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span> freq<span style="color: #51afef;">]</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
</pre>
</div>
</div>
</div>
<div id="outline-container-org5824f12" class="outline-3">
<h3 id="org5824f12"><span class="section-number-3">5.9</span> Functionalities To Evaluate The Accuracy Of The Data Product</h3>
<div class="outline-text-3" id="text-5-9">
<p>
Since creative brainstorming is the goal, “accuracy” is subjective.
</p>
<p>
We can, however, measure and compare language generation algorithms against how “expected” a phrase is given the training data. This measurement is “perplexity”.
</p>
<div class="org-src-container">
<pre class="src src-clojure"><span style="color: #51afef;">(</span>require '<span style="color: #c678dd;">[</span>taoensso.nippy <span style="color: #a9a1e1;">:as</span> nippy<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>com.owoga.tightly-packed-trie <span style="color: #a9a1e1;">:as</span> tpt<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>com.owoga.corpus.markov <span style="color: #a9a1e1;">:as</span> markov<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>clojure.java.io <span style="color: #a9a1e1;">:as</span> io<span style="color: #c678dd;">]</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">def</span> <span style="color: #dcaeea;">database</span> <span style="color: #c678dd;">(</span><span style="color: #ECBE7B;">nippy</span>/thaw-from-file <span style="color: #98be65;">(</span><span style="color: #ECBE7B;">io</span>/resource <span style="color: #98be65;">"models/markov-database-4-gram-backwards.bin"</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">def</span> <span style="color: #dcaeea;">markov-tight-trie</span>
<span style="color: #c678dd;">(</span><span style="color: #ECBE7B;">tpt</span>/load-tightly-packed-trie-from-file
<span style="color: #98be65;">(</span><span style="color: #ECBE7B;">io</span>/resource <span style="color: #98be65;">"models/markov-tightly-packed-trie-4-gram-backwards.bin"</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">(</span><span style="color: #ECBE7B;">markov</span>/decode-fn database<span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">let</span> <span style="color: #c678dd;">[</span>likely-phrase <span style="color: #98be65;">[</span><span style="color: #98be65;">"a"</span> <span style="color: #98be65;">"hole"</span> <span style="color: #98be65;">"</s>"</span> <span style="color: #98be65;">"</s>"</span><span style="color: #98be65;">]</span>
less-likely-phrase <span style="color: #98be65;">[</span><span style="color: #98be65;">"this"</span> <span style="color: #98be65;">"hole"</span> <span style="color: #98be65;">"</s>"</span> <span style="color: #98be65;">"</s>"</span><span style="color: #98be65;">]</span>
least-likely-phrase <span style="color: #98be65;">[</span><span style="color: #98be65;">"that"</span> <span style="color: #98be65;">"hole"</span> <span style="color: #98be65;">"</s>"</span> <span style="color: #98be65;">"</s>"</span><span style="color: #98be65;">]</span><span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span>run!
<span style="color: #98be65;">(</span><span style="color: #51afef;">fn</span> <span style="color: #a9a1e1;">[</span>word<span style="color: #a9a1e1;">]</span>
<span style="color: #a9a1e1;">(</span>println
<span style="color: #51afef;">(</span>format
<span style="color: #98be65;">"</span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;">%s</span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;"> has preceeded </span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;">hole</span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;"> </span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;"></s></span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;"> </span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;"></s></span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;"> a total of %s times"</span>
word
<span style="color: #c678dd;">(</span>second <span style="color: #98be65;">(</span>get markov-tight-trie <span style="color: #a9a1e1;">(</span>map database <span style="color: #51afef;">[</span><span style="color: #98be65;">"</s>"</span> <span style="color: #98be65;">"</s>"</span> <span style="color: #98be65;">"hole"</span> word<span style="color: #51afef;">]</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">[</span><span style="color: #98be65;">"a"</span> <span style="color: #98be65;">"this"</span> <span style="color: #98be65;">"that"</span><span style="color: #98be65;">]</span><span style="color: #c678dd;">)</span>
<span style="color: #c678dd;">(</span>run!
<span style="color: #98be65;">(</span><span style="color: #51afef;">fn</span> <span style="color: #a9a1e1;">[</span>word<span style="color: #a9a1e1;">]</span>
<span style="color: #a9a1e1;">(</span><span style="color: #51afef;">let</span> <span style="color: #51afef;">[</span>seed <span style="color: #c678dd;">[</span><span style="color: #98be65;">"</s>"</span> <span style="color: #98be65;">"</s>"</span> <span style="color: #98be65;">"hole"</span> word<span style="color: #c678dd;">]</span><span style="color: #51afef;">]</span>
<span style="color: #51afef;">(</span>println
<span style="color: #c678dd;">(</span>format
<span style="color: #98be65;">"%s is the perplexity of </span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;">%s</span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;"> </span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;">hole</span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;"> </span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;"></s></span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;"> </span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;"></s></span><span style="color: #98be65; font-weight: bold;">\"</span><span style="color: #98be65;">"</span>
<span style="color: #98be65;">(</span><span style="color: #51afef;">->></span> seed
<span style="color: #a9a1e1;">(</span>map database<span style="color: #a9a1e1;">)</span>
<span style="color: #a9a1e1;">(</span><span style="color: #ECBE7B;">markov</span>/perplexity <span style="color: #da8548; font-weight: bold;">4</span> markov-tight-trie<span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
word<span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span>
<span style="color: #98be65;">[</span><span style="color: #98be65;">"a"</span> <span style="color: #98be65;">"this"</span> <span style="color: #98be65;">"that"</span><span style="color: #98be65;">]</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
</pre>
</div>
<pre class="example">
"a" has preceeded "hole" "</s>" "</s>" a total of 250 times
"this" has preceeded "hole" "</s>" "</s>" a total of 173 times
"that" has preceeded "hole" "</s>" "</s>" a total of 45 times
-12.184088569934774 is the perplexity of "a" "hole" "</s>" "</s>"
-12.552930899563904 is the perplexity of "this" "hole" "</s>" "</s>"
-13.905719644461469 is the perplexity of "that" "hole" "</s>" "</s>"
</pre>
<p>
The results above make intuitive sense. The most common word to preceed “hole” at the end of a sentence is the word “a”. There are 250 instances of sentences of “… a hole.”. That can be compared to 173 instances of “… this hole.” and 45 instances of “… that hole.”.
</p>
<p>
Therefore, “… a hole.” is has the lowest “perplexity”.
</p>
<p>
This standardized measure of accuracy can be used to compare different language generation algorithms.
</p>
</div>
</div>
<div id="outline-container-org88dc329" class="outline-3">
<h3 id="org88dc329"><span class="section-number-3">5.10</span> Security Features</h3>
<div class="outline-text-3" id="text-5-10">
<p>
Artists/Songwriters place a lot of value in the secrecy of their content. Therefore, all communication with the web-based interface occurs over a secure connection using HTTPS.
</p>
<p>
Security certificates are generated using Let’s Encrypt and an Nginx web server handles the SSL termination.
</p>
<p>
With this precaution in place, attackers will not be able to snoop the content that songwriters are sending to or receiving from the servers.
</p>
</div>
</div>
<div id="outline-container-org613bd8f" class="outline-3">
<h3 id="org613bd8f"><span class="section-number-3">5.11</span> Tools To Monitor And Maintain The Product</h3>
<div class="outline-text-3" id="text-5-11">
<p>
By having the application server behind an HAProxy load balancer, we can take advantage of the built-in HAProxy stats page for monitoring amount of traffic and health of the application servers.
</p>
<div id="org2112f75" class="figure">
<p><img src="images/stats.png" alt="stats.png" />
</p>
</div>
<p>
<a href="http://darklimericks.com:8404/stats">http://darklimericks.com:8404/stats</a>
</p>
<p>
That page is behind basic authentication with username: admin and password: admin.
</p>
<p>
The server also includes the <code>certbot</code> script for updating and maintaining the SSL certificates issued by Let’s Encrypt.
</p>
</div>
</div>
<div id="outline-container-orgc6266b7" class="outline-3">
<h3 id="orgc6266b7"><span class="section-number-3">5.12</span> A User-Friendly, Functional Dashboard That Includes At Least Three Visualization Types</h3>
<div class="outline-text-3" id="text-5-12">
<p>
You can access an example of the user interface at <a href="https://darklimericks.com/rhymestorm">https://darklimericks.com/rhymestorm</a>.
</p>
<p>
You’ll see 3 input fields.
</p>
<p>
The first input field is for a word or phrase for which you wish to find a rhyme. Submitting that field will return three visualizations to help you pick a rhyme.
</p>
<p>
The first visualization is a scatter plot of rhyming words with the “quality” of the rhyme on the Y axis and the number of times that rhyming word/phrase occurs in the training corpus on the X axis.
</p>
<div id="org6aa1adf" class="figure">
<p><img src="images/rhymestorm-vis.png" alt="rhymestorm-vis.png" />
</p>
</div>
<p>
The second visualization is a word cloud where the size of each word is based on the frequency with which the word appears in the training corpus.
</p>
<div id="org950c96a" class="figure">
<p><img src="images/rhymestorm-vis-cloud.png" alt="rhymestorm-vis-cloud.png" />
</p>
</div>
<p>
The third visualization is a table that lists all of the rhymes, their pronunciations, the rhyme quality, and the frequency. The table is sorted first by the rhyme quality then by the frequency.
</p>
<div id="org215dc00" class="figure">
<p><img src="images/rhymestorm-vis-table.png" alt="rhymestorm-vis-table.png" />
</p>
</div>
</div>
</div>
</div>
<div id="outline-container-remaining-documentation" class="outline-2">
<h2 id="remaining-documentation"><span class="section-number-2">6</span> D. Documentation</h2>
<div class="outline-text-2" id="text-remaining-documentation">
</div>
<div id="outline-container-org9df4605" class="outline-3">
<h3 id="org9df4605"><span class="section-number-3">6.1</span> Business Vision</h3>
<div class="outline-text-3" id="text-6-1">
<p>
Supercharge songwriter’s abilities with automated rhyming lyric suggestions for brainstorming.
</p>
<p>
Without the physical constraints imposed by paperpack rhyming dictionaries, and with the full power of machine learning training, RhymeStorm™ will find rhymes don’t show up in typical rhyming dictionaries.
</p>
<p>
Rhymes and lyric suggestions will further be honed to target specific genres based on the training data set.
</p>
<p>
These two features combine with the speed of modern-day processing to provide rapid-fire rhyming suggestions never before seen.
</p>
</div>
<div id="outline-container-orga3bdd1c" class="outline-4">
<h4 id="orga3bdd1c"><span class="section-number-4">6.1.1</span> Requirements</h4>
<div class="outline-text-4" id="text-6-1-1">
<ul class="org-ul">
<li class="on"><code>[X]</code> Given a word or phrase, suggest rhymes (ranked by quality) (Trie)</li>
<li class="trans"><code>[-]</code> Given a word or phrase, suggest lyric completion (Hidden Markov Model)
<ul class="org-ul">
<li class="off"><code>[ ]</code> (Future iteration) Restrict suggestion by syllable count</li>
<li class="on"><code>[X]</code> Sort suggestions by frequency of occurrence in training corpus</li>
<li class="on"><code>[X]</code> Sort suggestions by rhyme quality</li>
<li class="off"><code>[ ]</code> (Future iteration) Show graph of suggestions with perplexity on one axis and rhyme quality on the other</li>
</ul></li>
</ul>
</div>
</div>
</div>
<div id="outline-container-orgd136d58" class="outline-3">
<h3 id="orgd136d58"><span class="section-number-3">6.2</span> Data Sets</h3>
<div class="outline-text-3" id="text-6-2">
<p>
I obtained the dataset from <a href="http://darklyrics.com">http://darklyrics.com</a>.
</p>
<p>
The code that I used to download all of the lyrics is at <a href="https://github.com/eihli/prhyme/blob/master/src/com/owoga/corpus/darklyrics.clj">https://github.com/eihli/prhyme/blob/master/src/com/owoga/corpus/darklyrics.clj</a>.
</p>
<p>
In the interest of being nice to the owners of <a href="http://darklyrics.com">http://darklyrics.com</a>, I’m keeping private the files containing the lyrics.
</p>
<p>
The trained data model is available.
</p>
<p>
See <code>web/resources/models/</code>
</p>
</div>
</div>
<div id="outline-container-orgf736042" class="outline-3">
<h3 id="orgf736042"><span class="section-number-3">6.3</span> Data Analysis</h3>
<div class="outline-text-3" id="text-6-3">
<p>
I wrote code to perform certain types of data analysis, but I didn’t find it useful to meet the business requirements of this project.
</p>
<p>
For example, there is natural language processing code at <a href="https://github.com/eihli/prhyme/blob/master/src/com/owoga/prhyme/nlp/core.clj">https://github.com/eihli/prhyme/blob/master/src/com/owoga/prhyme/nlp/core.clj</a> that parses a line into a grammar tree. I wrote several functions to manipulate and aggregate information about the grammar trees that compose the corpus. But I didn’t use any of that information in creation of the n-gram Hidden Markov Model nor in the user display. For tasks related to brainstorming rhyming lyrics, that extra information lacked significant value.
</p>
</div>
</div>
<div id="outline-container-org407721c" class="outline-3">
<h3 id="org407721c"><span class="section-number-3">6.4</span> Assessment Of Hypothesis</h3>
<div class="outline-text-3" id="text-6-4">
<p>
I’ll use an example output to subjectively assess the results of the project.
</p>
<p>
Below are some of the lyrics suggested to rhyme with the word “technologies”.
</p>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<colgroup>
<col class="org-left" />
<col class="org-right" />
<col class="org-left" />
<col class="org-right" />
</colgroup>
<tbody>
<tr>
<td class="org-left">Rhyme</td>
<td class="org-right">Quality</td>
<td class="org-left">Lyric</td>
<td class="org-right">Perplexity</td>
</tr>
<tr>
<td class="org-left">technologies</td>
<td class="org-right">8</td>
<td class="org-left">you will tear the skin from the nuclear technologies</td>
<td class="org-right">-0.04695091652785746</td>
</tr>
<tr>
<td class="org-left">pathologies</td>
<td class="org-right">7</td>
<td class="org-left">there’s no hope for body’s pathologies</td>
<td class="org-right">-0.09800371561934312</td>
</tr>
<tr>
<td class="org-left">apologies</td>
<td class="org-right">7</td>
<td class="org-left">swimming in a grey world dying it’s time for apologies</td>
<td class="org-right">-0.14781111654643642</td>
</tr>
<tr>
<td class="org-left">chronologies</td>
<td class="org-right">7</td>
<td class="org-left">damn god damn the seed lurks in chronologies</td>
<td class="org-right">-0.20912909334441387</td>
</tr>
<tr>
<td class="org-left">anomalies</td>
<td class="org-right">6</td>
<td class="org-left">yesterday was born i encounter the anomalies</td>
<td class="org-right">-0.19578505194217627</td>
</tr>
<tr>
<td class="org-left">atrocities</td>
<td class="org-right">6</td>
<td class="org-left">there’s no return and and the pimp your atrocities</td>
<td class="org-right">-0.21516240668167685</td>
</tr>
<tr>
<td class="org-left">ideologies</td>
<td class="org-right">6</td>
<td class="org-left">entrenched ideologies</td>
<td class="org-right">-0.27407234083849513</td>
</tr>
<tr>
<td class="org-left">monopolies</td>
<td class="org-right">6</td>
<td class="org-left">monopolies</td>
<td class="org-right">-0.8472654185540912</td>
</tr>
<tr>
<td class="org-left">qualities</td>
<td class="org-right">5</td>
<td class="org-left">with such qualities</td>
<td class="org-right">-0.0793752454750395</td>
</tr>
<tr>
<td class="org-left">policies</td>
<td class="org-right">5</td>
<td class="org-left">stop looking at insurance policies</td>
<td class="org-right">-0.11580898408112054</td>
</tr>
<tr>
<td class="org-left">colonies</td>
<td class="org-right">5</td>
<td class="org-left">betwixt my heels, through the tears you collapse the colonies</td>
<td class="org-right">-0.1610184959356118</td>
</tr>
<tr>
<td class="org-left">harmonies</td>
<td class="org-right">5</td>
<td class="org-left">broken harmonies</td>
<td class="org-right">-0.18655087962492334</td>
</tr>
<tr>
<td class="org-left">prophecies</td>
<td class="org-right">5</td>
<td class="org-left">seek the truth prophecies</td>
<td class="org-right">-0.24506696021938001</td>
</tr>
<tr>
<td class="org-left">festivities</td>
<td class="org-right">4</td>
<td class="org-left">you have touching the festivities</td>
<td class="org-right">-0.09271388814221376</td>
</tr>
<tr>
<td class="org-left">delicacies</td>
<td class="org-right">4</td>
<td class="org-left">grey that consumes what it never was sun and the delicacies</td>
<td class="org-right">-0.14553081854920977</td>
</tr>
<tr>
<td class="org-left">anybody’s</td>
<td class="org-right">4</td>
<td class="org-left">your eyes, will remain violent the anybody’s</td>
<td class="org-right">-0.17560987263626957</td>
</tr>
<tr>
<td class="org-left">extremities</td>
<td class="org-right">4</td>
<td class="org-left">i am missing extremities</td>
<td class="org-right">-0.30386279996641197</td>
</tr>
<tr>
<td class="org-left">casualties</td>
<td class="org-right">3</td>
<td class="org-left">feed the casualties</td>
<td class="org-right">-0.23600199637494926</td>
</tr>
</tbody>
</table>
<p>
Do these lyrics provide benefit to the brainstorming process?
</p>
<p>
The lines “make sense” to varying degrees.
</p>
<p>
The “pathologies” line, for example, contains a sensible 2-gram of “body’s pathologies”. The model has learned that the possessive form of “body” is a reasonable prefix to the word “pathologies”.
</p>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<colgroup>
<col class="org-left" />
<col class="org-right" />
<col class="org-left" />
<col class="org-right" />
</colgroup>
<tbody>
<tr>
<td class="org-left">pathologies</td>
<td class="org-right">7</td>
<td class="org-left">there’s no hope for body’s pathologies</td>
<td class="org-right">-0.09800371561934312</td>
</tr>
</tbody>
</table>
<p>
And the beginning of that line contains a phrase, “there’s no hope”, that fits perfectly with the genre/context of the training set (dark heavy metal).
</p>
<p>
It’s clear that the training worked. The output is relevant to the genre and grammatically reasonable.
</p>
<p>
There’s also a wide variety in the output, which is beneficial for
brainstorming. Suggestions range from clean and clear rhymes, like
“technologies” and “pathologies”, to more abstract rhymes like “technologies”
and “anybody’s”, which some artists can creatively manipulate effectively.
</p>
<p>
I assess this version of the product proves viable and there’s exciting
possibilities for improvements by integrating with making suggestions that meet
certain stress patterns, preferring phrases that contain synonyms or antonyms,
and more.
</p>
</div>
</div>
<div id="outline-container-org2d951c6" class="outline-3">
<h3 id="org2d951c6"><span class="section-number-3">6.5</span> Visualizations</h3>
<div class="outline-text-3" id="text-6-5">
<p>
RhymeStorm™ provides three visualizations to help songwriter’s find the perfect lyric.
</p>
<p>
The first visualization is a scatterplot comparing rhyme quality to frequency that the rhyming word or phrase appears in the training corpus.
</p>
<div id="orgfb59f99" class="figure">
<p><img src="images/rhyme-scatterplot.png" alt="rhyme-scatterplot.png" />
</p>
</div>
<p>
The second visualization is a word cloud where each word’s size is in proportion to the frequency with which the word appears in the training corpus.
</p>
<div id="org3403aad" class="figure">
<p><img src="images/wordcloud.png" alt="wordcloud.png" />
</p>
</div>
<p>
And the third visualization is a sorted table of rhyme suggestions. The rhymes are sorted first by quality and then by popularity.
</p>
<div id="orgea5f528" class="figure">
<p><img src="images/rhyme-table.png" alt="rhyme-table.png" />
</p>
</div>
</div>
</div>
<div id="outline-container-org60086e9" class="outline-3">
<h3 id="org60086e9"><span class="section-number-3">6.6</span> Accuracy</h3>
<div class="outline-text-3" id="text-6-6">
<p>
It’s difficult to objectively test the models accuracy since the goal of “brainstorm new lyric” is such a subjective goal. A valid test of that goal will require many human subjects to subjectively evaluate their performance while using the tool compared to their performance without the tool.
</p>
<p>
If we allow ourselves the assumption that the close a generated phrase is to a valid english sentence then the better the generated phrase is at helping a songwriter brainstorm, then one objective assessment measure can be the percentage of generated lyrics that are valid English sentences.
</p>
</div>
<div id="outline-container-orgd2e3d30" class="outline-4">
<h4 id="orgd2e3d30"><span class="section-number-4">6.6.1</span> Percentage Of Generated Lines That Are Valid English Sentences</h4>
<div class="outline-text-4" id="text-6-6-1">
<p>
We can use <a href="https://opennlp.apache.org/">Apache OpenNLP</a> to parse sentences into a grammar structure conforming to the parts of speech specified by the <a href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">University of Pennsylvania’s Treebank Project</a>.
</p>
<p>
If OpenNLP parses a line of text into a “simple declarative clause” from the Treebank Tag Set, as described <a href="https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html">here</a>, then we consider it a valid sentence.
</p>
<p>
Using this technique on a (small) sample of 100 generated sentences reveals that ~47 are valid.
</p>
<p>
This is just one of many possible assessment techniques we could use. It’s simple but could be expanded to include valid phrases other than Treebank’s clauses. For the purpose of having a measurement by which to compare changes to the algorithm, this suffices.
</p>
<div class="org-src-container">
<pre class="src src-clojure"><span style="color: #51afef;">(</span>require '<span style="color: #c678dd;">[</span>com.darklimericks.linguistics.core <span style="color: #a9a1e1;">:as</span> linguistics<span style="color: #c678dd;">]</span>
'<span style="color: #c678dd;">[</span>com.owoga.prhyme.nlp.core <span style="color: #a9a1e1;">:as</span> nlp<span style="color: #c678dd;">]</span><span style="color: #51afef;">)</span>
<span style="color: #5B6268;">;; </span><span style="color: #5B6268;">rhymestorm-lyric-suggestion returns 20 suggestions. Each suggestion is a vector of</span>
<span style="color: #5B6268;">;; </span><span style="color: #5B6268;">the rhyming word/quality/frequency and the sentence/parse. This function</span>
<span style="color: #5B6268;">;; </span><span style="color: #5B6268;">returns just the sentences. The sentences can be further filtered using</span>
<span style="color: #5B6268;">;; </span><span style="color: #5B6268;">OpenNLP to only those that are grammatically valid english sentences.</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">defn</span> <span style="color: #c678dd;">sample-of-20</span>
<span style="color: #c678dd;">[]</span>
<span style="color: #c678dd;">(</span><span style="color: #51afef;">->></span> <span style="color: #98be65;">"technology"</span>
<span style="color: #ECBE7B;">linguistics</span>/rhymestorm-lyric-suggestions
<span style="color: #98be65;">(</span>map <span style="color: #a9a1e1;">(</span>comp first second<span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #51afef;">defn</span> <span style="color: #c678dd;">average-valid-of-100-suggestions</span> <span style="color: #c678dd;">[]</span>
<span style="color: #c678dd;">(</span><span style="color: #51afef;">let</span> <span style="color: #98be65;">[</span>generated-suggestions <span style="color: #a9a1e1;">(</span>apply concat <span style="color: #51afef;">(</span>repeatedly <span style="color: #da8548; font-weight: bold;">5</span> sample-of-20<span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span>
valid-english <span style="color: #a9a1e1;">(</span>filter <span style="color: #ECBE7B;">nlp</span>/valid-sentence? generated-suggestions<span style="color: #a9a1e1;">)</span><span style="color: #98be65;">]</span>
<span style="color: #98be65;">(</span>/ <span style="color: #a9a1e1;">(</span>count valid-english<span style="color: #a9a1e1;">)</span> <span style="color: #da8548; font-weight: bold;">100</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span>println <span style="color: #c678dd;">(</span>average-valid-of-100-suggestions<span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #5B6268;">;; </span><span style="color: #5B6268;">47/100</span>
</pre>
</div>
<pre class="example">
47/100
</pre>
<p>
Where <code>nlp/valid-sentence?</code> is defined as follows.
</p>
<div class="org-src-container">
<pre class="src src-clojure"><span style="color: #51afef;">(</span><span style="color: #51afef;">defn</span> <span style="color: #c678dd;">valid-sentence?</span>
<span style="color: #83898d;">"Tokenizes and parses the phrase using OpenNLP models from</span>
<span style="color: #83898d;"> http://opennlp.sourceforge.net/models-1.5/</span>
<span style="color: #83898d;"> If the parse tree has a clause as the top-level tag, then</span>
<span style="color: #83898d;"> we consider it a valid English sentence."</span>
<span style="color: #c678dd;">[</span>phrase<span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span><span style="color: #51afef;">->></span> phrase
tokenize
<span style="color: #98be65;">(</span><span style="color: #ECBE7B;">string</span>/join <span style="color: #98be65;">" "</span><span style="color: #98be65;">)</span>
vector
parse
first
<span style="color: #ECBE7B;">tb</span>/make-tree
<span style="color: #a9a1e1;">:chunk</span>
first
<span style="color: #a9a1e1;">:tag</span>
<span style="color: #ECBE7B;">tb2</span>/clauses
boolean<span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
</pre>
</div>
</div>
</div>
</div>
<div id="outline-container-org8d29ef2" class="outline-3">
<h3 id="org8d29ef2"><span class="section-number-3">6.7</span> Testing</h3>
<div class="outline-text-3" id="text-6-7">
<p>
My language of choice for this project encourages a programming technique or paradigm known as REPL-driven development. REPL stands for Read-Eval-Print-Loop. This is a way to write and test code in real-time without a compilation step. Individual code chunks can be evaluated inside an editor, resulting in rapid feedback.
</p>
<p>
Therefore, many “tests” exist as comments immediately following the code under test. For example:
</p>
<div class="org-src-container">
<pre class="src src-clojure"><span style="color: #51afef;">(</span><span style="color: #51afef;">defn</span> <span style="color: #c678dd;">perfect-rhyme</span>
<span style="color: #c678dd;">[</span>phones<span style="color: #c678dd;">]</span>
<span style="color: #c678dd;">(</span><span style="color: #51afef;">->></span> phones
reverse
<span style="color: #98be65;">(</span><span style="color: #ECBE7B;">util</span>/take-through <span style="color: #ECBE7B;">stress-manip</span>/primary-stress?<span style="color: #98be65;">)</span>
first
reverse
<span style="color: #98be65;">(</span>#<span style="color: #a9a1e1;">(</span>cons <span style="color: #51afef;">(</span>first <span style="color: #dcaeea;">%</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span><span style="color: #ECBE7B;">stress-manip</span>/remove-any-stress-signifiers <span style="color: #c678dd;">(</span>rest <span style="color: #dcaeea;">%</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span><span style="color: #51afef;">)</span>
<span style="color: #51afef;">(</span>comment
<span style="color: #c678dd;">(</span>perfect-rhyme <span style="color: #98be65;">(</span>first <span style="color: #a9a1e1;">(</span><span style="color: #ECBE7B;">phonetics</span>/get-phones <span style="color: #98be65;">"technology"</span><span style="color: #a9a1e1;">)</span><span style="color: #98be65;">)</span><span style="color: #c678dd;">)</span>
<span style="color: #5B6268;">;; </span><span style="color: #5B6268;">=> ("AA1" "L" "AH" "JH" "IY")</span>
<span style="color: #51afef;">)</span>
</pre>
</div>
<p>
The code inside that comment can be evaluated with a simple keystroke while
inside an editor. It serves as both a test and a form of documentation, as you
can see the input and the expected output.
</p>
<p>
Supporting libraries have a more robust test suite, since their purpose is to be used more widely across other projects with contributions accepted from anyone.
</p>
<p>
Here is an example of the test suite for the code related to syllabification: <a href="https://github.com/eihli/phonetics/blob/main/test/com/owoga/phonetics/syllabify_test.clj">https://github.com/eihli/phonetics/blob/main/test/com/owoga/phonetics/syllabify_test.clj</a>.
</p>
</div>
</div>
<div id="outline-container-orgbcd20cb" class="outline-3">
<h3 id="orgbcd20cb"><span class="section-number-3">6.8</span> Source Code</h3>
<div class="outline-text-3" id="text-6-8">
<p>
I wrote three Clojure libraries and one Clojure application that combine to make RhymeStorm™.
</p>
</div>
<div id="outline-container-orgb5bde0d" class="outline-4">
<h4 id="orgb5bde0d"><span class="section-number-4">6.8.1</span> Tightly Packed Trie</h4>
<div class="outline-text-4" id="text-6-8-1">
<p>
This is the data structure that backs the Hidden Markov Model.
</p>
<p>
<a href="https://github.com/eihli/clj-tightly-packed-trie">https://github.com/eihli/clj-tightly-packed-trie</a>
</p>
</div>
</div>
<div id="outline-container-org68009bd" class="outline-4">
<h4 id="org68009bd"><span class="section-number-4">6.8.2</span> Phonetics</h4>
<div class="outline-text-4" id="text-6-8-2">
<p>
This is the helper library that syllabifies and manipulates words, phones, and syllables.
</p>
<p>
<a href="https://github.com/eihli/phonetics">https://github.com/eihli/phonetics</a>
</p>
</div>
</div>
<div id="outline-container-org615c902" class="outline-4">
<h4 id="org615c902"><span class="section-number-4">6.8.3</span> Rhyming</h4>
<div class="outline-text-4" id="text-6-8-3">
<p>
This library contains code for analyzing rhymes, sentence structure, and manipulating corpuses.
</p>
<p>
<a href="https://github.com/eihli/prhyme">https://github.com/eihli/prhyme</a>
</p>
</div>
</div>
<div id="outline-container-org8ffc320" class="outline-4">
<h4 id="org8ffc320"><span class="section-number-4">6.8.4</span> Web Server And User Interface</h4>
<div class="outline-text-4" id="text-6-8-4">
<p>
This application is not publicly available. I’ll upload it with submission of the project.
</p>
</div>
</div>
</div>
<div id="outline-container-org9010313" class="outline-3">
<h3 id="org9010313"><span class="section-number-3">6.9</span> Quick Start</h3>
<div class="outline-text-3" id="text-6-9">
</div>
<div id="outline-container-org00f3e76" class="outline-4">
<h4 id="org00f3e76"><span class="section-number-4">6.9.1</span> How To Initialize Development Environment</h4>
<div class="outline-text-4" id="text-6-9-1">
</div>
<ol class="org-ol">
<li><a id="org3ad8643"></a>Required Software<br />
<div class="outline-text-5" id="text-6-9-1-1">
<ul class="org-ul">
<li><a href="https://www.docker.com/">Docker</a></li>
<li><a href="https://clojure.org/releases/downloads">Clojure Version 1.10+</a></li>
<li><a href="https://github.com/clojure-emacs/cider">Emacs and CIDER</a></li>
</ul>
</div>
</li>
<li><a id="org5eaa8dc"></a>Steps<br />
<div class="outline-text-5" id="text-6-9-1-2">
<ol class="org-ol">
<li>Run <code>./db/run.sh && ./kv/run.sh</code> to start the docker containers for the database and key-value store.
<ol class="org-ol">
<li>The <code>run.sh</code> scripts only need to run once. They initialize development data containers. Subsequent development can continue with <code>docker start db && docker start kv</code>.</li>
</ol></li>
<li>Start a Clojure REPL in Emacs, evaluate the <code>dev/user.clj</code> namespace, and run <code>(init)</code></li>
<li>Visit <code>http://localhost:8000/rhymestorm</code></li>
</ol>
</div>
</li>
</ol>
</div>
<div id="outline-container-org7cd2611" class="outline-4">
<h4 id="org7cd2611"><span class="section-number-4">6.9.2</span> How To Run Software Locally</h4>
<div class="outline-text-4" id="text-6-9-2">
</div>
<ol class="org-ol">
<li><a id="orga03ff0d"></a>Requirements<br />
<div class="outline-text-5" id="text-6-9-2-1">
<ul class="org-ul">
<li><a href="https://www.java.com/download/ie_manual.jsp">Java</a></li>
<li><a href="https://www.docker.com/">Docker</a></li>
</ul>
</div>
</li>
<li><a id="org37b7c9e"></a>Steps<br />
<div class="outline-text-5" id="text-6-9-2-2">
<ol class="org-ol">
<li>Run <code>./db/run.sh && ./kv/run.sh</code> to start the docker containers for the database and key-value store.
<ol class="org-ol">
<li>The <code>run.sh</code> scripts only need to run once. They initialize development data containers. Subsequent development can continue with <code>docker start db && docker start kv</code>.</li>
</ol></li>
<li>The application’s <code>jar</code> builds with a <code>make</code> run from the root directory. (See <a href="../Makefile">Makefile</a>).</li>
<li>Navigate to the root directory of this git repo and run <code>java -jar darklimericks.jar</code></li>
<li>Visit <a href="http://localhost:8000/rhymestorm">http://localhost:8000/rhymestorm</a></li>
</ol>
</div>
</li>
</ol>
</div>
</div>
</div>
<div id="outline-container-orgffa2fb6" class="outline-2">
<h2 id="orgffa2fb6"><span class="section-number-2">7</span> Citations</h2>
<div class="outline-text-2" id="text-7">
<p>
Wikimedia Foundation. (2021, July 16). Markov Model. Wikipedia.
<a href="https://en.wikipedia.org/wiki/Markov_model">https://en.wikipedia.org/wiki/Markov_model</a>.
</p>
<p>
Wikimedia Foundation. (2021, June 25). Trie. Wikipedia.
<a href="https://en.wikipedia.org/wiki/Trie">https://en.wikipedia.org/wiki/Trie</a>.
</p>
<p>
Wikimedia Foundation. (2021, June 15). HiQ Labs v. LinkedIn. Wikipedia.
<a href="https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn">https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn</a>.
</p>
<p>
Gershgorn, D. (2021, July 7). GitHub’s automatic coding tool rests on untested
legal ground. The Verge.
<a href="https://www.theverge.com/2021/7/7/22561180/github-copilot-legal-copyright-fair-use-public-code">https://www.theverge.com/2021/7/7/22561180/github-copilot-legal-copyright-fair-use-public-code</a>.
</p>
<p>
Ulrich Germann, Eric Joanis, and Samuel Larkin. 2009. Tightly packed tries: How
to fit large models into memory, and make them load fast, too. Proceedings of
the Workshop on Software Engineering, Testing, and Quality Assurance for Natural
Language (SETQA- NLP 2009), pages 31–39
</p>
</div>
</div>
</div>
<div id="postamble" class="status">
<p class="author">Author: Eric Ihli</p>
<p class="date">Created: 2021-07-23 Fri 17:16</p>
</div>
</body>
</html>