Update README

4 years ago · 6813945bd5
parent 505c0d81a2
commit 6813945bd5
2 changed files with 10 additions and 376 deletions
--- a/README.org
+++ b/README.org
@ -1,11 +1,14 @@
 #+TITLE: Prhyme Rhyme Generation

-README is moving from RST to Org.
+Utilities for rhyming, NLP corpus text cleaning, syllabification, markov modeling, etc...

-[[https://github.com/eihli/prhyme]]
+This repo is a bit of a scratch-pad. As utilities solidify, I've been extracting them out.

-* TODO
- [X] Function to read/write tightly-packed-trie from/to file.
- [X] Pull code out of Prhyme and into its own library.
- [ ] Darklyrics Corpus as tightly-packed-trie
- [ ] Instead of a value as the last element of the list, take a function.
+- https://github.com/eihli/clj-tightly-packed-trie
+- https://github.com/eihli/phonetics
+- https://github.com/eihli/darklimericks
+
+Some cool things that haven't been extracted yet and remain in an alpha state:
+
+- Part-of-speech tagging/processing using OpenNLP https://github.com/eihli/prhyme/blob/main/src/com/owoga/prhyme/nlp/core.clj
+- Simple Good-Turing frequency estimation https://github.com/eihli/prhyme/blob/main/src/com/owoga/prhyme/generation/simple_good_turing.clj
--- a/README.rst
+++ b/README.rst
@ -1,369 +0,0 @@
-======
- TODO
-======
-
- Allow a depth with thesaurus lookups.
- Allow restriction to rhymes with certain number of syllables.
- Word graph with weights to form most likely sentences.
- Use CMU LexTool to find pronunciations for words not in dictionary.
-  http://www.speech.cs.cmu.edu/tools/lextool.html
-
-=============
- Terminology
-=============
-
-Use case:
---------
-
-I want to find phrase B that rhymes with phrase A where phrase B has a
-specifiable sentiment.
-
-Something like:
-"please turn on your magic beam"
-"queeze churn horrific bloodstream"
-
-I want these settings to be optional:
- phrase B to conform to certain grammatical structure.
- config of which words I prefer rhymes in phrase be
- config for rhyming rimes, onsets, and/or nuclei
- preferred number of syllables
-
-Some of those settings make more sense with individual words than with phrases.
-
-Here's a tricky consideration. Let's break those phrases down into syllables.
-
-"please turn on your magic beam"
-"queeze churn horrific bloodstream"
-
-("P" "L" "IY" "Z") ("T" "ER" "N") ("AO" "N") ("Y" "AO" "R") ("M" "AE" "JH" "IH" "K") ("B" "IY" "M")
-("K" "W" "IY" "Z") ("CH" "ER" "N") ("H" "AO" "R" "IH" "F" "EU" "K") ("B" "L" "AH" "D" "S" "T" "R" "IY" "M")
-
-We are imagining turning
-("AO" "N") ("Y" "AO" "R") ("M" "AE" "JH")
-into
-("H" "AO" "R" "IH" "F" "IH" "K")
-
-There's this difficulty in deciding how to group the syllables into words.
-
-("P" "L" "IY" "Z" "T" "ER" "N" "AO" "N" "Y" "AO" "R" "M" "AE" "JH" "IH" "K" "B" "IY" "M")
-("K" "W" "IY" "Z" "CH" "ER" "N" "H" "AO" "R" "IH" "F" "IH" "K" "B" "L" "AH" "D" "S" "T" "R" "IY" "M")
-
-If you take just the raw syllables and ignore the words, you get
-("P" "L" "IY" "Z") ("T" "ER" "N")  ("AO" "N")     ("Y" "AO" "R")   ("M" "AE") ("JH" "IH" "K")     ("B" "IY" "M")
-and
-("K" "W" "IY" "Z") ("CH" "ER" "N") ("H" "AO" "R") ("IH" "F")       ("IH" "K") ("B" "L" "AH" "D") ("S" "T" "R" "IY" "M")
-
-If you leave the syllables grouped into words, then MAGIC rhymes with
-TRAGIC, PELAGIC, etc... If you ignore the groupings of syllables into
-words, then you're stuck trying to rhyme with the single syllable
-"word" ("M" "AE"), which doesn't have any rhymes.
-
-Implementation
--------------
-
-2021-06-09
-++++++++++
-
-Most generation tasks are going to require some big data structures, like a Trie of n-grams.
-
-A ``context`` is an atom that gets updated with those data structures.
-
-Loading some of these data structures can take a long time, so only load what you need.
-
-An example of the different data structures you might load:
-
-Alliterations - From the database of n-grams, convert each n-gram to syllables then create a trie of the alliterations.
-Perfect rhymes - Again, from the database of n-grams, convert n-gram to syllables and create trie of reverse of syllables.
-Imperfect rhymes - Perform some manipulation of the syllables so that you can be more flexible with your rhymes.
-
-One key that is probably always required is the ``database``. This maps words to their IDs and IDs to their words. The integer
-IDs are necessary for tightly packed tries.
-
-2020-10-20
-++++++++++
-
-Start with a phrase.
-
-"Please turn on your magic beam".
-
-Convert it to syllables.
-
-``...(Y AO R) (M AE) (J IH K) (B IY M)``
-
-Find all words that rhyme in any way whatsoever.
-
-Weight each possible rhyming word.
-
-Choose by weight.
-
-Remove syllables from target phrase equal to syllables of chosen word.
-
-Find all words that either rhyme or are markov selections of previous word.
-
-Weight each possible word.
-
-Choose by weight.
-
-
-Prev
-++++
-
-What would it look like to solve the problem for a single grouping of syllables into words?
-
-In the case of
-("P" "L" "IY" "Z") ("T" "ER" "N") ("AO" "N") ("Y" "AO" "R") ("M" "AE" "JH" "IH" "K") ("B" "IY" "M")
-
-We wouldn't get
-("K" "W" "IY" "Z") ("CH" "ER" "N") ("H" "AO" "R" "IH" "F" "EU" "K") ("B" "L" "AH" "D" "S" "T" "R" "IY" "M")
-
-We might get the following, since the syllable groupings align.
-"queeze churn don more tragic fiends"
-
-We don't want to always restrict it to matching syllable groups,
-especially not for single words. If we give it the word "nation" we
-almost surely want words like "approbation" and "creation"; speaking
-from the use case of trying to find a rhyme to the last word of a
-phrase.
-
-Back to the entire phrase - the idea is we solve the problem for a
-single grouping of syllables and then we use the "partitions" function
-to get every possible combination of grouping of syllables and apply
-the solution to each of those.
-
-Performance
-+++++++++++
-
-Let's say we want to see all possible rhyming phrases of
-("P" "L" "IY" "Z") ("T" "ER" "N") ("AO" "N") ("Y" "AO" "R") ("M" "AE" "JH" "IH" "K") ("B" "IY" "M")
-
-Let's assume each syllable grouping has an average of 10 rhyming words.
-
-That's 10^6 possible phrases. We need a way to limit our search for
-both computational reasons and for UX reasons.
-
-There's going to be a lot of redundancy there.
-
-We don't need each of:
-"please churn don poor tragic fiend"
-"squeeze churn don poor tragic fiend"
-"freeze churn don poor tragic fiend"
-"sneeze churn don poor tragic fiend"
-\...
-
-So maybe we cycle through each word list.
-
-please  churn don  poor  tragic  fiend
-squeeze burn  con  door  plagic mean
-freeze  turn  fawn store tragic  stream
-sneeze  yearn don  more  plagic steam
-peas    churn con  four  tragic  deem
-queeze  burn  fawn yore  plagic reem
-
-A seperate process can search through these phrases and rank them by grammatical structure, sentiment, etc...
-
-We could also pre-filter the possible words by sentiment.
-
-Or, we could assign grammatical restrictions to each word and
-pre-filter the words by grammar. Then we'd get something like
-(adjective noun verb adjective noun) and it would really reduce the
-search space.
-
-But would we do that by hand? That might work for an individual
-grouping of syllables, but how would we restrict to that for each
-possible combination of grouping of syllables?
-
-One possible solution would be to have a list of all valid grammar structures for a certain number of words.
-
-"please churn don poor tragic fiends"
-(adj noun adv verb adj noun)
-(adj adj noun verb adj noun)
-(noun verb noun conj noun verb)
-\...
-
-Output
-++++++
-
-::
-
-    (["TEASE" "STERN" "CON" "SCORE" "MANIC" "STEAM"]
-    ["SQUEEZE" "BURN" "WAN" "OR" "BEATNIK" "TEAM"]
-    ["WHEEZE" "CHURN" "ON" "WHORE" "FABRIC" "SCREAM"]
-    ["SNEEZE" "TURN" "CON" "GORE" "FRANTIC" "SCHEME"]
-    ["FREEZE" "EARN" "WAN" "CORE" "EPIC" "STREAM"]
-    ["EASE" "STERN" "ON" "FLOOR" "CRYPTIC" "SEAM"]
-    ["SEIZE" "BURN" "CON" "BORE" "TOPIC" "THEME"]
-    ["TEASE" "CHURN" "WAN" "SNORE" "TOXIC" "DREAM"]
-    ["SQUEEZE" "TURN" "ON" "STORE" "TONIC" "STEAM"]
-    ["WHEEZE" "EARN" "CON" "SORE" "MYSTIC" "TEAM"]
-    ["SNEEZE" "STERN" "WAN" "ROAR" "STATIC" "SCREAM"]
-    ["FREEZE" "BURN" "ON" "FOR" "CLASSIC" "SCHEME"]
-    ["EASE" "CHURN" "CON" "CORPS" "SEPTIC" "STREAM"]
-    ["SEIZE" "TURN" "WAN" "BOAR" "CRITIC" "SEAM"]
-    ["TEASE" "EARN" "ON" "POUR" "CHRONIC" "THEME"]
-    ["SQUEEZE" "STERN" "CON" "SCORE" "LIPSTICK" "DREAM"]
-    ["WHEEZE" "BURN" "WAN" "OR" "PANIC" "STEAM"]
-    ["SNEEZE" "CHURN" "ON" "WHORE" "SEISMIC" "TEAM"]
-    ["FREEZE" "TURN" "CON" "GORE" "FROLIC" "SCREAM"]
-    ["EASE" "EARN" "WAN" "CORE" "GOTHIC" "SCHEME"]
-    ["SEIZE" "STERN" "ON" "FLOOR" "TRAGIC" "STREAM"]
-    ["TEASE" "BURN" "CON" "BORE" "CATHOLIC" "SEAM"]
-    ["SQUEEZE" "CHURN" "WAN" "SNORE" "CYNIC" "THEME"]
-    ["WHEEZE" "TURN" "ON" "STORE" "COMIC" "DREAM"]
-    ["SNEEZE" "EARN" "CON" "SORE" "PSYCHIC" "STEAM"]
-    ["FREEZE" "STERN" "WAN" "ROAR" "RELIC" "TEAM"]
-    ["EASE" "BURN" "ON" "FOR" "COSMIC" "SCREAM"]
-    ["SEIZE" "CHURN" "CON" "CORPS" "DRASTIC" "SCHEME"])
-
-
-::
-
-    TEASE   STERN CON SCORE  MANIC      STEAM
-    BREEZE  BURN  WAN OR     FABRIC     GLEAM
-    SQUEEZE CHURN ON  WHORE  FRANTIC    TEAM
-    WHEEZE  TURN      GORE   EPIC       SCREAM
-    SNEEZE            CORE   CRYPTIC    SCHEME
-    FREEZE            FLOOR  PUBLIC     STREAM
-    EASE              BORE   TOPIC      BEAM
-    SEIZE             SNORE  TOXIC      SEAM
-                      STORE  TONIC      THEME
-                      SORE   MYSTIC     DREAM
-                      ROAR   STATIC
-                      WAR    SIDEKICK
-                      FOR    SEPTIC
-                      CORPS  BROOMSTICK
-                      DRAWER CHRONIC
-                      POUR   LIPSTICK
-                             PANIC
-                             SEISMIC
-                             LOGIC
-                             FROLIC
-                             TRAGIC
-                             ATTIC
-                             CYNIC
-                             RELIC
-                             COSMIC
-                             DRASTIC
-
-Features
--------
-
-Given an output like the above, a user might see a word or phrase they really
-like.
-
-"FRANTIC SCREAM" for example.
-
-The rest of the sentence doesn't need to rhyme or necesarily contain words that
-are synonyms.
-
-Can we provide suggestions for the rest of the sentence? We know the number of
-syllables we want and the sentiment we want.
-
-Could we use something like a Markov chain to work backwards? Given some corpus,
-what words most likely preceed "frantic scream" that also align with our
-syllabic requirements?
-
-==============
- Articulation
-==============
-
-Terminology and types of rhymes
-------------------------------
-
-1. HAT - CAT
-2. HAT - HALF
-3. HAT - PACK
-
-The first of those examples clearly rhymes by anyone's definition of "rhyme". The first sound of the syllable known as the "onset", differs. The vowel sound, known as the "nuclei", and the final consonant sound, known as the "coda", are the same.
-
-The second example might not technically rhyme, but it can still be useful. The "onset", the "H" sound, and the "nuclei", the "AE" sound, are the same in both HAT and HALF. But they differ in their "coda".
-
-The third example is even less of a proper rhyme, but again it can be useful. The only matching sound is the "nuclei", the "AE" sound.
-
-Words with multiple syllables give us even more options.
-
-What is more important: to find the fewest words that rhyme any number of syllables (STUPIFIED - DIGNIFIED), or to find the fewest words that rhyme the greatest number of onsets/nuclei/codas (STUPIFIED - SCOOBY DIED)?
-
-1. STUPIFIED - SCOOBY DIED
-1. STUPIFIED - GROOVY FINE
-1. STUPIFIED - DIGNIFIED
-1. STUPIFIED - PRIDE
-
-Program Output
--------------
-
-Perfect rhymes
-DOG     -> [ [ [FOG COG HOG ...] ] ]
-
-Onset rhymes
-DOG     -> [ [ [DOLL DAWN ...] ] ]
-
-Nuclei rhymes
-DOG     -> [ [ [BALL CAUGHT FOUGHT ...] ] ]
-
-For multiple syllables, show rhymes for each possible partitioning of syllables.
-Order by rhymes that use the fewest number of words.
-BEEHIVE -> [ [ [REVIVE DEPRIVE] ]
-             [ [SEE WE BE ...] [THRIVE DIVE ...] ] ]
-
-For multi-syllable words, remove restriction to rhyme on every syllable.
-Order by words matching greatest number of syllables.
-BEEHIVE -> [ [ [REVIVE DEPRIVE ALIVE] ]
-             [ [SEE WE BE ... ] [THRIVE DIVE ...] ] ]
-
-Syllables
---------
-Typical model
-
-In the typical theory[citation needed] of syllable structure, the general structure of a syllable (σ) consists of three segments. These segments are grouped into two components:
-
-Onset (ω)
-    a consonant or consonant cluster, obligatory in some languages, optional or even restricted in others
-Rime (ρ)
-    right branch, contrasts with onset, splits into nucleus and coda
-
-    Nucleus (ν)
-        a vowel or syllabic consonant, obligatory in most languages
-    Coda (κ)
-        consonant, optional in some languages, highly restricted or prohibited in others
-
-Rules
-~~~~~
-
-Also, for "ellipsis", /ps/ is not a legal internal coda in English. The /s/ can only occur as an appendix, e.g. the plural -s at the end of a word. So it should be e.lip.sis
-
-http://www.glottopedia.org/index.php/Sonority_hierarchy
-
-http://www.glottopedia.org/index.php/Maximal_Onset_Principle
-
-Nasal
-----
-
-Air flow goes through nose.
-
-Examples: "n" in "nose", "m" in "may", "ŋ" in "funk".
-
-"ŋ" is known as the letter "eng" and the technical name of the consonant is the "voiced velar nasal"
-
-"voiced" in the above sentence refers to whether or not your vocal chords are active. Your voice chord doesn't vibrate with voiceless consonants, like "sh" "th" "p" "f". In contrast, notice the vibration in phonemes like "m" "r" "z".
-
-
-=========
- Example
-=========
-
-Mister Sandman, bring me a dream
-Make him the cutest that I've ever seen
-Give him two lips like roses in clover
-Then tell him that his lonesome nights are over
-
-Mister Sandman, bring me a dream
-Blood guts and gore, a nightmare machine
-
-
-\...
-
-Please turn on your magic beam
-Mister Sandman bring me a dream
-
-Fire burn attrocious bloodstream
-Mister Sandman, bring me a dream