From 6813945bd5317735e873ea4a56136c8691ca1dc8 Mon Sep 17 00:00:00 2001 From: Eric Ihli Date: Thu, 16 Dec 2021 07:30:27 -0600 Subject: [PATCH] Update README --- README.org | 17 ++- README.rst | 369 ----------------------------------------------------- 2 files changed, 10 insertions(+), 376 deletions(-) delete mode 100644 README.rst diff --git a/README.org b/README.org index d322a6d..3c40115 100644 --- a/README.org +++ b/README.org @@ -1,11 +1,14 @@ #+TITLE: Prhyme Rhyme Generation -README is moving from RST to Org. +Utilities for rhyming, NLP corpus text cleaning, syllabification, markov modeling, etc... -[[https://github.com/eihli/prhyme]] +This repo is a bit of a scratch-pad. As utilities solidify, I've been extracting them out. -* TODO -- [X] Function to read/write tightly-packed-trie from/to file. -- [X] Pull code out of Prhyme and into its own library. -- [ ] Darklyrics Corpus as tightly-packed-trie -- [ ] Instead of a value as the last element of the list, take a function. +- https://github.com/eihli/clj-tightly-packed-trie +- https://github.com/eihli/phonetics +- https://github.com/eihli/darklimericks + +Some cool things that haven't been extracted yet and remain in an alpha state: + +- Part-of-speech tagging/processing using OpenNLP https://github.com/eihli/prhyme/blob/main/src/com/owoga/prhyme/nlp/core.clj +- Simple Good-Turing frequency estimation https://github.com/eihli/prhyme/blob/main/src/com/owoga/prhyme/generation/simple_good_turing.clj diff --git a/README.rst b/README.rst deleted file mode 100644 index 1561486..0000000 --- a/README.rst +++ /dev/null @@ -1,369 +0,0 @@ -====== - TODO -====== - -- Allow a depth with thesaurus lookups. -- Allow restriction to rhymes with certain number of syllables. -- Word graph with weights to form most likely sentences. -- Use CMU LexTool to find pronunciations for words not in dictionary. - http://www.speech.cs.cmu.edu/tools/lextool.html - -============= - Terminology -============= - -Use case: ---------- - -I want to find phrase B that rhymes with phrase A where phrase B has a -specifiable sentiment. - -Something like: -"please turn on your magic beam" -"queeze churn horrific bloodstream" - -I want these settings to be optional: -- phrase B to conform to certain grammatical structure. -- config of which words I prefer rhymes in phrase be -- config for rhyming rimes, onsets, and/or nuclei -- preferred number of syllables - -Some of those settings make more sense with individual words than with phrases. - -Here's a tricky consideration. Let's break those phrases down into syllables. - -"please turn on your magic beam" -"queeze churn horrific bloodstream" - -("P" "L" "IY" "Z") ("T" "ER" "N") ("AO" "N") ("Y" "AO" "R") ("M" "AE" "JH" "IH" "K") ("B" "IY" "M") -("K" "W" "IY" "Z") ("CH" "ER" "N") ("H" "AO" "R" "IH" "F" "EU" "K") ("B" "L" "AH" "D" "S" "T" "R" "IY" "M") - -We are imagining turning -("AO" "N") ("Y" "AO" "R") ("M" "AE" "JH") -into -("H" "AO" "R" "IH" "F" "IH" "K") - -There's this difficulty in deciding how to group the syllables into words. - -("P" "L" "IY" "Z" "T" "ER" "N" "AO" "N" "Y" "AO" "R" "M" "AE" "JH" "IH" "K" "B" "IY" "M") -("K" "W" "IY" "Z" "CH" "ER" "N" "H" "AO" "R" "IH" "F" "IH" "K" "B" "L" "AH" "D" "S" "T" "R" "IY" "M") - -If you take just the raw syllables and ignore the words, you get -("P" "L" "IY" "Z") ("T" "ER" "N") ("AO" "N") ("Y" "AO" "R") ("M" "AE") ("JH" "IH" "K") ("B" "IY" "M") -and -("K" "W" "IY" "Z") ("CH" "ER" "N") ("H" "AO" "R") ("IH" "F") ("IH" "K") ("B" "L" "AH" "D") ("S" "T" "R" "IY" "M") - -If you leave the syllables grouped into words, then MAGIC rhymes with -TRAGIC, PELAGIC, etc... If you ignore the groupings of syllables into -words, then you're stuck trying to rhyme with the single syllable -"word" ("M" "AE"), which doesn't have any rhymes. - -Implementation --------------- - -2021-06-09 -++++++++++ - -Most generation tasks are going to require some big data structures, like a Trie of n-grams. - -A ``context`` is an atom that gets updated with those data structures. - -Loading some of these data structures can take a long time, so only load what you need. - -An example of the different data structures you might load: - -Alliterations - From the database of n-grams, convert each n-gram to syllables then create a trie of the alliterations. -Perfect rhymes - Again, from the database of n-grams, convert n-gram to syllables and create trie of reverse of syllables. -Imperfect rhymes - Perform some manipulation of the syllables so that you can be more flexible with your rhymes. - -One key that is probably always required is the ``database``. This maps words to their IDs and IDs to their words. The integer -IDs are necessary for tightly packed tries. - -2020-10-20 -++++++++++ - -Start with a phrase. - -"Please turn on your magic beam". - -Convert it to syllables. - -``...(Y AO R) (M AE) (J IH K) (B IY M)`` - -Find all words that rhyme in any way whatsoever. - -Weight each possible rhyming word. - -Choose by weight. - -Remove syllables from target phrase equal to syllables of chosen word. - -Find all words that either rhyme or are markov selections of previous word. - -Weight each possible word. - -Choose by weight. - - -Prev -++++ - -What would it look like to solve the problem for a single grouping of syllables into words? - -In the case of -("P" "L" "IY" "Z") ("T" "ER" "N") ("AO" "N") ("Y" "AO" "R") ("M" "AE" "JH" "IH" "K") ("B" "IY" "M") - -We wouldn't get -("K" "W" "IY" "Z") ("CH" "ER" "N") ("H" "AO" "R" "IH" "F" "EU" "K") ("B" "L" "AH" "D" "S" "T" "R" "IY" "M") - -We might get the following, since the syllable groupings align. -"queeze churn don more tragic fiends" - -We don't want to always restrict it to matching syllable groups, -especially not for single words. If we give it the word "nation" we -almost surely want words like "approbation" and "creation"; speaking -from the use case of trying to find a rhyme to the last word of a -phrase. - -Back to the entire phrase - the idea is we solve the problem for a -single grouping of syllables and then we use the "partitions" function -to get every possible combination of grouping of syllables and apply -the solution to each of those. - -Performance -+++++++++++ - -Let's say we want to see all possible rhyming phrases of -("P" "L" "IY" "Z") ("T" "ER" "N") ("AO" "N") ("Y" "AO" "R") ("M" "AE" "JH" "IH" "K") ("B" "IY" "M") - -Let's assume each syllable grouping has an average of 10 rhyming words. - -That's 10^6 possible phrases. We need a way to limit our search for -both computational reasons and for UX reasons. - -There's going to be a lot of redundancy there. - -We don't need each of: -"please churn don poor tragic fiend" -"squeeze churn don poor tragic fiend" -"freeze churn don poor tragic fiend" -"sneeze churn don poor tragic fiend" -\... - -So maybe we cycle through each word list. - -please churn don poor tragic fiend -squeeze burn con door plagic mean -freeze turn fawn store tragic stream -sneeze yearn don more plagic steam -peas churn con four tragic deem -queeze burn fawn yore plagic reem - -A seperate process can search through these phrases and rank them by grammatical structure, sentiment, etc... - -We could also pre-filter the possible words by sentiment. - -Or, we could assign grammatical restrictions to each word and -pre-filter the words by grammar. Then we'd get something like -(adjective noun verb adjective noun) and it would really reduce the -search space. - -But would we do that by hand? That might work for an individual -grouping of syllables, but how would we restrict to that for each -possible combination of grouping of syllables? - -One possible solution would be to have a list of all valid grammar structures for a certain number of words. - -"please churn don poor tragic fiends" -(adj noun adv verb adj noun) -(adj adj noun verb adj noun) -(noun verb noun conj noun verb) -\... - -Output -++++++ - -:: - - (["TEASE" "STERN" "CON" "SCORE" "MANIC" "STEAM"] - ["SQUEEZE" "BURN" "WAN" "OR" "BEATNIK" "TEAM"] - ["WHEEZE" "CHURN" "ON" "WHORE" "FABRIC" "SCREAM"] - ["SNEEZE" "TURN" "CON" "GORE" "FRANTIC" "SCHEME"] - ["FREEZE" "EARN" "WAN" "CORE" "EPIC" "STREAM"] - ["EASE" "STERN" "ON" "FLOOR" "CRYPTIC" "SEAM"] - ["SEIZE" "BURN" "CON" "BORE" "TOPIC" "THEME"] - ["TEASE" "CHURN" "WAN" "SNORE" "TOXIC" "DREAM"] - ["SQUEEZE" "TURN" "ON" "STORE" "TONIC" "STEAM"] - ["WHEEZE" "EARN" "CON" "SORE" "MYSTIC" "TEAM"] - ["SNEEZE" "STERN" "WAN" "ROAR" "STATIC" "SCREAM"] - ["FREEZE" "BURN" "ON" "FOR" "CLASSIC" "SCHEME"] - ["EASE" "CHURN" "CON" "CORPS" "SEPTIC" "STREAM"] - ["SEIZE" "TURN" "WAN" "BOAR" "CRITIC" "SEAM"] - ["TEASE" "EARN" "ON" "POUR" "CHRONIC" "THEME"] - ["SQUEEZE" "STERN" "CON" "SCORE" "LIPSTICK" "DREAM"] - ["WHEEZE" "BURN" "WAN" "OR" "PANIC" "STEAM"] - ["SNEEZE" "CHURN" "ON" "WHORE" "SEISMIC" "TEAM"] - ["FREEZE" "TURN" "CON" "GORE" "FROLIC" "SCREAM"] - ["EASE" "EARN" "WAN" "CORE" "GOTHIC" "SCHEME"] - ["SEIZE" "STERN" "ON" "FLOOR" "TRAGIC" "STREAM"] - ["TEASE" "BURN" "CON" "BORE" "CATHOLIC" "SEAM"] - ["SQUEEZE" "CHURN" "WAN" "SNORE" "CYNIC" "THEME"] - ["WHEEZE" "TURN" "ON" "STORE" "COMIC" "DREAM"] - ["SNEEZE" "EARN" "CON" "SORE" "PSYCHIC" "STEAM"] - ["FREEZE" "STERN" "WAN" "ROAR" "RELIC" "TEAM"] - ["EASE" "BURN" "ON" "FOR" "COSMIC" "SCREAM"] - ["SEIZE" "CHURN" "CON" "CORPS" "DRASTIC" "SCHEME"]) - - -:: - - TEASE STERN CON SCORE MANIC STEAM - BREEZE BURN WAN OR FABRIC GLEAM - SQUEEZE CHURN ON WHORE FRANTIC TEAM - WHEEZE TURN GORE EPIC SCREAM - SNEEZE CORE CRYPTIC SCHEME - FREEZE FLOOR PUBLIC STREAM - EASE BORE TOPIC BEAM - SEIZE SNORE TOXIC SEAM - STORE TONIC THEME - SORE MYSTIC DREAM - ROAR STATIC - WAR SIDEKICK - FOR SEPTIC - CORPS BROOMSTICK - DRAWER CHRONIC - POUR LIPSTICK - PANIC - SEISMIC - LOGIC - FROLIC - TRAGIC - ATTIC - CYNIC - RELIC - COSMIC - DRASTIC - -Features --------- - -Given an output like the above, a user might see a word or phrase they really -like. - -"FRANTIC SCREAM" for example. - -The rest of the sentence doesn't need to rhyme or necesarily contain words that -are synonyms. - -Can we provide suggestions for the rest of the sentence? We know the number of -syllables we want and the sentiment we want. - -Could we use something like a Markov chain to work backwards? Given some corpus, -what words most likely preceed "frantic scream" that also align with our -syllabic requirements? - -============== - Articulation -============== - -Terminology and types of rhymes -------------------------------- - -1. HAT - CAT -2. HAT - HALF -3. HAT - PACK - -The first of those examples clearly rhymes by anyone's definition of "rhyme". The first sound of the syllable known as the "onset", differs. The vowel sound, known as the "nuclei", and the final consonant sound, known as the "coda", are the same. - -The second example might not technically rhyme, but it can still be useful. The "onset", the "H" sound, and the "nuclei", the "AE" sound, are the same in both HAT and HALF. But they differ in their "coda". - -The third example is even less of a proper rhyme, but again it can be useful. The only matching sound is the "nuclei", the "AE" sound. - -Words with multiple syllables give us even more options. - -What is more important: to find the fewest words that rhyme any number of syllables (STUPIFIED - DIGNIFIED), or to find the fewest words that rhyme the greatest number of onsets/nuclei/codas (STUPIFIED - SCOOBY DIED)? - -1. STUPIFIED - SCOOBY DIED -1. STUPIFIED - GROOVY FINE -1. STUPIFIED - DIGNIFIED -1. STUPIFIED - PRIDE - -Program Output --------------- - -Perfect rhymes -DOG -> [ [ [FOG COG HOG ...] ] ] - -Onset rhymes -DOG -> [ [ [DOLL DAWN ...] ] ] - -Nuclei rhymes -DOG -> [ [ [BALL CAUGHT FOUGHT ...] ] ] - -For multiple syllables, show rhymes for each possible partitioning of syllables. -Order by rhymes that use the fewest number of words. -BEEHIVE -> [ [ [REVIVE DEPRIVE] ] - [ [SEE WE BE ...] [THRIVE DIVE ...] ] ] - -For multi-syllable words, remove restriction to rhyme on every syllable. -Order by words matching greatest number of syllables. -BEEHIVE -> [ [ [REVIVE DEPRIVE ALIVE] ] - [ [SEE WE BE ... ] [THRIVE DIVE ...] ] ] - -Syllables ---------- -Typical model - -In the typical theory[citation needed] of syllable structure, the general structure of a syllable (σ) consists of three segments. These segments are grouped into two components: - -Onset (ω) - a consonant or consonant cluster, obligatory in some languages, optional or even restricted in others -Rime (ρ) - right branch, contrasts with onset, splits into nucleus and coda - - Nucleus (ν) - a vowel or syllabic consonant, obligatory in most languages - Coda (κ) - consonant, optional in some languages, highly restricted or prohibited in others - -Rules -~~~~~ - -Also, for "ellipsis", /ps/ is not a legal internal coda in English. The /s/ can only occur as an appendix, e.g. the plural -s at the end of a word. So it should be e.lip.sis - -http://www.glottopedia.org/index.php/Sonority_hierarchy - -http://www.glottopedia.org/index.php/Maximal_Onset_Principle - -Nasal ------ - -Air flow goes through nose. - -Examples: "n" in "nose", "m" in "may", "ŋ" in "funk". - -"ŋ" is known as the letter "eng" and the technical name of the consonant is the "voiced velar nasal" - -"voiced" in the above sentence refers to whether or not your vocal chords are active. Your voice chord doesn't vibrate with voiceless consonants, like "sh" "th" "p" "f". In contrast, notice the vibration in phonemes like "m" "r" "z". - - -========= - Example -========= - -Mister Sandman, bring me a dream -Make him the cutest that I've ever seen -Give him two lips like roses in clover -Then tell him that his lonesome nights are over - -Mister Sandman, bring me a dream -Blood guts and gore, a nightmare machine - - -\... - -Please turn on your magic beam -Mister Sandman bring me a dream - -Fire burn attrocious bloodstream -Mister Sandman, bring me a dream