Introduction

I recently wrote a program 1 to randomly generate words that would be considered pronounceable under standard English phonics rules, such as for generating names and designations in works of fiction. In this study I intend to use statistical analysis of real English words for comparison with the generator and to possibly improve the quality of its output.

Formula

Setup

The original Random Name Generator uses static lists of vowels, consonants, and equivalent blends and digraphs to dynamically generate syllables using a regular-expression interpreter. The program begins by establishing constants for vowels and consonants (as defined in English), along with other symbol groups.

vowels <- unlist(strsplit("aeiou",""))
consonants <- unlist(strsplit("qwrtypsdfghjklzxcvbnm",""))
symbols <-unlist(strsplit("!@#$%^&*()_+-=[]{}\\|'\";:/?.>,<`~",""))
separators <- unlist(strsplit(" -=*/\\",""))
digits <- unlist(strsplit("1234567890",""))
letters = c(vowels,consonants)
vowels = c(vowels,unlist(strsplit("yw",""))) #Y and W are vowels
characters = c(letters,symbols," ",digits); #Any character

print(length(vowels))
## [1] 7
print(length(consonants))
## [1] 21

It then establishes a set of extended vowels, extended consonants, and ending-only consonants. These are digraphs and blends that function as a vowel or as a consonant within a syllable. In addition, the consonant sets include an empty string for syllables that start or end with a vowel. This allows us to define a syllable as “extended_consonant extended_vowel ending_consonant”.

xVowels <- c(vowels)
for(i in 1:6){ #Make controlled vowels for a-y. No "wr" vowel.
            xVowels <- c(xVowels, paste(vowels[i],"r",sep=""))
            #Digraph time!
            if (vowels[i] == "a" || vowels[i] =="o"){ #A's and O's
               for(j in 1:5){ #combine with aeiou
                  xVowels <- c(xVowels,paste(vowels[i],vowels[j], sep=""))
               }
            } else if(vowels[i] == "i" || vowels[i] == "u"){ #I's and U's
               for(j in 2:3){ #combine with ei
                  xVowels <- c(xVowels,paste(vowels[i],vowels[j], sep=""))
               }
            }else if(vowels[i] == "e"){ #E's
               for(j in 1:5){ #combine with aeiu
                  if(j!=4){ #skip o
                     xVowels <- c(xVowels,paste(vowels[i],vowels[j], sep=""))
                  }
               }
            }
         }
print(length(xVowels))
## [1] 31
print(xVowels)
##  [1] "a"  "e"  "i"  "o"  "u"  "y"  "w"  "ar" "aa" "ae" "ai" "ao" "au" "er"
## [15] "ea" "ee" "ei" "eu" "ir" "ie" "ii" "or" "oa" "oe" "oi" "oo" "ou" "ur"
## [29] "ue" "ui" "yr"

Extended Vowels are defined as the 7 vowels and half-vowels “aeiouyw”, along with two-letter vowel digraphs and controlled vowels (eg “ar”) where appropriate.

xConsonants = c(consonants,"ch","wh","th","ph","sh","gh","st","squ")#An extended consonant
eConsonants = c(xConsonants,"pt","ft","ct","lt","rr","ll","zz","ss","ff","sque") #Ending consonant
l = length(xConsonants)
for(i in 1:l){
            #Digraph time! Adding r's
            if(xConsonants[i]!="q" && xConsonants[i]!="r" && xConsonants[i]!="l" && xConsonants[i]!="z" &&xConsonants[i]!="s" && xConsonants[i]!="y" && xConsonants[i]!="h" && xConsonants[i]!="j" && xConsonants[i]!="n" && xConsonants[i]!="m" && xConsonants[i]!="x" && xConsonants[i]!="wh" && xConsonants[i]!="gh" && xConsonants[i]!="pt" && xConsonants[i]!="ft" && xConsonants[i]!="ct" && xConsonants[i]!="lt" && xConsonants[i]!="squ"){
               xConsonants <- c(xConsonants, paste(xConsonants[i],"r",sep=""))
            }
            #Not all q's need u's, but some should have them
            if(xConsonants[i]=="q"){
               xConsonants <- c(xConsonants,"qu")
            }
            
         }
xConsonants <- c(xConsonants,"") #Empty string as consonants aren't always necessary to a syllable.
print(length(xConsonants))
## [1] 46
print(xConsonants)
##  [1] "q"   "w"   "r"   "t"   "y"   "p"   "s"   "d"   "f"   "g"   "h"  
## [12] "j"   "k"   "l"   "z"   "x"   "c"   "v"   "b"   "n"   "m"   "ch" 
## [23] "wh"  "th"  "ph"  "sh"  "gh"  "st"  "squ" "qu"  "wr"  "tr"  "pr" 
## [34] "dr"  "fr"  "gr"  "kr"  "cr"  "vr"  "br"  "chr" "thr" "phr" "shr"
## [45] "str" ""

Extended consonants include the 21 standard consonants along with the 2 and 3 letter blends. Ending consonants contain more ‘xt’ blends, the double letter endings common to short vowels, and no blends ending with ‘r’.

print(length(eConsonants))
## [1] 39
print(eConsonants)
##  [1] "q"    "w"    "r"    "t"    "y"    "p"    "s"    "d"    "f"    "g"   
## [11] "h"    "j"    "k"    "l"    "z"    "x"    "c"    "v"    "b"    "n"   
## [21] "m"    "ch"   "wh"   "th"   "ph"   "sh"   "gh"   "st"   "squ"  "pt"  
## [31] "ft"   "ct"   "lt"   "rr"   "ll"   "zz"   "ss"   "ff"   "sque"

Production

The actual generator consists of a regular expression interpreter that chooses from among these options as specified in the pattern. In this analysis we will consider only the syllable generator:

syllable <- function(open = 3){
    # \s = \w\x or \w\x\e
    syll <- sample(xConsonants,1)
    syll <- paste(syll, sample(xVowels,1), sep="")
    close = sample(1:10,1)
    if(close > open){
        syll <- paste(syll, sample(eConsonants,1), sep="")
    }
    return(syll)
}

Test of Generator

To test this generator’s usability we will take a simple random sample consisting of 100 two-syllable randomly generated “words”.

set.seed(20)
words = vector()
for(i in 1:100){
    word <- syllable()
    word <- paste(word, syllable(), sep="")
    words <- c(words, word)
}
head(words)
## [1] "chroephuit" "taistfrad"  "maeloow"    "mituer"     "trailtfeeb"
## [6] "thriisteb"

(For the full list of generated words, see Appendix 1) The pronounceability of these “words” was tested by running them through the Text to Speech function of Mac OS X. None of the words forced the computer to attempt to spell them out. However, 9 of the words I identified as being harder to pronounce under the standard phonetic rules of English.

What are the mathematical characteristics of the generated words?

lengths = vector()
for(word in words){
    lengths <- c(lengths,length(unlist(strsplit(word,""))))
}
summary(lengths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0     8.0     9.0     8.8    10.0    13.0
qqnorm(lengths, main="Lengths of Generated Words")
qqline(lengths)

Lengths for these ostensibly two-syllable words 2 range from 5 to 13 characters, and they are distributed fairly normally about 9. This makes the average syllable length between 4 and 5 characters.

xlengths = vector()
for(cons in xConsonants){
    xlengths <- c(xlengths,length(unlist(strsplit(cons,""))))
}
summary(xlengths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    2.00    1.63    2.00    3.00
vlengths = vector()
for(vow in xVowels){
    vlengths <- c(vlengths,length(unlist(strsplit(vow,""))))
}
summary(vlengths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   1.774   2.000   2.000
elengths = vector()
for(cons in eConsonants){
    elengths <- c(elengths,length(unlist(strsplit(cons,""))))
}
summary(elengths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.538   2.000   4.000
#All combined
summary(c(xlengths,vlengths,elengths))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.638   2.000   4.000
hist(c(xlengths,vlengths,elengths))

boxplot(xlengths,vlengths,elengths,c(xlengths,vlengths,elengths), names=c("Extended consonants", "Extended Vowels", "Ending consonants","Combined"))

At present the selection of extended consonants and vowels gives an equal chance to every item. We can see a slightly higher chance of any given component being more than two letters, but this is offset somewhat by the 30% chance of the generator producing an open syllable (one that ends with a vowel.)

Comparison with Real English Words

Hypothesis

How then does this compare to real words? Can we use a sample of English words to improve the formula to produce output that looks more like English? Cryptanalysis efforts have produced a wide understanding of character frequency counts in verious languages, so can we apply similar techniques to see how syllables are structured?

To accomplish this I have taken a sample of real English words, broken them down into syllables, and then used statistical inference to find the proportion of blends and single letters used in a “\x\w\e” syllable system. This proportion will then be used to refine the random picker.

Data

Collection

The reference set of English words is pulled by a random sample of pages taken from The American Heritage College Dictionary, 4th Edition. 30 pages were selected from the dictionary section (pages 1-1596) without replacement using R’s sample function. 3 The entry words from the first column were divided into syllables and each syllable entered onto a CSV file along with variables indicating its relative location in the word. Entries consisting of multiple words were treated as separate words. Acronyms and abbreviations were omitted, and occasionally there were words pronounced as one syllable (eg. “blitzed”) that were entered as two to fit the “\x\w\e” syllable system. Otherwise, a syllable containing a silent “e” or “ue” after the second consonant had that noted in a separate column, while possessives and regular plurals with a separate entry or as part of a compound word were reverted to singular form. There was one word “Mahilyow”, that could not be forced into this format, and therefore the alternate form “Mogilev” was used.

pages <- sample(1:1596,30)
print(pages)
##  [1]  730   40  154  744 1583  107  229  337 1547  287  236 1287  224 1230
## [15] 1129  424 1181 1064 1372  673 1366  716 1517  833    1  118 1269  201
## [29] 1441 1134
dictionary <- read.table("dictionary.csv",header=TRUE,sep=",")
#Convert factors to characters
i <- sapply(dictionary, is.factor)
dictionary[i] <- lapply(dictionary[i], as.character)
tail(dictionary)
##      start_cons vowel end_cons silent_e location_front syllables
## 2002         tr     e                                2         4
## 2003          f     a        c                       3         4
## 2004          t     i        v        e              4         4
## 2005          p     u                                1         3
## 2006         tr     e                                2         3
## 2007          f     y                                3         3
Scope

These 2007 syllables from 870 words represent less than 10% of English words. The sample includes words from all the common English roots (including old English/ Old High German, Greek, Latin, French, Spanish, Arabic, Norse, and Sanskrit). Inferences are generalizable to the target population of English words as a whole, but it would be foolish to attempt claims of causality without blocking for various etymological factors.

Exploratory Data Analysis

We are looking for general characteristics of syllables in English words, and also to see if there is a relationship between these syllables and their position in the word.

Syllable composition
dictionary$len_start_cons <- as.factor(nchar(dictionary$start_cons))
dictionary$len_vowel <- as.factor(nchar(dictionary$vowel))
dictionary$len_end_cons <- as.factor(nchar(dictionary$end_cons))
dictionary$len_silent_e <- as.factor(nchar(dictionary$silent_e))
prop.table(table(dictionary$len_start_cons))
## 
##           0           1           2           3 
## 0.221225710 0.638266069 0.130543099 0.009965122
prop.table(table(dictionary$len_vowel))
## 
##           1           2           3 
## 0.890881913 0.106128550 0.002989537
prop.table(table(dictionary$len_end_cons))
## 
##           0           1           2           3 
## 0.339312407 0.544095665 0.108121574 0.008470354
prop.table(table(dictionary$len_silent_e))
## 
##           0           1           2 
## 0.934230194 0.064275037 0.001494768

We can see that in major contrast to the randomly generated syllables, the use of blends is fairly rare. Consonant blends account for about 14% of starting consonants and about 12% of ending consonants. Controlled vowels and vowel blends are used less than 11% of the time, and very few of those were three letters long.

prop.table(table(xlengths))
## xlengths
##          0          1          2          3 
## 0.02173913 0.45652174 0.39130435 0.13043478
prop.table(table(vlengths))
## vlengths
##         1         2 
## 0.2258065 0.7741935
prop.table(table(elengths))
## elengths
##          1          2          3          4 
## 0.53846154 0.41025641 0.02564103 0.02564103

In comparison, nearly half the generator’s “extended” and “ending”" consonants and over three quarters of the “extended vowels” were composed of 2 or more letters. This is a clear opportunity for improvement.

Syllables by location

But is there a relation between syllable structures and their position in the word?

To see this, lets create another column containing the types of syllables,

classify <- function(x){
    #Classify each syllable into open, closed front, closed end, and closed
#     levels = 0:3
#     labels = c( "open","closed.front", "closed.end", "closed"))
    class <- 0
    if (x['len_start_cons']>0) {
        class <- 1 #closed.front
        if (x['len_end_cons'] > 0) {
            class <- 3 #closed
        }
    }else if(x['len_end_cons'] > 0){
        class <- 2 #closed.end
    }
    return(class)

}

dictionary$class <- factor(apply(dictionary[1:nrow(dictionary),], 1, classify), levels = 0:3, labels = c( "open","closed.front", "closed.end", "closed"))
head(dictionary)
##   start_cons vowel end_cons silent_e location_front syllables
## 1                i        n                       1         3
## 2          v     e        r                       2         3
## 3          t     a        s        e              3         3
## 4                i        n                       1         4
## 5          v     e        r                       2         4
## 6          t     e                                3         4
##   len_start_cons len_vowel len_end_cons len_silent_e        class
## 1              0         1            1            0   closed.end
## 2              1         1            1            0       closed
## 3              1         1            1            1       closed
## 4              0         1            1            0   closed.end
## 5              1         1            1            0       closed
## 6              1         1            0            0 closed.front

and take a mosaic plot comparing the syllable type to the location as measured from the front of the word.

mosaicplot(table(dictionary$class,dictionary$location_front),xlab="Syllable type", ylab="Location")

Reading this plot, we see that about half the syllables are closed, but most of these are in the first syllable of the word. This is a situation of interest as this could be influenced by the number of one-syllable words. (r sum(dictionary$syllables==1))

Another situation of interest is the very small incidence of vowel-only syllables in the first syllable of a word.

Repeating the mosaic plot on only polysyllabic words failed to produce a drastic visible change.

polys <- subset(dictionary,syllables>1)
mosaicplot(table(polys$class,polys$location_front),xlab="Syllable type", ylab="Location")

What about when considering the syllable’s position relative to the end of the word?

mosaicplot(table(dictionary$class,dictionary$syllables - dictionary$location_front),xlab="Syllable type", ylab="Location from end")

Another situation of interest is the large jump in closed-ended ending syllables.

mosaicplot(table(polys$class,polys$syllables-polys$location_front),xlab="Syllable type", ylab="Location from end")

Repeating this on polysyllable words likewise had no drastic effect.

While these findings could bear investigation, for the purposes of the random name generator the relationship between the syllable’s location and its type is too weak to be of interest.

Inference

We are looking for two things: the true proportions of the various syllable classes and of digraphs.

Syllable classes

Using the DASI inference function, we can establish a confidence interval for each class individually.

source("http://bit.ly/dasi_inference")
#Create columns recoded into 2 levels for a proportion CI
dictionary$open[dictionary$class=="open"]<-"open"
dictionary$open[dictionary$class!="open"]<-"not"
inference(y=dictionary$open,est='proportion',type="ci",method="theoretical", success="open", eda_plot=FALSE)
## Single proportion -- success: open 
## Summary statistics: p_hat = 0.0912 ;  n = 2007 
## Check conditions: number of successes = 183 ; number of failures = 1824 
## Standard error = 0.0064 
## 95 % Confidence interval = ( 0.0786 , 0.1038 )
dictionary$closed.front[dictionary$class=="closed.front"]<-"closed.front"
dictionary$closed.front[dictionary$class!="closed.front"]<-"not"
inference(y=dictionary$closed.front,est='proportion',type="ci",method="theoretical", success="closed.front", eda_plot=FALSE)
## Single proportion -- success: closed.front 
## Summary statistics: p_hat = 0.2481 ;  n = 2007 
## Check conditions: number of successes = 498 ; number of failures = 1509 
## Standard error = 0.0096 
## 95 % Confidence interval = ( 0.2292 , 0.267 )
dictionary$closed.end[dictionary$class=="closed.end"]<-"closed.end"
dictionary$closed.end[dictionary$class!="closed.end"]<-"not"
inference(y=dictionary$closed.end,est='proportion',type="ci",method="theoretical", success="closed.end", eda_plot=FALSE)
## Single proportion -- success: closed.end 
## Summary statistics: p_hat = 0.13 ;  n = 2007 
## Check conditions: number of successes = 261 ; number of failures = 1746 
## Standard error = 0.0075 
## 95 % Confidence interval = ( 0.1153 , 0.1448 )
dictionary$closed[dictionary$class=="closed"]<-"closed"
dictionary$closed[dictionary$class!="closed"]<-"not"
inference(y=dictionary$closed,est='proportion',type="ci",method="theoretical", success="closed", eda_plot=FALSE)
## Single proportion -- success: closed 
## Summary statistics: p_hat = 0.5306 ;  n = 2007 
## Check conditions: number of successes = 1065 ; number of failures = 942 
## Standard error = 0.0111 
## 95 % Confidence interval = ( 0.5088 , 0.5525 )

We are 95% confident that the true proportion of syllables in English words are distributed with:

  • between 7.9 and 10.4 % open
  • between 22.9 and 26.7% closed in front
  • between 11.5 and 14.5% closed at the end, and
  • between 50.9 and 55.3% closed on both sides.
Digraphs

Likewise we can obtain a confidence interval for the use of digraphs in each section of the syllable. For the tests we need to limit our data to syllables containing the part of interest.

fronts <- subset(dictionary, len_start_cons!=0, select=c(len_start_cons))
fronts$length[fronts$len_start_cons!=1]<-"digraph"
fronts$length[fronts$len_start_cons==1]<-"single"
inference(y=fronts$length,est='proportion',type="ci",method="theoretical", success='single', eda_plot=FALSE)
## Single proportion -- success: single 
## Summary statistics: p_hat = 0.8196 ;  n = 1563 
## Check conditions: number of successes = 1281 ; number of failures = 282 
## Standard error = 0.0097 
## 95 % Confidence interval = ( 0.8005 , 0.8386 )
vowels <- subset(dictionary, len_vowel!=0, select=c(len_vowel))
vowels$length[vowels$len_vowel!=1]<-"digraph"
vowels$length[vowels$len_vowel==1]<-"single"
inference(y=vowels$length,est='proportion',type="ci",method="theoretical", success='single', eda_plot=FALSE)
## Single proportion -- success: single 
## Summary statistics: p_hat = 0.8909 ;  n = 2007 
## Check conditions: number of successes = 1788 ; number of failures = 219 
## Standard error = 0.007 
## 95 % Confidence interval = ( 0.8772 , 0.9045 )
ends <- subset(dictionary, len_end_cons!=0, select=c(len_end_cons))
ends$length[ends$len_end_cons!=1]<-"digraph"
ends$length[ends$len_end_cons==1]<-"single"
inference(y=ends$length,est='proportion',type="ci",method="theoretical", success='single', eda_plot=FALSE)
## Single proportion -- success: single 
## Summary statistics: p_hat = 0.8235 ;  n = 1326 
## Check conditions: number of successes = 1092 ; number of failures = 234 
## Standard error = 0.0105 
## 95 % Confidence interval = ( 0.803 , 0.844 )

We are 95% confident that the true proportion of non-digraph sounds in English words are distributed with:

  • between 80.1 and 83.9 % single opening consonants
  • between 87.7 and 90.5% single vowels, and
  • between 80.3 and 84.4% single ending consonants.

A procedural algorithm for generating syllables from this pattern should be able to work through the syllable from front to back, so we also need to obtain the proportion of fully closed syllables given a closed front. Since these variables are dependent we use Bayes’ Theorem:

\[P(A | B) = P(A & B)/P(B)\]

\[P(end | front) = P(end & front)/P(front)\]

\[P(end | front) = P(both)/P(front)\]

dictionary$all_closed_front[dictionary$len_start_cons != 0]<-"closed.front"
dictionary$all_closed_front[dictionary$len_start_cons == 0 ]<-"open.front"
#head(dictionary)
inference(y=dictionary$all_closed_front,est='proportion',type="ci",method="theoretical", success="closed.front", eda_plot=FALSE)
## Single proportion -- success: closed.front 
## Summary statistics: p_hat = 0.7788 ;  n = 2007 
## Check conditions: number of successes = 1563 ; number of failures = 444 
## Standard error = 0.0093 
## 95 % Confidence interval = ( 0.7606 , 0.7969 )

The expected proportion of all syllables with a closed front is between about 77.9%, and the proportion of fully closed syllables is about 53.1%. Plugging in yields

\[P(end | front) = .531/.779 = 68.2%\]

Revising the Random Name Generator

With this information, we can now improve the syllable generation function. Syllable classes will be determined using the estimated population proportions in a linear fashion, and blends will be kept in a separate set from the single consonants and vowels. We still need a separate ending consonant digraph set to keep syllables from ending in a consonant flanked by two r’s.

#This is modified xVowel generation code, no longer including the single vowels or y and w
vowels <- unlist(strsplit("aeiou",""))
dVowels <- vector()
for(i in 1:5){ #Make controlled vowels for a-u.
    dVowels <- c(dVowels, paste(vowels[i],"r",sep=""))
    #Digraph time!
    if (vowels[i] == "a" || vowels[i] =="o"){ #A's and O's
        for(j in 1:5){ #combine with aeiou
            dVowels <- c(dVowels,paste(vowels[i],vowels[j], sep=""))
            }
        } else if(vowels[i] == "i" || vowels[i] == "u"){ #I's and U's
            for(j in 2:3){ #combine with ei
                dVowels <- c(dVowels,paste(vowels[i],vowels[j], sep=""))
                }
            }else if(vowels[i] == "e"){ #E's
                for(j in 1:5){ #combine with aeiu
                    if(j!=4){ #skip o
                        dVowels <- c(dVowels,paste(vowels[i],vowels[j], sep=""))
                        }
                    }
                }
    }

dConsonants = c("ch","wh","th","ph","sh","gh","st","squ")#An extended consonant
eDConsonants = c(dConsonants,"pt","ft","ct","lt","rr","ll","zz","ss","ff","sque") #Ending consonant
l = length(dConsonants)
for(i in 1:l){
            #Digraph time! Adding r's
            if(dConsonants[i]!="q" && dConsonants[i]!="r" && dConsonants[i]!="l" && dConsonants[i]!="z" &&dConsonants[i]!="s" && dConsonants[i]!="y" && dConsonants[i]!="h" && dConsonants[i]!="j" && dConsonants[i]!="n" && dConsonants[i]!="m" && dConsonants[i]!="x" && dConsonants[i]!="wh" && dConsonants[i]!="gh" && dConsonants[i]!="pt" && dConsonants[i]!="ft" && dConsonants[i]!="ct" && dConsonants[i]!="lt" && dConsonants[i]!="squ"){
               dConsonants <- c(dConsonants, paste(dConsonants[i],"r",sep=""))
            }
            #Not all q's need u's, but some should have them
            if(dConsonants[i]=="q"){
               dConsonants <- c(dConsonants,"qu")
            }
            
         }
#We no longer need to include an empty string as a consonant.

And the revised syllable function:

syllableV2 <- function(){
    # \s Revised to have
    ## 77.9% chance of a front consonant
    ### 82.0% of which should be single
    ## A vowel 100% of the time
    ### 89.1% of which should be single
    ## 68.2% chance of an ending consonant
    ### 82.4% of which should be single
    front.closed <- (sample(1:1000,1) <= 779)
    front.single <- (sample(1:1000,1) <= 820)
    vowel.single <- (sample(1:1000,1) <= 891)
    end.closed <- (sample(1:1000,1) <= 682)
    end.single <- (sample(1:1000,1) <= 824)
    syll <- ""
    if(front.closed){
        if(front.single){
            syll <- paste(syll, sample(consonants,1), sep="")
        }else{
            syll <- paste(syll, sample(dConsonants,1), sep="")
        }
    }
    if(vowel.single){
            syll <- paste(syll, sample(vowels,1), sep="")
        }else{
            syll <- paste(syll, sample(dVowels,1), sep="")
    }
    if(end.closed){
        if(end.single){
            syll <- paste(syll, sample(consonants,1), sep="")
        }else{
            syll <- paste(syll, sample(eDConsonants,1), sep="")
        }
    }
    return(syll)
}

Testing

words2 = vector()
for(i in 1:100){
    word <- syllableV2()
    word <- paste(word, syllableV2(), sep="")
    words2 <- c(words2, word)
}
head(words2)
## [1] "jugyes"    "niene"     "hossrast"  "atgey"     "payag"     "jussshruw"

The computer voice synthesization test also worked well, and this time I could only identify 3 words as being somewhat more difficult.

Mathematical characteristics:

lengths2 = vector()
for(word in words2){
    lengths2 <- c(lengths2,length(unlist(strsplit(word,""))))
}
summary(lengths2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    6.00    5.78    6.25   10.00
qqnorm(lengths2, main="Lengths of Generated Words v2")
qqline(lengths2)

Note only did these words more consistently read as 2 syllables, they were substantially shorter, ranging from 3 to 10 letters, and were no longer distributed normally. The effect of the revised generator function can be demonstrated by comparing the lengths of the 2-syllable words from the old function, the new function, and the dataset:

twos <- subset(dictionary, syllables==2)
lengths3 <- vector()
len_so_far <- 0
for(i in 1:nrow(twos)){
    #row <- twos[i,]
    if (i %%2 > 0) {
        len_so_far <- nchar(twos[i,'start_cons'])+nchar(twos[i,'vowel'])+nchar(twos[i,'end_cons'])
    }else{
        length <- len_so_far + nchar(twos[i,'start_cons'])+nchar(twos[i,'vowel'])+nchar(twos[i,'end_cons'])
        lengths3 <- c(lengths3, length)
        len_so_far <-0
    }
}
boxplot(lengths,lengths2,lengths3, names=c("Original","Revised","Data"))

Conclusion

The quality of the Random Name Generator’s output was able to be improved by imitating the proportions of syllable structures from English words. The syllable function should also be able to be tuned to specific characteristics by taking proportions from other corpora, such as a definitive survey of the English language, surveys of actual names, or any other set of words you wished to emulate.

The link between syllable structure and its placement in the word was too weak to attempt to imitate. Blocking for confounding variables such as part of speech and etymology may show a relationship that could then be imitated.

The Random Name Generator might also be to imitate a given corpus in a more realistic manner by weighting the letters themselves. Further testing would be necessary to see if it is possible to make a procedurally generated random word generator using such techniques.

Revised version

The revised version of the Random Name Generator is live.

Appendices

Appendix 1: Generated Words from Original Formula

##   [1] "chroephuit"    "taistfrad"     "maeloow"       "mituer"       
##   [5] "trailtfeeb"    "thriisteb"     "grerptbii"     "whaaythah"    
##   [9] "waushrew"      "brirghad"      "ghurmwhii"     "gurshroal"    
##  [13] "thuthaeff"     "crouptceff"    "whyrghthroess" "phwsquexoo"   
##  [17] "riemleal"      "wyllyell"      "thructsi"      "froupoug"     
##  [21] "chrouqueem"    "zedeex"        "styrchpeiwh"   "graabchei"    
##  [25] "chraustyr"     "foanorp"       "sturvbryr"     "gheustbraull" 
##  [29] "choecroasque"  "mourrprui"     "phroistroa"    "prehly"       
##  [33] "tarlcheaf"     "grooshbroab"   "kropfrur"      "sarshu"       
##  [37] "chrepuir"      "aajeem"        "siellvaf"      "squiesqubith" 
##  [41] "ghiishzouft"   "brirdroir"     "churghgaov"    "ghirshreesh"  
##  [45] "squarwhquouw"  "waamgreach"    "squyzpreuj"    "looquir"      
##  [49] "biizzthaich"   "tiishauj"      "bruesphrork"   "stecroey"     
##  [53] "kreypreiw"     "phirctfurr"    "laipgroe"      "foiwjeip"     
##  [57] "wynhupt"       "duijqir"       "xoiftaect"     "dryrhghyr"    
##  [61] "craigreest"    "chebreu"       "kwzzphiell"    "chychw"       
##  [65] "jaapgorn"      "xywhrur"       "phrwzue"       "ghoongreu"    
##  [69] "wherrrstaep"   "vraictloa"     "higghup"       "zarssghii"    
##  [73] "wroishfrer"    "kusskuip"      "siwrorl"       "murssyrq"     
##  [77] "tojyr"         "fiewhnou"      "phimdrux"      "quiedcraph"   
##  [81] "neipaev"       "hirwhkrorph"   "ziivwhars"     "fasquooct"    
##  [85] "vuesqueleass"  "hieptzirt"     "vreezkau"      "ziisqua"      
##  [89] "serxstir"      "roimmyr"       "vorftwhoek"    "vryrdfraa"    
##  [93] "frisscyr"      "dreejgry"      "frotchi"       "qeirrstass"   
##  [97] "vrabxoath"     "breqquor"      "shesquthie"    "ghuissbruh"

The less pronounceable were:

## [1] "trailtfeeb"    "whyrghthroess" "phwsquexoo"    "styrchpeiwh"  
## [5] "squarwhquouw"  "dryrhghyr"     "chychw"        "fiewhnou"     
## [9] "hirwhkrorph"

Appendix 2: Sampled Pages

The following pages were sampled from The American Heritage (R) College Dictionary, 4th edition.

  • 730: invertase - invulnerable
  • 40: al-Qaeda - also (most of the page is taken up by a table of alphabets.)
  • 154: blindworm - block signal
  • 744: jerk - jewelry
  • 1583: writer - wyvern
  • 107: bailor - bald cypress
  • 229: Catoctin Mountains - cause
  • 337: crimp\(^2\) - critique
  • 1547: warthog - wast
  • 287: comforter - commensurate
  • 236: cerebroside - cervine
  • 1287: Shylock - sickroom
  • 224: cash crop - cast
  • 1230: sang - San Sebastian
  • 1129: pullon - punch
  • 424: down - Down’s Syndrome
  • 1181: repressed memory - repulsed
  • 1064: plan - plant
  • 1372: stumblebum - styloid
  • 673: Howard - hubby
  • 1366: Strayhorn - stress
  • 716: innumerate - INS
  • 1517: varicella - vasectomy
  • 833: maguey - mailbox
  • 1: a - abandoned (The opening page of a letter’s section does not include printed guide words.)
  • 118: base pair - Basque
  • 1269: set\(^2\) - seven
  • 201: cable television - -cade
  • 1441: Tyllich - timeous
  • 1134: putamen - pycnogonid

Appendix 3: Generated Words from Revised Formula

##   [1] "jugyes"     "niene"      "hossrast"   "atgey"      "payag"     
##   [6] "jussshruw"  "zisis"      "ishtha"     "mokrux"     "iver"      
##  [11] "ukof"       "dabba"      "ephpi"      "phamko"     "strocat"   
##  [16] "eescar"     "dithod"     "uphni"      "paod"       "wanuq"     
##  [21] "ciphaw"     "taftut"     "vumrect"    "shexqur"    "teqeg"     
##  [26] "whirzil"    "kuyge"      "tewqo"      "cinul"      "lorhob"    
##  [31] "kaftsquuch" "uslug"      "ibda"       "kowmash"    "pei"       
##  [36] "hemosqu"    "uvaoj"      "gesu"       "jomya"      "kufney"    
##  [41] "whegek"     "fuxpog"     "ruut"       "eipol"      "wigshro"   
##  [46] "xabvaiq"    "irbi"       "kukyi"      "cimjek"     "ejcuw"     
##  [51] "samphrar"   "yaebsen"    "kirssu"     "nuon"       "pixo"      
##  [56] "boakphi"    "feqyo"      "ohtu"       "bachro"     "bavwa"     
##  [61] "chucroq"    "nuphlach"   "durult"     "nevhoc"     "joauct"    
##  [66] "phagkaq"    "uhxis"      "jane"       "vikpez"     "lafsuft"   
##  [71] "nanneb"     "munsuj"     "yihthron"   "wuheph"     "oygom"     
##  [76] "ghopha"     "phaihib"    "strabroc"   "kinban"     "xusoy"     
##  [81] "yudbos"     "juzi"       "mafgej"     "dame"       "oybi"      
##  [86] "xijstell"   "warfiz"     "tozsof"     "vural"      "tigiin"    
##  [91] "gevjeag"    "wopga"      "ipbo"       "ralduy"     "noirzoft"  
##  [96] "zoftuq"     "idfuz"      "chrormoc"   "tiffep"     "suhwi"

The less pronounceable were:

## [1] "paod"     "uvaoj"    "chrormoc"

  1. The program was written in Javascript. For reproducibility of research, I have ported sections of the code to R to use in this document.

  2. While most of the words could be pronounced as two syllables, it occasionally made more sense to pronounce them as three. This is acceptable for a program whose main purpose is to generate fantastical names for fiction writing.

  3. The pages’ guide words, along with a page of the data are printed in Appendix 2.