Ortatious Andith ib Staylatt Neller

What does English sound like to people who don't speak it? And what does that have to do with machine learning and overfitting?

Harys Dalvi

March 2022

There are a few videos on the internet about what English sounds like to people who don't speak it. They often sound something like this:

Stringe canna is like a string paggard hasheter ominent if you think diadout forday and comminent paymin for the trainable.

In other words, the videos make no sense, and they shouldn't — but it always annoys me how you can make out a lot of actual English words, even if it's gibberish. That got me thinking about a way to generate fake English with a computer and avoid actual English words.

I thought of two main ways to do this: manually coding linguistic rules and using a neural network. In this page, I will go through both of these and compare the results. With that, let's start — or as they say in faux English, brind shass!

Hard-Coding Phonotactics

When I first tried to hard-code English, I had trouble finding the resources I needed. In particular, I couldn't find a source on English phonotactics that was detailed enough to write as a computer program. Phonotactics deals with how sounds are allowed to combine in a language: for example, “treels” is a valid combination of sounds in English, even though it isn't in the dictionary. “Gvprtskvni” is definitely not a valid combination in English, but believe it or not, it is an actual Georgian word.

გვფრცქვნი

Gvprtskvni: it means “you peel us” in Georgian.

Fortunately, there is a very well-known and detailed source on Japanese phonotactics: Japanese writing. I first wrote a program to create fake Japanese. This is how it works:

List all sounds that exist in Japanese.
Make rules for all the ways those sounds can combine and make syllables.
Adjust to make some sounds more probable than others [1].
The program will read this data and output fake Japanese.

As a non-Japanese speaker I thought the result was fairly convincing:

/oːri ribonu ku jokin roːpːuwaː. tɕin kisuoko beon ehoːn ri./
おうりりぼぬくよきんろうっぷわあ。ちんきすおこべおんえほうんり。
Ōri ribonu ku yokin rōppuwā. Chin kisuoko beon ehōn ri.

Japanese phonotactics is a lot simpler than English phonotactics though, so I couldn't go straight to English. I tried various languages along the way, referencing Wikipedia phonology pages, but I was a little less satisfied with the results. Note that in the following examples, I had to hand-pick words from the program that sounded best and make some of my own edits instead of taking the results as-is. In Mandarin Chinese, a lot of the syllables the program gave were actual characters. In the other languages, I sometimes had to manually remove real words. Here's (partially) fake Mandarin Chinese:

/nau̯¹i⁴ rɨpjɛn⁴ ʈʂʰaŋ²tɕy² pʰaŋmi ʂan¹mwən¹. tsʰau̯tɕʰiŋ³ lwai̯²tsʰɨ tsən¹sɨ tsjaŋ¹lwan tswai̯⁴./
Nāoyì ripiàn chángjú pangmi shānmūn. Caoqǐng luáici zēnsi ziāngluan zuài.

Here's fake Spanish:

/buˈɲasa ˈela fiˈneɾo pɾiˈnifɾio ʝuˈsela. ˈtʃeɲa luˈposa ˈdjenos setiˈgio raˈdɾaske./
Buñasa ela finero prinífrio llusela. Cheña luposa dienos setiguío radrasque.

Fake French:

/vœ̃ bənɛ̃ lœʒ ʁɔ̃ adɔ̃pan. sekɑ̃ ʒaʁi tuvəble kɥɛl bʁadlyʒe./
Veun benain lœuge ron adompanne. Sécan jarrie touvebler cuelle bradlugée.

Fake Hindi-Urdu:

/kəfmi ɖʱape kʰənːã dənam dʒʊbai. ətʃʰ ɦɪnɔli sɛxa ʈamba ɦoɽʱ./
कफ़मी ढापे खन्नाँ दनाम जुबाई। अछ हिनौली सैख़ा टांबा होढ़।
کَفمی ڈھاپے کھَنّاں دَنام جُبائی۔ اَچھ ہِنَولی سَیخا ٹامبا ہوڑھ۔
Kafmi ḍhaapay khannã danaam jubaai. Achh hinauli saikha ṭaamba hoṛh.

For fake Arabic, I had to include the actual Arabic definite article because it's such a distinctive part of the sound:

/az-zaːtiːɣaː ʕabajb al-kawki radʒr qajziraː. ʃawaː al-majsˤawtaː θun ʕaːxaðˤat al-qaʕasaː./
الزَّاتِيغَا عَبَيب الْکَوکِ رَجْر قَيزِرَا۔ شَوَا الْمَيصَوتَا ثُن عَاخَظَت الْقَعَسَا۔
Az-zātīghā ʻabayb al-kawki rajr qayzirā. Shawā al-mayṣawtā thun ʻākhaẓat al-qaʻasā.

And finally, fake English:

/snaɪʃoʊz bɪp spɛbi jutɛɪθ hoɪbraɪ. hædneɪ sɒtməwɪdʒ sturə rutʃ ɒskeɪl./
Snyshows bip spebby uteith hoibrigh. Hadnay sotmawidge stoora rooch osscale.
Sny-shohs bip speh-bee yoo-tayth hoi-bry. Had-nay sot-muh-wij stoo-ruh rooch oss-kayl.

Here's an unedited fake English to give you an idea of my edits:

/peɪʒi eɪ ni kju leɪ. aʊəaɪk deɪ lɒtʃeɪ snaʊʒibʌs toʊt./
Peigee ay knee cue lay. Owa-ike day lochay snowgebus tote.
Pay-zhee ay nee kyoo lay. Ow-uh-ike day law-chay snahw-zhee-bus toht.

Knee, cue, lay, day, and tote are all actual English words, and the rest of the words somehow seem off to me.

GAN

Next I tried to accomplish the same task using a neural network. My first choice was a generative adversarial neural network (GAN). If you feed in some data (like pictures of people, or English words) the network should return more generated examples of what you put in. It works in two parts: a generator and a discriminator which compete with each other (hence adversarial). The discriminator tries to discriminate between samples made by the generator and real data, while the generator tries to generate fakes that fool the discriminator. In the end, the generator should create such realistic examples that the discriminator can't tell what's real and what's not.

Sample outputs by the authors of StyleGAN. The people in these pictures are computer-generated, not real!

Unfortunately, this didn't work so well for me. Image GANs often use a convolutional neural network. This allows the discriminator to average together nearby pixels in a way that detects things like edges. I wanted my network to consider nearby letters: “tha” is a lot more likely than “gvp”. But what's G times 0.24? Since I was using letters instead of numbers, it didn't make sense.

If not convolution, I couldn't think of a reasonable way to consider nearby letters. I ended up using only Dense layers, one of the most basic types of layer in a neural network, which was clearly not good enough. After a lot of training, with my computer fans going crazy, I got:

??? ? cda c ??
???? ???uhe ?? ??
? ?? a? ggddfi ?? ?
????? ?c bacad ?? ?
?? a?kafejdmd??
a? ? ?l?olmsha?? ?
?????? ??ig?a ?? ??
? ?? bdnhn? a????
??? ??b?bbd b??? ?
???? bahrh?a??? ?

The question marks are where the neural network did not put letter #1 (A), or letter #2 (B), but something like letter #-4 or letter #49. There is a little bit of learning here: it is learning that words go in the middle, with spaces on either side. But I don't really think that's what English sounds like to people who don't speak it.

LSTM

Instead of using a GAN, I found an example online that used a Long-Short Term Memory (LSTM) network to generate text character-by-character [2]. LSTM networks are a type of recurrent neural network (RNN), meaning they can remember previous data in order to put their output in context [3]. This was exactly what I needed to generate fake English, because each letter depends on the letters around it: you can't have a word like “hdjafhkjsdjfh”.

While the original code used segments of 40 characters, I shortened it to 10 characters so it would be focused on generating words rather than sentences. I used the Universal Declaration of Human Rights [4] as input, first with the original text, then with a version transcribed into IPA phonetic characters. At first, the output didn't make much sense:

tdofend zulsance ins of dier ousshand ongantiinco und cous ariasdacpimcoancianl ancor toroglneibais asrarit, cacius or ankecmembecoousion ancen ortdas or macalitdgedd antecpudialitras artticevafs almideacicsiondkssmend actetjone irpoco ho erect pas biymonaitovinagiry alicof onyohperian an onde nceroncaed armandaes of eraqmane anuitaese aaleetandyicvetion the alterangererligeceocicaliandemtityarta

As the neural network learned more about the data, it started coming together:

furthermore, no forth other other bergenuin, by semplesng of las1 his the digrits to hid ortatious in lioged int inservational, touldiag in as chongrald themevees for the united nationaly in a pecils deace or the a the gnien of perserventied. irtist inciperhas arcaring of nock, huraliple for the bidivation social declaration or by law. article 24 everyone has the dignity and social protection of whis ded

At this point, there are some parts that make no sense (“las1”), some parts that sound like possible English words (“inservational”), and some actual English words (“social declaration”). When there are parts that make no sense, the model is clearly underfitting: it isn't able to match the data well enough. But when there are actual English words, the model is overfitting: it is making decisions based on overly specific data points rather than overall patterns in the data. For my purposes, I want to have words that sound possible but aren't real, like “bidivation”.

The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the black line. (Chabacano, Wikimedia, CC BY-SA 4.0)

The overfitting is clear in this case, where the neural network is memorizing the data. But outside of a machine learning context, the whole problem of figuring out what English sounds like to people who don't speak it is closely connected to overfitting. The problem I had with other attempts to tackle this problem is that they used many actual English words, just as my neural network ended up doing. Even my hard-coded phonotactic program had a few English words.

In addition to fake English words, the neural network generated some questionable remarks on human rights. For a network trained entirely on a declaration of human rights, this really highlights the importance of AI ethics...

as to marriage, during marriage shall be subjected to torture or to cruel, indushis s.
everyone shall be held in slavery or servitule
peoples of territories... shall be subject to arbitrary arceslation
everyone has the right to seek... and shall be held guilty of any penal offence
the moral and education shall be arbitrarily deprived of his country, includes freedoms, themselves and among the people
everyone has the right to equal pay for equal write... and the slave trade

Between these dystopian sentences, and the overfitting and underfitting in the model, this RNN showed how human input can still be an important supplement to AI. The same applied to my earlier program, where I had to do some manual edits on top of the computer-generated text. Maybe figuring out what English sounds like to people who don't speak it is just one of the many cases where collaboration between human and computer is the best approach.

References

The GitHub for this project is at https://github.com/crackalamoo/staylatt.

Frequency of occurrence for units of phonemes, morae, and syllables appearing in a lexical corpus of a Japanese newspaper (Katsuo Tamaoka & Shogo Makioka) ^
Character-level text generation with LSTM (Keras) ^
A Gentle Introduction to Long Short-Term Memory Networks by the Experts (Jason Brownlee, Machine Learning Mastery) ^
Universal Declaration of Human Rights (United Nations) ^