BardGPT

A miniature GPT model extended with a rhyme and meter model for generating poetry

July 2023

I asked ChatGPT to give me some Spanish words that end in the letter I. It gave me the following:

Amistad
Felicidad
Abril
Café
Bolígrafo
Lápiz
Universitari
Leviatán
Reír
Baladí

Of these, only universitari and baladí actually end in I. Furthermore, universitari isn't a Spanish word at all — it's Catalan. ChatGPT admittedly performed better with the same exercise in English, but this serves to highlight an important fact about large language models as they are currently built: they only have access to the words themselves, and have no idea what the words sound like or even how they're spelled beyond what is explicitly stated in their training data. How can we effectively add this information to language models?

Portrait of William Shakespeare — William Shakespeare, also known as “the Bard”

This leads into an even more important point. Scaling language models is great, but it comes with an enormous computational and environmental cost. How can we add more information into language models and achieve better results while managing the computational power required? In other words, how do we design language models more intelligently?

That's why I decided to make BardGPT, a miniature GPT model for generating poetry. BardGPT is also an exercise in extending the power of language models by giving them access to information beyond syntax and semantics. It uses information about the phonology of words to generate poetry more accurately than a plain transformer model while increasing the number of parameters only marginally.

How it Works

Data

Data for this model was sourced entirely from Project Gutenberg. In retrospect, this was a little flawed just because of the limited quantity and diversity of sources there. I only reached about 12 MB of raw data, which I then had to further process and condense. In the future I would like to use a larger and more diverse set of sources, hopefully with web scraping. This would also allow a larger model, because computing power wasn't the limiting factor here: overfitting was.

With that said, I did try a larger dataset from Kaggle. I found that the resulting poems were actually far worse because the Kaggle dataset had a lot of low-quality data. After finishing this project, I published my own dataset on Kaggle.

Once I had the data, there was a significant amount of preprocessing to be done. Most importantly, I used special tokens for common suffixes such as =ing (run =ing -> running) and =ed (use =ed -> used). This allowed a larger effective vocabulary size so I wouldn't need to store two separate tokens for run and running when the two ideas are related in a predictable way. Another trick to manage the vocabulary size was to make everything lowercase.

I also developed my own functions to determine rhymes and syllable counts. Initially I tried the eng-to-ipa Python module, but this was far too slow and didn't work for less common words. My functions sacrificed a small amount of accuracy for far greater speed and domain specificity. Check the GitHub repository crackalamoo/bardgpt for more details on how I stored information on rhyme and “meter”, loosely speaking.

Finally, I specified some tokens that the model should not be allowed to predict, even though they are among the most popular tokens in the dataset. This mainly included proper nouns such as “Lenore” (from Edgar Allen Poe's famous poem “The Raven”.) Generally I kept only those proper nouns that have great significance outside of the literary work they are associated with, so Lenore was out but Jesus made the cut. Some cases were more tricky, like Achilles from Homer's Iliad.

Model Architecture

Diagram of the model. The inputs split into two branches: one with a GPT-style decoder, and the other with a rhyme/meter model. The rhyme/meter model has separate rhyme and meter encodings which are processed through linear layers and concatenation. The final results of the GPT and rhyme/meter branches are added together before a final softmax function. — A schematic of the BardGPT architecture, made with draw.io. Rounded rectangles indicate trainable operations, while sharp rectangles indicate fixed operations.

This schematic mostly covers the big picture of the BardGPT architecture. Check the GitHub repository crackalamoo/bardgpt for more technical details on the components of the model. There are just a few important points I would like to add here.

First, I tested three different architectures in this project. The simplest one was an n-gram model that tries to predict the next word given the last three words using only a set of linear layers. The next was a transformer model, which is based on a GPT architecture and resembles the schematic shown here but with the rhyme/meter sections cut out. The final model is the one shown here, which is a transformer model with additional phonological information.

It's also interesting to note the number of parameters for each model:

n-gram: 24.1 million
Transformer: 38.9 million
Bard: 39.4 million

The rhyme and meter layers of the bard model only add about 500,000 parameters, which is a small fraction of the total.

Results

༄༅༅ the frozen pang
o thou life! returning now to open suns, ascend
to find unseen she dwelt! thy gleam thy bright loom act girl
to ransom radiant order crown the righteous speed.
loss of many whose gifts fail! string thou further,
bring to spy my anguish ajax, king heir!
thy elder burial hand must heart abhor,
think me canst thou then time these cared forbear,
thine sister here posterity, she need,
and my loosened passion stoops in vain.
— BardGPT

Each model reached the following minimum perplexities:

n-gram: 119.18
Transformer: 83.23
Bard: 80.18

The bard model therefore had a 3.7% improvement over the transformer model, while adding only 1.3% more parameters. But if you look at the actual poem the model generated, it doesn't seem to rhyme very well. Has the bard model really learned anything new?

I think I did not design a rhyme encoding that worked well enough to encourage the model to rhyme. Meter was more straightforward. Just by knowing how many syllables are in a line, the model can much better predict where to place <newline> tokens. To compare, here is a poem generated by the plain transformer model:

༄༅༅ haze footstep with a stray night mount went back long forward and down, one after plank and ten big star shone purple and austere,
sunk in a shining arm, prone his labor fled.
all gathered smoke came sympathy, by trenches devoting,
thrusting over all the age in door, roots went,
but unknown in happy rivulet, too soon began, that thou hast laughed re-indeed stirrest thy gates
that is the stream whose ruddiest song are heard?

Notice how some lines are much longer than others. (You may have to scroll.) This is much less of an issue in the bard model.

And just for fun, here's a poem generated by the naive n-gram model:

༄༅༅ we as fold debate reside between with helmet to your city kingly words
attend stood your wild island quarter
that shing twice while alone, earth flashed heaven slept thou truth when beauty's order bled?
crown over these run. but what, if the you knew."--say no more.
no more together given round us draw once. prow " if aloft thou pledging wherein finds alone, and flashes at thy loose touch along his songs with fill been filled rigid retreats one to fetch afar;
sun and since feel, like humble witness north, kneels: do they bring. she smiled
to asks stand cap and queen of golden paint along her glade should strike love their age; perhaps thee troops fell is looked to ilion;
accord of, i win
before the king that sad bill lost blazing up against vain pursue;
her spirit fire alone
to solemn cottage sing sound foam all cow, bank move of this soul design, the neptune most bow of wield close of sleep.
such heaven was jest, my flood, in death now the trade away, of sculptured scar swaying you are near
even remains sweetly learning, race from heaven kiss,
descend the conquering rainbow shone."
why moved;
nor passion! they their drunken met their charge ran upon what glass unknown before away, such fears which still rise away."

So clearly the transformer model was a vast improvement on the n-gram model, and it seems the bard model was an improvement on the transformer model as well. But what are the implications of this improvement?

Beyond Rhyming

Does a language model even need to know how to rhyme? I asked Llama-2-7B to generate a poem:

A sunbeam's gentle touch
A fleeting gift from above,
It lights the path I travel on,
And fills my heart with joy and love.
The light it brings is pure and bright,
A beacon in the darkest night,
It guides me on my journey through,
And helps me find my inner light anew.

With 7 billion parameters but no special architecture for rhyming, Llama-2-7B was able to generate a poem that rhymes better than BardGPT. This is because in addition to the increase in parameters, Llama-2-7B was trained on a much larger dataset from which it can simply memorize which rhymes are common in poetry. BardGPT, on the other hand, was trained on a much smaller dataset and had far fewer parameters, so it was not able to memorize rhymes to this extent.

7 billion parameters isn't even that excessive for a large language model, so do we really need specific architecture for rhyming? And to raise an even bigger question, do we need any additional model architecture at all, or can we rely on scaling models and discovering emergent properties?

At the end of the day, a large language model is just a mathematical algorithm. A more sophisticated algorithm can do more sophisticated tasks, in principle including coding, mathematics, rhyming, and anything else as long as it has the relevant data and parameters. So perhaps we don't really need specific architecture for rhyme and meter.

But to perform extremely sophisticated tasks, approaching artificial superintelligence, we need a model with extremely sophisticated data and parameters. This requires an immense amount of computational power and results in a black box model that is difficult to understand and control. Rather than going down this route, I think it's important to look at ways we can improve large language models without just adding more data and parameters. Transformers were a huge step forward in this regard, and reinforcement learning from human feedback was another. Surely there are many more to come.

This approach of making more sophisticated models has three main benefits over simply scaling the models:

If done well, it allows us to achieve better results with less computational power and less environmental impact.
With multimodal learning, it can expand the capabilities of our best AI models beyond language.
It improves the interpretability of the models, which is increasingly important as models develop human-like intelligence and may one day surpass us, but we have little understanding of how they work.

It might seem difficult, and it is, but consider this: humans are still smarter than AI, but the energy expenditure of a human brain is much less than that of a ChatGPT supercomputer. This means it can be done: we humans just need to use our brains and find out how to do it responsibly, before our brains themselves lose their spot at the top of the intelligence ladder.

And of course, we can't ignore the question of whether we should try to make such a powerful model at all. I don't have an answer to that, because its impacts would surely be far-reaching both positively and negatively. But I think if it can be done, it will be done, so it's better to do it in a thoughtful way.

Among the poets in the dataset for BardGPT were Lord Byron and Percy Bysshe Shelley. It's interesting that Lord Byron's daughter, Ada Lovelace, was a central figure in the early development of computer science and some even say she wrote the first computer program. Percy Shelley's wife, Mary Shelley, wrote Frankenstein, a cautionary tale whose message we would do well to heed in the AI race. As we develop more powerful AI, we have an opportunity to prevent an even worse environmental crisis and give humanity access to a benevolent intelligence at our fingertips. But it's possible that one day, we might have to get used to not being the smartest thing on the planet anymore. I hope and expect that day will be one where we live in prosperity. It should be a day where we do not look back on the arrogant humanity of the past like Percy Bysshe Shelley looked back on the arrogant pharaoh Ozymanidas:

And on the pedestal, these words appear:
My name is Ozymandias, King of Kings;
Look on my Works, ye Mighty, and despair!
Nothing beside remains. Round the decay
Of that colossal Wreck, boundless and bare
The lone and level sands stretch far away.

Can Transformers Solve Everything? — Looking into the math and the data reveals that transformers are both overused and underused.
DiscordGPT — How to use data from Discord in a GPT model and talk to a simulation of you and your friends
Ortatious Andith ib Staylatt Neller — What does English sound like to people who don't speak it? And what does that have to do with machine learning and overfitting?

References