DiscordGPT

Harys Dalvi

March 2023


Have you ever thought about uploading your consciousness to a computer and achieving immortality? In this tutorial, you will do the next best thing: upload a corpus of your words to a GPT model and get a simulation of you and your friends in conversation. I will be using Discord to get the corpus for dialogue, but you can use a different source as well. What matters is that you get a CSV containing blocks of dialogue.

This is approximately the quality of output you can expect. (You might be able to get better output if you have a computer with a powerful GPU; details below!)

Bob: I never thought I would have time
Bob: I'm getting into biology but it seems like a decent field
Bob: I'm interested in physics but not in any particular fields yet
Alice: but i think it can be a good background for math
Bob: I'm interested in the math side of it
Alice: very interesting
Bob: I think it can also be a good way to pick up data
Charlie: Yes but not that much for me
Charlie: For me for math the most desirable is to have a deeper understanding of systems
Alice: true
More examples
Alice: ive heard it's very popular but never actually tried it yet
Alice: I don't even know if it does any cool stuff
Alice: just think about this idk how to apply to classes
Bob: Well I applied to a few lol lol
Alice: https://tenor.com/view/the-funny-crack-of-saray-birshad-funny-dance-gif-24861177
Bob: I also applied as a freshman but it was the same thing. Also it's not super open to minorities lol
Alice: ive never been to this lol
Alice: ok it was a fun college course back in the early 90s
Alice: but I wonder if it would be any fun now
Alice: im very lazy ima play my pong right now
Bob: If I did an actual degree I think I could make a decent living
Alice: yeah
Bob: Or can you take the cap course?
Alice: no way
Alice: oh wait yes
Alice: idk
Alice: hi?
Bob: Maybe he used the code from one book or something
Alice: ok that is cool
Bob: You could do that and it would be kinda cool
Alice: i would like to see some other people do the reverse
Bob: Then you make them do it too lol
Bob: Oh maybe it would work
Alice: oh wow
Bob: So I don't have to do reverse trigs to calc all the stuff
Alice: wow good idea
Bob: For one i would say reverse trigs?
Bob: I know
Bob: But at the same time the reverse trig isn't the reverse trig
Alice: hi
Bob: I don't think so
Bob: But it's cool for it to be the inverse, but also can't be the reverse trig
Alice: wow
Alice: i wonder if that will work better
Alice: but idk if it actually will if you dont do trigs
Alice: theres so much stuff it makes it hard to learn

These examples were AI-generated from a GPT-2 model trained on one of my Discord channels. Names have been changed. As you can tell, we are nerds.

Obtaining Data

If you're using Discord, you can use the Discord Chat Exporter to get all the messages in a channel in a CSV format. First clone the repository onto your computer. Then, once you obtain your token and a channel ID, you can run the Unix command

dotnet DiscordChatExporter.Cli.dll export -t <YOUR TOKEN HERE> -c <CHANNEL ID HERE> -f Csv
to download that as a CSV. (Windows will be similar; check the Discort Chat Explorer wiki for more details.)

At this point, you should have a CSV in the following format:

AuthorID,Author,Date,Content,Attachments,Reactions
"961292453880303616","Alice#2027","29-Mar-23 02:01 PM","yo guys","",""
"961292453880303616","Alice#2027","29-Mar-23 02:01 PM","i have an idea","",""
"864940450982014276","Bob#1091","29-Mar-23 02:01 PM","what is it","",""
"961292453880303616","Alice#2027","29-Mar-23 02:02 PM","what if we are in a simulation?","",""
"864940450982014276","Bob#1091","29-Mar-23 02:02 PM","thats a stupid idea","",""
"418156904730524386","Charlie#7181","29-Mar-23 02:02 PM","Yea fr","",""
Call it something like channel.csv.

Preprocessing

In the end, we want a CSV in the following format:

Conversation,
"A: hi! B: hi, nice to meet u A: nice to meet u too! B: what are ur thoughts on the high energy consumption of training LLMs A: is this your typical icebreaker",
"A: yo guys A: i have an idea B: what is it A: what if we are in a simulation? B: thats a stupid idea C: Yea fr",
This is quite straightforward in Python. Note that for this step, you will need to have pandas and numpy. If you don't have these, you can run the command
pip install pandas numpy
(or pip3 if you use the python3 command) in Unix to install the libraries. If you have the libraries, you can create a file called something like preprocess.py to format the messages from Discord.
import pandas as pd
import numpy as np

df = pd.read_csv('channel.csv', sep=',', header=None).to_numpy()
df = df[1:] # ignore the first line with the field names AuthorID,Author,Date,Content,Attachments,Reactions
authors = df[:,1].astype('str') # author is field index 1
messages = df[:,3].astype('str') # content is field index 3

codenames = {
  "Alice#2027": "A",
  "Bob#1091": "B",
  "Charlie#7181": "C"
} # change this based on the users in your channel. exclude bots that you don't want to be included in the GPT output.

dialogue = "\""
for i in range(messages.shape[0]):
  try:
    # in this line, we will:
    # 1) use the "code name" for this user (use a unique initial like A, B, C)
    # 2) replace newlines with spaces, and double quotes with single quotes, so that the resulting CSV format is valid
    dialogue += codenames[users[i]]+": "+messages[i].replace('\n',' ').replace("\"", "'")+" "
    # the result will look like B: hi, nice to meet u
    if i % 64 == 63: # after every 64 messages,
      dialogue += "\",\n\"" # end this line of the CSV and start a new conversation block
  except KeyError: continue # ignore users that are not in the `codenames` dict
dialogue += "\"," # complete the last line of the CSV

# write the result to disk
file = open('discord.csv', 'w+')
file.write("Conversation,\n") # include a basic CSV header. we only need one field for this task.
file.write(dialogue)
file.close()
Let's take a closer look at the codenames dictionary. Why do we use initials? If we use the usernames or real names of the people in the conversation, our tokenizer could have problems. For usernames, something like Alice#2027 will be unfamiliar to the tokenizer, and will likely take multiple tokens. To concentrate on dialogue generation and not tokenization issues, it's easiest to just create single-letter aliases for each user.

If we use real names, the GPT model might have preexisting notions of the role of each person based on their names. For example, people with names like Jesus or Muhammad might be confused with people that the pretrained GPT model already has information on. On the other hand, people with names like crackalamoo that are less common in the corpus might be treated strangely by the model. Using letter initials for all names gets around both of these problems, but make sure the letters for each person are unique.

After running the above code, you should have a file discord.csv that contains all messages in your Discord channel blocked into conversations of 64 messages each.

Training

At this point, it's important to note that you will probably need some kind of GPU. If your computer has a GPU, you can create a file gpt.py and start training. Otherwise, you will need to use something like Google Colab to train your model.

To build our model, we will get help from the open source models at Hugging Face. Make sure you have the libraries datasets and transformers from Hugging Face as well as torch (PyTorch) for the main model. If you are on your computer rather than Colab, you can use this Unix command:

pip install datasets transformers
or pip3 if you use the python3 command. PyTorch installation on a local computer might be a little more complicated because you have to make sure you compile with GPU.

First of all, we want to load the CSV you just created as a Hugging Face dataset so we can use it with our GPT-2 model. If you're using your own computer's GPU, this will look something like:

import torch, datasets, transformers
import pandas as pd
assert torch.cuda.is_available() # make sure we have the GPU
FILE = "discord.csv" # replace with the directory of the file you created in preprocess.py
df = pd.read_csv(FILE)
dataset = datasets.Dataset.from_pandas(df)
# if you have a lot of data, or not a lot of time, you can do something like:
# dataset = datasets.Dataset.from_pandas(df.sample(2000))
# replace 2000 with something that works on your hardware.
Alternatively, if you're using Google Colab, you will have to load the file from your Drive. Upload the file you created in preprocess.py to somewhere in your Google Drive and create a Google Colab notebook. Then you can write
!pip install datasets transformers numpy
import datasets, transformers
import pandas as pd

# set up Google Drive access
from google.colab import drive
drive.mount('/content/gdrive')

FILE = "Your Directory Here/subfolder/discord.csv" # replace with the directory in your My Drive of the file you created in preprocess.py
df = pd.read_csv('gdrive/My Drive/'+FILE)
dataset = datasets.Dataset.from_pandas(df)
# if you have a lot of data, or not a lot of time, you can do something like:
# dataset = datasets.Dataset.from_pandas(df.sample(2000))

Now we want to create a train-test split. We mostly care about training the model; the real test of the model will be subjective, where we personally evaluate its ability to generate dialogue similar to that in your Discord channel. Therefore, we will use only 10% of the data for testing.

dataset = dataset.train_test_split(test_size=0.1)
Next we need a tokenizer. What a tokenizer does is it takes in a sentence like
I see the Apple store but I don't see any apples.
and produces a series of tokens like
"I", "see", "the", "Apple", "store", "but", "I", "do", "_n't", "see", "any", "apple", "_s", "."
You can see that this roughly corresponds to splitting the sentence into words. However, ideally the tokenizer should also split words into morphemes, which are the smallest possible unit of meaning in language; sometimes even smaller than words. This is why you have tokens like "_n't" and "." in addition to normal words. Luckily, Hugging Face has a tokenizer for us to use that handles all this.
tokenizer = transformers.AutoTokenizer.from_pretrained('gpt2-medium')
Now we want to tokenize all of our data.
def tokenize_conversation(csv_row):
  return tokenizer(csv_row['Conversation'], truncation=True)
tokenized_dataset = dataset.map(tokenize_conversation, batched=True, remove_columns=dataset['train'].column_names)
Now that we have our tokenized_dataset, we can create equally-sized groups of tokens to train the model on.
block_size = 256
def group_texts(examples):
  concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])
  total_length = (total_length // block_size) * block_size
  result = {
    k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
    for k, t in concatenated_examples.items()
  }
  result["labels"] = result["input_ids"].copy()
  return result

lm_dataset = tokenized_dataset.map(group_texts, batched=True)
Next we will use a DataCollatorForLanguageModeling. This pads the input in cases where not all inputs are the same length. The token we will use for padding is a special token called the end-of-sentence token, which lets the model know that a sample of conversation is over. There are other cases where we can work with something called masked language modeling, a way to randomly mask some of the words and have the model learn to predict them. However, we will not be using that in our data collator.
from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
This next step is important: we will actually load the model! If you're on Google Colab, the biggest model you'll be able to use is gpt2-medium.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('gpt2-medium')
# you can also try gpt2-large or gpt2-xl if you have the hardware for it.
# this will need a pretty big GPU!
Finally, we can actually train the model.
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
  output_dir="dialogue-model",
  evaluation_strategy="epoch",
  learning_rate=2e-5,
  weight_decay=0.01,
  num_train_epochs=1, # this is how many times we go through the entire dataset. try 2 if you have a lot of time.
  per_device_train_batch_size=4,
  per_device_eval_batch_size=8
)

torch.cuda.empty_cache() # get the GPU ready for training
trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=lm_dataset['train'],
  eval_dataset=lm_dataset['test'],
  data_collator=data_collator
)

trainer.train() # this will take a while! about 15-20 minutes for me on Colab.

Testing

Now there's just one last step: seeing if the model actually works! We will use the Hugging Face text generation pipeline, which handles using our model to generate text from a prompt.

from transformers import pipeline
generator = pipeline('text-generation', model=model)

def generate_messages(prompt='', num=10, max_length=128):
  outputs = generator(prompt, num_return_sequences=num, max_new_tokens=max_length)
  for output in outputs:
    print("-"*20)
    text = output['generated_text']
    print(text)
To call this function, we can easily do something like
generate_messages(": ", num=2, max_length=256)
if we're on Colab or a Jupyter notebook. I chose the prompt ": " to make sure that the model generates dialogue. You can also try an empty prompt like "", but it's a little less reliable. If you want a message from a specific person, you can try a prompt like "A: ".

If you're running the code locally without a Jupyter notebook, you might want something like this:

while True:
  prompt = input("Prompt: ")
  generate_messages(prompt, num=1)
The output will be in the form A: hi B: hi A: how are you. If you want something cleaner, you can replace the print(text) part of generate_messages with something like
  text = output['generated_text']
  text = text.replace("A: ", "\nAlice: ")
  text = text.replace("B: ", "\nBob: ")
  text = text.replace("C: ", "\nCharlie: ")
  print(text)

Results

With the gpt2-medium model, I was able to get some pretty good results like

Alice: How do you solve this type of problem on your own
Bob: i figured out how to use a vector to go from one node to another
Bob: like you could do this for y-axis
Alice: Wow
Bob: i remember that
Bob: wow
Alice: Lol
Alice: I remember the answer was so simple
Alice: Since I was doing a neural network thing I thought I could solve this using some kind of machine learning algorithm

as well as some weird output like

Charlie: ive never had one
Charlie: But I got some of my cousin's at a local Asian grocery store
Charlie: And now she won't buy anything from me because of it
Bob: True
Charlie: But it's ok that its just because i don’t like it
Charlie: I'm ok with the idea of the app because she can still go to Amazon and do the equivalent of what i was doing, except use my data instead of using Amazon to make my own decisions
Alice: lol
Alice: @Eve dude
Alice: hi

Observations

The dialogues tend to be grammatically correct, but don't make much sense. Still, it's clear that the model is learning from the Discord dataset because of its ability to create a dialogue format with the correct characters as well as its use of relevant subject matter. (For my channel, this meant a lot of talk about college life and math.)

The gpt2-medium model I used has 335 million parameters. The gpt2-xl model I would have liked to use has 1.5 billion parameters: a pretty significant step up. If I could use that model, I would expect much better performance and more coherent sentences.

Both of these GPT-2 models absolutely pale in comparison to GPT-3, which has over 175 billion parameters. If I had a larger dataset, and the computational power to feed it to GPT-3, what would happen? The result would not truly be conscious, but judging from the performance of large language models like ChatGPT, it might feel that way. A language model trained on your words would probably be aware of a large portion of your skills, personality, and experiences. It would sound a lot like you, almost as if your consciousness had been uploaded. Even this very basic language model picked up some ideas about math and other fields based on conversations in my Discord channel.

ChatGPT is in essence just a scaled-up version of what was done here. (Very, very scaled up, in many different ways.) These scaled-up models share conceptually similar architecture but take more space, a victory for the “bigger is better” design philosophy. However, the computational costs are not the only issue with large language models (LLMs). Making LLMs smart, safe, and personalized is an extremely difficult technical and societal problem.

Playing with Discord and LLMs conjures a world in which everyone has a personal language model. Would we really need to consult people in such a world when we can just consult their chatbots? How can we ensure ourselves a place in society that can't be replaced by AI?

These are hard questions. Perhaps it's worth asking my simulated friends.

Bob: How can we ensure ourselves a place in society that can't be replaced by AI?
Charlie: The main idea is that we should be able to use machines to become better people at whatever we decide to be better at. It's not that we should get smarter or better at anything for ourselves, but that we as individuals have choices about what to do with our potential in our lives. I'm sure I don't have the same choices if I were to become an AI super scientist (but I'm not), so I think it'll be fine.
Alice: I don't think that's a good idea
Alice: I think we'll see very little innovation in those fields
Alice: But AI will make it easier to work with other people
Bob: Good point
Bob: But they are also an important part of humanity
Alice: I agree
Alice: I think in my view AI will be a huge asset, and the best way I can see of how human beings will work together to make this happen is by fostering the idea of AI and learning from its mistakes
Eve: AI has always seemed dangerous, like a ticking time bomb
Eve: But AI seems like a solution to all the problems we have with the system itself
Eve: People have always been nervous about computers, they'll make a mistake sometimes, but in general are smart enough to learn from mistakes
Eve: And maybe they'll never get sick

References

The GitHub repository for this project is at crackalamoo/discordgpt.

A special thanks to the Brown Machine Intelligence Community and Tyrrrz Discord Chat Exporter for providing much of the code and concepts that I worked off of.